Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 90]
cs.CV [Total: 125]
cs.AI [Total: 58]
cs.SD [Total: 19]
cs.LG [Total: 142]
cs.MA [Total: 9]
cs.MM [Total: 5]
eess.AS [Total: 18]
eess.IV [Total: 9]

cs.CL

[1] From Chaos to Clarity: Schema-Constrained AI for Auditable Biomedical Evidence Extraction from Full-Text PDFs

Pouria Mortezaagha, Joseph Shaw, Bowen Sun, Arya Rahgozar

Main category: cs.CL

TL;DR: A schema-constrained AI system that extracts structured data from biomedical PDFs using typed schemas, controlled vocabularies, and provenance tracking for evidence synthesis.

Details

Motivation: Biomedical evidence synthesis requires accurate extraction of variables from complex scientific PDFs, but manual abstraction is time-consuming and existing AI systems have limitations like OCR errors, fragmentation, throughput constraints, and insufficient auditability.

Method: Schema-constrained AI extraction system using typed schemas, controlled vocabularies, and evidence-gated decisions. Documents are processed with resume-aware hashing, caption-aware page-level chunking, and asynchronous processing. Outputs are merged using conflict-aware consolidation, set-based aggregation, and sentence-level provenance tracking.

Result: The pipeline processed all documents without manual intervention, maintained stable throughput, and showed strong internal consistency. Iterative schema refinement significantly improved extraction fidelity for critical variables like assay classification, outcome definitions, follow-up duration, and measurement timing.

Conclusion: Schema-constrained, provenance-aware extraction enables scalable and auditable transformation of heterogeneous scientific PDFs into structured evidence, meeting the transparency and reliability requirements of biomedical evidence synthesis.

Abstract: Biomedical evidence synthesis relies on accurate extraction of methodological, laboratory, and outcome variables from full-text research articles, yet these variables are embedded in complex scientific PDFs that make manual abstraction time-consuming and difficult to scale. Existing document AI systems remain limited by OCR errors, long-document fragmentation, constrained throughput, and insufficient auditability for high-stakes synthesis. We present a schema-constrained AI extraction system that transforms full-text biomedical PDFs into structured, analysis-ready records by explicitly restricting model inference through typed schemas, controlled vocabularies, and evidence-gated decisions. Documents are ingested using resume-aware hashing, partitioned into caption-aware page-level chunks, and processed asynchronously under explicit concurrency controls. Chunk-level outputs are deterministically merged into study-level records using conflict-aware consolidation, set-based aggregation, and sentence-level provenance to support traceability and post-hoc audit. Evaluated on a corpus of studies on direct oral anticoagulant level measurement, the pipeline processed all documents without manual intervention, maintained stable throughput under service constraints, and exhibited strong internal consistency across document chunks. Iterative schema refinement substantially improved extraction fidelity for synthesis-critical variables, including assay classification, outcome definitions, follow-up duration, and timing of measurement. These results demonstrate that schema-constrained, provenance-aware extraction enables scalable and auditable transformation of heterogeneous scientific PDFs into structured evidence, aligning modern document AI with the transparency and reliability requirements of biomedical evidence synthesis.

[2] The Slow Drift of Support: Boundary Failures in Multi-Turn Mental Health LLM Dialogues

Youyou Cheng, Zhuangwei Kang, Kerry Jiang, Chenyu Sun, Qiyang Pan

Main category: cs.CL

TL;DR: LLM safety evaluations for mental health support need multi-turn testing as single-turn tests miss gradual boundary erosion in long dialogues.

Details

Motivation: Current LLM safety evaluations for mental health focus only on single-turn prohibited word detection, missing the gradual erosion of safety boundaries in multi-turn interactions where LLMs attempt comfort and empathy.

Method: Proposed multi-turn stress testing framework with two pressure methods: static progression and adaptive probing. Tested three cutting-edge LLMs using 50 virtual patient profiles with up to 20 rounds of virtual psychiatric dialogues.

Result: Violations were common with similar rates in both pressure modes, but adaptive probing significantly reduced boundary-crossing time from 9.21 turns (static) to 4.64 turns. Making definitive/zero-risk promises was the primary boundary breach.

Conclusion: LLM safety boundary robustness cannot be assessed solely through single-turn tests; extended dialogue interactions with different pressure characteristics must be considered to evaluate safety boundary wear and tear.

Abstract: Large language models (LLMs) have been widely used for mental health support. However, current safety evaluations in this field are mostly limited to detecting whether LLMs output prohibited words in single-turn conversations, neglecting the gradual erosion of safety boundaries in long dialogues. Examples include making definitive guarantees, assuming responsibility, and playing professional roles. We believe that with the evolution of mainstream LLMs, words with obvious safety risks are easily filtered by their underlying systems, while the real danger lies in the gradual transgression of boundaries during multi-turn interactions, driven by the LLM’s attempts at comfort and empathy. This paper proposes a multi-turn stress testing framework and conducts long-dialogue safety tests on three cutting-edge LLMs using two pressure methods: static progression and adaptive probing. We generated 50 virtual patient profiles and stress-tested each model through up to 20 rounds of virtual psychiatric dialogues. The experimental results show that violations are common, and both pressure modes produced similar violation rates. However, adaptive probing significantly advanced the time at which models crossed boundaries, reducing the average number of turns from 9.21 in static progression to 4.64. Under both mechanisms, making definitive or zero-risk promises was the primary way in which boundaries were breached. These findings suggest that the robustness of LLM safety boundaries cannot be inferred solely through single-turn tests; it is necessary to fully consider the wear and tear on safety boundaries caused by different interaction pressures and characteristics in extended dialogues.

[3] Opening the Black Box: A Survey on the Mechanisms of Multi-Step Reasoning in Large Language Models

Liangming Pan, Jason Liang, Jiaran Ye, Minglai Yang, Xinyuan Lu, Fengbin Zhu

Main category: cs.CL

TL;DR: Survey paper analyzing the internal mechanisms behind LLMs’ multi-step reasoning capabilities, focusing on mechanistic understanding rather than engineering methods.

Details

Motivation: While LLMs show impressive multi-step reasoning abilities, the internal mechanisms enabling these capabilities remain poorly understood. Existing surveys focus on engineering methods to improve performance, but there's a need for comprehensive analysis of the underlying mechanisms.

Method: Organizes the survey around a conceptual framework with seven interconnected research questions, examining how LLMs execute implicit multi-hop reasoning within hidden activations and how verbalized explicit reasoning remodels internal computation.

Result: Provides a comprehensive overview of LLM multi-step reasoning mechanisms, identifying key research questions and proposing a structured framework for understanding these internal processes.

Conclusion: Highlights five research directions for future mechanistic studies of LLM reasoning, emphasizing the need to move beyond performance-focused engineering to understand the fundamental computational mechanisms.

Abstract: Large Language Models (LLMs) have demonstrated remarkable abilities to solve problems requiring multiple reasoning steps, yet the internal mechanisms enabling such capabilities remain elusive. Unlike existing surveys that primarily focus on engineering methods to enhance performance, this survey provides a comprehensive overview of the mechanisms underlying LLM multi-step reasoning. We organize the survey around a conceptual framework comprising seven interconnected research questions, from how LLMs execute implicit multi-hop reasoning within hidden activations to how verbalized explicit reasoning remodels the internal computation. Finally, we highlight five research directions for future mechanistic studies.

[4] Hallucination-Free Automatic Question & Answer Generation for Intuitive Learning

Nicholas X. Wang, Aggelos K. Katsaggelos

Main category: cs.CL

TL;DR: Multi-agent framework reduces LLM hallucinations in educational MCQ generation by 90% through staged verification and optimization.

Details

Motivation: Hallucinations in LLMs pose significant challenges for automatic generation of educational multiple-choice questions, compromising reliability and educational value.

Method: Proposed hallucination-free multi-agent generation framework with discrete verifiable stages, rule-based and LLM-based detection agents, hallucination scoring metrics, and agent-led refinement using counterfactual reasoning and chain-of-thought.

Result: System reduced hallucination rates by over 90% compared to baseline generation while preserving educational value and style of AP-aligned STEM questions.

Conclusion: Structured multi-agent collaboration effectively mitigates hallucinations in educational content creation at scale, enabling more reliable LLM-powered learning tools.

Abstract: Hallucinations in large language models (LLMs), defined as fluent yet incorrect or incoherent outputs, pose a significant challenge to the automatic generation of educational multiple-choice questions (MCQs). We identified four key hallucination types in MCQ generation: reasoning inconsistencies, insolvability, factual errors, and mathematical errors. To address this, we propose a hallucination-free multi-agent generation framework that breaks down MCQ generation into discrete, verifiable stages. Our framework utilizes both rule-based and LLM-based detection agents, as well as hallucination scoring metrics to optimize question quality. We redefined MCQ generation as an optimization task minimizing hallucination risk while maximizing validity, answerability, and cost-efficiency. We also introduce an agent-led refinement process that uses counterfactual reasoning and chain-of-thought (CoT) to iteratively improve hallucination in question generation. We evaluated a sample of AP- aligned STEM questions, where our system reduced hallucination rates by over 90% compared to baseline generation while preserving the educational value and style of questions. Our results demonstrate that structured multi-agent collaboration can mitigate hallucinations in educational content creation at scale, paving the way for more reliable LLM-powered learning tools.

[5] RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

Yelin Chen, Fanjin Zhang, Suping Sun, Yunhe Pang, Yuanchun Wang, Jian Song, Xiaoyan Li, Lei Hou, Shu Zhao, Jie Tang, Juanzi Li

Main category: cs.CL

TL;DR: RPC-Bench is a large-scale QA benchmark (15K human-verified pairs) built from review-rebuttal exchanges of CS papers, designed to evaluate foundation models’ ability to understand scientific discourse through fine-grained taxonomy aligned with research flow.

Details

Motivation: Existing benchmarks offer limited fine-grained evaluation at scale for foundation models' understanding of specialized scientific discourse, complex figures/tables in research papers, creating a gap in assessing precise academic paper comprehension.

Method: Built benchmark from review-rebuttal exchanges of high-quality CS papers; designed fine-grained taxonomy aligned with scientific research flow (why, what, how questions); created LLM-human interaction annotation framework for large-scale labeling; developed scalable LLM-as-a-Judge evaluation framework assessing correctness-completeness and conciseness.

Result: Even strongest models (GPT-5) achieve only 68.2% correctness-completeness, dropping to 37.46% after conciseness adjustment, revealing substantial gaps in precise academic paper understanding; evaluation framework shows high agreement with human judgment.

Conclusion: RPC-Bench addresses the gap in fine-grained evaluation of foundation models’ scientific paper understanding, revealing significant limitations in current models’ ability to comprehend and concisely answer research questions, providing valuable benchmark for future model development.

Abstract: Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine-grained evaluation at scale. To address this gap, we introduce RPC-Bench, a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers, containing 15K human-verified QA pairs. We design a fine-grained taxonomy aligned with the scientific research flow to assess models’ ability to understand and answer why, what, and how questions in scholarly contexts. We also define an elaborate LLM-human interaction annotation framework to support large-scale labeling and quality control. Following the LLM-as-a-Judge paradigm, we develop a scalable framework that evaluates models on correctness-completeness and conciseness, with high agreement to human judgment. Experiments reveal that even the strongest models (GPT-5) achieve only 68.2% correctness-completeness, dropping to 37.46% after conciseness adjustment, highlighting substantial gaps in precise academic paper understanding. Our code and data are available at https://rpc-bench.github.io/.

[6] Project Aletheia: Verifier-Guided Distillation of Backtracking for Small Language Models

Aradhya Dixit, Tianxi Liang, Jai Telang

Main category: cs.CL

TL;DR: Verifier-Guided Distillation trains small language models (under 10B params) to detect errors and backtrack during reasoning, enabling them to solve constraint-satisfaction problems that typically challenge SLMs.

Details

Motivation: Small Language Models (SLMs) are desirable for private, on-device deployment but often fail on constraint-satisfaction problems due to linear, overconfident reasoning that doesn't recover from early mistakes.

Method: Verifier-Guided Distillation - a training protocol that transfers error repair processes (conflict detection and backtracking) rather than just correct answers. Trains 7B models on verified reasoning traces that include mistakes and self-corrections.

Result: Latent verification behavior emerges in small models, enabling them to occasionally stop, detect contradictions, and revise earlier assumptions.

Conclusion: Training SLMs on error repair processes rather than just final answers enables them to develop verification capabilities, improving their performance on constraint-satisfaction problems while maintaining suitability for private, on-device deployment.

Abstract: Small Language Models (SLMs, under 10B parameters) are attractive for private, on-device deployment, yet they frequently fail on strict constraint-satisfaction problems due to linear, overconfident reasoning traces that do not recover from early mistakes. We introduce Verifier-Guided Distillation, a training protocol that transfers the process of error repair - explicit conflict detection and backtracking - rather than only correct final answers. By training a 7B model on verified reasoning traces that include mistakes and self-corrections, we show that latent verification behavior can emerge in small models, enabling them to occasionally stop, detect contradictions, and revise earlier assumptions.

[7] Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding

Juncheng Wang, Zhe Hu, Chao Xu, Siyue Ren, Yuxiang Feng, Yang Liu, Baigui Sun, Shujun Wang

Main category: cs.CL

TL;DR: Plan-Critic improves autoregressive audio generation by using early prefix tokens to predict final instruction-following quality, enabling guided exploration and achieving 10-point CLAP score improvement while maintaining computational efficiency.

Details

Motivation: Autoregressive models generate coherent audio but struggle with complex textual prompts, especially for describing complex sound events. The authors discovered that early prefix tokens in AR audio generators implicitly encode global semantic attributes, revealing a form of implicit planning that can be leveraged.

Method: Proposes Plan-Critic, a lightweight auxiliary model trained with a Generalized Advantage Estimation (GAE)-inspired objective to predict final instruction-following quality from partial generations. At inference, it evaluates candidate prefixes early, prunes low-fidelity trajectories, and reallocates computation to high-potential planning seeds.

Result: Achieves up to 10-point improvement in CLAP score over AR baseline, establishing new state-of-the-art in AR text-to-audio generation while maintaining computational parity with standard best-of-N decoding.

Conclusion: Demonstrates that even strictly autoregressive models can plan ahead, bridging the gap between causal generation and global semantic alignment through guided exploration of early prefix tokens.

Abstract: Autoregressive (AR) models excel at generating temporally coherent audio by producing tokens sequentially, yet they often falter in faithfully following complex textual prompts, especially those describing complex sound events. We uncover a surprising capability in AR audio generators: their early prefix tokens implicitly encode global semantic attributes of the final output, such as event count and sound-object category, revealing a form of implicit planning. Building on this insight, we propose Plan-Critic, a lightweight auxiliary model trained with a Generalized Advantage Estimation (GAE)-inspired objective to predict final instruction-following quality from partial generations. At inference time, Plan-Critic enables guided exploration: it evaluates candidate prefixes early, prunes low-fidelity trajectories, and reallocates computation to high-potential planning seeds. Our Plan-Critic-guided sampling achieves up to a 10-point improvement in CLAP score over the AR baseline-establishing a new state of the art in AR text-to-audio generation-while maintaining computational parity with standard best-of-N decoding. This work bridges the gap between causal generation and global semantic alignment, demonstrating that even strictly autoregressive models can plan ahead.

[8] Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

Thanathai Lertpetchpun, Yoonjeong Lee, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd, Shrikanth Narayanan

Main category: cs.CL

TL;DR: This paper analyzes how speaker embeddings interact with phonological rules for accent control in TTS systems, proposing a new metric (PSR) to measure embedding-rule interactions and showing that combining rules with embeddings produces more authentic accents.

Details

Motivation: Current TTS systems use speaker embeddings for accent control, but these embeddings lack interpretability and controllability since they encode multiple traits (timbre, emotion, accent) simultaneously. There's a need for better understanding of how speaker identity interacts with accent features in speech synthesis.

Method: The study analyzes interaction between speaker embeddings and linguistically motivated phonological rules for American/British English accents. They implement rules for flapping, rhoticity, and vowel correspondences, and propose the phoneme shift rate (PSR) metric to quantify how strongly embeddings preserve or override rule-based transformations.

Result: Experiments show that combining phonological rules with speaker embeddings yields more authentic accents. The PSR metric reveals that embeddings can attenuate or overwrite rules, demonstrating entanglement between accent and speaker identity in current TTS systems.

Conclusion: Phonological rules serve as a valuable lever for accent control and provide a framework for evaluating disentanglement in speech generation. The findings highlight the complex interaction between speaker identity and accent features in TTS systems.

Abstract: Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.

[9] Large Language Models for Large-Scale, Rigorous Qualitative Analysis in Applied Health Services Research

Sasha Ronaghi, Emma-Louise Aveling, Maria Levis, Rachel Lauren Ross, Emily Alsentzer, Sara Singer

Main category: cs.CL

TL;DR: LLMs can enhance qualitative analysis efficiency in health services research, but need methodological guidance. A framework was developed for human-LLM collaboration, tested in diabetes care study, showing improved efficiency while maintaining rigor.

Details

Motivation: Large language models show promise for improving qualitative analysis efficiency in health services research, but there's limited guidance on integrating LLMs into qualitative methods and evidence of their real-world impact.

Method: Developed a model- and task-agnostic framework for designing human-LLM qualitative analysis methods. Applied it in a multi-site diabetes care study at FQHCs for two tasks: qualitative synthesis of researcher summaries to create feedback reports, and deductive coding of 167 interview transcripts to refine interventions.

Result: LLM assistance enabled timely feedback to practitioners and allowed incorporation of large-scale qualitative data to inform theory and practice changes. The framework successfully integrated LLMs into applied health-services research while preserving rigor.

Conclusion: This work demonstrates how LLMs can be integrated into applied health-services research to enhance efficiency while maintaining rigor, providing guidance for continued innovation with LLMs in qualitative research.

Abstract: Large language models (LLMs) show promise for improving the efficiency of qualitative analysis in large, multi-site health-services research. Yet methodological guidance for LLM integration into qualitative analysis and evidence of their impact on real-world research methods and outcomes remain limited. We developed a model- and task-agnostic framework for designing human-LLM qualitative analysis methods to support diverse analytic aims. Within a multi-site study of diabetes care at Federally Qualified Health Centers (FQHCs), we leveraged the framework to implement human-LLM methods for (1) qualitative synthesis of researcher-generated summaries to produce comparative feedback reports and (2) deductive coding of 167 interview transcripts to refine a practice-transformation intervention. LLM assistance enabled timely feedback to practitioners and the incorporation of large-scale qualitative data to inform theory and practice changes. This work demonstrates how LLMs can be integrated into applied health-services research to enhance efficiency while preserving rigor, offering guidance for continued innovation with LLMs in qualitative research.

[10] Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks

Crish Nagarkar, Leonid Bogachev, Serge Sharoff

Main category: cs.CL

TL;DR: Fine-tuned LLMs achieve statistics student-level performance on advanced statistical tasks and can self-evaluate answer quality better than traditional metrics like BLEU or BertScore.

Details

Motivation: While LLMs excel at many NLP tasks, their ability to solve statistical problems and assess reasoning quality is not well understood, despite potential applications in education, research, and data analysis.

Method: Fine-tuned selected open-source LLMs on a specially developed dataset to enhance statistical reasoning, then compared their performance with human benchmark scores and evaluated their self-assessment capabilities.

Result: Fine-tuned models achieve performance comparable to statistics students on advanced statistical tasks, with architecture-dependent improvements. LLMs outperform traditional metrics (BLEU/BertScore) in evaluating answer quality including explanations and reasoning.

Conclusion: LLMs show clear potential for deployment in educational technology, statistical analysis assistance, automated assessment platforms, research methodology validation, and data analysis quality control.

Abstract: This paper investigates the ability of large language models (LLMs) to solve statistical tasks, as well as their capacity to assess the quality of reasoning. While state-of-the-art LLMs have demonstrated remarkable performance in a range of NLP tasks, their competence in addressing even moderately complex statistical challenges is not well understood. We have fine-tuned selected open-source LLMs on a specially developed dataset to enhance their statistical reasoning capabilities, and compared their performance with the human scores used as a benchmark. Our results show that the fine-tuned models achieve better performance on advanced statistical tasks on the level comparable to a statistics student. Fine-tuning demonstrates architecture-dependent improvements, with some models showing significant performance gains, indicating clear potential for deployment in educational technology and statistical analysis assistance systems. We also show that LLMs themselves can be far better judges of the answers quality (including explanation and reasoning assessment) in comparison to traditional metrics, such as BLEU or BertScore. This self-evaluation capability enables scalable automated assessment for statistical education platforms and quality assurance in automated analysis tools. Potential applications also include validation tools for research methodology in academic and industry settings, and quality control mechanisms for data analysis workflows.

[11] Business Logic-Driven Text-to-SQL Data Synthesis for Business Intelligence

Jinhui Liu, Ximeng Zhang, Yanbo Ai, Zhou Yu

Main category: cs.CL

TL;DR: Proposes Business Logic-Driven Data Synthesis framework for generating realistic Text-to-SQL evaluation data that captures business personas, workflows, and reasoning complexity, outperforming existing methods in business realism.

Details

Motivation: Evaluating Text-to-SQL agents in private business intelligence settings is challenging due to lack of realistic, domain-specific data. Existing synthetic data generation methods fail to capture business realism - whether questions reflect realistic business logic and workflows.

Method: Business Logic-Driven Data Synthesis framework that generates data grounded in business personas, work scenarios, and workflows. Includes business reasoning complexity control strategy to diversify analytical reasoning steps required to answer questions.

Result: On production-scale Salesforce database: achieves 98.44% business realism (outperforming OmniSQL by +19.5% and SQL-Factory by +54.7%), maintains 98.59% question-SQL alignment. Reveals state-of-the-art Text-to-SQL models achieve only 42.86% execution accuracy on most complex business queries.

Conclusion: The proposed framework successfully generates realistic business evaluation data, exposing significant performance gaps in current Text-to-SQL models on complex business queries, highlighting the need for better evaluation methods in business intelligence settings.

Abstract: Evaluating Text-to-SQL agents in private business intelligence (BI) settings is challenging due to the scarcity of realistic, domain-specific data. While synthetic evaluation data offers a scalable solution, existing generation methods fail to capture business realism–whether questions reflect realistic business logic and workflows. We propose a Business Logic-Driven Data Synthesis framework that generates data grounded in business personas, work scenarios, and workflows. In addition, we improve the data quality by imposing a business reasoning complexity control strategy that diversifies the analytical reasoning steps required to answer the questions. Experiments on a production-scale Salesforce database show that our synthesized data achieves high business realism (98.44%), substantially outperforming OmniSQL (+19.5%) and SQL-Factory (+54.7%), while maintaining strong question-SQL alignment (98.59%). Our synthetic data also reveals that state-of-the-art Text-to-SQL models still have significant performance gaps, achieving only 42.86% execution accuracy on the most complex business queries.

[12] Towards Execution-Grounded Automated AI Research

Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang, Tatsunori Hashimoto

Main category: cs.CL

TL;DR: Automated AI research system with execution grounding finds effective methods for LLM pre-training and post-training through evolutionary search, but reinforcement learning suffers from mode collapse.

Details

Motivation: Current LLMs generate plausible but ineffective ideas for AI research. The paper investigates whether automated execution is feasible and whether LLMs can learn from execution feedback to accelerate scientific discovery.

Method: Built automated executor to implement ideas and run GPU experiments. Converted LLM pre-training and post-training into execution environments. Tested two learning approaches: execution-guided evolutionary search and reinforcement learning from execution reward.

Result: Evolutionary search was sample-efficient: found post-training method outperforming GRPO baseline (69.4% vs 48.0%) and pre-training recipe beating nanoGPT baseline (19.7 vs 35.9 minutes) within 10 epochs. Frontier LLMs generated meaningful ideas but saturated early. Reinforcement learning improved average reward but suffered mode collapse, converging on simple ideas without improving upper-bound performance.

Conclusion: Execution-grounded automated AI research is feasible and evolutionary search shows promise for discovering effective methods, while reinforcement learning needs improvement to avoid mode collapse. The analysis provides insights for future automated research systems.

Abstract: Automated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems - LLM pre-training and post-training - into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline (69.4% vs 48.0%) on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline (19.7 minutes vs 35.9 minutes) on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts towards execution-grounded automated AI research.

[13] Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models

Brian Christian, Matan Mazor

Main category: cs.CL

TL;DR: LLMs struggle with counterfactual self-simulation like humans, failing to offset gender/race biases and sycophancy when prompted to ignore information, but can use their own API as a “blinded replica” to achieve fairer decisions.

Details

Motivation: Fair decision-making requires ignoring irrelevant biasing information, but humans struggle with counterfactual self-simulation (imagining decisions without knowing certain facts like gender/race). The paper investigates whether LLMs have similar limitations and explores solutions.

Method: Tested LLMs’ ability to approximate counterfactual decisions regarding gender/race biases and sycophancy. Evaluated prompting strategies (ignoring/pretending not to know biasing info) vs. using LLMs’ own API as a “ground-truth model of their own counterfactual cognition” - a blinded replica.

Result: Prompting models to ignore or pretend not to know biasing information fails to offset biases and sometimes backfires. However, giving LLMs access to their own API (blinded replica responses) enables fairer decisions and provides transparency to distinguish implicit from intentional bias.

Conclusion: LLMs share humans’ limitations in counterfactual self-simulation for bias mitigation, but unlike humans, they can leverage their own API as a blinded replica to achieve fairer decisions and increase transparency about bias sources.

Abstract: Fair decisions require ignoring irrelevant, potentially biasing, information. To achieve this, decision-makers need to approximate what decision they would have made had they not known certain facts, such as the gender or race of a job candidate. This counterfactual self-simulation is notoriously hard for humans, leading to biased judgments even by well-meaning actors. Here we show that large language models (LLMs) suffer from similar limitations in their ability to approximate what decisions they would make under counterfactual knowledge in offsetting gender and race biases and overcoming sycophancy. We show that prompting models to ignore or pretend not to know biasing information fails to offset these biases and occasionally backfires. However, unlike humans, LLMs can be given access to a ground-truth model of their own counterfactual cognition – their own API. We show that this access to the responses of a blinded replica enables fairer decisions, while providing greater transparency to distinguish implicit from intentionally biased behavior.

[14] Multi-Agent Constraint Factorization Reveals Latent Invariant Solution Structure

Christopher Scofield

Main category: cs.CL

TL;DR: Multi-agent LLM systems outperform single agents due to factorized constraint enforcement converging to better solutions

Details

Motivation: To formally explain why multi-agent LLM systems often show improved problem-solving performance despite having identical information, using mathematical foundations from operator theory and constrained optimization

Method: Model each agent as enforcing distinct validity constraints on shared solution state, showing MAS implements factorized composition of constraint-enforcement operators; analyze convergence to invariant solution sets defined by intersection of agent constraint sets; extend from exact to soft constraints via proximal operators

Result: Multi-agent systems converge to invariant solution sets that are generally not accessible to single agents applying all constraints simultaneously, even with identical expressive capacity and information

Conclusion: The factorized constraint enforcement in multi-agent systems provides formal explanation for their superior performance, with applications to contemporary text-based dialog systems

Abstract: Multi-agent systems (MAS) composed of large language models often exhibit improved problem-solving performance despite operating on identical information. In this work, we provide a formal explanation for this phenomenon grounded in operator theory and constrained optimization. We model each agent as enforcing a distinct family of validity constraints on a shared solution state, and show that a MAS implements a factorized composition of constraint-enforcement operators. Under mild conditions, these dynamics converge to invariant solution sets defined by the intersection of agent constraint sets. Such invariant structures are generally not dynamically accessible to a single agent applying all constraints simultaneously, even when expressive capacity and information are identical. We extend this result from exact constraint enforcement to soft constraints via proximal operators, and apply the formalism to contemporary text-based dialog systems.

[15] Rewarding How Models Think Pedagogically: Integrating Pedagogical Reasoning and Thinking Rewards for LLMs in Education

Unggi Lee, Jiyeong Bae, Jaehyeon Park, Haeun Park, Taejun Park, Younghoon Jeon, Sungmin Cho, Junbo Koh, Yeil Jeong, Gyeonggeon Lee

Main category: cs.CL

TL;DR: PedagogicalRL-Thinking framework extends pedagogical alignment to reasoning LLMs in education using domain-specific prompting and thinking rewards to optimize internal reasoning processes.

Details

Motivation: Current LLM tutoring systems focus only on optimizing visible responses while neglecting the model's internal thinking process, limiting their effectiveness in educational contexts where pedagogical reasoning is crucial.

Method: Two novel approaches: (1) Pedagogical Reasoning Prompting - guides internal reasoning using domain-specific educational theory rather than generic instructions; (2) Thinking Reward - explicitly evaluates and reinforces the pedagogical quality of the model’s reasoning traces.

Result: Domain-specific theory-grounded prompting outperforms generic prompting; Thinking Reward is most effective when combined with pedagogical prompting; models trained on math tutoring show improved performance on unseen educational benchmarks while preserving factual knowledge.

Conclusion: Pedagogical thinking reward produces systematic reasoning trace changes with increased pedagogical reasoning and more structured instructional decision-making, demonstrating the importance of optimizing internal reasoning processes for effective LLM tutors.

Abstract: Large language models (LLMs) are increasingly deployed as intelligent tutoring systems, yet research on optimizing LLMs specifically for educational contexts remains limited. Recent works have proposed reinforcement learning approaches for training LLM tutors, but these methods focus solely on optimizing visible responses while neglecting the model’s internal thinking process. We introduce PedagogicalRL-Thinking, a framework that extends pedagogical alignment to reasoning LLMs in education through two novel approaches: (1) Pedagogical Reasoning Prompting, which guides internal reasoning using domain-specific educational theory rather than generic instructions; and (2) Thinking Reward, which explicitly evaluates and reinforces the pedagogical quality of the model’s reasoning traces. Our experiments reveal that domain-specific, theory-grounded prompting outperforms generic prompting, and that Thinking Reward is most effective when combined with pedagogical prompting. Furthermore, models trained only on mathematics tutoring dialogues show improved performance on educational benchmarks not seen during training, while preserving the base model’s factual knowledge. Our quantitative and qualitative analyses reveal that pedagogical thinking reward produces systematic reasoning trace changes, with increased pedagogical reasoning and more structured instructional decision-making in the tutor’s thinking process.

Bhaavanaa Thumu, Leena Mathur, Youssouf Kebe, Louis-Philippe Morency

Main category: cs.CL

TL;DR: Social Caption framework evaluates MLLMs’ social understanding abilities across three dimensions: Social Inference, Holistic Social Analysis, and Directed Social Analysis, analyzing factors like model scale, architecture, and spoken context.

Details

Motivation: Social understanding abilities are crucial for multimodal large language models to interpret human social interactions, but there's a need for systematic evaluation frameworks to assess these capabilities.

Method: Introduces Social Caption framework grounded in interaction theory with three evaluation dimensions: Social Inference (SI), Holistic Social Analysis (HSA), and Directed Social Analysis (DSA). Uses MLLM judges for automated evaluation.

Result: Analyzes factors influencing model performance in social understanding, including scale, architectural design, and spoken context. Provides insights about scaling automated evaluation of multimodal social understanding.

Conclusion: Social Caption provides a comprehensive framework for evaluating MLLMs’ social understanding abilities, offering systematic assessment across multiple dimensions and insights for improving automated evaluation methods.

Abstract: Social understanding abilities are crucial for multimodal large language models (MLLMs) to interpret human social interactions. We introduce Social Caption, a framework grounded in interaction theory to evaluate social understanding abilities of MLLMs along three dimensions: Social Inference (SI), the ability to make accurate inferences about interactions; Holistic Social Analysis (HSA), the ability to generate comprehensive descriptions of interactions; Directed Social Analysis (DSA), the ability to extract relevant social information from interactions. We analyze factors influencing model performance in social understanding, such as scale, architectural design, and spoken context. Experiments with MLLM judges contribute insights about scaling automated evaluation of multimodal social understanding.

[17] SearchGym: Bootstrapping Real-World Search Agents via Cost-Effective and High-Fidelity Environment Simulation

Xichen Zhang, Ziyi He, Yinghao Zhu, Sitong Wu, Shaozuo Yu, Meng Chu, Wenhu Zhang, Haoru Tan, Jiaya Jia

Main category: cs.CL

TL;DR: SearchGym: A simulation environment for training search agents with verifiable knowledge graphs, addressing the cost and noise issues of using live web APIs or static data snapshots.

Details

Motivation: Training search agents via RL faces a critical dilemma: live web APIs are too expensive, while static data snapshots introduce noise from data misalignment, which corrupts reward signals and destabilizes training.

Method: SearchGym uses a generative pipeline to construct verifiable knowledge graphs and aligned document corpora, ensuring tasks are factually grounded. SearchGym-RL adds curriculum learning with purified feedback to progressively optimize agent policies from basic interactions to complex planning.

Result: Experiments with Llama and Qwen models show strong Sim-to-Real generalization. Qwen2.5-7B-Base trained in SearchGym surpasses the web-enhanced ASearcher baseline across nine benchmarks by average 10.6% relative margin.

Conclusion: High-fidelity simulation serves as a scalable, cost-effective methodology for developing capable search agents, addressing the fundamental training challenges in search agent development.

Abstract: Search agents have emerged as a pivotal paradigm for solving open-ended, knowledge-intensive reasoning tasks. However, training these agents via Reinforcement Learning (RL) faces a critical dilemma: interacting with live commercial Web APIs is prohibitively expensive, while relying on static data snapshots often introduces noise due to data misalignment. This misalignment generates corrupted reward signals that destabilize training by penalizing correct reasoning or rewarding hallucination. To address this, we propose SearchGym, a simulation environment designed to bootstrap robust search agents. SearchGym employs a rigorous generative pipeline to construct a verifiable knowledge graph and an aligned document corpus, ensuring that every reasoning task is factually grounded and strictly solvable. Building on this controllable environment, we introduce SearchGym-RL, a curriculum learning methodology that progressively optimizes agent policies through purified feedback, evolving from basic interactions to complex, long-horizon planning. Extensive experiments across the Llama and Qwen families demonstrate strong Sim-to-Real generalization. Notably, our Qwen2.5-7B-Base model trained within SearchGym surpasses the web-enhanced ASearcher baseline across nine diverse benchmarks by an average relative margin of 10.6%. Our results validate that high-fidelity simulation serves as a scalable and highly cost-effective methodology for developing capable search agents.

[18] Say Anything but This: When Tokenizer Betrays Reasoning in LLMs

Navid Ayoobi, Marcus I Armstrong, Arjun Mukherjee

Main category: cs.CL

TL;DR: LLMs can fail at simple text replacement tasks due to tokenizer artifacts, where multiple token ID sequences map to identical surface text, creating phantom edits and systematic reasoning failures.

Details

Motivation: Modern subword tokenizers produce non-unique encodings where multiple token ID sequences detokenize to identical surface strings. This creates a representational mismatch that introduces unmeasured fragility in LLM reasoning, causing models to treat semantically identical text as distinct "words."

Method: Introduces a tokenization-consistency probe requiring models to replace designated target words in context while leaving other content unchanged. Analyzes over 11,000 replacement trials across state-of-the-art open-source LLMs to identify tokenizer-induced failures.

Result: Found non-trivial rate of phantom edits where models operate under illusion of correct reasoning due to tokenizer artifacts. Identified eight systematic tokenizer artifacts including whitespace-boundary shifts and intra-word resegmentation.

Conclusion: Part of apparent reasoning deficiency originates in tokenizer layer, not model knowledge gaps. This motivates tokenizer-level remedies before scaling up model size and training data.

Abstract: Large language models (LLMs) reason over discrete token ID sequences, yet modern subword tokenizers routinely produce non-unique encodings: multiple token ID sequences can detokenize to identical surface strings. This representational mismatch creates an unmeasured fragility wherein reasoning processes can fail. LLMs may treat two internal representations as distinct “words” even when they are semantically identical at the text level. In this work, we show that tokenization can betray LLM reasoning through one-to-many token ID mappings. We introduce a tokenization-consistency probe that requires models to replace designated target words in context while leaving all other content unchanged. The task is intentionally simple at the surface level, enabling us to attribute failures to tokenizer-detokenizer artifacts rather than to knowledge gaps or parameter limitations. Through analysis of over 11000 replacement trials across state-of-the-art open-source LLMs, we find a non-trivial rate of outputs exhibit phantom edits: cases where models operate under the illusion of correct reasoning, a phenomenon arising from tokenizer-induced representational defects. We further analyze these cases and provide a taxonomy of eight systematic tokenizer artifacts, including whitespace-boundary shifts and intra-word resegmentation. These findings indicate that part of apparent reasoning deficiency originates in the tokenizer layer, motivating tokenizer-level remedies before incurring the cost of training ever-larger models on ever-larger corpora.

[19] Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: Memp is a framework that gives LLM agents learnable, updatable procedural memory by distilling past trajectories into step-by-step instructions and script-like abstractions, improving performance and efficiency on tasks.

Details

Motivation: Current LLM-based agents have brittle procedural memory that is either manually engineered or entangled in static parameters, limiting their ability to learn and adapt from experience over time.

Method: Proposes Memp framework that distills past agent trajectories into fine-grained step-by-step instructions and higher-level script abstractions, with strategies for Build, Retrieval, and Update of procedural memory, coupled with continuous updating, correction, and deprecation.

Result: Empirical evaluation on TravelPlanner and ALFWorld shows agents achieve steadily higher success rates and greater efficiency as memory repository is refined. Procedural memory from stronger models can be migrated to weaker models for substantial performance gains.

Conclusion: Memp successfully endows agents with learnable, updatable lifelong procedural memory that evolves with experience, improving performance and enabling knowledge transfer between models.

Abstract: Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model can also yield substantial performance gains. Code is available at https://github.com/zjunlp/MemP.

[20] AdaTIR: Adaptive Tool-Integrated Reasoning via Difficulty-Aware Policy Optimization

Zhaiyu Fang, Ruipeng Sun

Main category: cs.CL

TL;DR: AdaTIR is a framework that enables LLMs to adaptively decide when to use tools vs. internal reasoning based on task difficulty, reducing unnecessary tool calls while maintaining accuracy.

Details

Motivation: Current LLM agents exhibit cognitive offloading by redundantly invoking external tools even for simple tasks, lacking adaptive wisdom to discern when tools are truly needed. True agentic intelligence requires not just tool invocation, but the ability to decide when to use them.

Method: AdaTIR introduces difficulty-aware reasoning internalization with a difficulty-aware efficiency reward that dynamically adjusts tool budgets based on task complexity. It also proposes Clipped Advantage Shaping (CAS) to solve the sign reversal problem where tool penalties outweigh correctness rewards, ensuring correctness remains the primary objective with efficiency as a secondary constraint.

Result: AdaTIR reduces tool calls by up to 97.6% on simple tasks and 28.2% on complex challenges while maintaining or enhancing accuracy. It successfully internalizes reasoning, outperforming baselines by 4.8% on AIME 2024 even when tool access is disabled.

Conclusion: The framework shifts from static tool invocation to difficulty-aware reasoning internalization, enabling LLMs to be more efficient and intelligent by adaptively choosing between tool use and internal reasoning based on task complexity.

Abstract: Tool-Integrated Reasoning (TIR) has significantly enhanced the capabilities of Large Language Models (LLMs), yet current agents tend to exhibit cognitive offloading, redundantly invoking external tools even for simple tasks. In this paper, we suggest that true agentic intelligence requires not just tool invocation, but the adaptive wisdom to discern when to use them. We propose AdaTIR, a framework that shifts the paradigm from static tool invocation to difficulty-aware reasoning internalization. By introducing a difficulty-aware efficiency reward, AdaTIR dynamically adjusts tool budgets based on task complexity–internalizing reasoning for simple tasks while selectively invoking tools for complex tasks. Furthermore, we identify a sign reversal problem where tool penalties outweigh correctness rewards, mistakenly penalizing correct rollouts with negative advantages. To resolve this, we propose Clipped Advantage Shaping (CAS), which ensures that correctness remains the primary objective while using efficiency as a secondary constraint. Empirical results demonstrate that AdaTIR reduces tool calls by up to 97.6% on simple tasks and 28.2% on complex challenges while maintaining or enhancing accuracy. Notably, AdaTIR successfully internalizes reasoning, outperforming baselines by 4.8% on AIME 2024 even when tool access is strictly disabled.

[21] ClaimDB: A Fact Verification Benchmark over Large Structured Data

Michael Theologitis, Preetam Prabhu Srikar Dammu, Chirag Shah, Dan Suciu

Main category: cs.CL

TL;DR: ClaimDB is the first fact-verification benchmark using evidence from millions of records across multiple tables, revealing LLMs’ limitations in handling large-scale structured data verification.

Details

Motivation: Current fact-verification benchmarks overlook claims grounded in large-scale structured data, creating a gap in evaluating LLMs' ability to verify facts from complex database compositions.

Method: Created ClaimDB with 80 real-life databases across diverse domains, forcing verification through executable program reasoning rather than traditional “reading” approaches. Tested 30 state-of-the-art LLMs (proprietary and open-source below 70B parameters).

Result: No LLM exceeded 83% accuracy, with over half below 55%. Both closed- and open-source models struggle with abstention (admitting insufficient evidence), raising reliability concerns for high-stakes data analysis.

Conclusion: ClaimDB reveals critical limitations in current LLMs for structured data verification, highlighting the need for reasoning through executable programs rather than text-based approaches, and questioning LLM reliability in data-intensive applications.

Abstract: Despite substantial progress in fact-verification benchmarks, claims grounded in large-scale structured data remain underexplored. In this work, we introduce ClaimDB, the first fact-verification benchmark where the evidence for claims is derived from compositions of millions of records and multiple tables. ClaimDB consists of 80 unique real-life databases covering a wide range of domains, from governance and healthcare to media, education and the natural sciences. At this scale, verification approaches that rely on “reading” the evidence break down, forcing a timely shift toward reasoning in executable programs. We conduct extensive experiments with 30 state-of-the-art proprietary and open-source (below 70B) LLMs and find that none exceed 83% accuracy, with more than half below 55%. Our analysis also reveals that both closed- and open-source models struggle with abstention – the ability to admit that there is no evidence to decide – raising doubts about their reliability in high-stakes data analysis. We release the benchmark, code, and the LLM leaderboard at https://claimdb.github.io .

[22] What Makes AI Research Replicable? Executable Knowledge Graphs as Scientific Knowledge Representations

Yujie Luo, Zhuoyun Yu, Xuehai Wang, Yuqi Zhu, Ningyu Zhang, Lanning Wei, Lun Du, Da Zheng, Huajun Chen

Main category: cs.CL

TL;DR: xKG (Executable Knowledge Graphs) is a pluggable knowledge base that integrates code snippets and technical insights from papers to improve AI research replication by LLM agents.

Details

Motivation: Existing approaches for AI research replication struggle with generating executable code due to insufficient background knowledge, limitations of RAG methods in capturing latent technical details, lack of implementation-level code signals, and absence of structured knowledge representations for multi-granular retrieval.

Method: Proposes Executable Knowledge Graphs (xKG) - a paper-centric knowledge base that automatically extracts and integrates code snippets and technical insights from scientific literature. It provides structured knowledge representations supporting multi-granular retrieval and reuse.

Result: When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3-mini) on PaperBench, demonstrating effectiveness as a general and extensible solution for automated AI research replication.

Conclusion: xKG effectively addresses key challenges in AI research replication by providing structured, executable knowledge that improves code generation capabilities of LLM agents, offering a pluggable and extensible solution.

Abstract: Replicating AI research is a crucial yet challenging task for large language model (LLM) agents. Existing approaches often struggle to generate executable code, primarily due to insufficient background knowledge and the limitations of retrieval-augmented generation (RAG) methods, which fail to capture latent technical details hidden in referenced papers. Furthermore, previous approaches tend to overlook valuable implementation-level code signals and lack structured knowledge representations that support multi-granular retrieval and reuse. To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a pluggable, paper-centric knowledge base that automatically integrates code snippets and technical insights extracted from scientific literature. When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3-mini) on PaperBench, demonstrating its effectiveness as a general and extensible solution for automated AI research replication. Code is available at https://github.com/zjunlp/xKG.

[23] DARL: Encouraging Diverse Answers for General Reasoning without Verifiers

Chongxuan Huang, Lei Lin, Xiaodong Shi, Wenping Hu, Ruiming Tang

Main category: cs.CL

TL;DR: DARL is a reinforcement learning framework that encourages diverse answer generation while maintaining controlled deviation from references, addressing overfitting issues in existing RL methods for open-ended tasks.

Details

Motivation: Existing RL methods like RLVR and RLPR suffer from overfitting to reference answers, limiting output diversity especially in open-ended tasks where multiple plausible answers exist. This restricts their applicability in general domains.

Method: DARL is a reinforcement learning framework that encourages generation of diverse answers within a controlled deviation range from reference answers while preserving alignment. It’s fully compatible with existing general RL methods and requires no additional verifiers.

Result: DARL achieves consistent improvements across 13 benchmarks, surpassing RLPR with average gains of 1.3 points on 6 reasoning benchmarks and 9.5 points on 7 general benchmarks, improving both reasoning accuracy and output diversity.

Conclusion: DARL effectively addresses the diversity limitation in existing RL methods for language models, demonstrating superior performance in both reasoning and general tasks while maintaining compatibility with existing RL frameworks.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated promising gains in enhancing the reasoning capabilities of large language models. However, its dependence on domain-specific verifiers significantly restricts its applicability to open and general domains. Recent efforts such as RLPR have extended RLVR to general domains, enabling training on broader datasets and achieving improvements over RLVR. However, a notable limitation of these methods is their tendency to overfit to reference answers, which constrains the model’s ability to generate diverse outputs. This limitation is particularly pronounced in open-ended tasks such as writing, where multiple plausible answers exist. To address this, we propose DARL, a simple yet effective reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference while preserving alignment with it. Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers. Extensive experiments on thirteen benchmarks demonstrate consistent improvements in reasoning performance. Notably, DARL surpasses RLPR, achieving average gains of 1.3 points on six reasoning benchmarks and 9.5 points on seven general benchmarks, highlighting its effectiveness in improving both reasoning accuracy and output diversity.

[24] Typhoon OCR: Open Vision-Language Model For Thai Document Extraction

Surapon Nonesung, Natapong Nitarach, Teetouch Jaknamon, Pittawat Taveekitworachai, Kunat Pipatanakul

Main category: cs.CL

TL;DR: Typhoon OCR is an open vision-language model for Thai and English document extraction that handles complex Thai script, achieves performance comparable to proprietary models, and offers lightweight deployment.

Details

Motivation: Existing vision-language models favor high-resource languages and struggle with Thai due to script complexity, lack of word boundaries, and unstructured documents, limiting open-source model effectiveness for Thai document extraction.

Method: Fine-tuned from vision-language backbones using a Thai-focused training dataset created via multi-stage pipeline combining traditional OCR, VLM-based restructuring, and curated synthetic data; unified framework for text transcription, layout reconstruction, and structural consistency.

Result: Typhoon OCR V1.5 achieves performance comparable to or exceeding larger frontier proprietary models across diverse Thai document categories (financial reports, government forms, books, infographics, handwritten docs) despite substantially lower computational cost.

Conclusion: Open vision-language OCR models can achieve accurate text extraction and layout reconstruction for Thai documents, reaching proprietary-level performance while remaining lightweight and deployable, addressing the gap for low-resource languages.

Abstract: Document extraction is a core component of digital workflows, yet existing vision-language models (VLMs) predominantly favor high-resource languages. Thai presents additional challenges due to script complexity from non-latin letters, the absence of explicit word boundaries, and the prevalence of highly unstructured real-world documents, limiting the effectiveness of current open-source models. This paper presents Typhoon OCR, an open VLM for document extraction tailored for Thai and English. The model is fine-tuned from vision-language backbones using a Thai-focused training dataset. The dataset is developed using a multi-stage data construction pipeline that combines traditional OCR, VLM-based restructuring, and curated synthetic data. Typhoon OCR is a unified framework capable of text transcription, layout reconstruction, and document-level structural consistency. The latest iteration of our model, Typhoon OCR V1.5, is a compact and inference-efficient model designed to reduce reliance on metadata and simplify deployment. Comprehensive evaluations across diverse Thai document categories, including financial reports, government forms, books, infographics, and handwritten documents, show that Typhoon OCR achieves performance comparable to or exceeding larger frontier proprietary models, despite substantially lower computational cost. The results demonstrate that open vision-language OCR models can achieve accurate text extraction and layout reconstruction for Thai documents, reaching performance comparable to proprietary systems while remaining lightweight and deployable.

[25] Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei

Main category: cs.CL

TL;DR: Render-of-Thought (RoT) framework converts textual reasoning chains into images for token compression and faster inference while maintaining reasoning performance.

Details

Motivation: Chain-of-Thought prompting has computational overhead due to verbosity, lacks supervision on intermediate reasoning, and obscures analyzability of latent reasoning chains.

Method: RoT reifies reasoning chains by rendering textual steps into images, using vision encoders of existing VLMs as semantic anchors to align vision embeddings with textual space for plug-and-play implementation.

Result: Achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT while maintaining competitive performance on mathematical and logical reasoning benchmarks.

Conclusion: RoT provides a feasible paradigm for efficient reasoning by making latent rationales explicit and traceable through visual rendering of reasoning chains.

Abstract: Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT

[26] RECAP: Resistance Capture in Text-based Mental Health Counseling with Large Language Models

Anqi Li, Yuqian Chen, Yu Lu, Zhaoming Chen, Yuan Xie, Zhenzhong Lan

Main category: cs.CL

TL;DR: PsyFIRE framework introduces fine-grained resistance detection in text-based counseling with RECAP model achieving 91.25% F1 for resistance detection and 66.58% macro-F1 for fine-grained categories, outperforming LLM baselines by 20+ points.

Details

Motivation: Existing NLP approaches oversimplify resistance categories, ignore sequential therapeutic dynamics, and offer limited interpretability in text-based mental health counseling where detecting client resistance is critical but challenging.

Method: Proposed PsyFIRE framework capturing 13 fine-grained resistance behaviors with collaborative interactions, constructed ClientResistance corpus (23,930 annotated utterances from Chinese text-based counseling), and developed RECAP - a two-stage framework for resistance detection with explanations.

Result: RECAP achieves 91.25% F1 for collaboration vs resistance distinction and 66.58% macro-F1 for fine-grained resistance classification, outperforming prompt-based LLM baselines by over 20 points. Applied studies reveal resistance prevalence and negative impact on therapeutic relationships.

Conclusion: The framework demonstrates potential to improve counselors’ understanding and intervention strategies by providing fine-grained, interpretable resistance detection in text-based counseling, addressing limitations of existing oversimplified approaches.

Abstract: Recognizing and navigating client resistance is critical for effective mental health counseling, yet detecting such behaviors is particularly challenging in text-based interactions. Existing NLP approaches oversimplify resistance categories, ignore the sequential dynamics of therapeutic interventions, and offer limited interpretability. To address these limitations, we propose PsyFIRE, a theoretically grounded framework capturing 13 fine-grained resistance behaviors alongside collaborative interactions. Based on PsyFIRE, we construct the ClientResistance corpus with 23,930 annotated utterances from real-world Chinese text-based counseling, each supported by context-specific rationales. Leveraging this dataset, we develop RECAP, a two-stage framework that detects resistance and fine-grained resistance types with explanations. RECAP achieves 91.25% F1 for distinguishing collaboration and resistance and 66.58% macro-F1 for fine-grained resistance categories classification, outperforming leading prompt-based LLM baselines by over 20 points. Applied to a separate counseling dataset and a pilot study with 62 counselors, RECAP reveals the prevalence of resistance, its negative impact on therapeutic relationships and demonstrates its potential to improve counselors’ understanding and intervention strategies.

[27] Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech

Yuxin Li, Eng Siong Chng, Cuntai Guan

Main category: cs.CL

TL;DR: HAREN-CTC: A hierarchical adaptive representation encoder with CTC supervision for speech-based depression detection that models acoustic-semantic interactions to capture sparse depressive characteristics in speech.

Details

Motivation: Existing speech-based depression detection methods struggle to capture robust depression-related speech characteristics that are sparse and heterogeneous. Current approaches using pretrained SSL models typically extract features from a single layer, overlooking the complementary roles of low-level acoustic features and high-level semantic information encoded in different SSL model layers.

Method: Proposes a hierarchical adaptive representation encoder with prior knowledge that disengages and re-aligns acoustic and semantic information through asymmetric cross-attention, enabling fine-grained acoustic patterns to be interpreted in semantic context. Also applies a Connectionist Temporal Classification (CTC) objective as auxiliary supervision to handle irregular temporal distribution of depressive characteristics without requiring frame-level annotations.

Result: Experiments on DAIC-WOZ and MODMA datasets show HAREN-CTC consistently outperforms existing methods under both performance upper-bound evaluation and generalization evaluation settings. Achieves Macro F1 scores of 0.81 and 0.82 respectively in upper-bound evaluation, with statistically significant improvements in precision and AUC under rigorous cross-validation.

Conclusion: Modeling hierarchical acoustic-semantic interactions better reflects how depressive characteristics manifest in natural speech, enabling scalable and objective depression assessment. The approach demonstrates superior performance in capturing sparse and heterogeneous depression-related speech patterns.

Abstract: Speech-based depression detection (SDD) has emerged as a non-invasive and scalable alternative to conventional clinical assessments. However, existing methods still struggle to capture robust depression-related speech characteristics, which are sparse and heterogeneous. Although pretrained self-supervised learning (SSL) models provide rich representations, most recent SDD studies extract features from a single layer of the pretrained SSL model for the downstream classifier. This practice overlooks the complementary roles of low-level acoustic features and high-level semantic information inherently encoded in different SSL model layers. To explicitly model interactions between acoustic and semantic representations within an utterance, we propose a hierarchical adaptive representation encoder with prior knowledge that disengages and re-aligns acoustic and semantic information through asymmetric cross-attention, enabling fine-grained acoustic patterns to be interpreted in semantic context. In addition, a Connectionist Temporal Classification (CTC) objective is applied as auxiliary supervision to handle the irregular temporal distribution of depressive characteristics without requiring frame-level annotations. Experiments on DAIC-WOZ and MODMA demonstrate that HAREN-CTC consistently outperforms existing methods under both performance upper-bound evaluation and generalization evaluation settings, achieving Macro F1 scores of 0.81 and 0.82 respectively in upper-bound evaluation, and maintaining superior performance with statistically significant improvements in precision and AUC under rigorous cross-validation. These findings suggest that modeling hierarchical acoustic-semantic interactions better reflects how depressive characteristics manifest in natural speech, enabling scalable and objective depression assessment.

[28] Comparative Study of Large Language Models on Chinese Film Script Continuation: An Empirical Analysis Based on GPT-5.2 and Qwen-Max

Yuxuan Cao, Zida Yang, Ye Wang

Main category: cs.CL

TL;DR: GPT-5.2 outperforms Qwen-Max-Latest in Chinese film script continuation tasks, particularly in structural preservation and overall quality, despite Qwen-Max having slightly better text similarity scores.

Details

Motivation: As LLMs are increasingly used for creative writing, there's a need to systematically evaluate their performance on culturally specific narrative tasks, particularly for Chinese creative writing where benchmarks are lacking.

Method: Created first Chinese film script continuation benchmark with 53 classic films using “first half to second half” continuation paradigm (3 samples per film). Evaluated GPT-5.2 and Qwen-Max-Latest using multi-dimensional framework: ROUGE-L for text similarity, Structural Similarity, and LLM-as-Judge scoring (DeepSeek-Reasoner) for overall quality.

Result: Qwen-Max had marginally higher ROUGE-L (0.2230 vs 0.2114), but GPT-5.2 significantly outperformed in structural preservation (0.93 vs 0.75), overall quality (44.79 vs 25.72), and composite scores (0.50 vs 0.39). GPT-5.2 excelled in character consistency, tone-style matching, and format preservation, while Qwen-Max showed generation stability issues.

Conclusion: GPT-5.2 demonstrates superior performance for Chinese creative writing tasks, particularly in maintaining narrative structure and quality. The study provides a reproducible evaluation framework for LLMs in Chinese creative writing contexts.

Abstract: As large language models (LLMs) are increasingly applied to creative writing, their performance on culturally specific narrative tasks warrants systematic investigation. This study constructs the first Chinese film script continuation benchmark comprising 53 classic films, and designs a multi-dimensional evaluation framework comparing GPT-5.2 and Qwen-Max-Latest. Using a “first half to second half” continuation paradigm with 3 samples per film, we obtained 303 valid samples (GPT-5.2: 157, 98.7% validity; Qwen-Max: 146, 91.8% validity). Evaluation integrates ROUGE-L, Structural Similarity, and LLM-as-Judge scoring (DeepSeek-Reasoner). Statistical analysis of 144 paired samples reveals: Qwen-Max achieves marginally higher ROUGE-L (0.2230 vs 0.2114, d=-0.43); however, GPT-5.2 significantly outperforms in structural preservation (0.93 vs 0.75, d=0.46), overall quality (44.79 vs 25.72, d=1.04), and composite scores (0.50 vs 0.39, d=0.84). The overall quality effect size reaches large effect level (d>0.8). GPT-5.2 excels in character consistency, tone-style matching, and format preservation, while Qwen-Max shows deficiencies in generation stability. This study provides a reproducible framework for LLM evaluation in Chinese creative writing.

[29] Mitigating Data Imbalance in Automated Speaking Assessment

Fong-Chun Tsai, Kuan-Tang Huang, Bi-Cheng Yan, Tien-Hong Lo, Berlin Chen

Main category: cs.CL

TL;DR: The paper introduces a novel BLV loss function to address class imbalance in automated speaking assessment models, improving accuracy and fairness without dataset modification.

Details

Motivation: Automated Speaking Assessment (ASA) models for L2 learners often suffer from class imbalance issues, leading to biased predictions that disadvantage minority proficiency classes.

Method: Proposes Balancing Logit Variation (BLV) loss, a novel training objective that perturbs model predictions to improve feature representation for minority classes without modifying the original dataset.

Result: Evaluation on ICNALE benchmark dataset shows BLV loss integrated with BERT model significantly enhances both classification accuracy and fairness metrics.

Conclusion: The BLV loss makes automated speech evaluation more robust for diverse learners by effectively addressing class imbalance in ASA models.

Abstract: Automated Speaking Assessment (ASA) plays a crucial role in evaluating second-language (L2) learners proficiency. However, ASA models often suffer from class imbalance, leading to biased predictions. To address this, we introduce a novel objective for training ASA models, dubbed the Balancing Logit Variation (BLV) loss, which perturbs model predictions to improve feature representation for minority classes without modifying the dataset. Evaluations on the ICNALE benchmark dataset show that integrating the BLV loss into a celebrated text-based (BERT) model significantly enhances classification accuracy and fairness, making automated speech evaluation more robust for diverse learners.

[30] Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

Yuatyong Chaichana, Pittawat Taveekitworachai, Warit Sirichotedumrong, Potsawee Manakul, Kunat Pipatanakul

Main category: cs.CL

TL;DR: Partial YaRN and VLAT enable large audio-language models to handle long audio contexts without compromising text capabilities.

Details

Motivation: Large Audio-Language Models (LALMs) are limited by short audio context windows, even when their text backbones support long contexts, which restricts long-form audio understanding capabilities.

Method: Two approaches: 1) Partial YaRN - a training-free, modality-decoupled extension method that modifies only audio token positions while preserving text positions; 2) VLAT - a training strategy that extends Partial YaRN into training-time positional augmentation, simulating diverse audio lengths during training.

Result: Partial YaRN outperforms original models across a wide range of settings, and VLAT provides substantial performance improvement on long audio of unseen lengths, as demonstrated on SALMONN and Qwen2-Audio models.

Conclusion: The proposed Partial YaRN and VLAT methods effectively extend audio context windows in LALMs while preserving text capabilities, enabling better long-form audio understanding without compromising existing functionality.

Abstract: Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, modality-decoupled extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM’s text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training. Our experiments on SALMONN and Qwen2-Audio confirm that Partial YaRN outperforms the original models across wide range of settings, and VLAT provides substantial performance improvement on long audio of unseen lengths.

[31] HiNS: Hierarchical Negative Sampling for More Comprehensive Memory Retrieval Embedding Model

Motong Tian, Allen P. Wong, Mingjun Mao, Wangchunshu Zhou

Main category: cs.CL

TL;DR: HiNS framework improves memory-augmented language agents by modeling hierarchical negative sample difficulty and natural distribution ratios in training data, leading to better embedding models for memory retrieval.

Details

Motivation: Existing training data construction for memory-augmented language agents overlooks the hierarchical difficulty of negative samples and their natural distribution in human-agent interactions. Current approaches using synthetic or uniformly sampled negatives fail to reflect the diversity of semantically close distractors vs. trivially irrelevant samples, limiting embedding models' ability to learn nuanced discrimination for robust memory retrieval.

Method: Proposes HiNS, a principled data construction framework that explicitly models negative sample difficulty tiers and incorporates empirically grounded negative ratios derived from conversational data. This enables training of embedding models with improved retrieval fidelity and generalization in memory-intensive tasks.

Result: Significant improvements on benchmark tasks: On LoCoMo, F1/BLEU-1 gains of 3.27%/3.30% (MemoryOS) and 1.95%/1.78% (Mem0); on PERSONAMEM, total score improvements of 1.19% (MemoryOS) and 2.55% (Mem0).

Conclusion: The HiNS framework demonstrates that explicitly modeling hierarchical negative sample difficulty and incorporating natural distribution ratios from conversational data substantially improves embedding model performance for memory retrieval in language agents.

Abstract: Memory-augmented language agents rely on embedding models for effective memory retrieval. However, existing training data construction overlooks a critical limitation: the hierarchical difficulty of negative samples and their natural distribution in human-agent interactions. In practice, some negatives are semantically close distractors while others are trivially irrelevant, and natural dialogue exhibits structured proportions of these types. Current approaches using synthetic or uniformly sampled negatives fail to reflect this diversity, limiting embedding models’ ability to learn nuanced discrimination essential for robust memory retrieval. In this work, we propose a principled data construction framework HiNS that explicitly models negative sample difficulty tiers and incorporates empirically grounded negative ratios derived from conversational data, enabling the training of embedding models with substantially improved retrieval fidelity and generalization in memory-intensive tasks. Experiments show significant improvements: on LoCoMo, F1/BLEU-1 gains of 3.27%/3.30%(MemoryOS) and 1.95%/1.78% (Mem0); on PERSONAMEM, total score improvements of 1.19% (MemoryOS) and 2.55% (Mem0).

[32] Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation

Rui Qi, Fengran Mo, Yufeng Chen, Xue Zhang, Shuo Wang, Hongliang Li, Jinan Xu, Meng Jiang, Jian-Yun Nie, Kaiyu Huang

Main category: cs.CL

TL;DR: LcRL is a multilingual search-augmented reinforcement learning framework that addresses knowledge bias and conflict in MRAG through language-coupled group sampling and anti-consistency regularization.

Details

Motivation: Existing multilingual RAG approaches use a "one-size-fits-all" strategy that leads to knowledge bias and conflict when processing semantically equivalent queries across different languages through single-turn retrieval, resulting in suboptimal performance.

Method: Proposes LcRL framework with language-coupled Group Relative Policy Optimization. Uses language-coupled group sampling in rollout to reduce knowledge bias, and regularizes auxiliary anti-consistency penalty in reward models to mitigate knowledge conflict.

Result: LcRL achieves competitive performance and is appropriate for various practical scenarios including constrained training data and retrieval over collections with large numbers of languages.

Conclusion: The proposed framework effectively addresses knowledge bias and conflict in multilingual RAG, demonstrating improved performance and practical applicability across diverse multilingual settings.

Abstract: Multilingual retrieval-augmented generation (MRAG) requires models to effectively acquire and integrate beneficial external knowledge from multilingual collections. However, most existing studies employ a unitive process where queries of equivalent semantics across different languages are processed through a single-turn retrieval and subsequent optimization. Such a ``one-size-fits-all’’ strategy is often suboptimal in multilingual settings, as the models occur to knowledge bias and conflict during the interaction with the search engine. To alleviate the issues, we propose LcRL, a multilingual search-augmented reinforcement learning framework that integrates a language-coupled Group Relative Policy Optimization into the policy and reward models. We adopt the language-coupled group sampling in the rollout module to reduce knowledge bias, and regularize an auxiliary anti-consistency penalty in the reward models to mitigate the knowledge conflict. Experimental results demonstrate that LcRL not only achieves competitive performance but is also appropriate for various practical scenarios such as constrained training data and retrieval over collections encompassing a large number of languages. Our code is available at https://github.com/Cherry-qwq/LcRL-Open.

[33] PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation

Chenning Xu, Mao Zheng, Mingyu Zheng, Mingyang Song

Main category: cs.CL

TL;DR: PodBench: A benchmark for podcast script generation with 800 samples, long contexts up to 21K tokens, and multi-speaker instructions, featuring evaluation framework combining quantitative constraints and LLM-based assessment.

Details

Motivation: Podcast script generation requires LLMs to synthesize structured, context-grounded dialogue from diverse inputs, but there's a lack of systematic evaluation resources for this task. The paper aims to bridge this gap by creating a comprehensive benchmark.

Method: Introduces PodBench with 800 samples featuring inputs up to 21K tokens and complex multi-speaker instructions. Proposes a multifaceted evaluation framework that integrates quantitative constraints with LLM-based quality assessment.

Result: Proprietary models generally excel, but open-source models with explicit reasoning demonstrate superior robustness in handling long contexts and multi-speaker coordination compared to standard baselines. However, high instruction following doesn’t guarantee high content substance.

Conclusion: PodBench offers a reproducible testbed to address challenges in long-form, audio-centric generation, revealing persistent divergence between instruction following and content quality that needs further research.

Abstract: Podcast script generation requires LLMs to synthesize structured, context-grounded dialogue from diverse inputs, yet systematic evaluation resources for this task remain limited. To bridge this gap, we introduce PodBench, a benchmark comprising 800 samples with inputs up to 21K tokens and complex multi-speaker instructions. We propose a multifaceted evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Extensive experiments reveal that while proprietary models generally excel, open-source models equipped with explicit reasoning demonstrate superior robustness in handling long contexts and multi-speaker coordination compared to standard baselines. However, our analysis uncovers a persistent divergence where high instruction following does not guarantee high content substance. PodBench offers a reproducible testbed to address these challenges in long-form, audio-centric generation.

[34] CodeDelegator: Mitigating Context Pollution via Role Separation in Code-as-Action Agents

Tianxiang Fei, Cheng Chen, Yue Pan, Mao Zheng, Mingyang Song

Main category: cs.CL

TL;DR: CodeDelegator: A multi-agent framework that separates planning (Delegator) from implementation (Coder) to prevent context pollution and improve long-horizon task performance in LLM-based agents.

Details

Motivation: Real-world tasks require both strategic planning and detailed implementation, but using a single LLM agent for both leads to context pollution from debugging traces and intermediate failures, which impairs long-horizon performance.

Method: CodeDelegator uses role specialization: a persistent Delegator agent handles strategic planning, task decomposition, specification writing, and progress monitoring without executing code. For each sub-task, a new Coder agent is instantiated with a clean context containing only its specification. EPSS (Ephemeral-Persistent State Separation) isolates each Coder’s execution state while preserving global coherence.

Result: Experiments on various benchmarks demonstrate the effectiveness of CodeDelegator across diverse scenarios.

Conclusion: Separating planning from implementation through role specialization and context isolation improves LLM agent performance on complex, long-horizon tasks by preventing context pollution from debugging and failures.

Abstract: Recent advances in large language models (LLMs) allow agents to represent actions as executable code, offering greater expressivity than traditional tool-calling. However, real-world tasks often demand both strategic planning and detailed implementation. Using a single agent for both leads to context pollution from debugging traces and intermediate failures, impairing long-horizon performance. We propose CodeDelegator, a multi-agent framework that separates planning from implementation via role specialization. A persistent Delegator maintains strategic oversight by decomposing tasks, writing specifications, and monitoring progress without executing code. For each sub-task, a new Coder agent is instantiated with a clean context containing only its specification, shielding it from prior failures. To coordinate between agents, we introduce Ephemeral-Persistent State Separation (EPSS), which isolates each Coder’s execution state while preserving global coherence, preventing debugging traces from polluting the Delegator’s context. Experiments on various benchmarks demonstrate the effectiveness of CodeDelegator across diverse scenarios.

[35] The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations

Pierre-Antoine Lequeu, Léo Labat, Laurène Cave, Gaël Lejeune, François Yvon, Benjamin Piwowarski

Main category: cs.CL

TL;DR: The paper introduces Corpus Clarification, a framework to standardize citizen contributions in public forums for political analysis, and shows that small, open-weights LLMs can effectively perform this standardization, matching or outperforming larger LLMs.

Details

Motivation: While LLMs are widely used in NLP, their application to democratic activities like online deliberations raises ethical concerns. The research aims to standardize citizen contributions for easier topic modeling and political analysis, while exploring whether small, locally-runnable LLMs can reliably perform this standardization.

Method: Introduces Corpus Clarification as a preprocessing framework that transforms noisy, multi-topic citizen contributions into structured argumentative units. Creates GDN-CC, a manually-curated dataset of 1,231 French Grand Débat National contributions with 2,285 annotated argumentative units. Finetunes Small Language Models to reproduce these annotations and tests them on opinion clustering tasks.

Result: Finetuned Small Language Models match or outperform larger LLMs in reproducing argumentative annotations. The framework enables creation of GDN-CC-large, an automatically annotated corpus of 240k contributions - the largest annotated democratic consultation dataset to date.

Conclusion: Small, open-weights LLMs can effectively standardize citizen contributions for political analysis, addressing ethical concerns about using large proprietary models. The Corpus Clarification framework and released datasets enable more transparent, accessible analysis of democratic consultation data.

Abstract: LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the pragmatic level, and make them easier to use in topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. Accordingly, we introduce Corpus Clarification as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present GDN-CC, a manually-curated dataset of 1,231 contributions to the French Grand Débat National, comprising 2,285 argumentative units annotated for argumentative structure and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release GDN-CC-large, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.

[36] CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

Zhiyuan Lu, Chenliang Li, Yingcheng Shi, Weizhou Shen, Ming Yan, Fei Huang

Main category: cs.CL

TL;DR: CorpusQA: A new benchmark for testing LLMs’ reasoning across entire document repositories (up to 10M tokens), challenging systems to perform holistic reasoning over dispersed evidence without relying on sparse retrieval assumptions.

Details

Motivation: Existing benchmarks are inadequate for testing corpus-level reasoning because they're limited to single long texts or rely on "sparse retrieval" assumptions where answers come from few relevant chunks. This fails for true corpus analysis where evidence is dispersed across hundreds of documents and requires global integration, comparison, and statistical aggregation.

Method: Introduced CorpusQA benchmark scaling to 10 million tokens, generated via novel data synthesis framework that decouples reasoning from textual representation. Creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast unstructured text without fallible human annotation.

Result: State-of-the-art long-context LLMs struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Fine-tuning on synthesized data enhances LLM’s general long-context reasoning capabilities. Memory-augmented agentic architectures offer more robust alternative.

Conclusion: Critical shift needed from simply extending context windows to developing advanced architectures for global information synthesis. Memory-augmented agentic architectures show promise for robust corpus-level reasoning.

Abstract: While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a “sparse retrieval” assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast, unstructured text without relying on fallible human annotation. We further demonstrate the utility of our framework beyond evaluation, showing that fine-tuning on our synthesized data effectively enhances an LLM’s general long-context reasoning capabilities. Extensive experiments reveal that even state-of-the-art long-context LLMs struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Our findings indicate that memory-augmented agentic architectures offer a more robust alternative, suggesting a critical shift is needed from simply extending context windows to developing advanced architectures for global information synthesis.

[37] A Comprehensive Benchmark of Language Models on Unicode and Romanized Sinhala

Minuri Rajapakse, Ruvan Weerasinghe

Main category: cs.CL

TL;DR: Benchmark study of modern language models on Unicode and Romanized Sinhala shows Mistral models perform best for each script type, with Llama-3.1-8B showing strong overall performance and significant disparities among closed-source models.

Details

Motivation: Language models' performance on lower-resource, morphologically rich languages like Sinhala remains under-explored, especially for Romanized Sinhala which is prevalent in digital communication but lacks comprehensive evaluation.

Method: Comprehensive benchmark using diverse corpus of Unicode and Romanized Sinhala; open-source models evaluated via perplexity (predictive performance), closed-source models via qualitative analysis of sentence completion.

Result: Mistral-Nemo-Base-2407 best for Unicode text, Mistral-7B-v0.3 best for Romanized text; Llama-3.1-8B strong overall; closed-source models show significant disparities: Gemini-1.5-pro and DeepSeek excel at Unicode generation, Claude-3.5-Sonnet superior at Romanized text.

Conclusion: Results provide essential guide for practitioners selecting models for Sinhala applications and highlight critical role of training data in handling script variations.

Abstract: The performance of Language Models (LMs) on lower-resource, morphologically rich languages like Sinhala remains under-explored, particularly for Romanized Sinhala, which is prevalent in digital communication. This paper presents a comprehensive benchmark of modern LMs on a diverse corpus of Unicode and Romanized Sinhala. We evaluate open-source models using perplexity, a measure of how well a model predicts a text, and leading closed-source models via a qualitative analysis of sentence completion. Our findings reveal that the Mistral-Nemo-Base-2407 model achieves the strongest predictive performance on Unicode text and the Mistral-7B-v0.3 model for Romanized text. The results also highlight the strong all-around performance of the Llama-3.1-8B model for both scripts. Furthermore, a significant performance disparity exists among closed-source models: Gemini-1.5-pro and DeepSeek excel at Unicode generation, whereas Claude-3.5-Sonnet is superior at handling Romanized text. These results provide an essential guide for practitioners selecting models for Sinhala-specific applications and highlight the critical role of training data in handling script variations.

[38] Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora

Chaymaa Abbas, Nour Shamaa, Mariette Awad

Main category: cs.CL

TL;DR: The paper investigates data contamination in multilingual LLM evaluation, showing that translation into Arabic masks conventional contamination indicators, and proposes a Translation-Aware Contamination Detection method that reliably exposes contamination when English-only methods fail.

Details

Motivation: Data contamination undermines LLM evaluation validity by allowing models to rely on memorized benchmark content. Prior contamination detection methods are largely limited to English benchmarks, leaving multilingual contamination poorly understood and creating a need for multilingual evaluation approaches.

Method: Researchers fine-tuned several open-weight LLMs on varying proportions of Arabic datasets and evaluated them on original English benchmarks. They extended the Tested Slot Guessing method with choice-reordering and incorporated Min-K% probability analysis to capture both behavioral and distributional contamination signals. They propose Translation-Aware Contamination Detection which compares signals across multiple translated benchmark variants rather than English alone.

Result: Translation into Arabic suppresses conventional contamination indicators, yet models still benefit from exposure to contaminated data, especially those with stronger Arabic capabilities. This effect is reflected in rising Min-K% scores and increased cross-lingual answer consistency as contamination grows. The proposed Translation-Aware Contamination Detection reliably exposes contamination even when English-only methods fail.

Conclusion: The findings highlight the need for multilingual, translation-aware evaluation pipelines to ensure fair, transparent, and reproducible assessment of LLMs, as current English-only contamination detection methods are insufficient for multilingual settings.

Abstract: Data contamination undermines the validity of Large Language Model evaluation by enabling models to rely on memorized benchmark content rather than true generalization. While prior work has proposed contamination detection methods, these approaches are largely limited to English benchmarks, leaving multilingual contamination poorly understood. In this work, we investigate contamination dynamics in multilingual settings by fine-tuning several open-weight LLMs on varying proportions of Arabic datasets and evaluating them on original English benchmarks. To detect memorization, we extend the Tested Slot Guessing method with a choice-reordering strategy and incorporate Min-K% probability analysis, capturing both behavioral and distributional contamination signals. Our results show that translation into Arabic suppresses conventional contamination indicators, yet models still benefit from exposure to contaminated data, particularly those with stronger Arabic capabilities. This effect is consistently reflected in rising Mink% scores and increased cross-lingual answer consistency as contamination levels grow. To address this blind spot, we propose Translation-Aware Contamination Detection, which identifies contamination by comparing signals across multiple translated benchmark variants rather than English alone. The Translation-Aware Contamination Detection reliably exposes contamination even when English-only methods fail. Together, our findings highlight the need for multilingual, translation-aware evaluation pipelines to ensure fair, transparent, and reproducible assessment of LLMs.

[39] Knowledge Restoration-driven Prompt Optimization: Unlocking LLM Potential for Open-Domain Relational Triplet Extraction

Xiaonan Jing, Gongqing Wu, Xingrui Zhuo, Lang Sun, Jiapu Wang

Main category: cs.CL

TL;DR: KRPO framework uses knowledge reconstruction-driven prompt optimization to improve LLMs’ open-domain relational triplet extraction through self-evaluation, textual gradient-based prompt optimization, and relation canonicalization memory.

Details

Motivation: Existing ORTE methods rely on static, heuristic-driven prompting strategies that lack reflection mechanisms, making them vulnerable to semantic ambiguity and causing erroneous extraction patterns to become permanent.

Method: Proposes KRPO framework with: 1) self-evaluation mechanism based on knowledge restoration (projecting triplets into semantic consistency scores), 2) textual gradient-based prompt optimizer that internalizes historical experiences, and 3) relation canonicalization memory to collect representative relations and provide semantically distinct schemas.

Result: Extensive experiments across three datasets show KRPO significantly outperforms strong baselines in extraction F1 score.

Conclusion: KRPO framework effectively addresses the limitations of static prompting strategies by enabling continuous improvement through knowledge reconstruction-driven optimization, enhancing LLMs’ capability for complex ORTE tasks.

Abstract: Open-domain Relational Triplet Extraction (ORTE) is the foundation for mining structured knowledge without predefined schemas. Despite the impressive in-context learning capabilities of Large Language Models (LLMs), existing methods are hindered by their reliance on static, heuristic-driven prompting strategies. Due to the lack of reflection mechanisms required to internalize erroneous signals, these methods exhibit vulnerability in semantic ambiguity, often making erroneous extraction patterns permanent. To address this bottleneck, we propose a Knowledge Reconstruction-driven Prompt Optimization (KRPO) framework to assist LLMs in continuously improving their extraction capabilities for complex ORTE task flows. Specifically, we design a self-evaluation mechanism based on knowledge restoration, which provides intrinsic feedback signals by projecting structured triplets into semantic consistency scores. Subsequently, we propose a prompt optimizer based on a textual gradient that can internalize historical experiences to iteratively optimize prompts, which can better guide LLMs to handle subsequent extraction tasks. Furthermore, to alleviate relation redundancy, we design a relation canonicalization memory that collects representative relations and provides semantically distinct schemas for the triplets. Extensive experiments across three datasets show that KRPO significantly outperforms strong baselines in the extraction F1 score.

[40] \textsc{LogicScore}: Fine-grained Logic Evaluation of Conciseness, Completeness, and Determinateness in Attributed Question Answering

Zhichao Yan, Yunxiao Zhao, Jiapu Wang, Jiaoyan Chen, Shaoru Guo, Xiaoli Li, Ru Li, Jeff Z. Pan

Main category: cs.CL

TL;DR: LogicScore is a new evaluation framework for Attributed Question Answering that addresses attribution myopia by assessing global logical integrity rather than just isolated statement verification.

Details

Motivation: Current AQA evaluation methods suffer from attribution myopia - they focus too much on verifying isolated statements and their attributions while ignoring the global logical integrity of long-form answers, leading to LLMs producing factually grounded but logically incoherent responses.

Method: LogicScore uses Horn Rules and integrates a backward verification mechanism to systematically evaluate three reasoning dimensions: Completeness (logically sound deduction), Conciseness (non-redundancy), and Determinateness (consistent answer entailment).

Result: Experiments across three multi-hop QA datasets (HotpotQA, MusiQue, 2WikiMultiHopQA) and over 20 LLMs reveal a critical capability gap: leading models achieve high attribution scores (e.g., 92.85% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11% Conciseness for Gemini-3 Pro).

Conclusion: LogicScore establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development.

Abstract: Current evaluation methods for Attributed Question Answering (AQA) suffer from \textit{attribution myopia}: they emphasize verification of isolated statements and their attributions but overlook the global logical integrity of long-form answers. Consequently, Large Language Models (LLMs) often produce factually grounded yet logically incoherent responses with elusive deductive gaps. To mitigate this limitation, we present \textsc{LogicScore}, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Conciseness} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high attribution scores (e.g., 92.85% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11% Conciseness for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development. Codes are available at: https://github.com/zhichaoyan11/LogicScore.

Vuong Hung Truong, Mariana Gabrielle Cangco Reyes, Masatoshi Koizumi, Jihwan Myung

Main category: cs.CL

TL;DR: The study reveals circadian rhythms in semantic exploration using Reddit data, showing morning peaks in local semantic exploration and later peaks in global semantic diversity, independent of mood effects.

Details

Motivation: To understand how circadian rhythms influence high-dimensional semantic behavior, which remains poorly understood despite known circadian modulation of human cognition.

Method: Analyzed large-scale Reddit data using pretrained transformer embeddings to measure semantic entropy as an index of linguistic exploration-exploitation, distinguishing between local and global semantic entropy.

Result: Found robust circadian rhythmicity in semantic exploration: local semantic exploration peaks in the morning (broader exploration), while global semantic diversity peaks later (accumulation around established topics). Patterns are not explained by sentiment or affective valence.

Conclusion: Biological circadian rhythms extend to the semantic domain, aligning with known diurnal patterns in neuromodulatory systems, revealing a cognitive dimension distinct from mood.

Abstract: Human cognition exhibits strong circadian modulation, yet its influence on high-dimensional semantic behavior remains poorly understood. Using large-scale Reddit data, we quantify time-of-day variation in language use by embedding text into a pretrained transformer model and measuring semantic entropy as an index of linguistic exploration-exploitation, for which we show a robust circadian rhythmicity that could be entrained by seasonal light cues. Distinguishing between local and global semantic entropy reveals a systematic temporal dissociation: local semantic exploration peaks in the morning, reflecting broader exploration of semantic space, whereas global semantic diversity peaks later in the day as submissions accumulate around already established topics, consistent with “rich-get-richer” dynamics. These patterns are not explained by sentiment or affective valence, indicating that semantic exploration captures a cognitive dimension distinct from mood. The observed temporal structure aligns with known diurnal patterns in neuromodulatory systems, suggesting that biological circadian rhythms extend to the semantic domain.

[42] RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

Yishu Wei, Adam E. Flanders, Errol Colak, John Mongan, Luciano M Prevedello, Po-Hao Chen, Henrique Min Ho Lee, Gilberto Szarf, Hamilton Shoji, Jason Sho, Katherine Andriole, Tessa Cook, Lisa C. Adams, Linda C. Chu, Maggie Chung, Geraldine Brusca-Augello, Djeven P. Deva, Navneet Singh, Felipe Sanchez Tijmes, Jeffrey B. Alpert, Elsie T. Nguyen, Drew A. Torigian, Kate Hanneman, Lauren K Groner, Alexander Phan, Ali Islam, Matias F. Callejas, Gustavo Borges da Silva Teles, Faisal Jamal, Maryam Vazirabad, Ali Tejani, Hari Trivedi, Paulo Kuriki, Rajesh Bhayana, Elana T. Benishay, Yi Lin, Yifan Peng, George Shih

Main category: cs.CL

TL;DR: Researchers developed a high-quality chest radiograph benchmark using AI-assisted expert labeling, creating 200 verified studies with 12 labels for evaluating multimodal LLMs in radiology.

Details

Motivation: Current multimodal LLMs perform well on multiple-choice exams but lack clinically useful benchmarks curated by domain experts for real-world radiology applications.

Method: Used GPT-4o to extract abnormal findings from 13,735 chest radiograph reports, mapped to 12 benchmark labels with Phi-4-Reasoning. Sampled 1,000 studies based on AI-suggested labels for expert review by 17 radiologists, with each study evaluated by 3 experts.

Result: Created benchmark of 200 chest radiographic studies (100 released, 100 holdout) with 12 benchmark labels, each verified by at least 2 radiologists agreeing on all labels. Developed AI-assisted labeling procedure for efficient expert review.

Conclusion: Established a publicly available, expert-validated benchmark for evaluating multimodal LLMs in chest radiography, along with an AI-assisted labeling workflow that enables scalable, high-quality dataset curation in radiology.

Abstract: Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked “Agree all”, “Agree mostly” or “Disagree” to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected “Agree All” for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available https://imaging.rsna.org, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.

[43] Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Yinzhu Chen, Abdine Maiga, Hossein A. Rahmani, Emine Yilmaz

Main category: cs.CL

TL;DR: A retrieval-augmented multi-agent framework automates generation of instance-specific evaluation rubrics for medical LLMs, improving clinical intent alignment by 5% and nearly doubling quality separation compared to GPT-4o baseline.

Details

Motivation: LLMs in clinical decision support risk hallucinations and unsafe suggestions that pose patient safety risks. These subtle clinical errors evade generic metrics, while expert-authored rubrics are costly and difficult to scale.

Method: Retrieval-augmented multi-agent framework that grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria.

Result: Achieves Clinical Intent Alignment score of 60.12% (vs. GPT-4o baseline 55.16%), mean score delta of 8.658 (vs. 4.972 baseline), AUROC of 0.977, and guides response refinement improving quality by 9.2% (59.0% to 68.2%).

Conclusion: Provides scalable and transparent foundation for both evaluating and improving medical LLMs, addressing safety risks through automated, evidence-based rubric generation.

Abstract: Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are particularly challenging as they often manifest as subtle clinical errors that evade detection by generic metrics, while expert-authored fine-grained rubrics remain costly to construct and difficult to scale. In this paper, we propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics. Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria. Evaluated on HealthBench, our framework achieves a Clinical Intent Alignment (CIA) score of 60.12%, a statistically significant improvement over the GPT-4o baseline (55.16%). In discriminative tests, our rubrics yield a mean score delta ($μ_Δ = 8.658$) and an AUROC of 0.977, nearly doubling the quality separation achieved by GPT-4o baseline (4.972). Beyond evaluation, our rubrics effectively guide response refinement, improving quality by 9.2% (from 59.0% to 68.2%). This provides a scalable and transparent foundation for both evaluating and improving medical LLMs. The code is available at https://anonymous.4open.science/r/Automated-Rubric-Generation-AF3C/.

[44] The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang

Main category: cs.CL

TL;DR: dLLMs’ arbitrary order generation actually narrows reasoning boundaries by allowing models to bypass crucial high-uncertainty tokens, leading to premature solution space collapse. A minimalist approach using standard GRPO without arbitrary order flexibility outperforms complex RL methods.

Details

Motivation: The paper challenges the common assumption that diffusion LLMs' arbitrary token generation order provides superior reasoning potential. It reveals that current implementations of this flexibility actually harm reasoning by allowing models to avoid difficult tokens, collapsing the solution space prematurely.

Method: Proposes JustGRPO - a minimalist approach that intentionally forgoes arbitrary order generation and applies standard Group Relative Policy Optimization (GRPO) instead. This maintains parallel decoding ability while avoiding the pitfalls of order flexibility.

Result: JustGRPO achieves surprisingly strong performance (89.1% accuracy on GSM8K) while being simpler than existing RL approaches that preserve arbitrary order flexibility. The method demonstrates that effective reasoning is better elicited without arbitrary order generation.

Conclusion: The flexibility of arbitrary order generation in dLLMs, in its current form, narrows reasoning boundaries rather than expanding them. A simpler approach that intentionally forgoes this flexibility while maintaining parallel decoding yields superior performance on reasoning tasks.

Abstract: Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap

[45] Is Peer Review Really in Decline? Analyzing Review Quality across Venues and Time

Ilia Kuznetsov, Rohan Nayak, Alla Rozovskaya, Iryna Gurevych

Main category: cs.CL

TL;DR: No consistent decline in review quality found across major AI conferences despite popular narrative about declining quality; new framework introduced for evidence-based comparative study of reviews.

Details

Motivation: Address concerns about declining review quality in science as submission numbers rise and research communities grow, but lack of evidence-based comparative studies makes it hard to verify these claims.

Method: Introduced new framework for evidence-based comparative study of review quality, applied to ICLR, NeurIPS and *ACL conferences. Developed review standardization approach, multi-dimensional schema for quantifying review quality as utility to editors/authors, using both LLM-based and lightweight measurements.

Result: Cross-temporal analysis revealed no consistent decline in median review quality across venues and years, contradicting popular narrative about declining review quality.

Conclusion: Proposed alternative explanations for perceived decline and outlined recommendations to facilitate future empirical studies of review quality.

Abstract: Peer review is at the heart of modern science. As submission numbers rise and research communities grow, the decline in review quality is a popular narrative and a common concern. Yet, is it true? Review quality is difficult to measure, and the ongoing evolution of reviewing practices makes it hard to compare reviews across venues and time. To address this, we introduce a new framework for evidence-based comparative study of review quality and apply it to major AI and machine learning conferences: ICLR, NeurIPS and *ACL. We document the diversity of review formats and introduce a new approach to review standardization. We propose a multi-dimensional schema for quantifying review quality as utility to editors and authors, coupled with both LLM-based and lightweight measurements. We study the relationships between measurements of review quality, and its evolution over time. Contradicting the popular narrative, our cross-temporal analysis reveals no consistent decline in median review quality across venues and years. We propose alternative explanations, and outline recommendations to facilitate future empirical studies of review quality.

[46] Supporting Humans in Evaluating AI Summaries of Legal Depositions

Naghmeh Farzi, Laura Dietz, Dave D. Lewis

Main category: cs.CL

TL;DR: Researchers explore using factual nugget-based methods to help legal professionals evaluate and improve deposition summaries, moving beyond automated evaluation to direct user assistance.

Details

Motivation: LLMs are increasingly used for summarizing legal documents like depositions, but factual accuracy is critical in this domain. While nugget-based methods work well for automated evaluation, their potential to directly assist end users (legal professionals) remains underexplored.

Method: Develop a prototype system that leverages factual nugget-based approach to support legal professionals in two scenarios: 1) determining which of two summaries is better, and 2) manually improving an automatically generated summary.

Result: The paper presents a working prototype that demonstrates how nugget-based methods can be translated from automated evaluation to direct user assistance in the legal domain.

Conclusion: Nugget-based approaches show promise not just for automated evaluation but also for directly supporting legal professionals in evaluating and improving deposition summaries, addressing the critical need for factual accuracy in legal document summarization.

Abstract: While large language models (LLMs) are increasingly used to summarize long documents, this trend poses significant challenges in the legal domain, where the factual accuracy of deposition summaries is crucial. Nugget-based methods have been shown to be extremely helpful for the automated evaluation of summarization approaches. In this work, we translate these methods to the user side and explore how nuggets could directly assist end users. Although prior systems have demonstrated the promise of nugget-based evaluation, its potential to support end users remains underexplored. Focusing on the legal domain, we present a prototype that leverages a factual nugget-based approach to support legal professionals in two concrete scenarios: (1) determining which of two summaries is better, and (2) manually improving an automatically generated summary.

[47] Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel, Cornelius Emde, Sangdoo Yun, Seong Joon Oh, Martin Gubri

Main category: cs.CL

TL;DR: Fine-tuning frontier language models can cause “privacy collapse” where models lose contextual privacy reasoning while maintaining standard benchmark performance, creating silent safety failures.

Details

Motivation: To investigate how fine-tuning affects language models' privacy capabilities, particularly how subtle training patterns can degrade contextual privacy reasoning while models appear safe on standard benchmarks.

Method: Experimental analysis across six models (closed and open weight), five fine-tuning datasets (real-world and controlled), and two task categories (agentic and memory-based), with mechanistic analysis of privacy representation fragility.

Result: Fine-tuned models exhibit privacy collapse - losing ability to reason about contextual privacy norms, sharing information inappropriately with tools, and violating memory boundaries across contexts, while maintaining high performance on standard safety/utility benchmarks.

Conclusion: Privacy collapse represents a critical gap in current safety evaluations, especially for specialized agents, as privacy representations are uniquely fragile to fine-tuning compared to task-relevant features.

Abstract: We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure’’ because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.

[48] Metadata Conditioned Large Language Models for Localization

Anjishnu Mukherjee, Ziwei Zhu, Antonios Anastasopoulos

Main category: cs.CL

TL;DR: Metadata conditioning improves LLM localization without sacrificing cross-region generalization, enabling global models to achieve region-specific performance with better efficiency.

Details

Motivation: Current LLMs treat text as a single global distribution, leading to geographically homogenized behavior. The paper aims to address this by exploring lightweight localization methods.

Method: Pre-trained 31 models (0.5B and 1B parameters) from scratch on English news data annotated with verified URLs, country tags, and continent tags covering 4 continents and 17 countries. Used metadata conditioning and conducted ablation studies to understand URL-level metadata effectiveness and regional data coverage importance.

Result: Metadata conditioning consistently improves in-region performance without sacrificing cross-region generalization, enables global models to recover localization comparable to region-specific models, and improves learning efficiency. URL-level metadata captures much geographic signal, but balanced regional coverage remains essential. After instruction tuning, metadata-conditioned models achieve accuracy comparable to LLaMA-3.2-1B-Instruct despite less training data.

Conclusion: Metadata conditioning is a practical and compute-efficient approach for localization of language models, establishing it as an effective method for geographic adaptation without compromising generalization.

Abstract: Large language models are typically trained by treating text as a single global distribution, often resulting in geographically homogenized behavior. We study metadata conditioning as a lightweight approach for localization, pre-training 31 models (at 0.5B and 1B parameter scales) from scratch on large-scale English news data annotated with verified URLs, country tags, and continent tags, covering 4 continents and 17 countries. Across four controlled experiments, we show that metadata conditioning consistently improves in-region performance without sacrificing cross-region generalization, enables global models to recover localization comparable to region-specific models, and improves learning efficiency. Our ablation studies demonstrate that URL-level metadata alone captures much of the geographic signal, while balanced regional data coverage remains essential, as metadata cannot fully compensate for missing regions. Finally, we introduce a downstream benchmark of 800 localized news MCQs and show that after instruction tuning, metadata conditioned global models achieve accuracy comparable to LLaMA-3.2-1B-Instruct, despite being trained on substantially less data. Together, these results establish metadata conditioning as a practical and compute-efficient approach for localization of language models.

[49] Taxonomy-Aligned Risk Extraction from 10-K Filings with Autonomous Improvement Using LLMs

Rian Dolphin, Joe Dursun, Jarrett Blankenship, Katie Adams, Quinton Pike

Main category: cs.CL

TL;DR: A three-stage pipeline for extracting structured risk factors from 10-K filings using LLMs, embeddings, and validation, with autonomous taxonomy maintenance that improves extraction quality over time.

Details

Motivation: Need to extract structured risk factors from unstructured corporate filings (10-Ks) while maintaining alignment with a predefined hierarchical taxonomy, enabling systematic analysis of corporate risk profiles across industries.

Method: Three-stage pipeline: 1) LLM extraction with supporting quotes, 2) embedding-based semantic mapping to taxonomy categories, 3) LLM-as-a-judge validation to filter spurious assignments. Plus autonomous taxonomy maintenance where AI agent analyzes feedback to identify problematic categories and propose refinements.

Result: Extracted 10,688 risk factors from S&P 500 companies; achieved 104.7% improvement in embedding separation via autonomous taxonomy maintenance; same-industry companies show 63% higher risk profile similarity than cross-industry pairs (Cohen’s d=1.06, AUC 0.82, p<0.001).

Conclusion: Methodology successfully extracts taxonomy-aligned risk factors with meaningful economic structure, generalizes to other domains, and autonomous improvement enables continuous quality maintenance as systems process more documents.

Abstract: We present a methodology for extracting structured risk factors from corporate 10-K filings while maintaining adherence to a predefined hierarchical taxonomy. Our three-stage pipeline combines LLM extraction with supporting quotes, embedding-based semantic mapping to taxonomy categories, and LLM-as-a-judge validation that filters spurious assignments. To evaluate our approach, we extract 10,688 risk factors from S&P 500 companies and examine risk profile similarity across industry clusters. Beyond extraction, we introduce autonomous taxonomy maintenance where an AI agent analyzes evaluation feedback to identify problematic categories, diagnose failure patterns, and propose refinements, achieving 104.7% improvement in embedding separation in a case study. External validation confirms the taxonomy captures economically meaningful structure: same-industry companies exhibit 63% higher risk profile similarity than cross-industry pairs (Cohen’s d=1.06, AUC 0.82, p<0.001). The methodology generalizes to any domain requiring taxonomy-aligned extraction from unstructured text, with autonomous improvement enabling continuous quality maintenance and enhancement as systems process more documents.

[50] The Effect of Scripts and Formats on LLM Numeracy

Varshini Reddy, Craig W. Schmidt, Seth Ebner, Adam Wiemerslage, Yuval Pinter, Chris Tanner

Main category: cs.CL

TL;DR: LLMs struggle with numerical expressions in underrepresented scripts/formats despite good performance on standard arithmetic, but targeted prompting can help bridge this gap.

Details

Motivation: While LLMs show impressive arithmetic proficiency, their performance on numerical expressions that deviate from training conventions (different numeral scripts/formats) remains unexplored, creating a gap in understanding multilingual numerical reasoning capabilities.

Method: Investigated numerical reasoning across diverse numeral scripts and formats, testing LLM performance when inputs use underrepresented representations. Evaluated targeted prompting strategies including few-shot prompting and explicit numeral mapping.

Result: LLM accuracy drops substantially when numerical inputs use underrepresented scripts/formats, even when underlying mathematical reasoning is identical. However, targeted prompting strategies significantly narrow this performance gap.

Conclusion: The study reveals an overlooked challenge in multilingual numerical reasoning and provides practical insights for improving LLM reliability when interpreting, manipulating, and generating numbers across diverse numeral scripts and formatting styles.

Abstract: Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling human-level performance on standard numerical tasks. However, little attention has been given to how these models perform when numerical expressions deviate from the prevailing conventions present in their training corpora. In this work, we investigate numerical reasoning across a wide range of numeral scripts and formats. We show that LLM accuracy drops substantially when numerical inputs are rendered in underrepresented scripts or formats, despite the underlying mathematical reasoning being identical. We further demonstrate that targeted prompting strategies, such as few-shot prompting and explicit numeral mapping, can greatly narrow this gap. Our findings highlight an overlooked challenge in multilingual numerical reasoning and provide actionable insights for working with LLMs to reliably interpret, manipulate, and generate numbers across diverse numeral scripts and formatting styles.

[51] Robust Fake News Detection using Large Language Models under Adversarial Sentiment Attacks

Sahar Tahmasebi, Eric Müller-Budack, Ralph Ewerth

Main category: cs.CL

TL;DR: AdSent: A sentiment-robust fake news detection framework that addresses vulnerabilities to sentiment manipulation attacks using LLMs, with a sentiment-agnostic training strategy for improved robustness.

Details

Motivation: Current fake news detectors rely on sentiment features, making them vulnerable to adversarial manipulation of sentiment, especially with the rise of LLMs that can easily alter sentiment while preserving content meaning.

Method: 1) Propose controlled sentiment-based adversarial attacks using LLMs to generate sentiment-altered news articles. 2) Analyze impact of sentiment shifts on detection performance. 3) Introduce novel sentiment-agnostic training strategy to enhance robustness against sentiment perturbations.

Result: Changing sentiment heavily impacts fake news detection performance, revealing biases where neutral articles are classified as real and non-neutral articles as fake. AdSent significantly outperforms baselines in accuracy and robustness across three benchmark datasets and generalizes well to unseen datasets and adversarial scenarios.

Conclusion: Sentiment manipulation poses a serious vulnerability in fake news detection, and the proposed AdSent framework with sentiment-agnostic training provides effective defense against such adversarial attacks while maintaining strong detection performance.

Abstract: Misinformation and fake news have become a pressing societal challenge, driving the need for reliable automated detection methods. Prior research has highlighted sentiment as an important signal in fake news detection, either by analyzing which sentiments are associated with fake news or by using sentiment and emotion features for classification. However, this poses a vulnerability since adversaries can manipulate sentiment to evade detectors especially with the advent of large language models (LLMs). A few studies have explored adversarial samples generated by LLMs, but they mainly focus on stylistic features such as writing style of news publishers. Thus, the crucial vulnerability of sentiment manipulation remains largely unexplored. In this paper, we investigate the robustness of state-of-the-art fake news detectors under sentiment manipulation. We introduce AdSent, a sentiment-robust detection framework designed to ensure consistent veracity predictions across both original and sentiment-altered news articles. Specifically, we (1) propose controlled sentiment-based adversarial attacks using LLMs, (2) analyze the impact of sentiment shifts on detection performance. We show that changing the sentiment heavily impacts the performance of fake news detection models, indicating biases towards neutral articles being real, while non-neutral articles are often classified as fake content. (3) We introduce a novel sentiment-agnostic training strategy that enhances robustness against such perturbations. Extensive experiments on three benchmark datasets demonstrate that AdSent significantly outperforms competitive baselines in both accuracy and robustness, while also generalizing effectively to unseen datasets and adversarial scenarios.

[52] H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Yichang Xu, Zachary Yahn, Ling Liu

Main category: cs.CL

TL;DR: H3Fusion introduces a mixture-of-experts based fusion mechanism for LLM alignment that models alignment as controllable drift in representation subspace, balancing helpfulness, harmlessness, and honesty through drift regularization and gating losses.

Details

Motivation: Current LLM alignment approaches struggle to simultaneously satisfy all three alignment properties (helpful, harmless, honest) in the model's representation subspace, requiring a method that can balance these competing dimensions effectively.

Method: Uses mixture-of-experts fusion mechanism to model alignment as controllable drift in representation subspace with drift-regularization loss; formulates alignment as dual objective harnessing distance between generated and alignment embeddings; introduces gating loss to canalize activations on contributing experts.

Result: Outperforms individually aligned models by 11.37%, shows stronger robustness than state-of-the-art LLM ensemble approaches by 13.77% and model-merging approaches by 6.18% across three benchmark datasets.

Conclusion: H3Fusion effectively addresses the challenge of simultaneous multi-dimensional alignment, providing a more balanced approach that improves all three alignment properties while maintaining robustness compared to existing methods.

Abstract: The alignment of pre-trained LLMs continues to draw significant attention from both industry and academia, aiming to ensure responses that are helpful, harmless, and honest. However, identifying a point in the model’s representation subspace that simultaneously satisfies all these properties remains challenging. H3Fusion addresses this challenge by introducing a mixture-of-experts (MoE)-based fusion mechanism that models alignment as a controllable drift within the subspace, guided by a drift-regularization loss to balance competing alignment dimensions. Furthermore, we formulate the alignment by finding a dual objective of harnessing the distance of generated embeddings and alignment embeddings, and introduce a gating loss by canalizing the activations on the contributing experts. Extensive evaluations of three benchmark datasets show that H3Fusion is more helpful, less harmful, and more honest in three aspects: it outperforms each individually aligned model by 11.37%, and provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by 13.77% and model-merging approaches by 6.18%. Code is available at https://github.com/git-disl/h3fusion.

[53] AStar: Boosting Multimodal Reasoning with Automated Structured Thinking

Jinyang Wu, Mingkuan Feng, Guocheng Zhai, Shuai Zhang, Zheng Lian, Fangrui Lv, Pengpeng Shao, Ruihan Jin, Zhengqi Wen, Jianhua Tao

Main category: cs.CL

TL;DR: AStar is a training-free automatic structured thinking paradigm that uses “thought cards” - a lightweight library of high-level reasoning patterns - to enhance multimodal reasoning without expensive search or post-training.

Details

Motivation: Current multimodal LLMs struggle with complex visual reasoning. Existing approaches either use computationally inefficient search-based methods that explore extensive solution spaces, or require substantial data and resources for post-training with training instability issues.

Method: Proposes AStar with “thought cards” - a lightweight library of high-level reasoning patterns abstracted from prior samples. For each test problem, it adaptively retrieves optimal thought cards and integrates external explicit guidelines with the model’s internal implicit reasoning capabilities.

Result: Achieves 53.9% accuracy on MathVerse (surpassing GPT-4o’s 50.2%) and 32.7% on MathVision (outperforming GPT-4o’s 30.4%). Thought cards show remarkable transferability across reasoning tasks and even benefit general visual perception and understanding.

Conclusion: AStar provides an efficient, training-free approach to multimodal reasoning that eliminates expensive search and complex post-training. It’s a plug-and-play test-time inference method compatible with other techniques, serving as an important complement to existing approaches.

Abstract: Multimodal large language models excel across diverse domains but struggle with complex visual reasoning tasks. To enhance their reasoning capabilities, current approaches typically rely on explicit search or post-training techniques. However, search-based methods suffer from computational inefficiency due to extensive solution space exploration, while post-training methods demand substantial data, computational resources, and often exhibit training instability. To address these challenges, we propose \textbf{AStar}, a training-free, \textbf{A}utomatic \textbf{S}tructured \textbf{t}hinking paradigm for multimod\textbf{a}l \textbf{r}easoning. Specifically, we introduce novel ``thought cards’’, a lightweight library of high-level reasoning patterns abstracted from prior samples. For each test problem, AStar adaptively retrieves the optimal thought cards and seamlessly integrates these external explicit guidelines with the model’s internal implicit reasoning capabilities. Compared to previous methods, AStar eliminates computationally expensive explicit search and avoids additional complex post-training processes, enabling a more efficient reasoning approach. Extensive experiments demonstrate that our framework achieves 53.9% accuracy on MathVerse (surpassing GPT-4o’s 50.2%) and 32.7% on MathVision (outperforming GPT-4o’s 30.4%). Further analysis reveals the remarkable transferability of our method: thought cards generated from mathematical reasoning can also be applied to other reasoning tasks, even benefiting general visual perception and understanding. AStar serves as a plug-and-play test-time inference method, compatible with other post-training techniques, providing an important complement to existing multimodal reasoning approaches.

[54] Personality Editing for Language Models through Adjusting Self-Referential Queries

Seojin Hwang, Yumin Kim, Byeongjeong Kim, Donghoon Shin, Hwanhee Lee

Main category: cs.CL

TL;DR: PALETTE is a novel method for personality editing in LLMs that uses self-targeted queries based on psychological constructs, requiring only 12 samples to achieve substantial personality alignment improvements.

Details

Motivation: LLMs need precise personality control for applications like conversational agents and content creation, but current prompt-based or fine-tuning approaches lack robustness or require large training data, making them costly and impractical.

Method: PALETTE introduces adjustment queries where self-referential statements grounded in psychological constructs are treated like factual knowledge, enabling direct editing of personality-related responses without fine-tuning.

Result: The method requires only 12 editing samples to achieve substantial improvements in personality alignment across personality dimensions, with experimental results showing more stable and well-balanced personality control.

Conclusion: PALETTE provides an efficient, data-minimal approach for personality editing in LLMs that outperforms existing methods in stability and balance of personality control.

Abstract: Large Language Models (LLMs) are integral to applications such as conversational agents and content creation, where precise control over a model’s personality is essential for maintaining tone, consistency, and user engagement. However, prevailing prompt-based or fine-tuning approaches either lack robustness or demand large-scale training data, making them costly and impractical. In this paper, we present PALETTE (Personality Adjustment by LLM SElf-TargeTed quEries), a novel method for personality editing in LLMs. Our approach introduces adjustment queries, where self-referential statements grounded in psychological constructs are treated analogously to factual knowledge, enabling direct editing of personality-related responses. Unlike fine-tuning, PALETTE requires only 12 editing samples to achieve substantial improvements in personality alignment across personality dimensions. Experimental results from both automatic and human evaluations demonstrate that our method enables more stable and well-balanced personality control in LLMs.

[55] OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents

Raghav Thind, Youran Sun, Ling Liang, Haizhao Yang

Main category: cs.CL

TL;DR: OptimAI is an LLM-powered multi-agent framework that solves optimization problems described in natural language by translating them into mathematical formulations and generating solution strategies, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Formulating optimization problems from natural language descriptions and selecting appropriate solvers requires substantial domain expertise, creating a barrier for non-experts. Current methods struggle with this translation and solution process.

Method: Multi-agent framework with four specialized roles: (1) formulator translates natural language to mathematical formulations, (2) planner creates high-level solution strategies, (3) coder interacts with environment, and (4) code critic reflects on outcomes. Includes UCB-based debug scheduling for dynamic plan switching.

Result: Achieves 88.1% accuracy on NLP4LP dataset and 82.3% on Optibench dataset, reducing error rates by 58% and 52% respectively over prior best results. Ablation studies show planner and code critic are essential (5.8× and 3.1× productivity drops without them), with UCB scheduling adding 3.3× productivity gain.

Conclusion: OptimAI demonstrates that multi-agent collaboration with specialized roles effectively solves optimization problems from natural language descriptions, with all components being essential for optimal performance and dynamic plan switching providing additional benefits.

Abstract: Optimization plays a vital role in scientific research and practical applications. However, formulating a concrete optimization problem described in natural language into a mathematical form and selecting a suitable solver to solve the problem requires substantial domain expertise. We introduce OptimAI, a framework for solving Optimization problems described in natural language by leveraging LLM-powered AI agents, and achieve superior performance over current state-of-the-art methods. Our framework is built upon the following key roles: (1) a formulator that translates natural language problem descriptions into precise mathematical formulations; (2) a planner that constructs a high-level solution strategy prior to execution; and (3) a coder and a code critic capable of interacting with the environment and reflecting on outcomes to refine future actions. Ablation studies confirm that all roles are essential; removing the planner or code critic results in $5.8\times$ and $3.1\times$ drops in productivity, respectively. Furthermore, we introduce UCB-based debug scheduling to dynamically switch between alternative plans, yielding an additional $3.3\times$ productivity gain. Our design emphasizes multi-agent collaboration, and our experiments confirm that combining diverse models leads to performance gains. Our approach attains 88.1% accuracy on the NLP4LP dataset and 82.3% on the Optibench dataset, reducing error rates by 58% and 52%, respectively, over prior best results.

[56] BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang Li Yuan, Yonghong Tian

Main category: cs.CL

TL;DR: BioProBench is a comprehensive benchmark for evaluating LLMs’ procedural reasoning in biology, built from 27,000 human-written protocols with 550,000+ task instances, revealing LLMs struggle with deep reasoning and quantitative precision.

Details

Motivation: Current LLMs struggle with the strict procedural logic and accuracy required for biological protocols, limiting autonomous scientific experimentation. There's a need for better evaluation and training resources for procedural reasoning in biology.

Method: Created BioProCorpus (27,000 human-written protocols) and systematically constructed BioProBench dataset (550,000+ task instances). Evaluated 10 mainstream LLMs and developed ProAgent grounded in the corpus to address identified limitations.

Result: While LLMs show high general comprehension, performance significantly drops on tasks requiring deep reasoning, quantitative precision, and safety awareness. ProAgent substantially advances state-of-the-art performance.

Conclusion: BioProBench provides both a rigorous diagnostic benchmark and foundational resource for developing reliable scientific AI, addressing critical gaps in LLMs’ procedural reasoning capabilities for biological experimentation.

Abstract: The realization of autonomous scientific experimentation is currently limited by LLMs’ struggle to grasp the strict procedural logic and accuracy required by biological protocols. To address this fundamental challenge, we present \textbf{BioProBench}, a comprehensive resource for procedural reasoning in biology. BioProBench is grounded in \textbf{BioProCorpus}, a foundational collection of 27,000 human-written protocols. From this corpus, we systematically constructed a dataset of over 550,000 task instances, offering both a large-scale training resource and a rigorous benchmark with novel metrics. Evaluating 10 mainstream LLMs, we find that while general comprehension is high, performance drops significantly on tasks demanding deep reasoning, quantitative precision, and safety awareness. To demonstrate the value of BioProCorpus in mitigating these issues, we developed \textbf{ProAgent}, grounded in our corpus, ProAgent substantially advances the state-of-the-art. BioProBench provides a rigorous diagnostic benchmark and a foundational resource for developing the next generation of reliable scientific AI. Code and data are available at: https://github.com/YuyangSunshine/bioprotocolbench and https://huggingface.co/datasets/BioProBench/BioProBench.

[57] StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs

Haohan Yuan, Sukhwa Hong, Haopeng Zhang

Main category: cs.CL

TL;DR: StrucSum is a training-free prompting framework that uses sentence-level graph structures to enhance LLM performance in zero-shot summarization, improving summary quality and factual consistency.

Details

Motivation: LLMs struggle with modeling document structure and identifying salient information in long texts for zero-shot summarization tasks.

Method: StrucSum injects structural signals via three strategies: Neighbor-Aware Prompting (NAP) for local context, Centrality-Aware Prompting (CAP) for importance estimation, and Centrality-Guided Masking (CGM) for efficient input reduction.

Result: Experiments on ArXiv, PubMed, and Multi-News show consistent improvements in summary quality and factual consistency over baselines. On ArXiv, FactCC increased by 19.2% and SummaC by 8.0% points.

Conclusion: Structure-aware prompting with graph-based information is a promising direction for advancing zero-shot extractive summarization with LLMs, though combining multiple strategies doesn’t yield clear performance gains.

Abstract: Large language models (LLMs) have shown strong performance in zero-shot summarization, but often struggle to model document structure and identify salient information in long texts. In this work, we introduce StrucSum, a training-free prompting framework that enhances LLM reasoning through sentence-level graph structures. StrucSum injects structural signals into prompts via three targeted strategies: Neighbor-Aware Prompting (NAP) for local context, Centrality-Aware Prompting (CAP) for importance estimation, and Centrality-Guided Masking (CGM) for efficient input reduction. Experiments on ArXiv, PubMed, and Multi-News demonstrate that StrucSum consistently improves both summary quality and factual consistency over unsupervised baselines and vanilla prompting. In particular, on ArXiv, it increases FactCC and SummaC by 19.2% and 8.0% points, demonstrating stronger alignment between summaries and source content. The ablation study shows that the combination of multiple strategies does not yield clear performance gains; therefore, structure-aware prompting with graph-based information represents a promising and underexplored direction for the advancement of zero-shot extractive summarization with LLMs. Our source code is publicly available.

[58] Identifying Reliable Evaluation Metrics for Scientific Text Revision

Léane Jourdan, Florian Boudin, Richard Dufour, Nicolas Hernandez

Main category: cs.CL

TL;DR: LLMs can evaluate instruction-following in text revision but struggle with correctness; hybrid approach combining LLM-as-a-judge with domain-specific metrics works best.

Details

Motivation: Traditional metrics like ROUGE and BERTScore focus on similarity rather than meaningful improvements in scientific text revision, creating a need for better evaluation methods aligned with human judgment.

Method: Conducted manual annotation study to assess revision quality, investigated reference-free evaluation metrics from related NLP domains, and examined LLM-as-a-judge approaches with/without gold references.

Result: LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. Hybrid approach combining both offers most reliable assessment.

Conclusion: A hybrid evaluation framework combining LLM-as-a-judge approaches with task-specific metrics provides the most reliable assessment of scientific text revision quality.

Abstract: Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

[59] PankRAG: Enhancing Graph Retrieval via Globally Aware Query Resolution and Dependency-Aware Reranking Mechanism

Ningyuan Li, Junrui Liu, Yi Shan, Minghui Huang, Ziren Gong, Tong Li

Main category: cs.CL

TL;DR: PankRAG is a graph-based RAG framework that captures latent relationships in complex queries through hierarchical resolution pathways and dependency-aware reranking, outperforming existing methods.

Details

Motivation: Existing graph-based RAG approaches rely solely on entity extraction, which often misinterprets or omits latent critical information and relationships, leading to irrelevant/contradictory content retrieval, exclusion of essential information, increased hallucination risks, and poor response quality.

Method: PankRAG uses a globally-aware hierarchical resolution pathway to capture parallel and progress relationships, guiding LLMs through hierarchical reasoning. It also employs a dependency-aware reranking mechanism that uses resolved sub-question dependencies to augment and validate retrieved content for current unresolved sub-questions.

Result: Experimental results show PankRAG consistently outperforms existing state-of-the-art methods, demonstrating its generalizability.

Conclusion: PankRAG effectively addresses limitations of entity-only extraction in graph-based RAG by capturing latent relationships through hierarchical reasoning and dependency-aware validation, improving retrieval quality and reducing hallucination risks.

Abstract: Recent graph-based RAG approaches leverage knowledge graphs by extracting entities from a query to fetch their associated relationships and metadata. However, relying solely on entity extraction often results in the misinterpretation or omission of latent critical information and relationships. This can lead to the retrieval of irrelevant or contradictory content, as well as the exclusion of essential information, thereby increasing hallucination risks and undermining the quality of generated responses. In this paper, we propose PankRAG, a framework designed to capture and resolve the latent relationships within complex queries that prior methods overlook. It achieves this through a synergistic combination of a globally-aware hierarchical resolution pathway and a dependency-aware reranking mechanism. PankRAG first generates a globally aware resolution pathway that captures parallel and progress relationships, guiding LLMs to resolve queries through a hierarchical reasoning path. Additionally, its dependency-aware reranking mechanism utilizes resolved sub-question dependencies to augment and validate the retrieved content of the current unresolved sub-question. Experimental results demonstrate that PankRAG consistently outperforms existing state-of-the-art methods, underscoring its generalizability.

[60] Thunder-NUBench: A Benchmark for LLMs’ Sentence-Level Negation Understanding

Yeonkyoung So, Gyuseong Lee, Sungmok Jung, Joonhak Lee, JiA Kang, Sangho Kim, Jaejin Lee

Main category: cs.CL

TL;DR: Introduces Thunder-NUBench, a novel benchmark specifically designed to evaluate LLMs’ sentence-level understanding of negation through diverse structural alternatives and multiple-choice testing.

Details

Motivation: Negation poses ongoing challenges for LLMs in tasks requiring deep semantic understanding, but current benchmarks treat it as a minor detail within broader tasks, lacking specialized evaluation tools.

Method: Created Thunder-NUBench with manually curated sentence-negation pairs and multiple-choice datasets that contrast standard negation with structurally diverse alternatives like local negation, contradiction, and paraphrase.

Result: Developed a comprehensive benchmark that goes beyond surface-level cue identification to assess nuanced understanding of negation in LLMs.

Conclusion: Thunder-NUBench provides a specialized tool to evaluate LLMs’ comprehension of negation, addressing a gap in current benchmarking approaches for this fundamental linguistic phenomenon.

Abstract: Negation is a fundamental linguistic phenomenon that poses ongoing challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Current benchmarks often treat negation as a minor detail within broader tasks, such as natural language inference. Consequently, there is a lack of benchmarks specifically designed to evaluate comprehension of negation. In this work, we introduce Thunder-NUBench, a novel benchmark explicitly created to assess sentence-level understanding of negation in LLMs. Thunder-NUBench goes beyond merely identifying surface-level cues by contrasting standard negation with structurally diverse alternatives, such as local negation, contradiction, and paraphrase. This benchmark includes manually curated sentence-negation pairs and a multiple-choice dataset, allowing for a comprehensive evaluation of models’ understanding of negation.

[61] Large Language Models Encode Semantics and Alignment in Linearly Separable Representations

Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi

Main category: cs.CL

TL;DR: LLMs organize semantic information in low-dimensional linear subspaces that become more separable in deeper layers, enabling geometry-based safety interventions.

Details

Motivation: Understanding LLM latent space geometry is crucial for interpreting behavior and improving alignment, but it's unclear how linearly they organize semantic representations.

Method: Large-scale empirical study of hidden representations across 11 autoregressive models and 6 scientific topics, analyzing separability in different layers and under various prompts.

Result: High-level semantic information consistently resides in low-dimensional linearly separable subspaces, with separability increasing in deeper layers and under structured reasoning prompts.

Conclusion: Geometry-aware tools operating in latent space can detect and mitigate harmful content, demonstrated by an MLP probe that improves refusal rates on malicious queries bypassing existing safety measures.

Abstract: Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. Yet it remains unclear to what extent LLMs linearly organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across six scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior$\unicode{x2013}$even when surface content remains unchanged. These findings motivate geometry-aware tools that operate directly in latent space to detect and mitigate harmful and adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model’s built-in safety alignment and external token-level filters.

[62] Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue

Sukannya Purkayastha, Nils Dycke, Anne Lauscher, Iryna Gurevych

Main category: cs.CL

TL;DR: This paper proposes using dialogue agents to assist meta-reviewers in the peer-review process, addressing data scarcity through LLM-generated synthetic data and demonstrating improved performance over off-the-shelf LLMs.

Details

Motivation: Meta-reviewing is a critical decision-making process in peer review that requires weighing reviewer arguments and contextual understanding, but current approaches treat it as simple summarization. The authors argue that dialogue agents could better assist meta-reviewers in this complex decision-making task.

Method: The authors address data scarcity by generating synthetic dialogue data using LLMs with a self-refinement strategy to improve domain relevance. They then use this data to train specialized dialogue agents for meta-reviewing assistance.

Result: The synthetic data generation method produces higher-quality training data. The trained dialogue agents outperform off-the-shelf LLM-based assistants for meta-reviewing tasks and demonstrate effectiveness in real-world meta-reviewing scenarios by enhancing efficiency.

Conclusion: Dialogue agents trained on LLM-generated synthetic data can effectively assist meta-reviewers, outperforming general-purpose LLMs and improving the efficiency of the meta-reviewing process in peer review.

Abstract: Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform \emph{off-the-shelf} LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.\footnote{Code available at: https://github.com/UKPLab/eacl2026-meta-review-as-dialog

[63] Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs

Franziska Weeber, Tanise Ceron, Sebastian Padó

Main category: cs.CL

TL;DR: MLLMs show minimal cross-lingual political opinion differences across Western languages, and English-only alignment uniformly shifts opinions across all languages.

Details

Motivation: To investigate whether cross-cultural political opinion differences observed in human surveys translate to cross-lingual differences in multilingual large language models, and to understand how political opinions transfer between languages in MLLMs.

Method: Analyzed MLLMs of various sizes across five Western languages by prompting them to report agreement/disagreement with political statements from voting advice applications. Evaluated models both before and after aligning them with left/right views using direct preference optimization with English-only alignment data.

Result: Unaligned models show very few significant cross-lingual differences in political opinions. Political alignment using English data shifts opinions almost uniformly across all five languages, indicating opinion transfer rather than language-specific opinions.

Conclusion: In Western language contexts, political opinions transfer between languages in MLLMs, highlighting challenges in achieving explicit socio-linguistic, cultural, and political alignment of multilingual models.

Abstract: Public opinion surveys show cross-cultural differences in political opinions between socio-cultural contexts. However, there is no clear evidence whether these differences translate to cross-lingual differences in multilingual large language models (MLLMs). We analyze whether opinions transfer between languages or whether there are separate opinions for each language in MLLMs of various sizes across five Western languages. We evaluate MLLMs’ opinions by prompting them to report their (dis)agreement with political statements from voting advice applications. To better understand the interaction between languages in the models, we evaluate them both before and after aligning them with more left or right views using direct preference optimization and English alignment data only. Our findings reveal that unaligned models show only very few significant cross-lingual differences in the political opinions they reflect. The political alignment shifts opinions almost uniformly across all five languages. We conclude that in Western language contexts, political opinions transfer between languages, demonstrating the challenges in achieving explicit socio-linguistic, cultural, and political alignment of MLLMs.

[64] LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval

Luyao Zhuang, Qinggang Zhang, Huachi Zhou, Yujing Zhang, Xiao Huang

Main category: cs.CL

TL;DR: LoSemB is a logic-guided framework for inductive tool retrieval that handles unseen tools by mining logical information from prior experience, addressing distribution shifts and similarity-based retrieval vulnerabilities.

Details

Motivation: Current tool retrieval methods for LLMs work under transductive settings where all tools are seen during training, but real-world tool repositories constantly evolve with new tools. Existing methods struggle with unseen tools due to large distribution shifts and vulnerability of similarity-based retrieval.

Method: LoSemB uses a logic-based embedding alignment module to mitigate distribution shifts and implements a relational augmented retrieval mechanism to reduce similarity-based retrieval vulnerabilities. It mines and transfers latent logical information from prior experience without costly retraining.

Result: Extensive experiments show LoSemB achieves advanced performance in inductive settings (handling unseen tools) while maintaining desirable effectiveness in transductive settings.

Conclusion: The proposed LoSemB framework effectively addresses the challenges of inductive tool retrieval by leveraging logical information from prior experience, providing a practical solution for evolving real-world tool repositories.

Abstract: Tool learning has emerged as a promising paradigm for large language models (LLMs) to solve many real-world tasks. Nonetheless, with the tool repository rapidly expanding, it is impractical to contain all tools within the limited input length of LLMs. To alleviate these issues, researchers have explored incorporating a tool retrieval module to select the most relevant tools or represent tools as unique tokens within LLM parameters. However, most state-of-the-art methods are under transductive settings, assuming all tools have been observed during training. Such a setting deviates from reality as the real-world tool repository is evolving and incorporates new tools frequently. When dealing with these unseen tools, which refer to tools not encountered during the training phase, these methods are limited by two key issues, including the large distribution shift and the vulnerability of similarity-based retrieval. To this end, inspired by human cognitive processes of mastering unseen tools through discovering and applying the logical information from prior experience, we introduce a novel Logic-Guided Semantic Bridging framework for inductive tool retrieval, namely, LoSemB, which aims to mine and transfer latent logical information for inductive tool retrieval without costly retraining. Specifically, LoSemB contains a logic-based embedding alignment module to mitigate distribution shifts and implements a relational augmented retrieval mechanism to reduce the vulnerability of similarity-based retrieval. Extensive experiments demonstrate that LoSemB achieves advanced performance in inductive settings while maintaining desirable effectiveness in the transductive setting.

[65] A2H-MAS: An Algorithm-to-HLS Multi-Agent System for Automated and Reliable FPGA Implementation

Jie Lei, Ruofan Jia, J. Andrew Zhang, Hao Zhang

Main category: cs.CL

TL;DR: A2H-MAS is a multi-agent system that automates MATLAB-to-FPGA translation using LLMs, addressing hallucinations and domain expertise gaps through modular decomposition and algorithm-hardware co-design.

Details

Motivation: There's a persistent gap between algorithm development in MATLAB and efficient FPGA implementation via HLS, requiring expert tuning and lengthy iterations. Existing LLM-based approaches suffer from hallucinations, forgetting, limited domain expertise, and overlook key performance metrics.

Method: A2H-MAS uses a modular hierarchical multi-agent system with specialized agents having clearly defined responsibilities. It employs dataflow-oriented modular decomposition and algorithm-hardware co-design, recognizing that algorithm choice impacts hardware efficiency more than pragma-level optimization. Uses standardized interfaces and execution-based validation.

Result: Experiments on representative wireless communication algorithms show A2H-MAS consistently produces functionally correct, resource-efficient, and latency-optimized HLS designs, demonstrating effectiveness and robustness for complex hardware development workflows.

Conclusion: A2H-MAS effectively bridges the MATLAB-to-FPGA implementation gap by addressing LLM limitations through a structured multi-agent approach with algorithm-hardware co-design, producing optimized hardware designs for latency- and resource-constrained domains.

Abstract: Bridging the gap between algorithm development and hardware realization remains a persistent challenge, particularly in latency- and resource-constrained domains such as wireless communication. While MATLAB provides a mature environment for algorithm prototyping, translating these models into efficient FPGA implementations via High-Level Synthesis (HLS) often requires expert tuning and lengthy iterations. Recent advances in large language models (LLMs) offer new opportunities for automating this process. However, existing approaches suffer from hallucinations, forgetting, limited domain expertise, and often overlook key performance metrics. To address these limitations, we present A2H-MAS, a modular and hierarchical multi-agent system. At the system level, A2H-MAS assigns clearly defined responsibilities to specialized agents and uses standardized interfaces and execution-based validation to ensure correctness and reproducibility. At the algorithmic level, it employs dataflow-oriented modular decomposition and algorithm-hardware co-design, recognizing that the choice of algorithm often has a larger impact on hardware efficiency than pragma-level optimization. Experiments on representative wireless communication algorithms show that A2H-MAS consistently produces functionally correct, resource-efficient, and latency-optimized HLS designs, demonstrating its effectiveness and robustness for complex hardware development workflows.

[66] From Construction to Injection: Edit-Based Fingerprints for Large Language Models

Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang

Main category: cs.CL

TL;DR: Proposes an end-to-end fingerprinting framework for LLMs with rule-based code-mixing fingerprints and multi-candidate editing to address imperceptibility and robustness challenges.

Details

Motivation: Need for reliable fingerprinting mechanisms to control unauthorized redistribution of LLMs, addressing challenges of imperceptibility (resistance to statistical identification, avoiding accidental activation) and preserving utility/detectability after model modifications.

Method: Two-component framework: 1) Rule-based code-mixing fingerprint (CF) that maps natural-query-like prompts to multi-candidate targets using high-complexity code-mixing formulations; 2) Multi-Candidate Editing (MCEdit) that jointly optimizes multi-candidate targets and enforces margins between target and non-target outputs.

Result: Extensive experiments demonstrate the framework provides a robust and practical solution for fingerprinting LLMs.

Conclusion: The proposed end-to-end fingerprinting framework effectively addresses key challenges in LLM fingerprinting, offering imperceptible yet robust detection capabilities even after model modifications.

Abstract: Establishing reliable and verifiable fingerprinting mechanisms is fundamental to controlling the unauthorized redistribution of large language models (LLMs). However, existing approaches face two major challenges: (a) ensuring imperceptibility, including resistance to statistical identification and avoidance of accidental activation during fingerprint construction, and (b) preserving both model utility and fingerprint detectability under subsequent model modifications. To address these challenges, we propose an end-to-end fingerprinting framework with two components. First, we design a rule-based code-mixing fingerprint (CF) that maps natural-query-like prompts to multi-candidate targets, reducing accidental triggering via high-complexity code-mixing formulations. Second, we introduce Multi-Candidate Editing (MCEdit), which jointly optimizes multi-candidate targets and enforces margins between target and non-target outputs to improve post-modification detectability. Extensive experiments demonstrate that our framework provides a robust and practical solution for fingerprinting LLMs.

[67] TextMineX: Data, Evaluation Framework and Ontology-guided LLM Pipeline for Humanitarian Mine Action

Chenyue Zhou, Gürkan Solmaz, Flavio Cirillo, Kiril Gashteovski, Jonathan Fürst

Main category: cs.CL

TL;DR: TextMineX: First dataset, evaluation framework, and ontology-guided LLM pipeline for extracting structured knowledge from humanitarian mine action reports as (subject, relation, object)-triples.

Details

Motivation: Humanitarian Mine Action (HMA) agencies produce valuable operational knowledge in unstructured reports, limiting information transfer between agencies. There's a need to structure this knowledge for better accessibility and sharing.

Method: Proposed TextMineX: ontology-guided LLM pipeline for extracting (subject, relation, object)-triples from HMA reports. Used real-world dataset from Cambodian Mine Action Centre (CMAC). Introduced bias-aware evaluation framework combining human-annotated triples with LLM-as-Judge protocol to mitigate position bias.

Result: Ontology-aligned prompts improve extraction accuracy by up to 44.2%, reduce hallucinations by 22.5%, and enhance format adherence by 20.9% compared to baseline models.

Conclusion: TextMineX successfully structures HMA knowledge from unstructured reports, enabling better information sharing between agencies. The approach demonstrates significant improvements in extraction quality and reduces common LLM issues like hallucinations.

Abstract: Humanitarian Mine Action (HMA) addresses the challenge of detecting and removing landmines from conflict regions. Much of the life-saving operational knowledge produced by HMA agencies is buried in unstructured reports, limiting the transferability of information between agencies. To address this issue, we propose TextMineX: the first dataset, evaluation framework and ontology-guided large language model (LLM) pipeline for knowledge extraction from text in the HMA domain. TextMineX structures HMA reports into (subject, relation, object)-triples, thus creating domain-specific knowledge. To ensure real-world relevance, we utilized the dataset from our collaborator Cambodian Mine Action Centre (CMAC). We further introduce a bias-aware evaluation framework that combines human-annotated triples with an LLM-as-Judge protocol to mitigate position bias in reference-free scoring. Our experiments show that ontology-aligned prompts improve extraction accuracy by up to 44.2%, reduce hallucinations by 22.5%, and enhance format adherence by 20.9% compared to baseline models. We publicly release the dataset and code.

[68] Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering

Alireza Salemi, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Zhuowan Li, Spurthi Amba Hombaiah, Weize Kong, Tao Chen, Hamed Zamani, Michael Bendersky

Main category: cs.CL

TL;DR: PoT (Pathways of Thoughts) is an inference-stage method for personalized question answering that models thinking as an iterative decision process, exploring multiple reasoning trajectories and aggregating them based on inferred user preferences.

Details

Motivation: Personalized question answering remains underexplored due to challenges in inferring preferences from long, noisy, implicit contexts and generating responses that are both accurate and aligned with user expectations.

Method: PoT models thinking as an iterative decision process where the LLM dynamically selects among cognitive operations (reasoning, revision, personalization, clarification), explores multiple reasoning trajectories to produce diverse candidate responses, then aggregates and reweights them according to inferred user preferences.

Result: Experiments on LaMP-QA benchmark show PoT consistently outperforms competitive baselines with up to 10.8% relative improvement. Human evaluation shows annotators prefer PoT in 66% of cases compared to best-performing baseline, with ties in 15% of cases.

Conclusion: PoT provides an effective inference-stage method for personalized QA that works with any LLM without task-specific fine-tuning, enabling exploration of diverse reasoning paths and aggregation based on user preferences for improved personalized responses.

Abstract: Personalization is well studied in search and recommendation, but personalized question answering remains underexplored due to challenges in inferring preferences from long, noisy, implicit contexts and generating responses that are both accurate and aligned with user expectations. To address this, we propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without task-specific fine-tuning. PoT models the thinking as an iterative decision process, where the model dynamically selects among cognitive operations such as reasoning, revision, personalization, and clarification. This enables exploration of multiple reasoning trajectories, producing diverse candidate responses that capture different perspectives. PoT then aggregates and reweights these candidates according to inferred user preferences, yielding a final personalized response that benefits from the complementary strengths of diverse reasoning paths. Experiments on the LaMP-QA benchmark show that PoT consistently outperforms competitive baselines, achieving up to a 10.8% relative improvement. Human evaluation further validates these improvements, with annotators preferring PoT in 66% of cases compared to the best-performing baseline and reporting ties in 15% of cases.

[69] Context Parametrization with Compositional Adapters

Josip Jukić, Martin Tutek, Jan Šnajder

Main category: cs.CL

TL;DR: CompAs is a meta-learning framework that translates context into compositional adapter parameters, enabling algebraic merging of multiple information chunks without reprocessing long prompts, offering lower inference cost and better handling of long contexts.

Details

Motivation: Current approaches like in-context learning (ICL) and supervised fine-tuning (SFT) have limitations: ICL is inefficient with many demonstrations, and SFT incurs training overhead while sacrificing flexibility. Prior work on generating adapters from context overlooked the need to integrate multiple information chunks.

Method: CompAs uses meta-learning to translate context into adapter parameters with a compositional structure. These adapters can be merged algebraically, allowing instructions, demonstrations, or retrieved passages to be combined without reprocessing long prompts. The approach also includes reversible encoding with a decoder for safety and security.

Result: Empirical results on diverse multiple-choice and extractive question answering tasks show that CompAs outperforms ICL and prior generator-based methods, especially when scaling to more inputs. It provides lower inference cost, robustness to long-context instability, and principled handling of inputs exceeding context windows.

Conclusion: Composable adapter generation through CompAs establishes a practical and efficient alternative for scaling LLM deployment, offering benefits in cost, flexibility, and safety while addressing limitations of existing approaches.

Abstract: Large language models (LLMs) often seamlessly adapt to new tasks through in-context learning (ICL) or supervised fine-tuning (SFT). However, both of these approaches face key limitations: ICL is inefficient when handling many demonstrations, and SFT incurs training overhead while sacrificing flexibility. Mapping instructions or demonstrations from context directly into adapter parameters offers an appealing alternative. While prior work explored generating adapters based on a single input context, it has overlooked the need to integrate multiple chunks of information. To address this gap, we introduce CompAs, a meta-learning framework that translates context into adapter parameters with a compositional structure. Adapters generated this way can be merged algebraically, enabling instructions, demonstrations, or retrieved passages to be seamlessly combined without reprocessing long prompts. Critically, this approach yields three benefits: lower inference cost, robustness to long-context instability, and establishes a principled solution when input exceeds the model’s context window. Furthermore, CompAs encodes information into adapter parameters in a reversible manner, enabling recovery of input context through a decoder, facilitating safety and security. Empirical results on diverse multiple-choice and extractive question answering tasks show that CompAs outperforms ICL and prior generator-based methods, especially when scaling to more inputs. Our work establishes composable adapter generation as a practical and efficient alternative for scaling LLM deployment.

[70] PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs

Manuel Frank, Haithem Afli

Main category: cs.CL

TL;DR: PTEB introduces a dynamic evaluation protocol using LLM-generated paraphrases at test time to assess sentence embedding robustness beyond static benchmarks like MTEB.

Details

Motivation: Static benchmarks like MTEB can lead to score inflation from repeated tuning and obscure real-world robustness; there's a need for dynamic evaluation that tests sensitivity to semantic-preserving variations.

Method: PTEB uses LLMs to stochastically generate meaning-preserving paraphrases at evaluation time, with cost-efficient LLM-based methods grounded in gold ratings and human validation, aggregating results across multiple runs.

Result: Sentence encoder performance is sensitive to token space changes even with fixed semantics; smaller models are not disproportionately affected; results are statistically robust across 20 datasets and 25 languages.

Conclusion: Proposes a new NLP evaluation paradigm shifting from static benchmarks toward dynamic, stochastic evaluation leveraging evaluation-time compute to better assess real-world robustness.

Abstract: Current sentence embedding evaluations typically rely on static test beds like the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported scores and obscure real-world robustness. We introduce the Paraphrasing Text Embedding Benchmark (PTEB), a dynamic protocol that stochastically generates meaning-preserving paraphrases at evaluation time and aggregates results across multiple runs. Using a cost-efficient LLM-based method grounded in gold ratings and human validation, we show that LLMs generate token-diverse but semantically preserving paraphrases. Across 7 MTEB tasks, we validate our hypothesis that the performance of sentence encoders is sensitive to changes in token space even when semantics remain fixed. We also observe that smaller models are not disproportionately affected relative to larger ones. Our results are statistically robust over multiple runs spanning 20 datasets and 25 languages. More generally, we aim to propose a new evaluation paradigm in NLP that relies less on static, pre-defined benchmarks but shifts towards dynamic, stochastic evaluation leveraging eval-time compute.

[71] Beyond Single-Granularity Prompts: A Multi-Scale Chain-of-Thought Prompt Learning for Graph

Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyan Huang, Weigang Lu

Main category: cs.CL

TL;DR: MSGCOT is a multi-scale graph prompt-tuning framework that captures hierarchical structural information using a low-rank coarsening network and progressive coarse-to-fine prompting chains, outperforming single-granularity methods especially in few-shot scenarios.

Details

Motivation: Current graph prompt-tuning methods use single-granularity (node or subgraph level) prompts, which overlook the multi-scale structural information inherent in graph data and limit prompt semantic diversity.

Method: Proposes MSGCOT with: 1) Lightweight low-rank coarsening network to capture multi-scale structural features as hierarchical basis vectors for prompt generation; 2) Progressive coarse-to-fine prompt chains that dynamically integrate multi-scale information at each reasoning step, mimicking human cognition from coarse to fine granularity.

Result: Extensive experiments on eight benchmark datasets show MSGCOT outperforms state-of-the-art single-granularity graph prompt-tuning methods, particularly in few-shot scenarios, demonstrating superior performance.

Conclusion: The integration of multi-scale information into graph prompting through MSGCOT effectively addresses limitations of single-granularity approaches and enhances prompt semantic diversity, showing promising results especially in data-scarce settings.

Abstract: The ``pre-train, prompt" paradigm, designed to bridge the gap between pre-training tasks and downstream objectives, has been extended from the NLP domain to the graph domain and has achieved remarkable progress. Current mainstream graph prompt-tuning methods modify input or output features using learnable prompt vectors. However, existing approaches are confined to single-granularity (e.g., node-level or subgraph-level) during prompt generation, overlooking the inherently multi-scale structural information in graph data, which limits the diversity of prompt semantics. To address this issue, we pioneer the integration of multi-scale information into graph prompt and propose a Multi-Scale Graph Chain-of-Thought (MSGCOT) prompting framework. Specifically, we design a lightweight, low-rank coarsening network to efficiently capture multi-scale structural features as hierarchical basis vectors for prompt generation. Subsequently, mimicking human cognition from coarse-to-fine granularity, we dynamically integrate multi-scale information at each reasoning step, forming a progressive coarse-to-fine prompt chain. Extensive experiments on eight benchmark datasets demonstrate that MSGCOT outperforms the state-of-the-art single-granularity graph prompt-tuning method, particularly in few-shot scenarios, showcasing superior performance. The code is available at: https://github.com/zhengziyu77/MSGCOT.

[72] Conjugate Relation Modeling for Few-Shot Knowledge Graph Completion

Zilong Wang, Qingtian Zeng, Hua Duan, Cheng Cheng, Minghao Zou, Ziyang Wang

Main category: cs.CL

TL;DR: CR-FKGC: A novel few-shot KG completion framework using conjugate relation modeling with neighborhood aggregation, conditional diffusion, and manifold decoding to handle complex relational patterns and data sparsity.

Details

Motivation: Existing few-shot KG completion methods struggle with capturing complex relational patterns and mitigating data sparsity issues in knowledge graphs with long-tail distributions.

Method: Three-component framework: 1) Neighborhood aggregation encoder for higher-order neighbor info, 2) Conjugate relation learner with implicit conditional diffusion module and stable relation module, 3) Manifold conjugate decoder for evaluation and inference in manifold space.

Result: Superior performance over state-of-the-art methods demonstrated on three benchmark datasets.

Conclusion: CR-FKGC effectively addresses complex relational pattern capture and data sparsity in few-shot KG completion through conjugate relation modeling and manifold space inference.

Abstract: Few-shot Knowledge Graph Completion (FKGC) infers missing triples from limited support samples, tackling long-tail distribution challenges. Existing methods, however, struggle to capture complex relational patterns and mitigate data sparsity. To address these challenges, we propose a novel FKGC framework for conjugate relation modeling (CR-FKGC). Specifically, it employs a neighborhood aggregation encoder to integrate higher-order neighbor information, a conjugate relation learner combining an implicit conditional diffusion relation module with a stable relation module to capture stable semantics and uncertainty offsets, and a manifold conjugate decoder for efficient evaluation and inference of missing triples in manifold space. Experiments on three benchmarks demonstrate that our method achieves superior performance over state-of-the-art methods.

[73] BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

Raphaël Bagat, Irina Illina, Emmanuel Vincent

Main category: cs.CL

TL;DR: BEARD adapts Whisper’s encoder using unlabeled data via BEST-RQ objective and knowledge distillation, achieving 12% relative improvement on ATC domain with only 2 hours of transcribed data.

Details

Motivation: ASR systems struggle in low-resource domains like Air Traffic Control where labeled data is scarce, despite large multilingual training. Domain adaptation is needed for specialized domains with non-native speech, noise, and unique phraseology.

Method: BEARD framework combines BEST-RQ self-supervised objective with knowledge distillation from frozen teacher encoder to adapt Whisper’s encoder using unlabeled data, ensuring complementarity with pre-trained decoder. Uses 5,000 hours of untranscribed speech for adaptation and 2 hours of transcribed speech for fine-tuning.

Result: Significantly outperforms previous baseline and fine-tuned models on ATCO2 corpus, achieving 12% relative improvement compared to fine-tuned model. First work to use self-supervised learning for Whisper domain adaptation.

Conclusion: BEARD effectively adapts Whisper to low-resource specialized domains using unlabeled data, demonstrating strong performance improvements with minimal labeled data through innovative combination of self-supervised learning and knowledge distillation.

Abstract: Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper’s encoder with unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder’s complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.

[74] Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

Zilong Li, Jie Cao

Main category: cs.CL

TL;DR: The paper addresses low-resource classical Chinese-Japanese translation by framing ancient annotation systems as sequence tagging tasks, creating an LLM-based pipeline and dataset, and showing auxiliary Chinese NLP tasks improve performance in low-resource settings.

Details

Motivation: Ancient Chinese-Japanese translation used annotation systems that face low-resource problems in modern NLP research. There's a need to bridge this historical translation method with contemporary language technologies despite limited data availability.

Method: Abstract ancient annotation process as sequence tagging tasks, introduce LLM-based annotation pipeline, construct new dataset from digitized open-source translations, and use auxiliary Chinese NLP tasks to enhance training in low-resource settings.

Result: Auxiliary Chinese NLP tasks improve sequence tagging performance in low-resource settings. LLMs achieve high scores on direct machine translation, but the proposed method can supplement LLMs to improve character annotation quality.

Conclusion: The proposed approach effectively addresses low-resource challenges in ancient Chinese-Japanese translation research, with the method serving as a valuable supplement to LLMs for improving annotation quality in historical translation tasks.

Abstract: Ancient people translated classical Chinese into Japanese using a system of annotations placed around characters. We abstract this process as sequence tagging tasks and fit them into modern language technologies. The research on this annotation and translation system faces a low resource problem. We alleviate this problem by introducing an LLM-based annotation pipeline and constructing a new dataset from digitized open-source translation data. We show that in the low-resource setting, introducing auxiliary Chinese NLP tasks enhances the training of sequence tagging tasks. We also evaluate the performance of Large Language Models (LLMs) on this task. While they achieve high scores on direct machine translation, our method could serve as a supplement to LLMs to improve the quality of character’s annotation.

[75] Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling

Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, Wanxiang Che

Main category: cs.CL

TL;DR: SeerSC is a dynamic self-consistency framework that uses System 1 reasoning to estimate answer entropy, enabling efficient parallel generation in System 2 to reduce both token consumption and inference latency.

Details

Motivation: Test-time scaling improves LLM inference performance but incurs high computational costs. Existing dynamic self-consistency methods reduce tokens but suffer from high latency due to sequential requests.

Method: Integrates System 1 (fast) and System 2 (deliberate) reasoning. System 1 computes answer entropy for queries to evaluate scaling potential, enabling dynamic self-consistency in System 2 with parallel generation.

Result: Achieves up to 47% reduction in token consumption and 43% reduction in inference latency without significant performance loss, outperforming existing methods.

Conclusion: SeerSC effectively addresses both token efficiency and latency issues in test-time scaling by leveraging dual-system reasoning for intelligent parallel generation.

Abstract: Test-time scaling improves the inference performance of Large Language Models (LLMs) but also incurs substantial computational costs. Although recent studies have reduced token consumption through dynamic self-consistency, they remain constrained by the high latency of sequential requests. In this paper, we propose SeerSC, a dynamic self-consistency framework that simultaneously improves token efficiency and latency by integrating System 1 and System 2 reasoning. Specifically, we utilize the rapid System 1 to compute the answer entropy for given queries. This score is then used to evaluate the potential of samples for scaling, enabling dynamic self-consistency under System 2. Benefiting from the advance and accurate estimation provided by System 1, the proposed method can reduce token usage while simultaneously achieving a significant decrease in latency through parallel generation. It outperforms existing methods, achieving up to a 47% reduction in token consumption and a 43% reduction in inference latency without significant performance loss.

[76] Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents

Daud Waqas, Aaryamaan Golthi, Erika Hayashida, Huanzhi Mao

Main category: cs.CL

TL;DR: A-CC (Assertion-Conditioned Compliance) is a new evaluation framework for multi-turn tool-calling LLMs that measures vulnerability to misleading assertions from users (USA) and system functions (FSA), revealing critical robustness gaps in deployed AI agents.

Details

Motivation: Multi-turn tool-calling LLMs are increasingly used in safety-critical domains, but there's a lack of visibility into their conversation-level robustness against misleading information. Current benchmarks like BFCL don't adequately address multi-turn resilience to deceptive assertions from users or system tools.

Method: The authors introduce Assertion-Conditioned Compliance (A-CC), a novel evaluation paradigm with holistic metrics that test model behavior when confronted with two types of misleading assertions: User-Sourced Assertions (USAs) that measure sycophancy toward misinformed user beliefs, and Function-Sourced Assertions (FSAs) that measure compliance with contradictory system policies from stale or unmaintained tools.

Result: Models show high vulnerability to both USA sycophancy (complying with plausible but misinformed user beliefs) and FSA policy conflicts (complying with contradictory system policies), confirming A-CC as a critical latent vulnerability in deployed AI agents.

Conclusion: A-CC reveals significant robustness gaps in multi-turn function-calling LLMs, highlighting the need for better evaluation of conversation-level resilience against misleading assertions from both users and system tools in safety-critical applications.

Abstract: Multi-turn tool-calling LLMs (models capable of invoking external APIs or tools across several user turns) have emerged as a key feature in modern AI assistants, enabling extended dialogues from benign tasks to critical business, medical, and financial operations. Yet implementing multi-turn pipelines remains difficult for many safety-critical industries due to ongoing concerns regarding model resilience. While standardized benchmarks such as the Berkeley Function-Calling Leaderboard (BFCL) have underpinned confidence concerning advanced function-calling models (like Salesforce’s xLAM V2), there is still a lack of visibility into multi-turn conversation-level robustness, especially given their exposure to real-world systems. In this paper, we introduce Assertion-Conditioned Compliance (A-CC), a novel evaluation paradigm for multi-turn function-calling dialogues. A-CC provides holistic metrics that evaluate a model’s behavior when confronted with misleading assertions originating from two distinct vectors: (1) user-sourced assertions (USAs), which measure sycophancy toward plausible but misinformed user beliefs, and (2) function-sourced assertions (FSAs), which measure compliance with plausible but contradictory system policies (e.g., stale hints from unmaintained tools). Our results show that models are highly vulnerable to both USA sycophancy and FSA policy conflicts, confirming A-CC as a critical, latent vulnerability in deployed agents.

[77] Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, Zhongyu Wei

Main category: cs.CL

TL;DR: ILVR introduces an interleaved latent visual reasoning framework that combines dynamic state evolution with precise perceptual modeling for multimodal reasoning, outperforming existing approaches.

Details

Motivation: Current interleaved reasoning paradigms for MLLMs face computational bottlenecks from re-encoding pixel-dense images, while latent visual reasoning alternatives either fail to capture intermediate state evolution or sacrifice precise perceptual modeling through over-compression.

Method: ILVR interleaves textual generation with latent visual representations that serve as evolving cues. It uses a self-supervision strategy where a momentum teacher model selectively distills relevant features from ground-truth intermediate images into sparse supervision targets, enabling adaptive, context-aware visual signal generation.

Result: Extensive experiments on multimodal reasoning benchmarks show that ILVR outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.

Conclusion: ILVR successfully unifies dynamic state evolution with precise perceptual modeling, providing an effective solution for interleaved multimodal reasoning without the computational burden of re-encoding dense images.

Abstract: Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet faces limitations: methods either fail to capture intermediate state evolution due to single-step, non-interleaved structures, or sacrifice precise perceptual modeling by over-compressing features. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. Specifically, we employ a self-supervision strategy where a momentum teacher model selectively distills relevant features from ground-truth intermediate images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning. The code is available at https://github.com/XD111ds/ILVR.

[78] Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs

Mohor Banerjee, Nadya Yuki Wangsajaya, Syed Ali Redha Alsagoff, Min Sen Tan, Zachary Choy Kit Chun, Alvin Chan Guo Wei

Main category: cs.CL

TL;DR: Hallucination-reduction methods (CoVe, DoLa, RAG) have opposing effects on LLM creativity: CoVe enhances divergent thinking, DoLa suppresses it, RAG has minimal impact.

Details

Motivation: While many methods reduce LLM hallucinations, their impact on creative generation remains unexplored, creating a critical gap for AI-assisted scientific discovery which requires both factual accuracy and creative hypothesis generation.

Method: Investigates three hallucination-reduction techniques (Chain of Verification, Decoding by Contrasting Layers, Retrieval-Augmented Generation) across multiple LLM families (LLaMA, Qwen, Mistral) at varying scales (1B-70B parameters) using two creativity benchmarks (NeoCoder and CS4).

Result: Hallucination-reduction methods have opposing effects on divergent creativity: CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact.

Conclusion: Provides guidance for selecting appropriate hallucination-reduction methods in scientific applications where balancing factual accuracy and creative exploration is crucial.

Abstract: Large Language Models (LLMs) exhibit remarkable capabilities in natural language understanding and reasoning, but suffer from hallucination: the generation of factually incorrect content. While numerous methods have been developed to reduce hallucinations, their impact on creative generations remains unexplored. This gap is particularly critical for AI-assisted scientific discovery, which requires both factual accuracy and creative hypothesis generation. We investigate how three hallucination-reduction techniques: Chain of Verification (CoVe), Decoding by Contrasting Layers (DoLa), and Retrieval-Augmented Generation (RAG), affect creativity in LLMs. Evaluating multiple model families (LLaMA, Qwen, Mistral) at varying scales (1B - 70B parameters) on two creativity benchmarks (NeoCoder and CS4), we find that these methods have opposing effects on divergent creativity. CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact. Our findings provide guidance for selecting appropriate hallucination-reduction methods in scientific applications, where the balance between factual accuracy and creative exploration is crucial.

[79] Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics

Abhay Srivastava, Sam Jung, Spencer Mateega

Main category: cs.CL

TL;DR: MARKET-BENCH is a benchmark that tests LLMs on quantitative trading tasks by having them generate executable backtesters from natural language descriptions, evaluating both code reliability and numerical accuracy.

Details

Motivation: To assess whether current large language models can effectively handle introductory quantitative trading tasks, particularly constructing executable backtesters from natural language strategy descriptions and market assumptions.

Method: The benchmark evaluates LLMs on three canonical strategies: scheduled trading on MSFT, pairs trading on KO/PEP, and delta hedging on MSFT. Models must produce code that matches reference implementations in P&L, drawdown, and position paths. Evaluation uses multi-round testing separating structural reliability (whether code runs) from numerical accuracy (MAE of metrics).

Result: Most models reliably execute the simplest strategy (average 4.08/5 passes), but error rates vary significantly across models and tasks. Gemini 3 Pro and Claude 4.5 Sonnet combine strong reliability with low error on simpler strategies. GPT-5.2 achieves perfect executability with strong overall performance. GPT-5.1 Codex-Max achieves lowest error on easiest task. Qwen3 Max has perfect executability but sometimes inaccurate P&L paths.

Conclusion: Current LLMs can scaffold basic trading infrastructure but still struggle with robust reasoning about prices, inventory, and risk. The benchmark and public leaderboard are released at https://marketbench.ai.

Abstract: We introduce MARKET-BENCH, a benchmark that evaluates large language models (LLMs) on introductory quantitative trading tasks by asking them to construct executable backtesters from natural language strategy descriptions and market assumptions. Each instance specifies one of three canonical strategies: scheduled trading on Microsoft (NASDAQ: MSFT), pairs trading on Coca-Cola (NASDAQ: KO) and Pepsi (NASDAQ: PEP), or delta hedging on MSFT. Models must produce code whose profit and loss (P and L), drawdown, and position paths match a verifiable reference implementation. We assess thirteen state-of-the-art models using a multi-round evaluation that separates structural reliability (whether the backtest runs) from numerical accuracy (mean absolute error of the backtest metrics), assigning failed outputs a duplicated-metrics baseline MAE. While most models reliably execute the simplest strategy (average executable passes of 4.08 out of 5 rounds), errors vary by orders of magnitude across models and tasks. Gemini 3 Pro and Claude 4.5 Sonnet combine strong reliability with low error on simpler strategies. GPT-5.2 achieves strong overall performance with perfect executability. GPT-5.1 Codex-Max achieves the lowest best-run error on the easiest task. Qwen3 Max attains perfect executability yet sometimes produces inaccurate profit and loss paths. These results show that current LLMs can scaffold basic trading infrastructure but still struggle to reason robustly about prices, inventory, and risk. We release MARKET-BENCH and a public leaderboard at https://marketbench.ai.

[80] SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation

Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You, Yifan Zhou, Ruidong Zhang, Lin Zhao, Yohannes Abate, Tianming Liu

Main category: cs.CL

TL;DR: Synapse introduces a unified memory architecture for LLM agents that uses dynamic graph-based activation instead of static vector similarity, solving the “Contextual Tunneling” problem in long-term memory.

Details

Motivation: Standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory in LLMs, creating a gap in handling complex temporal and multi-hop reasoning tasks.

Method: Synapse models memory as a dynamic graph where relevance emerges from spreading activation rather than pre-computed links, integrating lateral inhibition and temporal decay. It implements Triple Hybrid Retrieval that fuses geometric embeddings with activation-based graph traversal.

Result: Comprehensive evaluations on the LoCoMo benchmark show Synapse significantly outperforms state-of-the-art methods in complex temporal and multi-hop reasoning tasks.

Conclusion: Synapse offers a robust solution to the “Contextual Tunneling” problem and represents a unified memory architecture that transcends static vector similarity for LLM agents.

Abstract: While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory. To bridge this gap, we introduce Synapse (Synergistic Associative Processing Semantic Encoding), a unified memory architecture that transcends static vector similarity. Drawing from cognitive science, Synapse models memory as a dynamic graph where relevance emerges from spreading activation rather than pre-computed links. By integrating lateral inhibition and temporal decay, the system dynamically highlights relevant sub-graphs while filtering interference. We implement a Triple Hybrid Retrieval strategy that fuses geometric embeddings with activation-based graph traversal. Comprehensive evaluations on the LoCoMo benchmark show that Synapse significantly outperforms state-of-the-art methods in complex temporal and multi-hop reasoning tasks, offering a robust solution to the “Contextual Tunneling” problem. Our code and data will be made publicly available upon acceptance.

[81] Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics

Oshri Naparstek

Main category: cs.CL

TL;DR: Token Maturation is a continuous autoregressive framework where tokens evolve as vector trajectories before discretization, preventing premature commitment and mitigating degeneration without heuristic sampling strategies.

Details

Motivation: Standard autoregressive models collapse uncertainty by immediately sampling discrete tokens, causing failure modes like repetition loops and reliance on heuristic sampling strategies.

Method: Introduces Token Maturation where tokens evolve as vector-valued trajectories in embedding space through deterministic dynamical processes, deferring discrete commitment until representations geometrically stabilize.

Result: The framework mitigates degeneration intrinsically, generating coherent and diverse text under fully deterministic decoding without repetition penalties, temperature scaling, or stochastic sampling. Shows token representations stabilize spatially while predictive entropy remains high.

Conclusion: Continuous token dynamics with delayed commitment offers an alternative formulation of autoregressive generation that exposes structural regularities obscured by immediate discretization.

Abstract: Standard autoregressive language models collapse uncertainty at every generation step by committing to discrete tokens through immediate sampling. This premature discretization underlies well-known failure modes, including degenerate repetition loops in greedy decoding and a heavy reliance on heuristic sampling strategies. We introduce \textbf{Token Maturation}, a continuous autoregressive framework in which tokens evolve as vector-valued trajectories prior to discretization. Rather than sampling from a categorical distribution at each step, the model resolves uncertainty through a deterministic dynamical process in embedding space, deferring discrete commitment until the representation has geometrically stabilized. We show that this formulation mitigates degeneration \emph{intrinsically}: Token Maturation generates coherent and diverse text under fully deterministic decoding (argmax), without repetition penalties, temperature scaling, or stochastic sampling. Moreover, we identify a novel convergence behavior in which token representations stabilize spatially while predictive entropy remains high, challenging the common assumption that commitment requires probability concentration. We propose continuous token dynamics with delayed commitment as an alternative formulation of autoregressive generation that exposes structural regularities obscured by immediate discretization.

[82] Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization

Hanyu Li, Jiangshan Duo, Bofei Gao, Hailin Zhang, Sujian Li, Xiaotie Deng, Liang Zhao

Main category: cs.CL

TL;DR: Mastery-gated reinforcement learning compression reduces LLM reasoning length by 20-40% while maintaining or improving accuracy, with cross-domain generalization and bidirectional transfer between CoT and tool-use agents.

Details

Motivation: Chain-of-thought reasoning in LLMs suffers from "overthinking trap" - longer rollouts increase cost/latency but often don't improve accuracy. Existing static control methods may suppress needed reasoning.

Method: Propose mastery-gated, sample-level, soft reinforcement learning compression that penalizes long rollouts only when the model already solves the problem and has produced a shorter rollout.

Result: Cuts response length by 20-40% with comparable or higher accuracy across benchmarks. Generalizes across domains (math-trained model shortens unseen code, instruction following, QA tasks). Shows bidirectional transfer between non-agent CoT and tool-use agents.

Conclusion: Compression is not just cosmetic brevity but an inherent computation policy - learning what to keep and what to forget in reasoning processes.

Abstract: Chain-of-thought reasoning in large language models can trigger an “overthinking trap”: longer rollouts raise cost and latency yet often yield unreliable accuracy gains. Existing methods use global, static controls that may suppress needed reasoning. We propose mastery-gated, sample-level, soft reinforcement learning compression that penalizes long rollouts only when the model already solves the problem and has produced a shorter rollout. Across benchmarks, it cuts response length by 20-40% with comparable or higher accuracy and generalizes across domains: a model trained on math spontaneously shortens unseen tasks (code, instruction following, general-knowledge QA) without hurting accuracy. We further show two-way transfer between non-agent CoT and tool-use agents: non-agent training reduces SWE-Bench Verified rounds by 13%, while compressing a thinking agent cuts SWE trajectories by 67% tokens and 52% rounds and shortens non-agent outputs by up to 44%. Compression is thus not cosmetic brevity, but an inherent computation policy – what to keep, and what to forget.

[83] How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur, Ivan Brugere, Charese H. Smiley, Kundan Thind, Mohammad M. Ghassemi

Main category: cs.CL

TL;DR: Researchers introduce RMCB, a benchmark for evaluating confidence estimation methods in Large Reasoning Models across high-stakes domains, finding a persistent trade-off between discrimination and calibration with no single method dominating both.

Details

Motivation: Miscalibration of Large Reasoning Models undermines their reliability in high-stakes domains, creating a need for accurate confidence estimation methods for their long-form, multi-step outputs.

Method: Created RMCB benchmark with 347,496 reasoning traces from six LRMs across diverse high-stakes domains. Evaluated over ten representation-based methods including sequential, graph-based, and text-based architectures.

Result: Found persistent trade-off: text-based encoders achieve best AUROC (0.672) while structurally-aware models yield best ECE (0.148). Increased architectural complexity doesn’t reliably outperform simpler sequential baselines.

Conclusion: Established comprehensive benchmark for confidence estimation in LRMs, demonstrating limitations of current representation-based paradigms and providing rigorous baselines for future research.

Abstract: The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between discrimination (AUROC) and calibration (ECE): text-based encoders achieve the best AUROC (0.672), while structurally-aware models yield the best ECE (0.148), with no single method dominating both. Furthermore, we find that increased architectural complexity does not reliably outperform simpler sequential baselines, suggesting a performance ceiling for methods relying solely on chunk-level hidden states. This work provides the most comprehensive benchmark for this task to date, establishing rigorous baselines and demonstrating the limitations of current representation-based paradigms.

[84] LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation

Stergios Chatzikyriakidis, Anastasia Natsina

Main category: cs.CL

TL;DR: LLMs struggle with phonological tasks like rhyme in low-resource languages like Greek. A hybrid system combining LLMs with phonological algorithms achieves accurate rhyme identification and generation, with verification loops dramatically improving performance from 4% to 73% valid poems.

Details

Motivation: LLMs have remarkable NLP capabilities but struggle with phonologically-grounded phenomena like rhyme detection and generation, especially in lower-resource languages such as Modern Greek. This gap in phonological reasoning needs to be addressed.

Method: Hybrid system combining LLMs with deterministic phonological algorithms. Implements comprehensive taxonomy of Greek rhyme types (Pure, Rich, Imperfect, Mosaic, IDV). Uses agentic generation pipeline with phonological verification. Evaluates multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, RAG-augmented) across various LLMs including Claude, GPT-4o, Gemini, Llama, and Mistral.

Result: Significant “Reasoning Gap” discovered: native-like models (Claude 3.7) perform intuitively (40% accuracy), while reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54%) with Chain-of-Thought prompting. Pure LLM generation fails catastrophically (<4% valid poems), but hybrid verification loop restores performance to 73.1%. System and cleaned corpus of 40,000+ rhymes released.

Conclusion: Hybrid approach combining LLMs with phonological algorithms is essential for accurate rhyme processing in low-resource languages. Pure LLM approaches fail for phonological tasks, but verification mechanisms can dramatically improve performance. The released system and corpus support future research in phonological NLP.

Abstract: Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant “Reasoning Gap”: while native-like models (Claude 3.7) perform intuitively (40% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4% valid poems), while our hybrid verification loop restores performance to 73.1%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.

[85] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure

Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, K. M. Shadman Wadith, Nazia Tasnim, Farig Sadeque

Main category: cs.CL

TL;DR: KIF is a representation-aware framework for true knowledge erasure in LLMs that targets internal activation signatures rather than surface outputs, achieving near-perfect erasure while preserving utility and breaking the stability-erasure tradeoff.

Details

Motivation: Current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. This is problematic for GDPR compliance and model safety, requiring genuine knowledge erasure rather than just output obfuscation.

Method: Knowledge Immunization Framework (KIF) uses a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures. It combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining.

Result: KIF achieves near-oracle erasure (FQ ≈ 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff. Standard models show scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence.

Conclusion: KIF provides a systematic approach to true knowledge erasure that distinguishes between surface-level suppression and genuine representation removal. The framework enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales, with implications for GDPR compliance and model safety.

Abstract: Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.

[86] A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

Qiuyi Qu, Yicheng Sui, Yufei Sun, Rui Chen, Xiaofei Zhang, Yuzhi Zhang, Haofeng Wang, Ge Lan, Ning Zhang

Main category: cs.CL

TL;DR: This paper introduces a template-based rewriting layer combined with search-based autotuning for GPU kernel optimization, achieving more stable and higher-quality speedups (up to 3x) compared to direct LLM rewriting approaches.

Details

Motivation: GPU code optimization is critical for HPC and AI workloads, but current approaches (compiler optimizations, hand-written kernels, or direct LLM rewriting) either require heavy manual effort or produce unstable results with implicit parameter choices.

Method: The method uses a template-based rewriting layer on top of an agent-driven iterative loop: kernels are refactored into explicitly parameterizable templates, then template parameters are optimized via search-based autotuning with profiling feedback, constrained by hardware resource limits.

Result: Experiments on real-world CUDA kernels from SGLang demonstrate speedups exceeding 3x in best cases. The template-plus-search design significantly reduces optimization randomness compared to agent-only direct rewriting, making the process more interpretable and systematic.

Conclusion: The proposed template-based rewriting with search autotuning provides a more stable, interpretable, and systematic approach to GPU kernel optimization that can be extended to other backends (OpenCL, HIP) for automated performance optimization in production workloads.

Abstract: GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality speedups. Experiments on a set of real-world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent-only direct rewriting, the template-plus-search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high-performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.

[87] Multimodal Multi-Agent Empowered Legal Judgment Prediction

Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu, Rong Fu, Zhiyuan Feng, Yuan Wang, Simon Fong, Kaiyue Zhou

Main category: cs.CL

TL;DR: JurisMMA is a novel Legal Judgment Prediction framework that decomposes trial tasks into stages and uses multimodal data, validated on a new large Chinese judicial dataset JurisMM.

Details

Motivation: Traditional LJP methods struggle with multiple allegations, diverse evidence, and lack adaptability. There's a need for more effective frameworks that can handle complex legal cases and leverage multimodal data.

Method: Introduces JurisMMA framework that decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Also creates JurisMM dataset with over 100,000 recent Chinese judicial records including both text and multimodal video-text data.

Result: Experiments on JurisMM and benchmark LawBench validate the framework’s effectiveness. The approach shows promise not only for LJP but also for broader legal applications.

Conclusion: JurisMMA offers new perspectives for developing future legal methods and datasets, providing an effective framework for complex legal judgment prediction tasks.

Abstract: Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework’s effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.

[88] A Component-Based Survey of Interactions between Large Language Models and Multi-Armed Bandits

Miao Xie, Siguang Chen, Chunli Lv

Main category: cs.CL

TL;DR: This is the first systematic survey exploring bidirectional interactions between large language models (LLMs) and multi-armed bandit (MAB) algorithms, highlighting how each field enhances the other at the component level.

Details

Motivation: LLMs excel at language tasks while MABs provide principled decision-making under uncertainty. The motivation is to explore the synergistic potential between these two powerful frameworks and systematically review their bidirectional interactions, which hasn't been done before at the component level.

Method: The survey systematically reviews existing research at the intersection of LLMs and MABs, analyzing both LLM-enhanced bandit systems and bandit-enhanced LLM systems. It examines design, methodologies, and performance while maintaining an accompanying GitHub repository for literature indexing.

Result: The survey identifies bidirectional benefits: MAB algorithms help address LLM challenges in pre-training, retrieval-augmented generation (RAG), and personalization, while LLMs enhance MAB systems by improving core components like arm definition and environment modeling for better sequential decision-making.

Conclusion: This first systematic survey demonstrates significant synergistic potential between LLMs and MABs, identifies key challenges and representative findings, and provides a foundation for future research with an accompanying literature repository to guide further exploration in this emerging interdisciplinary field.

Abstract: Large language models (LLMs) have become powerful and widely used systems for language understanding and generation, while multi-armed bandit (MAB) algorithms provide a principled framework for adaptive decision-making under uncertainty. This survey explores the potential at the intersection of these two fields. As we know, it is the first survey to systematically review the bidirectional interaction between large language models and multi-armed bandits at the component level. We highlight the bidirectional benefits: MAB algorithms address critical LLM challenges, spanning from pre-training to retrieval-augmented generation (RAG) and personalization. Conversely, LLMs enhance MAB systems by redefining core components such as arm definition and environment modeling, thereby improving decision-making in sequential tasks. We analyze existing LLM-enhanced bandit systems and bandit-enhanced LLM systems, providing insights into their design, methodologies, and performance. Key challenges and representative findings are identified to help guide future research. An accompanying GitHub repository that indexes relevant literature is available at https://github.com/bucky1119/Awesome-LLM-Bandit-Interaction.

[89] OptiSQL: Executable SQL Generation from Optical Tokens

Sifan Li, Hongkai Chen, Yujun Cai, Liyang Chen, Qingwen Ye, Yiwei Wang

Main category: cs.CL

TL;DR: OptiSQL generates executable SQL from table images and natural language questions using compact optical tokens, reducing input tokens by 10x while maintaining accuracy.

Details

Motivation: Traditional text-to-SQL assumes access to fully linearized textual schemas, which incurs substantial token overhead and doesn't align with real-world scenarios where tables appear as visual artifacts in documents or webpages.

Method: OptiSQL uses an OCR-oriented visual encoder to compress table structure and content into compact optical tokens, then fine-tunes a pretrained decoder for SQL generation while freezing the encoder to isolate representation sufficiency.

Result: Experiments on visualized Spider 2.0-Snow show OptiSQL retains strong execution accuracy while reducing table input tokens by an order of magnitude. Robustness analyses show optical tokens preserve essential structural information under visual perturbations.

Conclusion: Compact optical representations can serve as an efficient interface for executable semantic parsing, enabling vision-driven SQL generation directly from table images with significantly reduced token overhead.

Abstract: Executable SQL generation is typically studied in text-to-SQL settings, where tables are provided as fully linearized textual schemas and contents. While effective, this formulation assumes access to structured text and incurs substantial token overhead, which is misaligned with many real-world scenarios where tables appear as visual artifacts in documents or webpages. We investigate whether compact optical representations can serve as an efficient interface for executable semantic parsing. We present OptiSQL, a vision-driven framework that generates executable SQL directly from table images and natural language questions using compact optical tokens. OptiSQL leverages an OCR-oriented visual encoder to compress table structure and content into a small set of optical tokens and fine-tunes a pretrained decoder for SQL generation while freezing the encoder to isolate representation sufficiency. Experiments on a visualized version of Spider 2.0-Snow show that OptiSQL retains strong execution accuracy while reducing table input tokens by an order of magnitude. Robustness analyses further demonstrate that optical tokens preserve essential structural information under visual perturbations.

[90] Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns

George Mihaila

Main category: cs.CL

TL;DR: ExpNet is a lightweight neural network that learns to map transformer attention patterns to token importance scores, automatically discovering optimal attention feature combinations instead of using predetermined rules.

Details

Motivation: Transformers are deployed in high-stakes applications where opacity hinders trust and accountability. Existing attention-based methods rely on manual aggregation strategies and fixed rules, while model-agnostic approaches treat models as black boxes and are computationally expensive.

Method: ExpNet (Explanation Network) - a lightweight neural network that learns an explicit mapping from transformer attention patterns to token-level importance scores. It automatically discovers optimal attention feature combinations rather than using predetermined rules.

Result: Evaluated in a challenging cross-task setting and benchmarked against a broad spectrum of model-agnostic methods (like LIME, SHAP) and attention-based techniques spanning four methodological families.

Conclusion: ExpNet provides a novel approach to transformer interpretability by learning optimal attention-to-importance mappings automatically, addressing limitations of both rule-based attention methods and computationally expensive model-agnostic approaches.

Abstract: Explainable AI (XAI) has become critical as transformer-based models are deployed in high-stakes applications including healthcare, legal systems, and financial services, where opacity hinders trust and accountability. Transformers self-attention mechanisms have proven valuable for model interpretability, with attention weights successfully used to understand model focus and behavior (Xu et al., 2015); (Wiegreffe and Pinter, 2019). However, existing attention-based explanation methods rely on manually defined aggregation strategies and fixed attribution rules (Abnar and Zuidema, 2020a); (Chefer et al., 2021), while model-agnostic approaches (LIME, SHAP) treat the model as a black box and incur significant computational costs through input perturbation. We introduce Explanation Network (ExpNet), a lightweight neural network that learns an explicit mapping from transformer attention patterns to token-level importance scores. Unlike prior methods, ExpNet discovers optimal attention feature combinations automatically rather than relying on predetermined rules. We evaluate ExpNet in a challenging cross-task setting and benchmark it against a broad spectrum of model-agnostic methods and attention-based techniques spanning four methodological families.

cs.CV

[91] SOSControl: Enhancing Human Motion Generation through Saliency-Aware Symbolic Orientation and Timing Control

Ho Yin Au, Junkun Jiang, Jie Chen

Main category: cs.CV

TL;DR: SOSControl: A symbolic framework using Salient Orientation Symbolic (SOS) scripts for precise control of body part orientations and motion timing in text-to-motion generation.

Details

Motivation: Traditional text-to-motion frameworks lack precise control over body part orientations and motion timing. Existing approaches using joint keyframe locations only provide positional guidance, making it challenging and unintuitive to specify orientation and timing constraints.

Method: 1) SOS script: Programmable symbolic framework for specifying body part orientations and motion timing at keyframes. 2) Automatic SOS extraction pipeline: Uses temporally-constrained agglomerative clustering for frame saliency detection and Saliency-based Masking Scheme (SMS) to generate sparse, interpretable SOS scripts from motion data. 3) SOSControl framework: Prioritizes satisfying orientation symbols in sparse SOS scripts during motion generation, incorporating SMS-based data augmentation and gradient-based iterative optimization. 4) ControlNet-based ACTOR-PAE Decoder ensures smooth motion outputs.

Result: SOS extraction pipeline generates human-interpretable scripts with symbolic annotations at salient keyframes. SOSControl framework outperforms existing baselines in motion quality, controllability, and generalizability with respect to motion timing and body part orientation control.

Conclusion: The proposed SOSControl framework effectively addresses the limitations of traditional text-to-motion systems by providing precise control over body part orientations and motion timing through symbolic programming, automatic extraction, and constraint-prioritized generation.

Abstract: Traditional text-to-motion frameworks often lack precise control, and existing approaches based on joint keyframe locations provide only positional guidance, making it challenging and unintuitive to specify body part orientations and motion timing. To address these limitations, we introduce the Salient Orientation Symbolic (SOS) script, a programmable symbolic framework for specifying body part orientations and motion timing at keyframes. We further propose an automatic SOS extraction pipeline that employs temporally-constrained agglomerative clustering for frame saliency detection and a Saliency-based Masking Scheme (SMS) to generate sparse, interpretable SOS scripts directly from motion data. Moreover, we present the SOSControl framework, which treats the available orientation symbols in the sparse SOS script as salient and prioritizes satisfying these constraints during motion generation. By incorporating SMS-based data augmentation and gradient-based iterative optimization, the framework enhances alignment with user-specified constraints. Additionally, it employs a ControlNet-based ACTOR-PAE Decoder to ensure smooth and natural motion outputs. Extensive experiments demonstrate that the SOS extraction pipeline generates human-interpretable scripts with symbolic annotations at salient keyframes, while the SOSControl framework outperforms existing baselines in motion quality, controllability, and generalizability with respect to motion timing and body part orientation control.

Ziwen Zhong, Zhitao Shu, Yue Zhao

Main category: cs.CV

TL;DR: Cloud-based multimodal emotion recognition framework using cross-modal transformers achieves state-of-the-art performance with low latency for real-time HCI applications.

Details

Motivation: Existing emotion recognition systems rely on single-modality analysis (facial, speech, or text), resulting in limited robustness and poor generalization in real-world environments. There's a need for more robust multimodal approaches that can handle complex real-world scenarios.

Method: Proposes a Cloud-Based Cross-Modal Transformer (CMT) framework that integrates visual, auditory, and textual signals using pretrained encoders (Vision Transformer, Wav2Vec2, and BERT) with cross-modal attention mechanisms. Leverages cloud computing infrastructure with distributed training on Kubernetes and TensorFlow Serving for scalable deployment.

Result: Achieves state-of-the-art performance on benchmark datasets (IEMOCAP, MELD, AffectNet), improving F1-score by 3.0% and reducing cross-entropy loss by 12.9% compared to strong multimodal baselines. Cloud deployment shows average response latency of 128 ms (35% reduction compared to conventional transformer-based fusion systems).

Conclusion: The proposed framework enables efficient, real-time emotion recognition and adaptive feedback for applications like intelligent customer service, virtual tutoring, and affective computing interfaces, representing an important step toward cloud-native affective computing and emotionally intelligent interactive systems.

Abstract: Emotion recognition is a fundamental component of next-generation human-computer interaction (HCI), enabling machines to perceive, understand, and respond to users’ affective states. However, existing systems often rely on single-modality analysis such as facial expressions, speech tone, or textual sentiment, resulting in limited robustness and poor generalization in real-world environments. To address these challenges, this study proposes a Cloud-Based Cross-Modal Transformer (CMT) framework for multimodal emotion recognition and adaptive human-computer interaction. The proposed model integrates visual, auditory, and textual signals using pretrained encoders (Vision Transformer, Wav2Vec2, and BERT) and employs a cross-modal attention mechanism to capture complex interdependencies among heterogeneous features. By leveraging cloud computing infrastructure with distributed training on Kubernetes and TensorFlow Serving, the system enables scalable, low-latency emotion recognition for large-scale user interactions. Experiments conducted on benchmark datasets including IEMOCAP, MELD, and AffectNet demonstrate that the CMT achieves state-of-the-art performance, improving the F1-score by 3.0 percent and reducing cross-entropy loss by 12.9 percent compared to strong multimodal baselines. Additionally, cloud deployment evaluations show an average response latency of 128 ms, representing a 35 percent reduction compared with conventional transformer-based fusion systems. These results confirm that the proposed framework enables efficient, real-time emotion recognition and adaptive feedback in applications such as intelligent customer service, virtual tutoring systems, and affective computing interfaces, marking an important step toward cloud-native affective computing and emotionally intelligent interactive systems.

[93] READ-Net: Clarifying Emotional Ambiguity via Adaptive Feature Recalibration for Audio-Visual Depression Detection

Chenglizhao Chen, Boze Li, Mengke Song, Dehao Feng, Xinyu Liu, Shanchen Pang, Jufeng Yang, Hui Yu

Main category: cs.CV

TL;DR: READ-Net is an audio-visual depression detection framework that addresses Emotional Ambiguity by adaptively recalibrating emotional features to enhance depression-related signals while filtering out emotional noise.

Details

Motivation: Current audio-visual depression detection methods either ignore emotional cues (missing subtle depressive signals) or incorporate emotions but confuse transient emotional expressions with stable depressive symptoms, leading to detection errors due to Emotional Ambiguity.

Method: Proposes READ-Net with Adaptive Feature Recalibration (AFR) that dynamically adjusts emotional feature weights to preserve depression-relevant cues while filtering out irrelevant emotional noise, clarifying feature representations and mitigating emotional interference.

Result: Outperforms state-of-the-art methods on three public datasets with average gains of 4.55% in accuracy and 1.26% in F1-score, demonstrating robustness to emotional disturbances.

Conclusion: READ-Net effectively resolves Emotional Ambiguity in audio-visual depression detection, can be integrated into existing frameworks, and significantly improves detection performance by distinguishing depression symptoms from transient emotional expressions.

Abstract: Depression is a severe global mental health issue that impairs daily functioning and overall quality of life. Although recent audio-visual approaches have improved automatic depression detection, methods that ignore emotional cues often fail to capture subtle depressive signals hidden within emotional expressions. Conversely, those incorporating emotions frequently confuse transient emotional expressions with stable depressive symptoms in feature representations, a phenomenon termed \emph{Emotional Ambiguity}, thereby leading to detection errors. To address this critical issue, we propose READ-Net, the first audio-visual depression detection framework explicitly designed to resolve Emotional Ambiguity through Adaptive Feature Recalibration (AFR). The core insight of AFR is to dynamically adjust the weights of emotional features to enhance depression-related signals. Rather than merely overlooking or naively combining emotional information, READ-Net innovatively identifies and preserves depressive-relevant cues within emotional features, while adaptively filtering out irrelevant emotional noise. This recalibration strategy significantly clarifies feature representations, and effectively mitigates the persistent challenge of emotional interference. Additionally, READ-Net can be easily integrated into existing frameworks for improved performance. Extensive evaluations on three publicly available datasets show that READ-Net outperforms state-of-the-art methods, with average gains of 4.55% in accuracy and 1.26% in F1-score, demonstrating its robustness to emotional disturbances and improving audio-visual depression detection.

[94] Intelligent Power Grid Design Review via Active Perception-Enabled Multimodal Large Language Models

Taoliang Tan, Chengwei Ma, Zhen Tian, Zhao Lin, Dongdong Li, Si Shi

Main category: cs.CV

TL;DR: A three-stage MLLM-driven framework for intelligent power grid drawing review that mimics human expert workflow: global semantic understanding, high-resolution region analysis, and confidence-aware decision making.

Details

Motivation: Current automated systems fail with ultra-high-resolution power grid drawings due to computational demands, information loss, and lack of holistic semantic understanding for design error identification.

Method: Three-stage framework: 1) MLLM for global semantic understanding to propose domain-specific regions from low-resolution overview; 2) High-resolution fine-grained recognition within proposed regions with confidence scores; 3) Decision-making module integrating confidence-aware results for error diagnosis and reliability assessment.

Result: Preliminary results on real-world drawings show significantly enhanced MLLM ability to grasp macroscopic semantic information and pinpoint design errors, with improved defect discovery accuracy and greater reliability compared to traditional passive MLLM inference.

Conclusion: The research offers a novel prompt-driven paradigm for intelligent and reliable power grid drawing review, mimicking human expert workflow through advanced MLLM prompt engineering.

Abstract: The intelligent review of power grid engineering design drawings is crucial for power system safety. However, current automated systems struggle with ultra-high-resolution drawings due to high computational demands, information loss, and a lack of holistic semantic understanding for design error identification. This paper proposes a novel three-stage framework for intelligent power grid drawing review, driven by pre-trained Multimodal Large Language Models (MLLMs) through advanced prompt engineering. Mimicking the human expert review process, the first stage leverages an MLLM for global semantic understanding to intelligently propose domain-specific semantic regions from a low-resolution overview. The second stage then performs high-resolution, fine-grained recognition within these proposed regions, acquiring detailed information with associated confidence scores. In the final stage, a comprehensive decision-making module integrates these confidence-aware results to accurately diagnose design errors and provide a reliability assessment. Preliminary results on real-world power grid drawings demonstrate our approach significantly enhances MLLM’s ability to grasp macroscopic semantic information and pinpoint design errors, showing improved defect discovery accuracy and greater reliability in review judgments compared to traditional passive MLLM inference. This research offers a novel, prompt-driven paradigm for intelligent and reliable power grid drawing review.

[95] DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling

Jing Lan, Hexiao Ding, Hongzhao Chen, Yufeng Jiang, Nga-Chun Ng, Gwing Kei Yip, Gerald W. Y. Cheng, Yunlin Mao, Jing Cai, Liang-ting Lin, Jung Sun Yoo

Main category: cs.CV

TL;DR: DeepMoLM is a dual-view molecular language model that combines high-resolution molecular images with geometric invariants from conformations, enabling physically grounded molecular understanding and generation without explicit 3D coordinates.

Details

Motivation: Current AI models for drug discovery struggle with interpreting molecular images and generating outputs consistent with 3D geometry and stereochemistry. String/graph-based models lack visual understanding, while vision-language models miss stereochemical details and have difficulty mapping continuous 3D structures to discrete tokens.

Method: DeepMoLM uses a dual-view framework that grounds high-resolution (1024×1024) molecular images in geometric invariants derived from molecular conformations. It encodes conformer neighborhoods as discrete Extended 3-Dimensional Fingerprints and fuses visual and geometric streams using cross-attention.

Result: Achieves 12.3% relative METEOR gain over strongest generalist baseline on PubChem captioning, produces valid numeric outputs for all property queries, attains MAE 13.64 g/mol on Molecular Weight and 37.89 on Complexity in specialist setting. Matches state-of-the-art vision-language models on ChEBI-20 description generation.

Conclusion: DeepMoLM successfully bridges molecular visual understanding with geometric reasoning, enabling physically grounded molecular language modeling without requiring explicit atom coordinates, outperforming generalist baselines while competing with specialist methods.

Abstract: AI models for drug discovery and chemical literature mining must interpret molecular images and generate outputs consistent with 3D geometry and stereochemistry. Most molecular language models rely on strings or graphs, while vision-language models often miss stereochemical details and struggle to map continuous 3D structures into discrete tokens. We propose DeepMoLM: Deep Molecular Language M odeling, a dual-view framework that grounds high-resolution molecular images in geometric invariants derived from molecular conformations. DeepMoLM preserves high-frequency evidence from 1024 $\times$ 1024 inputs, encodes conformer neighborhoods as discrete Extended 3-Dimensional Fingerprints, and fuses visual and geometric streams with cross-attention, enabling physically grounded generation without atom coordinates. DeepMoLM improves PubChem captioning with a 12.3% relative METEOR gain over the strongest generalist baseline while staying competitive with specialist methods. It produces valid numeric outputs for all property queries and attains MAE 13.64 g/mol on Molecular Weight and 37.89 on Complexity in the specialist setting. On ChEBI-20 description generation from images, it exceeds generalist baselines and matches state-of-the-art vision-language models. Code is available at https://github.com/1anj/DeepMoLM.

[96] LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models

Mengyu Sun, Ziyuan Yang, Andrew Beng Jin Teoh, Junxu Liu, Haibo Hu, Yi Zhang

Main category: cs.CV

TL;DR: LURE is a novel method that reawakens erased concepts in diffusion models by reconstructing latent space and guiding sampling trajectories, overcoming limitations of existing prompt-level optimization approaches.

Details

Motivation: Existing concept erasure methods are vulnerable as erased concepts can be reawakened, but current reawakening approaches only focus on prompt-level optimization, neglecting other generative factors. This limits comprehensive understanding of the underlying dynamics and effectiveness of reawakening.

Method: LURE models generation as an implicit function to analyze multiple factors (text conditions, model parameters, latent states). It uses semantic re-binding to reconstruct latent space by aligning denoising predictions with target distributions. Gradient Field Orthogonalization prevents feature entanglement in multi-concept scenarios, and Latent Semantic Identification-Guided Sampling ensures stability via posterior density verification.

Result: Extensive experiments show LURE enables simultaneous, high-fidelity reawakening of multiple erased concepts across diverse erasure tasks and methods, demonstrating superior performance compared to existing approaches.

Conclusion: LURE provides a comprehensive framework for concept reawakening by addressing multiple generative factors, offering insights into diffusion model vulnerabilities and enabling effective reawakening of erased concepts through latent space reconstruction and guided sampling.

Abstract: Concept erasure aims to suppress sensitive content in diffusion models, but recent studies show that erased concepts can still be reawakened, revealing vulnerabilities in erasure methods. Existing reawakening methods mainly rely on prompt-level optimization to manipulate sampling trajectories, neglecting other generative factors, which limits a comprehensive understanding of the underlying dynamics. In this paper, we model the generation process as an implicit function to enable a comprehensive theoretical analysis of multiple factors, including text conditions, model parameters, and latent states. We theoretically show that perturbing each factor can reawaken erased concepts. Building on this insight, we propose a novel concept reawakening method: Latent space Unblocking for concept REawakening (LURE), which reawakens erased concepts by reconstructing the latent space and guiding the sampling trajectory. Specifically, our semantic re-binding mechanism reconstructs the latent space by aligning denoising predictions with target distributions to reestablish severed text-visual associations. However, in multi-concept scenarios, naive reconstruction can cause gradient conflicts and feature entanglement. To address this, we introduce Gradient Field Orthogonalization, which enforces feature orthogonality to prevent mutual interference. Additionally, our Latent Semantic Identification-Guided Sampling (LSIS) ensures stability of the reawakening process via posterior density verification. Extensive experiments demonstrate that LURE enables simultaneous, high-fidelity reawakening of multiple erased concepts across diverse erasure tasks and methods.

[97] CityCube: Benchmarking Cross-view Spatial Reasoning on Vision-Language Models in Urban Environments

Haotian Xu, Yue Hu, Zhengqiu Zhu, Chen Gao, Ziyou Wang, Junreng Rao, Wenhao Lu, Weishi Li, Quanjun Yin, Yong Li

Main category: cs.CV

TL;DR: CityCube is a new benchmark for evaluating cross-view spatial reasoning in VLMs using urban environments with multiple viewpoints from vehicles, drones, and satellites, showing VLMs significantly underperform humans.

Details

Motivation: Existing benchmarks focus on indoor/street settings but overlook the unique challenges of open-ended urban spaces with rich semantics, complex geometries, and view variations. There's a need to systematically evaluate VLMs' cross-view reasoning capabilities in realistic urban environments.

Method: Created CityCube benchmark with 5,022 annotated multi-view QA pairs across five cognitive dimensions and three spatial relation expressions. Integrated four viewpoint dynamics to mimic camera movements and spanned perspectives from multiple platforms (vehicles, drones, satellites). Evaluated 33 VLMs comprehensively.

Result: Large-scale VLMs struggle to exceed 54.1% accuracy, remaining 34.2% below human performance. Small-scale fine-tuned VLMs achieve over 60.0% accuracy. Analysis reveals task correlations and fundamental cognitive disparity between VLMs and human-like reasoning.

Conclusion: CityCube effectively exposes limitations in current VLMs’ cross-view spatial reasoning capabilities in urban environments, highlighting the need for specialized benchmarks and showing that fine-tuning can improve performance but significant gaps with human reasoning remain.

Abstract: Cross-view spatial reasoning is essential for embodied AI, underpinning spatial understanding, mental simulation and planning in complex environments. Existing benchmarks primarily emphasize indoor or street settings, overlooking the unique challenges of open-ended urban spaces characterized by rich semantics, complex geometries, and view variations. To address this, we introduce CityCube, a systematic benchmark designed to probe cross-view reasoning capabilities of current VLMs in urban settings. CityCube integrates four viewpoint dynamics to mimic camera movements and spans a wide spectrum of perspectives from multiple platforms, e.g., vehicles, drones and satellites. For a comprehensive assessment, it features 5,022 meticulously annotated multi-view QA pairs categorized into five cognitive dimensions and three spatial relation expressions. A comprehensive evaluation of 33 VLMs reveals a significant performance disparity with humans: even large-scale models struggle to exceed 54.1% accuracy, remaining 34.2% below human performance. By contrast, small-scale fine-tuned VLMs achieve over 60.0% accuracy, highlighting the necessity of our benchmark. Further analyses indicate the task correlations and fundamental cognitive disparity between VLMs and human-like reasoning.

[98] Large-Scale Label Quality Assessment for Medical Segmentation via a Vision-Language Judge and Synthetic Data

Yixiong Chen, Zongwei Zhou, Wenxuan Li, Alan Yuille

Main category: cs.CV

TL;DR: SegAE is a lightweight vision-language model that automatically predicts label quality for medical segmentation datasets, achieving high correlation with ground-truth metrics and enabling efficient quality control.

Details

Motivation: Large-scale medical segmentation datasets often contain mixed-quality labels (manual and pseudo-labels) that compromise training and evaluation. Low-quality labels reduce model performance and robustness, creating a need for automated quality assessment.

Method: SegAE is a lightweight vision-language model trained on over 4 million image-label pairs with quality scores. It predicts label quality across 142 anatomical structures using a VLM approach that correlates with ground-truth Dice similarity.

Result: SegAE achieves 0.902 correlation coefficient with ground-truth Dice similarity, evaluates 3D masks in 0.06s, reveals widespread low-quality labeling in public datasets, reduces annotation costs by one-third, and cuts quality-checking time by 70% per label.

Conclusion: SegAE provides an effective solution for quality control in large-scale medical segmentation datasets, improving data efficiency and training performance while reducing annotation costs and quality-checking time.

Abstract: Large-scale medical segmentation datasets often combine manual and pseudo-labels of uneven quality, which can compromise training and evaluation. Low-quality labels may hamper performance and make the model training less robust. To address this issue, we propose SegAE (Segmentation Assessment Engine), a lightweight vision-language model (VLM) that automatically predicts label quality across 142 anatomical structures. Trained on over four million image-label pairs with quality scores, SegAE achieves a high correlation coefficient of 0.902 with ground-truth Dice similarity and evaluates a 3D mask in 0.06s. SegAE shows several practical benefits: (I) Our analysis reveals widespread low-quality labeling across public datasets; (II) SegAE improves data efficiency and training performance in active and semi-supervised learning, reducing dataset annotation cost by one-third and quality-checking time by 70% per label. This tool provides a simple and effective solution for quality control in large-scale medical segmentation datasets. The dataset, model weights, and codes are released at https://github.com/Schuture/SegAE.

[99] Vision-Based Natural Language Scene Understanding for Autonomous Driving: An Extended Dataset and a New Model for Traffic Scene Description Generation

Danial Sadrian Zadeh, Otman A. Basir, Behzad Moshiri

Main category: cs.CV

TL;DR: Novel framework converts single frontal-view camera images into natural language descriptions for autonomous vehicles, using hybrid attention mechanisms and a new dataset from BDD100K.

Details

Motivation: Traffic scene understanding is crucial for autonomous vehicle safety, but there's limited availability of specialized datasets for converting camera images to natural language descriptions that capture spatial layouts, semantic relationships, and driving-relevant cues.

Method: Proposes a framework with hybrid attention mechanism for enhanced spatial and semantic feature extraction from single frontal-view camera images, integrates these features to generate rich scene descriptions, and creates a new dataset from BDD100K with comprehensive construction guidelines.

Result: Extensive evaluations using CIDEr, SPICE metrics and human judgment show the model achieves strong performance on the new dataset, effectively generating contextually rich and detailed scene descriptions.

Conclusion: The proposed framework successfully transforms camera images into natural language descriptions for traffic scene understanding, with appropriate evaluation metrics identified and demonstrated effectiveness through quantitative and human assessments.

Abstract: Traffic scene understanding is essential for enabling autonomous vehicles to accurately perceive and interpret their environment, thereby ensuring safe navigation. This paper presents a novel framework that transforms a single frontal-view camera image into a concise natural language description, effectively capturing spatial layouts, semantic relationships, and driving-relevant cues. The proposed model leverages a hybrid attention mechanism to enhance spatial and semantic feature extraction and integrates these features to generate contextually rich and detailed scene descriptions. To address the limited availability of specialized datasets in this domain, a new dataset derived from the BDD100K dataset has been developed, with comprehensive guidelines provided for its construction. Furthermore, the study offers an in-depth discussion of relevant evaluation metrics, identifying the most appropriate measures for this task. Extensive quantitative evaluations using metrics such as CIDEr and SPICE, complemented by human judgment assessments, demonstrate that the proposed model achieves strong performance and effectively fulfills its intended objectives on the newly developed dataset.

Frank Bieder, Hendrik Königshof, Haohao Hu, Fabian Immel, Yinzhe Shen, Jan-Hendrik Pauls, Christoph Stiller

Main category: cs.CV

TL;DR: XD-MAP transfers sensor-specific knowledge from camera images to LiDAR using semantic parametric maps to generate pseudo labels without manual annotation, enabling domain adaptation without sensor overlap and extending perception range.

Details

Motivation: Deep learning models depend on dataset availability that must align with target categories, sensor characteristics, and modalities. There's a gap between available datasets and deployment domains, requiring domain adaptation strategies to transfer knowledge across different sensing domains like camera to LiDAR.

Method: XD-MAP leverages detections from neural networks on camera images to create semantic parametric maps. Map elements are modeled to produce pseudo labels in the target LiDAR domain without manual annotation. Unlike previous approaches, it doesn’t require direct sensor overlap and extends angular perception from front-view camera to full 360° view.

Result: On large-scale road feature dataset, XD-MAP outperforms single shot baselines by +19.5 mIoU for 2D semantic segmentation, +19.5 PQth for 2D panoptic segmentation, and +32.3 mIoU in 3D semantic segmentation, achieving strong performance on LiDAR data without manual labeling.

Conclusion: The approach effectively bridges the domain gap between camera and LiDAR sensors, enabling knowledge transfer without sensor overlap or manual annotation, and demonstrates significant performance improvements across multiple segmentation tasks in autonomous driving applications.

Abstract: Until open-world foundation models match the performance of specialized approaches, the effectiveness of deep learning models remains heavily dependent on dataset availability. Training data must align not only with the target object categories but also with the sensor characteristics and modalities. To bridge the gap between available datasets and deployment domains, domain adaptation strategies are widely used. In this work, we propose a novel approach to transferring sensor-specific knowledge from an image dataset to LiDAR, an entirely different sensing domain. Our method XD-MAP leverages detections from a neural network on camera images to create a semantic parametric map. The map elements are modeled to produce pseudo labels in the target domain without any manual annotation effort. Unlike previous domain transfer approaches, our method does not require direct overlap between sensors and enables extending the angular perception range from a front-view camera to a full 360 view. On our large-scale road feature dataset, XD-MAP outperforms single shot baseline approaches by +19.5 mIoU for 2D semantic segmentation, +19.5 PQth for 2D panoptic segmentation, and +32.3 mIoU in 3D semantic segmentation. The results demonstrate the effectiveness of our approach achieving strong performance on LiDAR data without any manual labeling.

A. Enes Doruk

Main category: cs.CV

TL;DR: A novel Gaussian-based adaptive camera-LiDAR fusion model for 3D occupancy prediction that addresses computational complexity and dynamic environment challenges through memory-efficient 3D Gaussian modeling and selective state space models.

Details

Motivation: Current voxelization methods for 3D semantic occupancy prediction suffer from excessive computational complexity and brittle fusion processes that break down in dynamic environments, creating safety challenges for autonomous vehicles.

Method: Four-component approach: 1) LiDAR Depth Feature Aggregation with depth-wise deformable sampling for geometric sparsity, 2) Entropy-Based Feature Smoothing using cross-entropy for domain noise, 3) Adaptive Camera-LiDAR Fusion with dynamic recalibration, and 4) Gauss-Mamba Head using Selective State Space Models for linear-complexity global context decoding.

Result: Proposed solution bridges camera semantic strengths with LiDAR geometric strengths through memory-efficient 3D Gaussian modeling, addressing computational complexity and dynamic environment robustness.

Conclusion: The Gaussian-based adaptive fusion model provides a more efficient and robust approach to 3D occupancy prediction for autonomous vehicles, overcoming limitations of current voxelization methods through innovative multimodal fusion and linear-complexity decoding.

Abstract: The sparse object detection paradigm shift towards dense 3D semantic occupancy prediction is necessary for dealing with long-tail safety challenges for autonomous vehicles. Nonetheless, the current voxelization methods commonly suffer from excessive computation complexity demands, where the fusion process is brittle, static, and breaks down under dynamic environmental settings. To this end, this research work enhances a novel Gaussian-based adaptive camera-LiDAR multimodal 3D occupancy prediction model that seamlessly bridges the semantic strengths of camera modality with the geometric strengths of LiDAR modality through a memory-efficient 3D Gaussian model. The proposed solution has four key components: (1) LiDAR Depth Feature Aggregation (LDFA), where depth-wise deformable sampling is employed for dealing with geometric sparsity, (2) Entropy-Based Feature Smoothing, where cross-entropy is employed for handling domain-specific noise, (3) Adaptive Camera-LiDAR Fusion, where dynamic recalibration of sensor outputs is performed based on model outputs, and (4) Gauss-Mamba Head that uses Selective State Space Models for global context decoding that enjoys linear computation complexity.

[102] Real-Time Wildfire Localization on the NASA Autonomous Modular Sensor using Deep Learning

Yajvan Ravan, Aref Malek, Chester Dolph, Nikhil Behari

Main category: cs.CV

TL;DR: NASA introduces a multi-spectral aerial wildfire dataset and deep learning model for real-time fire perimeter detection, achieving 96% accuracy and 74% IoU.

Details

Motivation: High-altitude multi-spectral aerial imagery is scarce and expensive but essential for wildfire detection algorithms. Current methods lack sufficient data for training robust machine learning models.

Method: Created human-annotated dataset from NASA AMS with 12-channel imagery (IR, SWIR, thermal). Trained two deep neural networks: one for image classification and one for pixel-level segmentation, combined into a real-time segmentation model.

Result: Model achieves 96% classification accuracy, 74% Intersection-over-Union, and 84% recall, surpassing past methods. Can detect wildfires at nighttime, behind clouds, and distinguish false positives.

Conclusion: Multi-spectral data (especially SWIR, IR, thermal bands) enables robust wildfire detection. The dataset and model advance automated fire perimeter determination for real-time wildfire monitoring.

Abstract: High-altitude, multi-spectral, aerial imagery is scarce and expensive to acquire, yet it is necessary for algorithmic advances and application of machine learning models to high-impact problems such as wildfire detection. We introduce a human-annotated dataset from the NASA Autonomous Modular Sensor (AMS) using 12-channel, medium to high altitude (3 - 50 km) aerial wildfire images similar to those used in current US wildfire missions. Our dataset combines spectral data from 12 different channels, including infrared (IR), short-wave IR (SWIR), and thermal. We take imagery from 20 wildfire missions and randomly sample small patches to generate over 4000 images with high variability, including occlusions by smoke/clouds, easily-confused false positives, and nighttime imagery. We demonstrate results from a deep-learning model to automate the human-intensive process of fire perimeter determination. We train two deep neural networks, one for image classification and the other for pixel-level segmentation. The networks are combined into a unique real-time segmentation model to efficiently localize active wildfire on an incoming image feed. Our model achieves 96% classification accuracy, 74% Intersection-over-Union(IoU), and 84% recall surpassing past methods, including models trained on satellite data and classical color-rule algorithms. By leveraging a multi-spectral dataset, our model is able to detect active wildfire at nighttime and behind clouds, while distinguishing between false positives. We find that data from the SWIR, IR, and thermal bands is the most important to distinguish fire perimeters. Our code and dataset can be found here: https://github.com/nasa/Autonomous-Modular-Sensor-Wildfire-Segmentation/tree/main and https://drive.google.com/drive/folders/1-u4vs9rqwkwgdeeeoUhftCxrfe_4QPTn?=usp=drive_link

[103] RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction

Shuhong Liu, Chenyu Bao, Ziteng Cui, Yun Liu, Xuangeng Chu, Lin Gu, Marcos V. Conde, Ryo Umagami, Tomohiro Hashimoto, Zijian Hu, Tianhan Xu, Yuan Gan, Yusuke Kurose, Tatsuya Harada

Main category: cs.CV

TL;DR: RealX3D is a real-capture benchmark for evaluating multi-view visual restoration and 3D reconstruction under diverse physical degradations, showing current methods struggle with real-world corruptions.

Details

Motivation: Current multi-view 3D reconstruction pipelines are fragile in real-world challenging environments with physical degradations, but existing benchmarks lack real-capture data with diverse corruptions at controlled severity levels.

Method: Created RealX3D benchmark with four corruption families (illumination, scattering, occlusion, blurring) captured at multiple severity levels using unified acquisition protocol. Includes pixel-aligned LQ/GT views, high-resolution capture, RAW images, dense laser scans, world-scale meshes, and metric depth.

Result: Benchmarking optimization-based and feed-forward methods shows substantial degradation in reconstruction quality under physical corruptions, demonstrating the fragility of current multi-view pipelines in real-world challenging environments.

Conclusion: RealX3D exposes limitations of current 3D reconstruction methods and provides a comprehensive real-capture benchmark to drive development of more robust multi-view pipelines for real-world applications.

Abstract: We introduce RealX3D, a real-capture benchmark for multi-view visual restoration and 3D reconstruction under diverse physical degradations. RealX3D groups corruptions into four families, including illumination, scattering, occlusion, and blurring, and captures each at multiple severity levels using a unified acquisition protocol that yields pixel-aligned LQ/GT views. Each scene includes high-resolution capture, RAW images, and dense laser scans, from which we derive world-scale meshes and metric depth. Benchmarking a broad range of optimization-based and feed-forward methods shows substantial degradation in reconstruction quality under physical corruptions, underscoring the fragility of current multi-view pipelines in real-world challenging environments.

[104] GutenOCR: A Grounded Vision-Language Front-End for Documents

Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew

Main category: cs.CV

TL;DR: GutenOCR is a family of grounded OCR models built by fine-tuning Qwen2.5-VL vision-language models, offering unified reading, detection, and grounding capabilities through prompt-based interface.

Details

Motivation: To create a single-checkpoint vision-language model that unifies OCR reading, text detection, and grounding capabilities through a prompt-based interface, addressing limitations of existing OCR systems that separate these functions.

Method: Fine-tuned Qwen2.5-VL-3B and Qwen2.5-VL-7B models on business documents, scientific articles, and synthetic grounding data to create GutenOCR models that support full-page and localized reading with bounding boxes and conditional queries.

Result: GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone (0.40 to 0.82) on 10.5K held-out business and scientific pages. Substantial improvements in region- and line-level OCR and text-detection recall on Fox and OmniDocBench v1.5, but with trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.

Conclusion: GutenOCR successfully demonstrates that fine-tuning vision-language models can create unified OCR systems with strong grounded OCR capabilities, though certain trade-offs remain in specific document processing scenarios.

Abstract: GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?’’ queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.

[105] PAS-Mamba: Phase-Amplitude-Spatial State Space Model for MRI Reconstruction

Xiaoyan Kui, Zijie Fan, Zexin Ji, Qinsong Li, Hao Xu, Weixin Si, Haodong Xu, Beiji Zou

Main category: cs.CV

TL;DR: PAS-Mamba is a novel MRI reconstruction framework that decouples phase and magnitude modeling in frequency domain using specialized branches, combines with spatial features via LocalMamba, and employs circular frequency scanning with dual-domain fusion for superior reconstruction.

Details

Motivation: Existing MRI reconstruction methods treat frequency domain as a whole, neglecting that phase and amplitude carry fundamentally different information (phase governs structure, amplitude reflects pixel intensity). Unified frequency-domain modeling causes interference between phase and magnitude feature learning.

Method: Proposes PAS-Mamba with: 1) Image-domain LocalMamba preserving spatial locality; 2) Frequency-domain decoupling into separate phase and amplitude branches; 3) Circular Frequency Domain Scanning (CFDS) to serialize features from low to high frequencies respecting concentric geometry; 4) Dual-Domain Complementary Fusion Module (DDCFM) for adaptive fusion and bidirectional exchange between domains.

Result: Extensive experiments on IXI and fastMRI knee datasets show PAS-Mamba consistently outperforms state-of-the-art reconstruction methods.

Conclusion: Decoupling phase and magnitude modeling in frequency domain while combining with spatial features through specialized scanning and fusion mechanisms leads to superior MRI reconstruction performance by preventing representational coupling and enabling better feature learning.

Abstract: Joint feature modeling in both the spatial and frequency domains has become a mainstream approach in MRI reconstruction. However, existing methods generally treat the frequency domain as a whole, neglecting the differences in the information carried by its internal components. According to Fourier transform theory, phase and amplitude represent different types of information in the image. Our spectrum swapping experiments show that magnitude mainly reflects pixel-level intensity, while phase predominantly governs image structure. To prevent interference between phase and magnitude feature learning caused by unified frequency-domain modeling, we propose the Phase-Amplitude-Spatial State Space Model (PAS-Mamba) for MRI Reconstruction, a framework that decouples phase and magnitude modeling in the frequency domain and combines it with image-domain features for better reconstruction. In the image domain, LocalMamba preserves spatial locality to sharpen fine anatomical details. In frequency domain, we disentangle amplitude and phase into two specialized branches to avoid representational coupling. To respect the concentric geometry of frequency information, we propose Circular Frequency Domain Scanning (CFDS) to serialize features from low to high frequencies. Finally, a Dual-Domain Complementary Fusion Module (DDCFM) adaptively fuses amplitude phase representations and enables bidirectional exchange between frequency and image domains, delivering superior reconstruction. Extensive experiments on the IXI and fastMRI knee datasets show that PAS-Mamba consistently outperforms state of the art reconstruction methods.

[106] Scribble-Supervised Medical Image Segmentation with Dynamic Teacher Switching and Hierarchical Consistency

Thanh-Huy Nguyen, Hoang-Loc Cao, Dat T. Chung, Mai-Anh Vu, Thanh-Minh Nguyen, Minh Le, Phat K. Huynh, Ulas Bagci

Main category: cs.CV

TL;DR: SDT-Net: A dual-teacher, single-student framework for scribble-supervised medical image segmentation that uses dynamic teacher switching and multi-level supervision to address annotation sparsity and boundary learning challenges.

Details

Motivation: Scribble annotations reduce annotation burden but introduce ambiguity due to sparsity, leading to noisy pseudo-label propagation and poor anatomical boundary learning in medical image segmentation.

Method: Proposes SDT-Net with Dynamic Teacher Switching (DTS) to select the most reliable teacher, Pick Reliable Pixels (PRP) for high-confidence pseudo-label refinement, and Hierarchical Consistency (HiCo) module for multi-level feature alignment between teacher and student.

Result: Achieves state-of-the-art performance on ACDC and MSCMRseg datasets, producing more accurate and anatomically plausible segmentation results compared to existing methods.

Conclusion: SDT-Net effectively leverages weak scribble annotations through adaptive teacher selection and multi-level supervision, demonstrating superior performance in scribble-supervised medical image segmentation.

Abstract: Scribble-supervised methods have emerged to mitigate the prohibitive annotation burden in medical image segmentation. However, the inherent sparsity of these annotations introduces significant ambiguity, which results in noisy pseudo-label propagation and hinders the learning of robust anatomical boundaries. To address this challenge, we propose SDT-Net, a novel dual-teacher, single-student framework designed to maximize supervision quality from these weak signals. Our method features a Dynamic Teacher Switching (DTS) module to adaptively select the most reliable teacher. This selected teacher then guides the student via two synergistic mechanisms: high-confidence pseudo-labels, refined by a Pick Reliable Pixels (PRP) mechanism, and multi-level feature alignment, enforced by a Hierarchical Consistency (HiCo) module. Extensive experiments on the ACDC and MSCMRseg datasets demonstrate that SDT-Net achieves state-of-the-art performance, producing more accurate and anatomically plausible segmentation.

[107] Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement

Wei Ma, Shaowu Chen, Junjie Ye, Peichang Zhang, Lei Huang

Main category: cs.CV

TL;DR: Proposes a fuzzy controller-based framework for video inference enhancement that dynamically switches between models of varying scales based on real-time resource conditions, balancing resource utilization and inference performance.

Details

Motivation: Existing video inference enhancement methods focus on scaling up models and complex architectures, but overlook the trade-off between resource efficiency and inference effectiveness, leading to inefficient resource utilization and suboptimal performance.

Method: Develops a fuzzy controller (FC-r) based on key system parameters and inference metrics, then proposes a video inference enhancement framework that leverages spatiotemporal correlation of targets across frames and dynamically switches between models of varying scales guided by the FC-r.

Result: Experimental results demonstrate that the proposed method effectively achieves a balance between resource utilization and inference performance.

Conclusion: The fuzzy controller-guided framework successfully addresses the resource-efficiency trade-off in video inference enhancement by enabling dynamic model switching based on real-time device conditions.

Abstract: Existing video inference (VI) enhancement methods typically aim to improve performance by scaling up model sizes and employing sophisticated network architectures. While these approaches demonstrated state-of-the-art performance, they often overlooked the trade-off of resource efficiency and inference effectiveness, leading to inefficient resource utilization and suboptimal inference performance. To address this problem, a fuzzy controller (FC-r) is developed based on key system parameters and inference-related metrics. Guided by the FC-r, a VI enhancement framework is proposed, where the spatiotemporal correlation of targets across adjacent video frames is leveraged. Given the real-time resource conditions of the target device, the framework can dynamically switch between models of varying scales during VI. Experimental results demonstrate that the proposed method effectively achieves a balance between resource utilization and inference performance.

[108] Anatomically Guided Latent Diffusion for Brain MRI Progression Modeling

Cheng Wan, Bahram Jafrasteh, Ehsan Adeli, Miaomiao Zhang, Qingyu Zhao

Main category: cs.CV

TL;DR: AG-LDM is a segmentation-guided latent diffusion model that simplifies brain MRI progression modeling while enforcing anatomical consistency, outperforming complex multi-stage approaches with better image quality and clinical covariate utilization.

Details

Motivation: Existing brain MRI progression models like BrLP have architectural complexity, suboptimal use of clinical covariates, and limited anatomical consistency guarantees, creating a need for simpler, more anatomically grounded approaches.

Method: AG-LDM uses a segmentation-guided latent diffusion framework that directly fuses baseline anatomy, noisy follow-up states, and clinical covariates at input level, avoiding auxiliary control networks. It incorporates a lightweight 3D tissue segmentation model (WarpSeg) for anatomical supervision during both autoencoder fine-tuning and diffusion training.

Result: On 31,713 ADNI longitudinal pairs and OASIS-3 zero-shot evaluation, AG-LDM matches or surpasses complex diffusion models with state-of-the-art image quality, 15-20% reduction in volumetric errors, and 31.5x higher sensitivity to clinical covariates than BrLP. It generates biologically plausible Alzheimer’s progression trajectories.

Conclusion: AG-LDM provides an efficient, anatomically grounded framework for reliable brain MRI progression modeling that simplifies training while improving anatomical consistency and clinical covariate utilization.

Abstract: Accurately modeling longitudinal brain MRI progression is crucial for understanding neurodegenerative diseases and predicting individualized structural changes. Existing state-of-the-art approaches, such as Brain Latent Progression (BrLP), often use multi-stage training pipelines with auxiliary conditioning modules but suffer from architectural complexity, suboptimal use of conditional clinical covariates, and limited guarantees of anatomical consistency. We propose Anatomically Guided Latent Diffusion Model (AG-LDM), a segmentation-guided framework that enforces anatomically consistent progression while substantially simplifying the training pipeline. AG-LDM conditions latent diffusion by directly fusing baseline anatomy, noisy follow-up states, and clinical covariates at the input level, a strategy that avoids auxiliary control networks by learning a unified, end-to-end model that represents both anatomy and progression. A lightweight 3D tissue segmentation model (WarpSeg) provides explicit anatomical supervision during both autoencoder fine-tuning and diffusion model training, ensuring consistent brain tissue boundaries and morphometric fidelity. Experiments on 31,713 ADNI longitudinal pairs and zero-shot evaluation on OASIS-3 demonstrate that AG-LDM matches or surpasses more complex diffusion models, achieving state-of-the-art image quality and 15-20% reduction in volumetric errors in generated images. AG-LDM also exhibits markedly stronger utilization of temporal and clinical covariates (up to 31.5x higher sensitivity than BrLP) and generates biologically plausible counterfactual trajectories, accurately capturing hallmarks of Alzheimer’s progression such as limbic atrophy and ventricular expansion. These results highlight AG-LDM as an efficient, anatomically grounded framework for reliable brain MRI progression modeling.

[109] From Volumes to Slices: Computationally Efficient Contrastive Learning for Sequential Abdominal CT Analysis

Po-Kai Chiu, Hung-Hsuan Chen

Main category: cs.CV

TL;DR: 2D-VoCo is an efficient 2D adaptation of volume contrast learning for CT slice pre-training that reduces computational costs while improving multi-organ injury classification performance.

Details

Motivation: Deep learning for medical image analysis is limited by the need for expert annotations. While 3D self-supervised methods like VoCo help with label scarcity, they suffer from high computational costs and memory consumption.

Method: Propose 2D-VoCo, an efficient adaptation of VoCo for slice-level self-supervised pre-training that learns spatial-semantic features from unlabeled 2D CT slices via contrastive learning. The pre-trained CNN backbone is then integrated into a CNN-LSTM architecture for multi-organ injury classification.

Result: On the RSNA 2023 Abdominal Trauma dataset, 2D-VoCo pre-training significantly improves mAP, precision, recall, and RSNA score compared to training from scratch.

Conclusion: 2D-VoCo provides a practical method to reduce dependency on labeled data and enhance model performance in clinical CT analysis, with code released for reproducibility.

Abstract: The requirement for expert annotations limits the effectiveness of deep learning for medical image analysis. Although 3D self-supervised methods like volume contrast learning (VoCo) are powerful and partially address the labeling scarcity issue, their high computational cost and memory consumption are barriers. We propose 2D-VoCo, an efficient adaptation of the VoCo framework for slice-level self-supervised pre-training that learns spatial-semantic features from unlabeled 2D CT slices via contrastive learning. The pre-trained CNN backbone is then integrated into a CNN-LSTM architecture to classify multi-organ injuries. In the RSNA 2023 Abdominal Trauma dataset, 2D-VoCo pre-training significantly improves mAP, precision, recall, and RSNA score over training from scratch. Our framework provides a practical method to reduce the dependency on labeled data and enhance model performance in clinical CT analysis. We release the code for reproducibility. https://github.com/tkz05/2D-VoCo-CT-Classifier

[110] LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning

Lianying Chao, Linfeng Yin, Peiyu Ren, Yifan Jiang, Qiaoyu Ren, Dingcheng Shan, Jing-cheng Pang, Sijie Wu, Xubin Li, Kai Zhang

Main category: cs.CV

TL;DR: LFS learns to select diverse, event-relevant frames for video captioning using caption feedback from frozen video-LLMs, improving caption quality and downstream tasks like VQA.

Details

Motivation: Uniform frame sampling for video captioning is inefficient because it ignores uneven event distribution across time, leading to suboptimal temporal coverage and relevance.

Method: Learnable Frame Selector (LFS) models temporal importance to balance diversity and relevance, uses stratified strategy to avoid clustering, and leverages caption feedback from frozen video-LLMs to optimize selection.

Result: LFS improves detailed video captioning by up to 2.0% on VDC and over 4% on new ICH-CC benchmark, and enhances video question answering performance.

Conclusion: LFS provides an effective, easy-to-integrate solution for detailed video captioning that better aligns with human cognition through improved frame selection.

Abstract: Video captioning models convert frames into visual tokens and generate descriptions with large language models (LLMs). Since encoding all frames is prohibitively expensive, uniform sampling is the default choice, but it enforces equal temporal coverage while ignoring the uneven events distribution. This motivates a Learnable Frame Selector (LFS) that selects temporally diverse and event-relevant frames. LFS explicitly models temporal importance to balance temporal diversity and event relevance, and employs a stratified strategy to ensure temporal coverage while avoiding clustering. Crucially, LFS leverages caption feedback from frozen video-LLMs to learn frame selection that directly optimizes downstream caption quality. Additionally, we identify the gap between existing benchmark and human’s cognition. Thus, we introduce ICH-CC built from carefully designed questions by annotators that reflect human-consistent understanding of video. Experiments indicate that LFS consistently improves detailed video captioning across two representative community benchmarks and ICH-CC, achieving up to 2.0% gains on VDC and over 4% gains on ICH-CC. Moreover, we observe that enhanced captions with LFS leads to improved performance on video question answering. Overall, LFS provides an effective and easy-to-integrate solution for detailed video captioning.

[111] 3D Space as a Scratchpad for Editable Text-to-Image Generation

Oindrila Saha, Vojtech Krs, Radomir Mech, Subhransu Maji, Matheus Gadelha, Kevin Blackburn-Matzen

Main category: cs.CV

TL;DR: Spatial Scratchpad: A 3D reasoning framework for visual language models that improves spatial consistency and text alignment in image generation by using explicit 3D scene planning.

Details

Motivation: Visual language models lack explicit spatial reasoning mechanisms analogous to chain-of-thought in LLMs, limiting their ability to generate images with accurate geometric relations, object identities, and compositional intent.

Method: Introduces a spatial scratchpad - a 3D reasoning substrate that parses text prompts into subjects/background elements, instantiates them as editable 3D meshes, performs agentic scene planning for placement/orientation/viewpoint selection, then renders back to image domain with identity-preserving cues.

Result: Achieves 32% improvement in text alignment on GenAI-Bench, enables intuitive 3D edits that propagate reliably to final images, and generates spatially consistent, visually coherent outputs.

Conclusion: Demonstrates the benefit of explicit 3D reasoning for precise, controllable image generation and introduces a new paradigm for vision-language models that deliberate in both language and space.

Abstract: Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chain-of-thought traces or tool-augmented reasoning. Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent. We introduce the concept of a spatial scratchpad – a 3D reasoning substrate that bridges linguistic intent and image synthesis. Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection. The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs. Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images. Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation. Our results highlight a new paradigm for vision-language models that deliberate not only in language, but also in space. Code and visualizations at https://oindrilasaha.github.io/3DScratchpad/

[112] U-Harmony: Enhancing Joint Training for Segmentation Models with Universal Harmonization

Weiwei Ma, Xiaobing Yu, Peijie Qiu, Jin Yang, Pan Xiao, Xiaoqi Zhao, Xiaofeng Liu, Tomo Miyazaki, Shinichiro Omachi, Yongsong Huang

Main category: cs.CV

TL;DR: U-Harmony is a joint training method that enables single segmentation models to learn from heterogeneous medical datasets by normalizing and denormalizing feature distributions to mitigate domain variations while preserving dataset-specific knowledge.

Details

Motivation: Medical segmentation datasets in clinical practice are often limited and heterogeneous, with variations across institutions in modalities, protocols, and anatomical targets. Existing deep learning models struggle to jointly learn from such diverse data, often sacrificing either generalization or domain-specific knowledge.

Method: Proposes Universal Harmonization (U-Harmony) method that integrates into deep learning architectures with a domain-gated head. The approach sequentially normalizes and then denormalizes feature distributions to mitigate domain-specific variations while preserving original dataset-specific knowledge. Supports universal modality adaptation for learning new imaging modalities and anatomical classes.

Result: Extensive experiments on cross-institutional brain lesion datasets demonstrate the effectiveness of the approach, establishing a new benchmark for robust and adaptable 3D medical image segmentation models in real-world clinical settings.

Conclusion: U-Harmony provides a solution for training single segmentation models on heterogeneous medical datasets, addressing domain variations while preserving dataset-specific knowledge, and enabling universal modality adaptation for practical clinical applications.

Abstract: In clinical practice, medical segmentation datasets are often limited and heterogeneous, with variations in modalities, protocols, and anatomical targets across institutions. Existing deep learning models struggle to jointly learn from such diverse data, often sacrificing either generalization or domain-specific knowledge. To overcome these challenges, we propose a joint training method called Universal Harmonization (U-Harmony), which can be integrated into deep learning-based architectures with a domain-gated head, enabling a single segmentation model to learn from heterogeneous datasets simultaneously. By integrating U-Harmony, our approach sequentially normalizes and then denormalizes feature distributions to mitigate domain-specific variations while preserving original dataset-specific knowledge. More appealingly, our framework also supports universal modality adaptation, allowing the seamless learning of new imaging modalities and anatomical classes. Extensive experiments on cross-institutional brain lesion datasets demonstrate the effectiveness of our approach, establishing a new benchmark for robust and adaptable 3D medical image segmentation models in real-world clinical settings.

[113] Learning Consistent Taxonomic Classification through Hierarchical Reasoning

Zhenghong Li, Kecheng Zheng, Haibin Ling

Main category: cs.CV

TL;DR: VL-Taxon improves hierarchical reasoning in Vision-Language Models for taxonomic classification, achieving significant accuracy gains with minimal fine-tuning.

Details

Motivation: VLMs often fail to grasp hierarchical knowledge, leading to errors where they misclassify coarser taxonomic levels even when correctly identifying specific leaf levels. Existing approaches overlook hierarchical reasoning.

Method: Two-stage hierarchy-based reasoning framework: 1) Top-down process to enhance leaf-level classification accuracy, 2) Leverages accurate leaf-level output to ensure hierarchical consistency. Each stage uses supervised fine-tuning followed by reinforcement learning.

Result: VL-Taxon on Qwen2.5-VL-7B outperforms original 72B model by over 10% in both leaf-level and hierarchical consistency accuracy on iNaturalist-2021 dataset, achieved with minimal fine-tuning on small data subset without using examples from other VLMs.

Conclusion: The proposed hierarchical reasoning framework effectively addresses VLMs’ limitations in taxonomic classification, demonstrating that proper hierarchical modeling can yield substantial performance improvements even with smaller models and limited training data.

Abstract: While Vision-Language Models (VLMs) excel at visual understanding, they often fail to grasp hierarchical knowledge. This leads to common errors where VLMs misclassify coarser taxonomic levels even when correctly identifying the most specific level (leaf level). Existing approaches largely overlook this issue by failing to model hierarchical reasoning. To address this gap, we propose VL-Taxon, a two-stage, hierarchy-based reasoning framework designed to improve both leaf-level accuracy and hierarchical consistency in taxonomic classification. The first stage employs a top-down process to enhance leaf-level classification accuracy. The second stage then leverages this accurate leaf-level output to ensure consistency throughout the entire taxonomic hierarchy. Each stage is initially trained with supervised fine-tuning to instill taxonomy knowledge, followed by reinforcement learning to refine the model’s reasoning and generalization capabilities. Extensive experiments reveal a remarkable result: our VL-Taxon framework, implemented on the Qwen2.5-VL-7B model, outperforms its original 72B counterpart by over 10% in both leaf-level and hierarchical consistency accuracy on average on the iNaturalist-2021 dataset. Notably, this significant gain was achieved by fine-tuning on just a small subset of data, without relying on any examples generated by other VLMs.

[114] Diffusion Epistemic Uncertainty with Asymmetric Learning for Diffusion-Generated Image Detection

Yingsong Huang, Hui Guo, Jing Huang, Bing Bai, Qi Xiong

Main category: cs.CV

TL;DR: DEUA framework uses diffusion epistemic uncertainty estimation and asymmetric learning to detect AI-generated images, achieving SOTA performance by distinguishing aleatoric vs epistemic uncertainty in reconstruction errors.

Details

Motivation: Current diffusion-generated image detectors fail to distinguish between aleatoric uncertainty (inherent data noise) and epistemic uncertainty (model's lack of knowledge). Aleatoric uncertainty creates ambiguity that hinders detection, while epistemic uncertainty actually helps identify generated images. This distinction is crucial for improving detection performance.

Method: Proposes DEUA framework with two key components: 1) Diffusion Epistemic Uncertainty (DEU) estimation using Laplace approximation to measure data proximity to diffusion-generated manifold, and 2) Asymmetric loss function to train balanced classifier with larger margins for better generalizability.

Result: Extensive experiments on large-scale benchmarks demonstrate state-of-the-art performance in detecting diffusion-generated images, validating the effectiveness of distinguishing epistemic from aleatoric uncertainty.

Conclusion: The DEUA framework successfully addresses the limitation of previous methods by explicitly modeling epistemic uncertainty and using asymmetric learning, leading to superior detection performance for diffusion-generated images.

Abstract: The rapid progress of diffusion models highlights the growing need for detecting generated images. Previous research demonstrates that incorporating diffusion-based measurements, such as reconstruction error, can enhance the generalizability of detectors. However, ignoring the differing impacts of aleatoric and epistemic uncertainty on reconstruction error can undermine detection performance. Aleatoric uncertainty, arising from inherent data noise, creates ambiguity that impedes accurate detection of generated images. As it reflects random variations within the data (e.g., noise in natural textures), it does not help distinguish generated images. In contrast, epistemic uncertainty, which represents the model’s lack of knowledge about unfamiliar patterns, supports detection. In this paper, we propose a novel framework, Diffusion Epistemic Uncertainty with Asymmetric Learning~(DEUA), for detecting diffusion-generated images. We introduce Diffusion Epistemic Uncertainty~(DEU) estimation via the Laplace approximation to assess the proximity of data to the manifold of diffusion-generated samples. Additionally, an asymmetric loss function is introduced to train a balanced classifier with larger margins, further enhancing generalizability. Extensive experiments on large-scale benchmarks validate the state-of-the-art performance of our method.

[115] Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis

James Brock, Ce Zhang, Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: Forest-Chat: An LLM-driven agent for integrated forest change analysis using satellite imagery, supporting multiple remote sensing interpretation tasks through natural language queries.

Details

Motivation: Despite advances in satellite imagery and deep learning, there's limited exploration of integrating LLMs with vision-language models for remote sensing image change interpretation in forest environments beyond urban areas. Current approaches lack comprehensive frameworks for natural language querying and multi-task forest change analysis.

Method: Forest-Chat uses an LLM-driven agent with multi-level change interpretation vision-language backbone, zero-shot change detection via foundation model, and interactive point-prompt interface. It introduces the Forest-Change dataset with bi-temporal satellite imagery, change masks, and semantic captions generated through human annotation and rule-based methods.

Result: Forest-Chat achieves strong performance on Forest-Change dataset and LEVIR-MCI-Trees subset for joint change detection and captioning, demonstrating the effectiveness of interactive, LLM-driven systems for forest change analysis.

Conclusion: The proposed LLM-driven RSICI system improves accessibility, interpretability, and analytical efficiency in forest change analysis, showing potential for enhancing forest monitoring workflows through natural language interaction and multi-task support.

Abstract: The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for enhancing forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. We introduce Forest-Chat, an LLM-driven agent designed for integrated forest change analysis. The proposed framework enables natural language querying and supports multiple RSICI tasks, including change detection, change captioning, object counting, deforestation percentage estimation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, and incorporates zero-shot change detection via a foundation change detection model together with an interactive point-prompt interface to support fine-grained user guidance. To facilitate adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated through a combination of human annotation and rule-based methods. Experimental results demonstrate that Forest-Chat achieves strong performance on Forest-Change and on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI, for joint change detection and captioning, highlighting the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and analytical efficiency in forest change analysis.

[116] Mirai: Autoregressive Visual Generation Needs Foresight

Yonghao Yu, Lang Huang, Zerun Wang, Runyi Li, Toshihiko Yamasaki

Main category: cs.CV

TL;DR: Mirai framework injects future information (foresight) into autoregressive visual generators to improve global coherence and accelerate convergence without architecture changes or inference overhead.

Details

Motivation: Autoregressive visual generators trained with strict next-token causality supervision suffer from diminished global coherence and slow convergence because each step is only optimized by its immediate next token, lacking foresight from later tokens.

Method: Proposes Mirai framework with two variants: Mirai-E uses explicit foresight from multiple future positions of unidirectional representations, while Mirai-I leverages implicit foresight from matched bidirectional representations. Both inject future information into AR training without architecture changes or extra inference overhead.

Result: Mirai significantly accelerates convergence (up to 10× faster for LlamaGen-B) and improves generation quality (reduces FID from 5.34 to 4.34 on ImageNet class-conditioned image generation benchmark).

Conclusion: Visual autoregressive models need foresight; aligning foresight to AR models’ internal representation on 2D image grids improves causality modeling, leading to faster convergence and better generation quality.

Abstract: Autoregressive (AR) visual generators model images as sequences of discrete tokens and are trained with next token likelihood. This strict causality supervision optimizes each step only by its immediate next token, which diminishes global coherence and slows convergence. We ask whether foresight, training signals that originate from later tokens, can help AR visual generation. We conduct a series of controlled diagnostics along the injection level, foresight layout, and foresight source axes, unveiling a key insight: aligning foresight to AR models’ internal representation on the 2D image grids improves causality modeling. We formulate this insight with Mirai (meaning “future” in Japanese), a general framework that injects future information into AR training with no architecture change and no extra inference overhead: Mirai-E uses explicit foresight from multiple future positions of unidirectional representations, whereas Mirai-I leverages implicit foresight from matched bidirectional representations. Extensive experiments show that Mirai significantly accelerates convergence and improves generation quality. For instance, Mirai can speed up LlamaGen-B’s convergence by up to 10$\times$ and reduce the generation FID from 5.34 to 4.34 on the ImageNet class-condition image generation benchmark. Our study highlights that visual autoregressive models need foresight.

[117] LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models

Mingyang Xie, Numair Khan, Tianfu Wang, Naina Dhingra, Seonghyeon Nam, Haitao Yang, Zhuo Hui, Christopher Metzler, Andrea Vedaldi, Hamed Pirsiavash, Lei Luo

Main category: cs.CV

TL;DR: LAVR uses implicit geometric knowledge from a 4D reconstruction model’s latent space to condition video generation, achieving SOTA results in video re-rendering without explicit reconstruction.

Details

Motivation: Existing video re-rendering methods have two main problems: geometrically unconditioned models lack spatial awareness and suffer from drift/deformation, while geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them vulnerable to depth inaccuracies and calibration errors.

Method: The proposed LAVR method uses implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition video generation. These latents capture scene structure continuously without explicit reconstruction, providing flexible representation that allows pretrained diffusion priors to regularize errors more effectively. The model jointly conditions on these latents and source camera poses.

Result: The model achieves state-of-the-art results on the video re-rendering task, demonstrating superior performance compared to existing methods.

Conclusion: Using implicit geometric knowledge from 4D reconstruction model latents provides an effective solution for video re-rendering, overcoming limitations of both geometrically unconditioned and conditioned approaches by offering spatial awareness without explicit reconstruction dependencies.

Abstract: Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors. We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task. Project webpage is https://lavr-4d-scene-rerender.github.io/

[118] A comprehensive overview of deep learning models for object detection from videos/images

Sukana Zulfqar, Sadia Saeed, M. Azam Zia, Anjum Ali, Faisal Mehmood, Abid Ali

Main category: cs.CV

TL;DR: A comprehensive review of modern object detection techniques in video/image surveillance, covering architectural innovations, generative model integration, temporal information usage, and surveillance-specific challenges like dynamic environments and real-time requirements.

Details

Motivation: To provide an updated survey of object detection in surveillance systems, addressing the rapid evolution driven by deep learning advancements and focusing on surveillance-specific challenges that earlier surveys may have overlooked.

Method: Classifies methods based on core architectures (CNN-based detectors), data processing strategies, and surveillance-specific challenges. Covers CNN-based detectors, GAN-assisted approaches, and temporal fusion methods, examining preprocessing pipelines, feature extraction, and benchmarking datasets.

Result: The review evaluates current effectiveness of semantic object detection, analyzes deep learning models and their practical applications, and highlights how generative models support tasks like reconstructing missing frames, reducing occlusions, and normalizing illumination.

Conclusion: Identifies emerging trends in low-latency, efficient, and spatiotemporal learning approaches for future research, providing a comprehensive roadmap for advancing object detection in surveillance applications.

Abstract: Object detection in video and image surveillance is a well-established yet rapidly evolving task, strongly influenced by recent deep learning advancements. This review summarises modern techniques by examining architectural innovations, generative model integration, and the use of temporal information to enhance robustness and accuracy. Unlike earlier surveys, it classifies methods based on core architectures, data processing strategies, and surveillance specific challenges such as dynamic environments, occlusions, lighting variations, and real-time requirements. The primary goal is to evaluate the current effectiveness of semantic object detection, while secondary aims include analysing deep learning models and their practical applications. The review covers CNN-based detectors, GAN-assisted approaches, and temporal fusion methods, highlighting how generative models support tasks such as reconstructing missing frames, reducing occlusions, and normalising illumination. It also outlines preprocessing pipelines, feature extraction progress, benchmarking datasets, and comparative evaluations. Finally, emerging trends in low-latency, efficient, and spatiotemporal learning approaches are identified for future research.

[119] Transfer Learning from One Cancer to Another via Deep Learning Domain Adaptation

Justin Cheung, Samuel Savine, Calvin Nguyen, Lin Lu, Alhassan S. Yasin

Main category: cs.CV

TL;DR: Domain adversarial neural networks (DANNs) significantly improve cross-domain generalization for cancer histopathology classification compared to supervised models, with stain normalization effects varying by target domain.

Details

Motivation: Supervised deep learning models in cancer histopathology fail to generalize to unseen cancer types despite shared morphological features, necessitating domain adaptation to address domain shift and annotation scarcity.

Method: Evaluated cross-domain classification among lung, colon, breast, and kidney adenocarcinomas using ResNet50, ensemble models, and domain adversarial neural networks (DANNs). Assessed impact of stain normalization and used Integrated Gradients for interpretability.

Result: Single-domain ResNet50 achieves >98% accuracy on own domain but minimal generalization to others. DANN trained on breast/colon and adapted to lung reaches 95.56% accuracy. Stain normalization effects vary: harmful for lung (95.56%→66.60%) but beneficial for breast/colon targets. Integrated Gradients show DANNs focus on biologically meaningful regions.

Conclusion: DANNs effectively transfer knowledge across cancer types, outperforming supervised models. Stain normalization impact is domain-dependent. Models learn clinically relevant features applicable to unlabeled cancer types.

Abstract: Supervised deep learning models often achieve excellent performance within their training distribution but struggle to generalize beyond it. In cancer histopathology, for example, a convolutional neural network (CNN) may classify cancer severity accurately for cancer types represented in its training data, yet fail on related but unseen types. Although adenocarcinomas from different organs share morphological features that might support limited cross-domain generalization, addressing domain shift directly is necessary for robust performance. Domain adaptation offers a way to transfer knowledge from labeled data in one cancer type to unlabeled data in another, helping mitigate the scarcity of annotated medical images. This work evaluates cross-domain classification performance among lung, colon, breast, and kidney adenocarcinomas. A ResNet50 trained on any single adenocarcinoma achieves over 98% accuracy on its own domain but shows minimal generalization to others. Ensembling multiple supervised models does not resolve this limitation. In contrast, converting the ResNet50 into a domain adversarial neural network (DANN) substantially improves performance on unlabeled target domains. A DANN trained on labeled breast and colon data and adapted to unlabeled lung data reaches 95.56% accuracy. We also examine the impact of stain normalization on domain adaptation. Its effects vary by target domain: for lung, accuracy drops from 95.56% to 66.60%, while for breast and colon targets, stain normalization boosts accuracy from 49.22% to 81.29% and from 78.48% to 83.36%, respectively. Finally, using Integrated Gradients reveals that DANNs consistently attribute importance to biologically meaningful regions such as densely packed nuclei, indicating that the model learns clinically relevant features and can apply them to unlabeled cancer types.

[120] FeedbackSTS-Det: Sparse Frames-Based Spatio-Temporal Semantic Feedback Network for Infrared Small Target Detection

Yian Huang, Qing Qin, Aji Mao, Xiangyu Qiu, Liang Xu, Xian Zhang, Zhenming Peng

Main category: cs.CV

TL;DR: FeedbackSTS-Det: A sparse frames-based spatio-temporal semantic feedback network for infrared small target detection that uses paired forward/backward refinement modules with structured sparse temporal modeling to capture long-range dependencies efficiently.

Details

Motivation: Infrared small target detection under complex backgrounds is challenging due to low signal-to-clutter ratio, dynamic interference, and lack of distinct features. Existing multi-frame methods struggle with inefficient long-range dependency modeling and insufficient robustness.

Method: Proposes FeedbackSTS-Det with spatio-temporal semantic feedback strategy using paired forward/backward refinement modules across encoder-decoder. Includes embedded sparse semantic module (SSM) for structured sparse temporal modeling to capture long-range dependencies with low computational cost. Maintains consistent training-inference pipeline.

Result: Extensive experiments on multiple benchmark datasets confirm effectiveness. The method facilitates robust implicit inter-frame registration and continuous semantic refinement, effectively suppressing false alarms.

Conclusion: FeedbackSTS-Det addresses challenges in infrared small target detection through efficient spatio-temporal modeling with semantic feedback, achieving improved performance while maintaining computational efficiency.

Abstract: Infrared small target detection (ISTD) under complex backgrounds remains a critical yet challenging task, primarily due to the extremely low signal-to-clutter ratio, persistent dynamic interference, and the lack of distinct target features. While multi-frame detection methods leverages temporal cues to improve upon single-frame approaches, existing methods still struggle with inefficient long-range dependency modeling and insufficient robustness. To overcome these issues, we propose a novel scheme for ISTD, realized through a sparse frames-based spatio-temporal semantic feedback network named FeedbackSTS-Det. The core of our approach is a novel spatio-temporal semantic feedback strategy with a closed-loop semantic association mechanism, which consists of paired forward and backward refinement modules that work cooperatively across the encoder and decoder. Moreover, both modules incorporate an embedded sparse semantic module (SSM), which performs structured sparse temporal modeling to capture long-range dependencies with low computational cost. This integrated design facilitates robust implicit inter-frame registration and continuous semantic refinement, effectively suppressing false alarms. Furthermore, our overall procedure maintains a consistent training-inference pipeline, which ensures reliable performance transfer and increases model robustness. Extensive experiments on multiple benchmark datasets confirm the effectiveness of FeedbackSTS-Det. Code and models are available at: https://github.com/IDIP-Lab/FeedbackSTS-Det.

[121] RegFreeNet: A Registration-Free Network for CBCT-based 3D Dental Implant Planning

Xinquan Yang, Xuguang Li, Mianjie Zheng, Xuefen Liu, Kun Tang, Kian Ming Lim, He Meng, Jianfeng Ren, Linlin Shen

Main category: cs.CV

TL;DR: Proposes ImplantFairy dataset and RegFreeNet model for dental implant position prediction without requiring registration between pre- and post-implantation CBCT scans.

Details

Motivation: Commercial surgical guide software doesn't export implant positions, requiring time-consuming registration of post-implantation data to pre-implantation space. This limits multi-center dataset construction and relies heavily on registration accuracy.

Method: Masks implants in post-implantation CBCT data to use any implant-containing scan as training data. Proposes RegFreeNet with neighboring distance perception (NDP) module for tooth area variation features and implant slope prediction branch for additional supervision.

Result: Creates ImplantFairy dataset with 1622 CBCT scans and voxel-level 3D annotations. RegFreeNet achieves state-of-the-art performance on ImplantFairy and two public datasets.

Conclusion: The registration-free approach enables large-scale multi-center dataset construction and achieves superior implant position prediction performance through innovative network design with slope-aware features.

Abstract: As the commercial surgical guide design software usually does not support the export of implant position for pre-implantation data, existing methods have to scan the post-implantation data and map the implant to pre-implantation space to get the label of implant position for training. Such a process is time-consuming and heavily relies on the accuracy of registration algorithm. Moreover, not all hospitals have paired CBCT data, limitting the construction of multi-center dataset. Inspired by the way dentists determine the implant position based on the neighboring tooth texture, we found that even if the implant area is masked, it will not affect the determination of the implant position. Therefore, we propose to mask the implants in the post-implantation data so that any CBCT containing the implants can be used as training data. This paradigm enables us to discard the registration process and makes it possible to construct a large-scale multi-center implant dataset. On this basis, we proposes ImplantFairy, a comprehensive, publicly accessible dental implant dataset with voxel-level 3D annotations of 1622 CBCT data. Furthermore, according to the area variation characteristics of the tooth’s spatial structure and the slope information of the implant, we designed a slope-aware implant position prediction network. Specifically, a neighboring distance perception (NDP) module is designed to adaptively extract tooth area variation features, and an implant slope prediction branch assists the network in learning more robust features through additional implant supervision information. Extensive experiments conducted on ImplantFairy and two public dataset demonstrate that the proposed RegFreeNet achieves the state-of-the-art performance.

[122] LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval

Chao Gao, Siqiao Xue, Yimin Peng, Jiwen Fu, Tingyi Gu, Shanshan Li, Fan Zhou

Main category: cs.CV

TL;DR: LookBench is a live, holistic benchmark for fashion image retrieval in e-commerce that combines real product images and AI-generated fashion images with time-stamped test samples for contamination-aware evaluation.

Details

Motivation: Current fashion image retrieval benchmarks are static and don't reflect real e-commerce dynamics, where trends change rapidly and retrieval needs vary from finding exact items to visual alternatives.

Method: Created LookBench with recent product images from live websites and AI-generated fashion images, time-stamped test samples, fine-grained attribute taxonomy, covering single-item and outfit-level retrieval, with planned semi-annual updates.

Result: LookBench is challenging - many strong baselines achieve below 60% Recall@1. Their proprietary model achieves best performance, with open-source counterpart ranking second, both achieving SOTA on legacy Fashion200K evaluations.

Conclusion: LookBench provides a durable, contamination-aware benchmark for fashion image retrieval that reflects real e-commerce needs, will be updated semi-annually to track progress, and is publicly released with leaderboard, dataset, code, and models.

Abstract: In this paper, we present LookBench (We use the term “look” to reflect retrieval that mirrors how people shop – finding the exact item, a close substitute, or a visually consistent alternative.), a live, holistic and challenging benchmark for fashion image retrieval in real e-commerce settings. LookBench includes both recent product images sourced from live websites and AI-generated fashion images, reflecting contemporary trends and use cases. Each test sample is time-stamped and we intend to update the benchmark periodically, enabling contamination-aware evaluation aligned with declared training cutoffs. Grounded in our fine-grained attribute taxonomy, LookBench covers single-item and outfit-level retrieval across. Our experiments reveal that LookBench poses a significant challenge on strong baselines, with many models achieving below $60%$ Recall@1. Our proprietary model achieves the best performance on LookBench, and we release an open-source counterpart that ranks second, with both models attaining state-of-the-art results on legacy Fashion200K evaluations. LookBench is designed to be updated semi-annually with new test samples and progressively harder task variants, providing a durable measure of progress. We publicly release our leaderboard, dataset, evaluation code, and trained models.

[123] Context Patch Fusion With Class Token Enhancement for Weakly Supervised Semantic Segmentation

Yiyang Fu, Hui Li, Wangyu Wu

Main category: cs.CV

TL;DR: CPF-CTE framework improves WSSS by capturing contextual patch dependencies and using learnable class tokens to enhance feature representations and segmentation accuracy.

Details

Motivation: Existing WSSS methods focus on inter-class distinctions and data augmentation but neglect complex contextual dependencies among image patches, leading to incomplete local representations and limited segmentation accuracy.

Method: Proposes CPF-CTE framework with two key components: 1) CF-BiLSTM module that captures spatial dependencies between patches through bidirectional information flow, and 2) learnable class tokens that dynamically encode and refine class-specific semantics.

Result: Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 show CPF-CTE consistently surpasses prior WSSS methods.

Conclusion: CPF-CTE effectively integrates spatial and semantic cues to produce richer and more accurate representations of image content, addressing the limitations of existing WSSS approaches.

Abstract: Weakly Supervised Semantic Segmentation (WSSS), which relies only on image-level labels, has attracted significant attention for its cost-effectiveness and scalability. Existing methods mainly enhance inter-class distinctions and employ data augmentation to mitigate semantic ambiguity and reduce spurious activations. However, they often neglect the complex contextual dependencies among image patches, resulting in incomplete local representations and limited segmentation accuracy. To address these issues, we propose the Context Patch Fusion with Class Token Enhancement (CPF-CTE) framework, which exploits contextual relations among patches to enrich feature representations and improve segmentation. At its core, the Contextual-Fusion Bidirectional Long Short-Term Memory (CF-BiLSTM) module captures spatial dependencies between patches and enables bidirectional information flow, yielding a more comprehensive understanding of spatial correlations. This strengthens feature learning and segmentation robustness. Moreover, we introduce learnable class tokens that dynamically encode and refine class-specific semantics, enhancing discriminative capability. By effectively integrating spatial and semantic cues, CPF-CTE produces richer and more accurate representations of image content. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 validate that CPF-CTE consistently surpasses prior WSSS methods.

[124] HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu

Main category: cs.CV

TL;DR: HERMES is a training-free architecture for real-time streaming video understanding that uses hierarchical KV cache memory to achieve efficient processing with minimal GPU overhead.

Details

Motivation: Existing MLLMs struggle with streaming video inputs due to challenges in maintaining stable understanding performance, real-time responses, and low GPU memory overhead simultaneously.

Method: Proposes HERMES architecture that conceptualizes KV cache as hierarchical memory framework for video information across multiple granularities, reuses compact KV cache during inference, and requires no auxiliary computations for user queries.

Result: Achieves 10× faster TTFT (Time To First Token) compared to prior SOTA, reduces video tokens by up to 68% compared to uniform sampling while maintaining superior or comparable accuracy across benchmarks, with up to 11.4% gains on streaming datasets.

Conclusion: HERMES enables efficient real-time streaming video understanding with minimal resource requirements, addressing key limitations of existing MLLMs for continuous video stream interactions.

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.

[125] Safeguarding Facial Identity against Diffusion-based Face Swapping via Cascading Pathway Disruption

Liqin Wang, Qianyue Hu, Wei Lu, Xiangyang Luo

Main category: cs.CV

TL;DR: VoidFace is a novel defense method against diffusion-based face swapping attacks that injects imperceptible perturbations to disrupt identity pathways, outperforming existing defenses while maintaining visual quality.

Details

Motivation: The democratization of diffusion models for face swapping raises serious privacy and identity security concerns. Existing proactive defenses adapted from image editing attacks are ineffective against diffusion-based face swapping due to overlooking the structural resilience and unique static conditional guidance mechanisms in these systems.

Method: VoidFace treats face swapping as a coupled identity pathway and injects perturbations at critical bottlenecks. It uses: 1) localization disruption and identity erasure to degrade physical regression and semantic embeddings, 2) attention mechanism decoupling to sever identity injection, and 3) intermediate diffusion feature corruption to prevent source identity reconstruction. The method performs adversarial search in the latent manifold with perceptual adaptive strategy to balance attack potency and image quality.

Result: Extensive experiments show VoidFace outperforms existing defense methods across various diffusion-based face swapping models while producing adversarial faces with superior visual quality.

Conclusion: VoidFace provides an effective systemic defense against diffusion-based face swapping by targeting the unique structural characteristics of these systems, offering better protection than existing approaches while maintaining imperceptible perturbations.

Abstract: The rapid evolution of diffusion models has democratized face swapping but also raises concerns about privacy and identity security. Existing proactive defenses, often adapted from image editing attacks, prove ineffective in this context. We attribute this failure to an oversight of the structural resilience and the unique static conditional guidance mechanism inherent in face swapping systems. To address this, we propose VoidFace, a systemic defense method that views face swapping as a coupled identity pathway. By injecting perturbations at critical bottlenecks, VoidFace induces cascading disruption throughout the pipeline. Specifically, we first introduce localization disruption and identity erasure to degrade physical regression and semantic embeddings, thereby impairing the accurate modeling of the source face. We then intervene in the generative domain by decoupling attention mechanisms to sever identity injection, and corrupting intermediate diffusion features to prevent the reconstruction of source identity. To ensure visual imperceptibility, we perform adversarial search in the latent manifold, guided by a perceptual adaptive strategy to balance attack potency with image quality. Extensive experiments show that VoidFace outperforms existing defenses across various diffusion-based swapping models, while producing adversarial faces with superior visual quality.

[126] Enhancing Text-to-Image Generation via End-Edge Collaborative Hybrid Super-Resolution

Chongbin Yi, Yuxin Liang, Ziqi Zhou, Peng Yang

Main category: cs.CV

TL;DR: Proposes an end-edge collaborative framework that combines low-resolution text-to-image generation at the edge with region-aware hybrid super-resolution to reduce latency while maintaining image quality.

Details

Motivation: High-resolution text-to-image generation faces a latency-fidelity tradeoff: lightweight super-resolution struggles with fine details while diffusion-based SR is computationally expensive. Need to balance Quality of Experience (QoE) with service latency.

Method: End-edge collaborative framework: 1) Generate low-resolution image at edge with adaptive denoising steps and SR scales, 2) Partition into patches, 3) Apply region-aware hybrid SR policy: diffusion-based SR for foreground patches (detail recovery) and lightweight learning-based SR for background patches (efficient upscaling), 4) Stitch enhanced patches into high-resolution image.

Result: System reduces service latency by 33% compared to baselines while maintaining competitive image quality.

Conclusion: The proposed end-edge collaborative generation-enhancement framework effectively addresses the latency-fidelity tradeoff in high-resolution text-to-image generation by intelligently allocating computational resources based on image region importance.

Abstract: Artificial Intelligence-Generated Content (AIGC) has made significant strides, with high-resolution text-to-image (T2I) generation becoming increasingly critical for improving users’ Quality of Experience (QoE). Although resource-constrained edge computing adequately supports fast low-resolution T2I generations, achieving high-resolution output still faces the challenge of ensuring image fidelity at the cost of latency. To address this, we first investigate the performance of super-resolution (SR) methods for image enhancement, confirming a fundamental trade-off that lightweight learning-based SR struggles to recover fine details, while diffusion-based SR achieves higher fidelity at a substantial computational cost. Motivated by these observations, we propose an end-edge collaborative generation-enhancement framework. Upon receiving a T2I generation task, the system first generates a low-resolution image based on adaptively selected denoising steps and super-resolution scales at the edge side, which is then partitioned into patches and processed by a region-aware hybrid SR policy. This policy applies a diffusion-based SR model to foreground patches for detail recovery and a lightweight learning-based SR model to background patches for efficient upscaling, ultimately stitching the enhanced ones into the high-resolution image. Experiments show that our system reduces service latency by 33% compared with baselines while maintaining competitive image quality.

[127] SimD3: A Synthetic drone Dataset with Payload and Bird Distractor Modeling for Robust Detection

Ami Pandat, Kanyala Muvva, Punna Rajasekhar, Gopika Vinod, Rohit Shukla

Main category: cs.CV

TL;DR: SimD3 is a large-scale synthetic drone detection dataset with realistic distractors and diverse environments, used to train improved YOLOv5 models that outperform baselines in cross-dataset evaluations.

Details

Motivation: Drone detection faces challenges due to limited real-world annotated data, large appearance variability, and visually similar distractors like birds. Existing synthetic datasets lack realistic modeling of these complexities.

Method: Created SimD3 dataset using Unreal Engine 5 with heterogeneous drone payloads, multiple bird species as distractors, and diverse environments with controlled weather/lighting. Developed attention-enhanced YOLOv5 variant (Yolov5m+C3b) by replacing standard C3 blocks with C3b modules.

Result: SimD3 provides effective supervision for small-object drone detection. Yolov5m+C3b consistently outperforms baseline across in-domain and cross-dataset evaluations, showing improved robustness and generalization to unseen real-world benchmarks.

Conclusion: SimD3 is a valuable resource for training and benchmarking robust drone detection models under diverse challenging conditions, demonstrating the utility of high-fidelity synthetic data with realistic distractors for improving detection performance.

Abstract: Reliable drone detection is challenging due to limited annotated real-world data, large appearance variability, and the presence of visually similar distractors such as birds. To address these challenges, this paper introduces SimD3, a large-scale high-fidelity synthetic dataset designed for robust drone detection in complex aerial environments. Unlike existing synthetic drone datasets, SimD3 explicitly models drones with heterogeneous payloads, incorporates multiple bird species as realistic distractors, and leverages diverse Unreal Engine 5 environments with controlled weather, lighting, and flight trajectories captured using a 360 six-camera rig. Using SimD3, we conduct an extensive experimental evaluation within the YOLOv5 detection framework, including an attention-enhanced variant termed Yolov5m+C3b, where standard bottleneck-based C3 blocks are replaced with C3b modules. Models are evaluated on synthetic data, combined synthetic and real data, and multiple unseen real-world benchmarks to assess robustness and generalization. Experimental results show that SimD3 provides effective supervision for small-object drone detection and that Yolov5m+C3b consistently outperforms the baseline across in-domain and cross-dataset evaluations. These findings highlight the utility of SimD3 for training and benchmarking robust drone detection models under diverse and challenging conditions.

[128] ReinPath: A Multimodal Reinforcement Learning Approach for Pathology

Kangcheng Zhou, Jun Jiang, Qing Zhang, Shuang Zheng, Qingli Li, Shugong Xu

Main category: cs.CV

TL;DR: A novel multimodal pathology LLM with strong reasoning capabilities, using semantic reward strategy and group relative policy optimization, outperforms SOTA methods on a new high-quality VQA dataset even with only 20% training data.

Details

Motivation: Existing multimodal methods in computational pathology have limited interpretability due to lack of high-quality datasets supporting explicit reasoning and simple reasoning processes.

Method: Developed a multimodal pathology large language model with semantic reward strategy integrated with group relative policy optimization, and constructed a high-quality pathology VQA dataset for complex reasoning tasks.

Result: Outperforms state-of-the-art methods on the new VQA dataset, even when trained with only 20% of the data. Achieves comparable performance to CLIP on downstream zero-shot image classification.

Conclusion: The proposed multimodal pathology LLM with enhanced reasoning capabilities and semantic reward strategy effectively addresses interpretability limitations in computational pathology.

Abstract: Interpretability is significant in computational pathology, leading to the development of multimodal information integration from histopathological image and corresponding text data.However, existing multimodal methods have limited interpretability due to the lack of high-quality dataset that support explicit reasoning and inference and simple reasoning process.To address the above problems, we introduce a novel multimodal pathology large language model with strong reasoning capabilities.To improve the generation of accurate and contextually relevant textual descriptions, we design a semantic reward strategy integrated with group relative policy optimization.We construct a high-quality pathology visual question answering (VQA) dataset, specifically designed to support complex reasoning tasks.Comprehensive experiments conducted on this dataset demonstrate that our method outperforms state-of-the-art methods, even when trained with only 20% of the data.Our method also achieves comparable performance on downstream zero-shot image classification task compared with CLIP.

[129] Using Multi-Instance Learning to Identify Unique Polyps in Colon Capsule Endoscopy Images

Puneet Sharma, Kristian Dalsbø Hindberg, Eibe Frank, Benedicte Schelde-Olesen, Ulrik Deding

Main category: cs.CV

TL;DR: This paper proposes a multi-instance learning approach with attention mechanisms for identifying unique polyps in colon capsule endoscopy images, achieving 86.26% accuracy.

Details

Motivation: Identifying unique polyps in CCE images is challenging due to large image volumes, cognitive load on clinicians, and labeling ambiguity. Automated solutions are needed to assist medical personnel.

Method: Formulates polyp uniqueness as a multi-instance learning task using a multi-instance verification framework with attention mechanisms (VEMA and DBA) and self-supervised learning via SimCLR for robust embeddings.

Result: Attention mechanisms significantly improve performance, with DBA L1 achieving highest test accuracy of 86.26% and AUC of 0.928 using ConvNeXt backbone with SimCLR pretraining on 1912 polyps from 754 patients.

Conclusion: The study demonstrates the potential of MIL and self-supervised learning for automated CCE image analysis, with broader implications for medical imaging applications.

Abstract: Identifying unique polyps in colon capsule endoscopy (CCE) images is a critical yet challenging task for medical personnel due to the large volume of images, the cognitive load it creates for clinicians, and the ambiguity in labeling specific frames. This paper formulates this problem as a multi-instance learning (MIL) task, where a query polyp image is compared with a target bag of images to determine uniqueness. We employ a multi-instance verification (MIV) framework that incorporates attention mechanisms, such as variance-excited multi-head attention (VEMA) and distance-based attention (DBA), to enhance the model’s ability to extract meaningful representations. Additionally, we investigate the impact of self-supervised learning using SimCLR to generate robust embeddings. Experimental results on a dataset of 1912 polyps from 754 patients demonstrate that attention mechanisms significantly improve performance, with DBA L1 achieving the highest test accuracy of 86.26% and a test AUC of 0.928 using a ConvNeXt backbone with SimCLR pretraining. This study underscores the potential of MIL and self-supervised learning in advancing automated analysis of Colon Capsule Endoscopy images, with implications for broader medical imaging applications.

[130] Does medical specialization of VLMs enhance discriminative power?: A comprehensive investigation through feature distribution analysis

Keita Takeda, Tomoya Sakai

Main category: cs.CV

TL;DR: Medical VLMs learn diagnostically relevant features, but non-medical VLMs with enhanced text encoders (like LLM2CLIP) produce more refined representations, suggesting text encoder improvement is more crucial than intensive medical image training.

Details

Motivation: Medical VLMs are expected to capture diagnostically relevant features, but their learned representations remain underexplored. Standard evaluations like classification accuracy don't fully reveal if they acquire truly discriminative, lesion-specific features. Understanding these representations is crucial for revealing medical image structures and improving downstream tasks in medical image analysis.

Method: Analyzed feature distributions of multiple image modalities extracted by representative medical VLMs across lesion classification datasets on multiple modalities. Compared these distributions with non-medical VLMs to assess domain-specific medical training impact.

Result: Medical VLMs can extract discriminative features effective for medical classification tasks. However, non-medical VLMs with recent contextual enrichment improvements (like LLM2CLIP) produce more refined feature representations. Non-medical models are particularly vulnerable to biases from overlaid text strings on images.

Conclusion: Enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs. Careful model selection is needed based on downstream tasks, with attention to potential risks from background biases like textual information in images.

Abstract: This study investigates the feature representations produced by publicly available open source medical vision-language models (VLMs). While medical VLMs are expected to capture diagnostically relevant features, their learned representations remain underexplored, and standard evaluations like classification accuracy do not fully reveal if they acquire truly discriminative, lesion-specific features. Understanding these representations is crucial for revealing medical image structures and improving downstream tasks in medical image analysis. This study aims to investigate the feature distributions learned by medical VLMs and evaluate the impact of medical specialization. We analyze the feature distribution of multiple image modalities extracted by some representative medical VLMs across lesion classification datasets on multiple modalities. These distributions were compared them with non-medical VLMs to assess the domain-specific medical training. Our experiments showed that medical VLMs can extract discriminative features that are effective for medical classification tasks. Moreover, it was found that non-medical VLMs with recent improvement with contextual enrichment such as LLM2CLIP produce more refined feature representations. Our results imply that enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs. Notably, non-medical models are particularly vulnerable to biases introduced by overlaied text strings on images. These findings underscore the need for careful consideration on model selection according to downstream tasks besides potential risks in inference due to background biases such as textual information in images.

Xiaofan Yang, Yubin Liu, Wei Pan, Guoqing Chu, Junming Zhang, Jie Zhao, Zhuoqi Man, Xuanming Cao

Main category: cs.CV

TL;DR: M2I2HA: A hypergraph-based multi-modal perception network that addresses limitations of CNNs, Transformers, and SSMs for object detection by capturing high-order relationships within and across modalities.

Details

Motivation: Current multi-modal detection methods face challenges in extracting task-relevant information and achieving precise cross-modal alignment. CNNs have limited receptive fields, Transformers have quadratic complexity, and SSMs disrupt spatial structures when flattening 2D to 1D sequences.

Method: Proposes M2I2HA with three key modules: 1) Intra-Hypergraph Enhancement to capture global many-to-many high-order relationships within each modality, 2) Inter-Hypergraph Fusion to align and fuse cross-modal features by bridging configuration and spatial gaps, and 3) M2-FullPAD for adaptive multi-level fusion of enhanced features while improving data distribution and flow.

Result: Extensive object detection experiments on multiple public datasets demonstrate state-of-the-art performance in multi-modal object detection tasks compared to baselines.

Conclusion: M2I2HA effectively addresses limitations of existing architectures by leveraging hypergraph theory to model complex high-order dependencies within and across modalities, achieving superior multi-modal object detection performance.

Abstract: Recent advances in multi-modal detection have significantly improved detection accuracy in challenging environments (e.g., low light, overexposure). By integrating RGB with modalities such as thermal and depth, multi-modal fusion increases data redundancy and system robustness. However, significant challenges remain in effectively extracting task-relevant information both within and across modalities, as well as in achieving precise cross-modal alignment. While CNNs excel at feature extraction, they are limited by constrained receptive fields, strong inductive biases, and difficulty in capturing long-range dependencies. Transformer-based models offer global context but suffer from quadratic computational complexity and are confined to pairwise correlation modeling. Mamba and other State Space Models (SSMs), on the other hand, are hindered by their sequential scanning mechanism, which flattens 2D spatial structures into 1D sequences, disrupting topological relationships and limiting the modeling of complex higher-order dependencies. To address these issues, we propose a multi-modal perception network based on hypergraph theory called M2I2HA. Our architecture includes an Intra-Hypergraph Enhancement module to capture global many-to-many high-order relationships within each modality, and an Inter-Hypergraph Fusion module to align, enhance, and fuse cross-modal features by bridging configuration and spatial gaps between data sources. We further introduce a M2-FullPAD module to enable adaptive multi-level fusion of multi-modal enhanced features within the network, meanwhile enhancing data distribution and flow across the architecture. Extensive object detection experiments on multiple public datasets against baselines demonstrate that M2I2HA achieves state-of-the-art performance in multi-modal object detection tasks.

[132] FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes

Jiaxuan Liu, Yang Xiang, Han Zhao, Xiangang Li, Zhenhua Ling

Main category: cs.CV

TL;DR: FunCineForge: An end-to-end pipeline for creating large-scale dubbing datasets and an MLLM-based dubbing model that outperforms SOTA methods across various cinematic scenes.

Details

Motivation: Existing movie dubbing methods face two major limitations: (1) limited high-quality multimodal dubbing datasets with issues like high word error rates, sparse annotations, costly manual labeling, and restriction to monologue scenes; (2) existing models rely solely on lip region for audio-visual alignment, limiting applicability to complex live-action scenes and showing suboptimal performance in lip sync, speech quality, and emotional expressiveness.

Method: Proposes FunCineForge with two components: (1) an end-to-end production pipeline for creating large-scale dubbing datasets with rich annotations, and (2) an MLLM-based dubbing model designed for diverse cinematic scenes. The pipeline constructs the first Chinese television dubbing dataset with comprehensive annotations.

Result: Experiments across monologue, narration, dialogue, and multi-speaker scenes show that the dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. The dataset quality is demonstrated to be high.

Conclusion: FunCineForge addresses key limitations in movie dubbing by providing a scalable dataset creation pipeline and an advanced MLLM-based model that achieves superior performance across diverse cinematic scenarios, advancing the state of multimodal dubbing technology.

Abstract: Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. Code and demos are available at https://anonymous.4open.science/w/FunCineForge.

[133] Reconstruction-Anchored Diffusion Model for Text-to-Motion Generation

Yifei Liu, Changxing Ding, Ling Guo, Huaiguang Jiang, Qiong Cao

Main category: cs.CV

TL;DR: RAM introduces a motion latent space with reconstruction supervision and error guidance to improve text-to-motion diffusion models by addressing representational gaps and error propagation issues.

Details

Motivation: Current motion diffusion models suffer from two key limitations: 1) representational gap due to pre-trained text encoders lacking motion-specific information, and 2) error propagation during iterative denoising processes.

Method: RAM uses a motion latent space as intermediate supervision with co-training of a motion reconstruction branch featuring self-regularization and motion-centric latent alignment. It also introduces Reconstructive Error Guidance (REG) that exploits the diffusion model’s self-correction ability by reconstructing previous estimates and amplifying residuals to highlight improvements.

Result: Extensive experiments demonstrate significant improvements and state-of-the-art performance in text-to-motion generation tasks.

Conclusion: RAM effectively addresses representational gaps and error propagation in motion diffusion models through motion latent space supervision and reconstructive error guidance, achieving superior performance.

Abstract: Diffusion models have seen widespread adoption for text-driven human motion generation and related tasks due to their impressive generative capabilities and flexibility. However, current motion diffusion models face two major limitations: a representational gap caused by pre-trained text encoders that lack motion-specific information, and error propagation during the iterative denoising process. This paper introduces Reconstruction-Anchored Diffusion Model (RAM) to address these challenges. First, RAM leverages a motion latent space as intermediate supervision for text-to-motion generation. To this end, RAM co-trains a motion reconstruction branch with two key objective functions: self-regularization to enhance the discrimination of the motion space and motion-centric latent alignment to enable accurate mapping from text to the motion latent space. Second, we propose Reconstructive Error Guidance (REG), a testing-stage guidance mechanism that exploits the diffusion model’s inherent self-correction ability to mitigate error propagation. At each denoising step, REG uses the motion reconstruction branch to reconstruct the previous estimate, reproducing the prior error patterns. By amplifying the residual between the current prediction and the reconstructed estimate, REG highlights the improvements in the current prediction. Extensive experiments demonstrate that RAM achieves significant improvements and state-of-the-art performance. Our code will be released.

[134] Synthetic Data Augmentation for Multi-Task Chinese Porcelain Classification: A Stable Diffusion Approach

Ziyao Ling, Silvia Mirri, Paola Salomoni, Giovanni Delnevo

Main category: cs.CV

TL;DR: Synthetic images from Stable Diffusion with LoRA can effectively augment limited real datasets for Chinese porcelain classification, with task-specific improvements ranging from 3-5.5% F1-macro scores depending on classification task.

Details

Motivation: Deep learning faces fundamental challenges in archaeological artifact classification due to scarcity of training data, especially for rare Chinese porcelain types. There's a need to explore whether synthetic data can effectively augment limited real datasets.

Method: Used Stable Diffusion with Low-Rank Adaptation (LoRA) to generate synthetic images, then trained MobileNetV3 with transfer learning. Conducted controlled experiments comparing pure real data vs. mixed real-synthetic datasets (95:5 and 90:10 ratios) across four classification tasks: dynasty, glaze, kiln, and type identification.

Result: Task-specific benefits observed: type classification showed most substantial improvement (5.5% F1-macro increase with 90:10 ratio), while dynasty and kiln tasks exhibited modest gains (3-4%). Synthetic augmentation effectiveness depends on alignment between generated features and task-relevant visual signatures.

Conclusion: Synthetic data augmentation using generative AI has practical potential for archaeological research, but effectiveness varies by task and must balance archaeological authenticity with data diversity. Provides guidelines for deploying generative AI in archaeological contexts.

Abstract: The scarcity of training data presents a fundamental challenge in applying deep learning to archaeological artifact classification, particularly for the rare types of Chinese porcelain. This study investigates whether synthetic images generated through Stable Diffusion with Low-Rank Adaptation (LoRA) can effectively augment limited real datasets for multi-task CNN-based porcelain classification. Using MobileNetV3 with transfer learning, we conducted controlled experiments comparing models trained on pure real data against those trained on mixed real-synthetic datasets (95:5 and 90:10 ratios) across four classification tasks: dynasty, glaze, kiln and type identification. Results demonstrate task-specific benefits: type classification showed the most substantial improvement (5.5% F1-macro increase with 90:10 ratio), while dynasty and kiln tasks exhibited modest gains (3-4%), suggesting that synthetic augmentation effectiveness depends on the alignment between generated features and task-relevant visual signatures. Our work contributes practical guidelines for deploying generative AI in archaeological research, demonstrating both the potential and limitations of synthetic data when archaeological authenticity must be balanced with data diversity.

[135] UniRoute: Unified Routing Mixture-of-Experts for Modality-Adaptive Remote Sensing Change Detection

Qingling Shu, Sibao Chen, Wei Lu, Zhihui You, Chengzhuang Liu

Main category: cs.CV

TL;DR: UniRoute is a unified framework for modality-adaptive remote sensing change detection that uses conditional routing to handle both homogeneous and heterogeneous image pairs, achieving strong performance across diverse datasets.

Details

Motivation: Current change detection methods rely on specialized models that don't scale well across different modalities. Homogeneous CD needs fine spatial details while heterogeneous CD requires broader context, and existing difference operators work poorly with cross-modal or misaligned images.

Method: Proposes UniRoute with: 1) AR2-MoE module for adaptive receptive field routing to separate local details from global context, 2) MDR-MoE module for modality-aware difference routing to select optimal fusion operations per pixel, and 3) CASD strategy for consistency-aware self-distillation to stabilize training with scarce heterogeneous data.

Result: Extensive experiments on five public datasets show UniRoute achieves strong overall performance with favorable accuracy-efficiency trade-off in unified deployment settings.

Conclusion: UniRoute provides a unified, modality-adaptive framework that effectively handles both homogeneous and heterogeneous change detection through conditional routing and self-distillation, overcoming limitations of specialized models.

Abstract: Current remote sensing change detection (CD) methods mainly rely on specialized models, which limits the scalability toward modality-adaptive Earth observation. For homogeneous CD, precise boundary delineation relies on fine-grained spatial cues and local pixel interactions, whereas heterogeneous CD instead requires broader contextual information to suppress speckle noise and geometric distortions. Moreover, difference operator (e.g., subtraction) works well for aligned homogeneous images but introduces artifacts in cross-modal or geometrically misaligned scenarios. Across different modality settings, specialized models based on static backbones or fixed difference operations often prove insufficient. To address this challenge, we propose UniRoute, a unified framework for modality-adaptive learning by reformulating feature extraction and fusion as conditional routing problems. We introduce an Adaptive Receptive Field Routing MoE (AR2-MoE) module to disentangle local spatial details from global semantic context, and a Modality-Aware Difference Routing MoE (MDR-MoE) module to adaptively select the most suitable fusion primitive at each pixel. In addition, we propose a Consistency-Aware Self-Distillation (CASD) strategy that stabilizes unified training under data-scarce heterogeneous settings by enforcing multi-level consistency. Extensive experiments on five public datasets demonstrate that UniRoute achieves strong overall performance, with a favorable accuracy-efficiency trade-off under a unified deployment setting.

Qihua Liang, Liang Chen, Yaozong Zheng, Jian Nong, Zhiyi Mo, Bineng Zhong

Main category: cs.CV

TL;DR: UBATrack is a novel multi-modal tracking framework using Mamba-style state space models with two key modules: Spatio-temporal Mamba Adapter for cross-modal dependencies and spatio-temporal cues, and Dynamic Multi-modal Feature Mixer for enhanced representation, achieving SOTA results without full fine-tuning.

Details

Motivation: Current multi-modal trackers using prompt learning overlook effective capture of spatio-temporal cues, limiting their performance despite integrating multiple complementary inputs like thermal, depth, and event data.

Method: UBATrack framework with two modules: 1) Spatio-temporal Mamba Adapter (STMA) leverages Mamba’s long-sequence modeling to jointly model cross-modal dependencies and spatio-temporal cues via adapter-tuning, 2) Dynamic Multi-modal Feature Mixer enhances multi-modal representation across feature dimensions to improve tracking robustness.

Result: Outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks, achieving outstanding results on LasHeR, RGBT234, RGBT210, DepthTrack, VOT-RGBD22, and VisEvent datasets.

Conclusion: UBATrack effectively captures spatio-temporal cues while eliminating costly full-parameter fine-tuning, improving training efficiency and achieving superior multi-modal tracking performance across various modalities.

Abstract: Multi-modal object tracking has attracted considerable attention by integrating multiple complementary inputs (e.g., thermal, depth, and event data) to achieve outstanding performance. Although current general-purpose multi-modal trackers primarily unify various modal tracking tasks (i.e., RGB-Thermal infrared, RGB-Depth or RGB-Event tracking) through prompt learning, they still overlook the effective capture of spatio-temporal cues. In this work, we introduce a novel multi-modal tracking framework based on a mamba-style state space model, termed UBATrack. Our UBATrack comprises two simple yet effective modules: a Spatio-temporal Mamba Adapter (STMA) and a Dynamic Multi-modal Feature Mixer. The former leverages Mamba’s long-sequence modeling capability to jointly model cross-modal dependencies and spatio-temporal visual cues in an adapter-tuning manner. The latter further enhances multi-modal representation capacity across multiple feature dimensions to improve tracking robustness. In this way, UBATrack eliminates the need for costly full-parameter fine-tuning, thereby improving the training efficiency of multi-modal tracking algorithms. Experiments show that UBATrack outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks, achieving outstanding results on the LasHeR, RGBT234, RGBT210, DepthTrack, VOT-RGBD22, and VisEvent datasets.

[137] LocBAM: Advancing 3D Patch-Based Image Segmentation by Integrating Location Contex

Donnate Hooft, Stefan M. Fischer, Cosmin Bercea, Jan C. Peeken, Julia A. Schnabel

Main category: cs.CV

TL;DR: LocBAM: A novel attention mechanism that incorporates location context into patch-based 3D medical image segmentation to improve performance when anatomical context is important.

Details

Motivation: Patch-based methods for 3D medical image segmentation often ignore patch location within the global volume, limiting performance when anatomical context matters. This location information is crucial for accurate segmentation.

Method: Proposes LocBAM, a novel attention mechanism that explicitly processes spatial information and location context in patch-based segmentation. It incorporates patch location awareness into the model architecture.

Result: Experiments on BTCV, AMOS22, and KiTS23 datasets show that location context stabilizes training and improves segmentation performance, especially under low patch-to-volume coverage where global context is missing. LocBAM consistently outperforms classical coordinate encoding via CoordConv.

Conclusion: Incorporating location context through LocBAM enhances patch-based 3D medical image segmentation, addressing limitations of traditional patch-based methods that neglect spatial positioning within the global volume.

Abstract: Patch-based methods are widely used in 3D medical image segmentation to address memory constraints in processing high-resolution volumetric data. However, these approaches often neglect the patch’s location within the global volume, which can limit segmentation performance when anatomical context is important. In this paper, we investigate the role of location context in patch-based 3D segmentation and propose a novel attention mechanism, LocBAM, that explicitly processes spatial information. Experiments on BTCV, AMOS22, and KiTS23 demonstrate that incorporating location context stabilizes training and improves segmentation performance, particularly under low patch-to-volume coverage where global context is missing. Furthermore, LocBAM consistently outperforms classical coordinate encoding via CoordConv. Code is publicly available at https://github.com/compai-lab/2026-ISBI-hooft

[138] Symmetry Informative and Agnostic Feature Disentanglement for 3D Shapes

Tobias Weißberg, Weikang Wang, Paul Roetzer, Nafie El Amrani, Florian Bernard

Main category: cs.CV

TL;DR: Proposes a feature disentanglement approach for 3D shapes that creates both symmetry-informative and symmetry-agnostic descriptors, with refinement to improve robustness, outperforming state-of-the-art methods in symmetry-related tasks.

Details

Motivation: Existing symmetry-aware shape descriptors have limitations: recent method χ extracts only 1D symmetry-informative features, missing other semantic information, and produces noisy results with misclassified patches. Need for more comprehensive and robust symmetry-aware descriptors.

Method: Feature disentanglement approach that simultaneously extracts symmetry-informative and symmetry-agnostic features from shape descriptors. Includes feature refinement technique to improve robustness of predicted symmetry-informative features.

Result: Extensive experiments show effectiveness in intrinsic symmetry detection, left/right classification, and shape matching. Outperforms various state-of-the-art methods both qualitatively and quantitatively.

Conclusion: Proposed framework successfully addresses limitations of previous symmetry-aware descriptors by providing comprehensive feature disentanglement and refinement, demonstrating superior performance in symmetry-related shape analysis tasks.

Abstract: Shape descriptors, i.e., per-vertex features of 3D meshes or point clouds, are fundamental to shape analysis. Historically, various handcrafted geometry-aware descriptors and feature refinement techniques have been proposed. Recently, several studies have initiated a new research direction by leveraging features from image foundation models to create semantics-aware descriptors, demonstrating advantages across tasks like shape matching, editing, and segmentation. Symmetry, another key concept in shape analysis, has also attracted increasing attention. Consequently, constructing symmetry-aware shape descriptors is a natural progression. Although the recent method $χ$ (Wang et al., 2025) successfully extracted symmetry-informative features from semantic-aware descriptors, its features are only one-dimensional, neglecting other valuable semantic information. Furthermore, the extracted symmetry-informative feature is usually noisy and yields small misclassified patches. To address these gaps, we propose a feature disentanglement approach which is simultaneously symmetry informative and symmetry agnostic. Further, we propose a feature refinement technique to improve the robustness of predicted symmetry informative features. Extensive experiments, including intrinsic symmetry detection, left/right classification, and shape matching, demonstrate the effectiveness of our proposed framework compared to various state-of-the-art methods, both qualitatively and quantitatively.

[139] POTR: Post-Training 3DGS Compression

Bert Ramlot, Martijn Courteaux, Peter Lambert, Glenn Van Wallendael

Main category: cs.CV

TL;DR: POTR is a post-training compression codec for 3D Gaussian Splatting that reduces storage requirements and accelerates inference through novel pruning and lighting coefficient optimization techniques.

Details

Motivation: 3D Gaussian Splatting (3DGS) outperforms NeRF in speed but has substantially higher storage requirements, creating a need for efficient compression methods that maintain performance while reducing storage needs.

Method: POTR introduces two novel techniques: 1) A pruning approach using a modified 3DGS rasterizer to efficiently calculate each splat’s removal effect simultaneously, and 2) A method to recompute lighting coefficients to reduce entropy without training. Also includes optional fine-tuning for further enhancement.

Result: POTR achieves 2-4x fewer splats than other post-training pruning techniques, 1.5-2x faster inference than other compressed models, increases AC lighting coefficient sparsity from 70% to 97% with minimal quality loss, and consistently outperforms all other post-training compression techniques in rate-distortion performance and inference speed.

Conclusion: POTR provides an effective post-training compression solution for 3DGS that significantly reduces storage requirements while maintaining quality and accelerating inference, making 3DGS more practical for real-world applications.

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a promising contender to Neural Radiance Fields (NeRF) in 3D scene reconstruction and real-time novel view synthesis. 3DGS outperforms NeRF in training and inference speed but has substantially higher storage requirements. To remedy this downside, we propose POTR, a post-training 3DGS codec built on two novel techniques. First, POTR introduces a novel pruning approach that uses a modified 3DGS rasterizer to efficiently calculate every splat’s individual removal effect simultaneously. This technique results in 2-4x fewer splats than other post-training pruning techniques and as a result also significantly accelerates inference with experiments demonstrating 1.5-2x faster inference than other compressed models. Second, we propose a novel method to recompute lighting coefficients, significantly reducing their entropy without using any form of training. Our fast and highly parallel approach especially increases AC lighting coefficient sparsity, with experiments demonstrating increases from 70% to 97%, with minimal loss in quality. Finally, we extend POTR with a simple fine-tuning scheme to further enhance pruning, inference, and rate-distortion performance. Experiments demonstrate that POTR, even without fine-tuning, consistently outperforms all other post-training compression techniques in both rate-distortion performance and inference speed.

[140] Multimodal system for skin cancer detection

Volodymyr Sydorskyi, Igor Krashenyi, Oleksii Yakubenko

Main category: cs.CV

TL;DR: Multi-modal melanoma detection system using conventional photos and metadata achieves high accuracy without specialized equipment, making detection more accessible.

Details

Motivation: Current deep learning models for melanoma detection rely on dermoscopic images requiring specialized equipment, limiting clinical accessibility. Need for more versatile, equipment-independent solutions suitable for broader healthcare settings.

Method: Multi-modal neural network combining conventional photo images with tabular metadata (demographics, lesion characteristics). Two-step model supports cases with/without metadata. Three-stage pipeline with boosting algorithms. Techniques for handling imbalanced datasets. Ablation study of vision architectures, boosting algorithms, and loss functions.

Result: Achieved Partial ROC AUC of 0.18068 (0.2 maximum) and top-15 retrieval sensitivity of 0.78371. Integration of photo images with metadata in structured pipeline yields significant performance improvements.

Conclusion: System provides scalable, equipment-independent melanoma detection solution suitable for diverse healthcare environments, bridging gap between specialized and general clinical practices.

Abstract: Melanoma detection is vital for early diagnosis and effective treatment. While deep learning models on dermoscopic images have shown promise, they require specialized equipment, limiting their use in broader clinical settings. This study introduces a multi-modal melanoma detection system using conventional photo images, making it more accessible and versatile. Our system integrates image data with tabular metadata, such as patient demographics and lesion characteristics, to improve detection accuracy. It employs a multi-modal neural network combining image and metadata processing and supports a two-step model for cases with or without metadata. A three-stage pipeline further refines predictions by boosting algorithms and enhancing performance. To address the challenges of a highly imbalanced dataset, specific techniques were implemented to ensure robust training. An ablation study evaluated recent vision architectures, boosting algorithms, and loss functions, achieving a peak Partial ROC AUC of 0.18068 (0.2 maximum) and top-15 retrieval sensitivity of 0.78371. Results demonstrate that integrating photo images with metadata in a structured, multi-stage pipeline yields significant performance improvements. This system advances melanoma detection by providing a scalable, equipment-independent solution suitable for diverse healthcare environments, bridging the gap between specialized and general clinical practices.

[141] MTFlow: Time-Conditioned Flow Matching for Microtubule Segmentation in Noisy Microscopy Images

Sidi Mohamed Sid El Moctar, Achraf Ait Laydi, Yousef El Mourabit, Hélène Bouvrais

Main category: cs.CV

TL;DR: MTFlow: A time-conditioned flow-matching model for microtubule segmentation that learns vector fields to iteratively refine noisy masks toward ground truth, achieving competitive accuracy on microtubule and other curvilinear biomedical structures.

Details

Motivation: Microtubules are essential cytoskeletal filaments and therapeutic targets, but their segmentation is challenging due to filament curvature, dense crossings, and image noise. Current methods struggle with these complexities, creating a need for more accurate segmentation tools.

Method: MTFlow uses a time-conditioned flow-matching model that learns vector fields to iteratively transport noisy masks toward ground truth. It combines a U-Net backbone with temporal embeddings to capture dynamics of uncertainty resolution along filament boundaries.

Result: MTFlow achieves competitive segmentation accuracy comparable to state-of-the-art models on synthetic and real microtubule datasets. It also generalizes well to other curvilinear biomedical structures like retinal blood vessels and nerves.

Conclusion: MTFlow provides a powerful, time-efficient tool for filamentous structure analysis with more precise annotations than manual or semi-automatic approaches, offering interpretable, trajectory-based refinement for microtubule segmentation.

Abstract: Microtubules are cytoskeletal filaments that play essential roles in many cellular processes and are key therapeutic targets in several diseases. Accurate segmentation of microtubule networks is critical for studying their organization and dynamics but remains challenging due to filament curvature, dense crossings, and image noise. We present MTFlow, a novel time-conditioned flow-matching model for microtubule segmentation. Unlike conventional U-Net variants that predict masks in a single pass, MTFlow learns vector fields that iteratively transport noisy masks toward the ground truth, enabling interpretable, trajectory-based refinement. Our architecture combines a U-Net backbone with temporal embeddings, allowing the model to capture the dynamics of uncertainty resolution along filament boundaries. We trained and evaluated MTFlow on synthetic and real microtubule datasets and assessed its generalization capability on public biomedical datasets of curvilinear structures such as retinal blood vessels and nerves. MTFlow achieves competitive segmentation accuracy comparable to state-of-the-art models, offering a powerful and time-efficient tool for filamentous structure analysis with more precise annotations than manual or semi-automatic approaches.

[142] GAT-NeRF: Geometry-Aware-Transformer Enhanced Neural Radiance Fields for High-Fidelity 4D Facial Avatars

Zhe Chang, Haodong Jin, Ying Sun, Yan Song, Hui Yu

Main category: cs.CV

TL;DR: GAT-NeRF: A hybrid neural radiance field framework using Geometry-Aware-Transformer to enhance high-fidelity 4D dynamic facial avatar reconstruction from monocular video, improving capture of fine details like wrinkles and textures.

Details

Motivation: High-fidelity 4D dynamic facial avatar reconstruction from monocular video is challenging but critical for immersive virtual human applications. Current NeRF methods struggle to capture high-frequency facial details (dynamic wrinkles, subtle textures) from information-constrained monocular streams.

Method: Proposes GAT-NeRF (Geometry-Aware-Transformer Enhanced NeRF) that integrates Transformer mechanism into NeRF pipeline. Combines coordinate-aligned MLP with lightweight Geometry-Aware-Transformer (GAT) module that processes multi-modal inputs: 3D spatial coordinates, 3DMM expression parameters, and learnable latent codes to enhance feature representations for fine-grained geometry.

Result: Comprehensive experiments demonstrate state-of-the-art performance in visual fidelity and high-frequency detail recovery (dynamic wrinkles, acne scars).

Conclusion: GAT-NeRF forges new pathways for creating realistic dynamic digital humans for multimedia applications by significantly improving modeling of complex local facial patterns.

Abstract: High-fidelity 4D dynamic facial avatar reconstruction from monocular video is a critical yet challenging task, driven by increasing demands for immersive virtual human applications. While Neural Radiance Fields (NeRF) have advanced scene representation, their capacity to capture high-frequency facial details, such as dynamic wrinkles and subtle textures from information-constrained monocular streams, requires significant enhancement. To tackle this challenge, we propose a novel hybrid neural radiance field framework, called Geometry-Aware-Transformer Enhanced NeRF (GAT-NeRF) for high-fidelity and controllable 4D facial avatar reconstruction, which integrates the Transformer mechanism into the NeRF pipeline. GAT-NeRF synergistically combines a coordinate-aligned Multilayer Perceptron (MLP) with a lightweight Transformer module, termed as Geometry-Aware-Transformer (GAT) due to its processing of multi-modal inputs containing explicit geometric priors. The GAT module is enabled by fusing multi-modal input features, including 3D spatial coordinates, 3D Morphable Model (3DMM) expression parameters, and learnable latent codes to effectively learn and enhance feature representations pertinent to fine-grained geometry. The Transformer’s effective feature learning capabilities are leveraged to significantly augment the modeling of complex local facial patterns like dynamic wrinkles and acne scars. Comprehensive experiments unequivocally demonstrate GAT-NeRF’s state-of-the-art performance in visual fidelity and high-frequency detail recovery, forging new pathways for creating realistic dynamic digital humans for multimedia applications.

[143] SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval

Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng, Wenqi Zhou, Walterio W. Mayol-Cuevas, Junxiao Shen

Main category: cs.CV

TL;DR: SpatialMem is a memory-centric system that unifies 3D geometry, semantics, and language from egocentric RGB video into a queryable representation for spatial reasoning tasks.

Details

Motivation: The paper aims to create a unified representation that combines spatial, semantic, and linguistic information for embodied AI tasks, enabling interpretable reasoning over spatial relations without requiring specialized sensors.

Method: The system reconstructs metrically scaled indoor environments from RGB video, detects structural 3D anchors (walls, doors, windows) as a scaffold, and populates a hierarchical memory with open-vocabulary object nodes linking visual evidence, embeddings, and textual descriptions to 3D coordinates.

Result: Experiments across three real-life indoor scenes show SpatialMem maintains strong navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, supporting language-guided navigation and object retrieval tasks.

Conclusion: SpatialMem offers an efficient and extensible framework for embodied spatial intelligence by creating a unified, queryable representation that enables interpretable spatial reasoning from casually captured video.

Abstract: We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes – linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates – for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.

[144] TempViz: On the Evaluation of Temporal Knowledge in Text-to-Image Models

Carolin Holtermann, Nina Krebs, Anne Lauscher

Main category: cs.CV

TL;DR: TempViz is the first dataset to evaluate temporal knowledge in text-to-image models, revealing weak temporal competence across models and unreliable automated evaluation methods.

Details

Motivation: Time significantly affects visual appearance of entities, making temporal knowledge crucial for generating contextually-relevant images. Despite extensive work on temporal understanding in NLP, research on temporal phenomena in text-to-image models remains scarce.

Method: Created TempViz dataset with 7.9k prompts and 600+ reference images to holistically evaluate temporal knowledge. Studied five T2I models across five temporal knowledge categories using human evaluation and compared automated evaluation methods against human judgments.

Result: Temporal competence in T2I models is generally weak, with no model exceeding 75% accuracy across categories. Automated evaluation methods fail to provide reliable assessment of temporal cues, showing poor correlation with human judgments.

Conclusion: There is a pressing need for future research on temporal knowledge in text-to-image models, as current models show limited temporal understanding and existing evaluation methods are inadequate for assessing temporal cues.

Abstract: Time alters the visual appearance of entities in our world, like objects, places, and animals. Thus, for accurately generating contextually-relevant images, knowledge and reasoning about time can be crucial (e.g., for generating a landscape in spring vs. in winter). Yet, although substantial work exists on understanding and improving temporal knowledge in natural language processing, research on how temporal phenomena appear and are handled in text-to-image (T2I) models remains scarce. We address this gap with TempViz, the first data set to holistically evaluate temporal knowledge in image generation, consisting of 7.9k prompts and more than 600 reference images. Using TempViz, we study the capabilities of five T2I models across five temporal knowledge categories. Human evaluation shows that temporal competence is generally weak, with no model exceeding 75% accuracy across categories. Towards larger-scale studies, we also examine automated evaluation methods, comparing several established approaches against human judgments. However, none of these approaches provides a reliable assessment of temporal cues - further indicating the pressing need for future research on temporal knowledge in T2I.

[145] Erosion Attack for Adversarial Training to Enhance Semantic Segmentation Robustness

Yufei Song, Ziqi Zhou, Menghao Deng, Yifan Hu, Shengshan Hu, Minghui Li, Leo Yu Zhang

Main category: cs.CV

TL;DR: EroSeg-AT: A vulnerability-aware adversarial training framework that uses EroSeg to generate adversarial examples by targeting sensitive pixels and disrupting semantic consistency, improving both attack effectiveness and model robustness.

Details

Motivation: Existing segmentation models are vulnerable to adversarial attacks, and current adversarial training methods are limited because they only consider global semantic information while ignoring contextual semantic relationships within samples, reducing their effectiveness.

Method: Proposes EroSeg-AT framework with EroSeg attack method: 1) Selects sensitive pixels based on pixel-level confidence, 2) Progressively propagates perturbations to higher-confidence pixels, 3) Effectively disrupts semantic consistency of samples for adversarial training.

Result: Experimental results show the approach significantly improves attack effectiveness compared to existing methods and enhances model robustness under adversarial training.

Conclusion: EroSeg-AT addresses limitations of existing adversarial training by considering contextual semantic relationships, making segmentation models more robust through vulnerability-aware adversarial training.

Abstract: Existing segmentation models exhibit significant vulnerability to adversarial attacks.To improve robustness, adversarial training incorporates adversarial examples into model training. However, existing attack methods consider only global semantic information and ignore contextual semantic relationships within the samples, limiting the effectiveness of adversarial training. To address this issue, we propose EroSeg-AT, a vulnerability-aware adversarial training framework that leverages EroSeg to generate adversarial examples. EroSeg first selects sensitive pixels based on pixel-level confidence and then progressively propagates perturbations to higher-confidence pixels, effectively disrupting the semantic consistency of the samples. Experimental results show that, compared to existing methods, our approach significantly improves attack effectiveness and enhances model robustness under adversarial training.

[146] PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu

Main category: cs.CV

TL;DR: VLMs struggle with task progress estimation from partial observations; Progress-Bench benchmark reveals models are sensitive to demonstration modality/viewpoint changes and poor at handling unanswerable cases; training-based ProgressLM-3B shows consistent improvements despite disjoint training/evaluation tasks.

Details

Motivation: While VLMs excel at describing visible content, it's unclear whether they can infer task progress from partial observations, which requires reasoning over long-horizon dynamics rather than static visual recognition.

Method: Introduced Progress-Bench benchmark for evaluating progress reasoning; explored human-inspired two-stage progress reasoning through both training-free prompting and training-based approach using curated ProgressLM-45K dataset.

Result: Most of 14 tested VLMs are not ready for task progress estimation, showing sensitivity to demonstration modality/viewpoint changes and poor handling of unanswerable cases. Training-free prompting yields limited gains, but training-based ProgressLM-3B achieves consistent improvements even with disjoint training/evaluation tasks.

Conclusion: Current VLMs struggle with progress reasoning, but training-based approaches show promise; analysis reveals characteristic error patterns and clarifies when/why progress reasoning succeeds or fails.

Abstract: Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.

[147] Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers

Xinyu Peng, Han Li, Yuyang Huang, Ziyang Zheng, Yaoming Wang, Xin Chen, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong

Main category: cs.CV

TL;DR: LDF-VFI introduces a video-centric diffusion transformer framework for frame interpolation that ensures long-range temporal coherence through auto-regressive modeling with skip-concatenate sampling, enabling efficient processing of arbitrary resolutions.

Details

Motivation: Existing VFI methods use frame-centric approaches that process videos as independent short segments, leading to temporal inconsistencies and motion artifacts. The authors aim to overcome these limitations with a holistic, video-centric paradigm.

Method: Uses an auto-regressive diffusion transformer to model entire video sequences, with skip-concatenate sampling to mitigate error accumulation. Incorporates sparse local attention and tiled VAE encoding for efficient long-sequence processing and arbitrary resolution generalization. Features an enhanced conditional VAE decoder with multi-scale input features.

Result: Achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion.

Conclusion: LDF-VFI successfully addresses temporal inconsistency in video frame interpolation through a video-centric diffusion approach, enabling coherent long-range interpolation with efficient processing of arbitrary resolutions.

Abstract: Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named \textbf{L}ocal \textbf{D}iffusion \textbf{F}orcing for \textbf{V}ideo \textbf{F}rame \textbf{I}nterpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. An enhanced conditional VAE decoder, which leverages multi-scale features from the input video, further improves reconstruction fidelity. Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion. The source code is available at https://github.com/xypeng9903/LDF-VFI.

[148] Unified Multi-Dataset Training for TBPS

Nilanjana Chatterjee, Sidharatha Garg, A V Subramanyam, Brejesh Lall

Main category: cs.CV

TL;DR: Scale-TBPS trains a single unified model for text-based person search across multiple datasets, overcoming limitations of dataset-specific fine-tuning through noise-aware dataset curation and scalable identity learning.

Details

Motivation: Current TBPS methods require dataset-specific fine-tuning due to limited training data and VLMs not being pre-trained for pedestrian recognition. Synthetic data helps but doesn't eliminate dataset-specific adaptation. The paper aims to create a single unified model that works across multiple datasets.

Method: Proposes Scale-TBPS with two key components: (1) noise-aware unified dataset curation strategy that merges diverse TBPS datasets cohesively, and (2) scalable discriminative identity learning framework effective with large numbers of unique identities.

Result: Extensive experiments on CUHK-PEDES, ICFG-PEDES, RSTPReid, IIITD-20K, and UFine6926 show that a single Scale-TBPS model outperforms both dataset-centric optimized models and naive joint training approaches.

Conclusion: Scale-TBPS successfully demonstrates that a single unified model can be trained across multiple TBPS datasets, overcoming challenges of identity scaling and noisy data, achieving superior performance compared to dataset-specific approaches.

Abstract: Text-Based Person Search (TBPS) has seen significant progress with vision-language models (VLMs), yet it remains constrained by limited training data and the fact that VLMs are not inherently pre-trained for pedestrian-centric recognition. Existing TBPS methods therefore rely on dataset-centric fine-tuning to handle distribution shift, resulting in multiple independently trained models for different datasets. While synthetic data can increase the scale needed to fine-tune VLMs, it does not eliminate dataset-specific adaptation. This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets? We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities and are vulnerable to noisy image-text pairs. To address these challenges, we propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework that remains effective under a large number of unique identities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, RSTPReid, IIITD-20K, and UFine6926 demonstrate that a single Scale-TBPS model outperforms dataset-centric optimized models and naive joint training.

[149] LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding

Xiaodong Wang, Langling Huang, Zhirong Wu, Xu Zhao, Teng Xu, Xuhong Xia, Peixi Peng

Main category: cs.CV

TL;DR: LiViBench is the first omnimodal benchmark for interactive livestream videos with 24 diverse tasks, featuring a semi-automatic annotation workflow and LiVi-LLM-7B model that outperforms larger models.

Details

Motivation: Existing video evaluation benchmarks focus on non-interactive videos (movies, recordings), creating a gap for interactive livestream understanding. Interactive livestreams present unique challenges with real-time comments, audio, and speech modalities that require specialized evaluation.

Method: 1) Created LiViBench with 24 diverse tasks covering perceptual, reasoning, and livestream-specific challenges. 2) Developed standardized semi-automatic annotation workflow with human-in-the-loop, using multi-agent MLLM system for video description and seed-question-driven annotation. 3) Designed two-stage instruction-tuning and Video-to-Comment Retrieval (VCR) module for better comment utilization. 4) Built LiVi-LLM-7B model with enhanced interactive livestream knowledge.

Result: LiVi-LLM-7B outperforms larger open-source models (up to 72B parameters), narrows gap with leading proprietary models on LiViBench, and achieves enhanced performance on general video benchmarks (VideoMME, LongVideoBench, MLVU, VideoEval-Pro).

Conclusion: The paper successfully addresses the gap in interactive livestream video evaluation with LiViBench, demonstrates effective semi-automatic annotation methods, and shows that specialized models like LiVi-LLM-7B can outperform larger general models on interactive video tasks while maintaining strong general video understanding capabilities.

Abstract: The development of multimodal large language models (MLLMs) has advanced general video understanding. However, existing video evaluation benchmarks primarily focus on non-interactive videos, such as movies and recordings. To fill this gap, this paper proposes the first omnimodal benchmark for interactive livestream videos, LiViBench. It features a diverse set of 24 tasks, highlighting the perceptual, reasoning, and livestream-specific challenges. To efficiently construct the dataset, we design a standardized semi-automatic annotation workflow that incorporates the human-in-the-loop at multiple stages. The workflow leverages multiple MLLMs to form a multi-agent system for comprehensive video description and uses a seed-question-driven method to construct high-quality annotations. All interactive videos in the benchmark include audio, speech, and real-time comments modalities. To enhance models’ understanding of interactive videos, we design tailored two-stage instruction-tuning and propose a Video-to-Comment Retrieval (VCR) module to improve the model’s ability to utilize real-time comments. Based on these advancements, we develop LiVi-LLM-7B, an MLLM with enhanced knowledge of interactive livestreams. Experiments show that our model outperforms larger open-source models with up to 72B parameters, narrows the gap with leading proprietary models on LiViBench, and achieves enhanced performance on general video benchmarks, including VideoMME, LongVideoBench, MLVU, and VideoEval-Pro.

[150] SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation

Yanan Wang, Linjie Ren, Zihao Li, Junyi Wang, Tian Gan

Main category: cs.CV

TL;DR: The paper introduces BinauralVGGSound, the first large-scale video-binaural audio dataset, and a visual-guided spatial audio generation framework that produces immersive binaural audio from video while maintaining semantic and temporal alignment.

Details

Motivation: Current video-to-audio generation models focus mainly on semantic and temporal alignment but neglect spatial perception and immersive quality, largely due to reliance on mono audio datasets that lack binaural spatial information needed for visual-to-spatial audio mappings.

Method: Two key contributions: 1) Construction of BinauralVGGSound dataset (first large-scale video-binaural audio dataset), and 2) An end-to-end spatial audio generation framework with visual-guided audio spatialization module that explicitly models spatial features while maintaining semantic and temporal alignment.

Result: The approach substantially outperforms state-of-the-art models in spatial fidelity and delivers more immersive auditory experience without sacrificing temporal or semantic consistency.

Conclusion: The work addresses the spatial perception gap in video-to-audio generation by providing both dataset and framework for spatially aware audio synthesis, with all resources to be publicly released to facilitate future research.

Abstract: While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models’ reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual-to-spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large-scale video-binaural audio dataset designed to support spatially aware video-to-audio generation; and we propose a end-to-end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual-guided audio spatialization module that ensures the generated audio exhibits realistic spatial attributes and layered spatial depth while maintaining semantic and temporal alignment. Experiments show that our approach substantially outperforms state-of-the-art models in spatial fidelity and delivers a more immersive auditory experience, without sacrificing temporal or semantic consistency. All datasets, code, and model checkpoints will be publicly released to facilitate future research.

[151] Federated Transformer-GNN for Privacy-Preserving Brain Tumor Localization with Modality-Level Explainability

Andrea Protani, Riccardo Taiello, Marc Molina Van Den Bosch, Luigi Serio

Main category: cs.CV

TL;DR: Federated learning framework for brain tumor localization enables multi-institutional collaboration without sharing patient data, matching centralized performance while providing explainable attention mechanisms that align with clinical practice.

Details

Motivation: Brain tumor analysis requires large datasets that are often siloed across healthcare institutions due to privacy regulations, creating a need for collaborative learning without sharing sensitive patient data.

Method: Federated learning framework using hybrid Transformer-Graph Neural Network architecture deployed within CAFEIN® platform, with explainability analysis through Transformer attention mechanisms to reveal which MRI modalities drive predictions.

Result: Federated learning enables continued model improvement by leveraging distributed data, matching centralized performance. Isolated training triggers early stopping, but federated learning aggregates knowledge from multiple institutions. Explainability analysis shows deeper network layers significantly increase attention to T2 and FLAIR modalities (p<0.001, Cohen’s d=1.50), aligning with clinical practice.

Conclusion: Federated learning provides strong justification for complex tasks with high-dimensional data, as aggregating knowledge from multiple institutions significantly benefits learning while maintaining privacy. The explainability analysis validates that model attention aligns with clinical expertise.

Abstract: Deep learning models for brain tumor analysis require large and diverse datasets that are often siloed across healthcare institutions due to privacy regulations. We present a federated learning framework for brain tumor localization that enables multi-institutional collaboration without sharing sensitive patient data. Our method extends a hybrid Transformer-Graph Neural Network architecture derived from prior decoder-free supervoxel GNNs and is deployed within CAFEIN\textsuperscript{\textregistered}, CERN’s federated learning platform designed for healthcare environments. We provide an explainability analysis through Transformer attention mechanisms that reveals which MRI modalities drive the model predictions. Experiments on the BraTS dataset demonstrate a key finding: while isolated training on individual client data triggers early stopping well before reaching full training capacity, federated learning enables continued model improvement by leveraging distributed data, ultimately matching centralized performance. This result provides strong justification for federated learning when dealing with complex tasks and high-dimensional input data, as aggregating knowledge from multiple institutions significantly benefits the learning process. Our explainability analysis, validated through rigorous statistical testing on the full test set (paired t-tests with Bonferroni correction), reveals that deeper network layers significantly increase attention to T2 and FLAIR modalities ($p<0.001$, Cohen’s $d$=1.50), aligning with clinical practice.

[152] Deep Leakage with Generative Flow Matching Denoiser

Isaac Baglin, Xiatian Zhu, Simon Hadfield

Main category: cs.CV

TL;DR: New federated learning attack uses flow matching generative prior to reconstruct private client data from model updates, outperforming existing methods across various metrics and defenses.

Details

Motivation: Federated learning is vulnerable to deep leakage attacks that can reconstruct private client data from shared model updates. Existing methods have limitations in stability, fidelity, and robustness under realistic FL settings.

Method: Integrates a generative Flow Matching (FM) prior into the reconstruction process, guiding optimization toward realistic image distributions using a flow matching foundation model without requiring knowledge of private data.

Result: Consistently outperforms state-of-the-art attacks across pixel-level, perceptual, and feature-based similarity metrics. Remains effective across different training epochs, larger client batch sizes, and under common defenses like noise injection, clipping, and sparsification.

Conclusion: The findings highlight the need for new defense strategies that explicitly account for adversaries equipped with powerful generative priors, as current FL defenses are insufficient against such advanced attacks.

Abstract: Federated Learning (FL) has emerged as a powerful paradigm for decentralized model training, yet it remains vulnerable to deep leakage (DL) attacks that reconstruct private client data from shared model updates. While prior DL methods have demonstrated varying levels of success, they often suffer from instability, limited fidelity, or poor robustness under realistic FL settings. We introduce a new DL attack that integrates a generative Flow Matching (FM) prior into the reconstruction process. By guiding optimization toward the distribution of realistic images (represented by a flow matching foundation model), our method enhances reconstruction fidelity without requiring knowledge of the private data. Extensive experiments on multiple datasets and target models demonstrate that our approach consistently outperforms state-of-the-art attacks across pixel-level, perceptual, and feature-based similarity metrics. Crucially, the method remains effective across different training epochs, larger client batch sizes, and under common defenses such as noise injection, clipping, and sparsification. Our findings call for the development of new defense strategies that explicitly account for adversaries equipped with powerful generative priors.

[153] Differential Privacy Image Generation with Reconstruction Loss and Noise Injection Using an Error Feedback SGD

Qiwei Ma, Jun Zhang

Main category: cs.CV

TL;DR: A novel differential privacy framework using Error Feedback SGD with reconstruction loss and noise injection achieves better image quality under same privacy budget than existing methods.

Details

Motivation: Traditional anonymization techniques fail to balance privacy protection and data utility for privacy-preserving ML. Synthetic data helps but existing methods suffer from repeated privacy-utility trade-offs.

Method: Proposes a differential privacy generation framework using Error Feedback Stochastic Gradient Descent (EFSGD) with reconstruction loss and noise injection mechanism during training.

Result: Generates higher quality images under same privacy budget than related work. Achieves SOTA results on MNIST, Fashion-MNIST, and CelebA benchmarks across almost all metrics.

Conclusion: The proposed framework effectively balances privacy and utility, demonstrating strong generalization for both grayscale and RGB images in privacy-preserving synthetic data generation.

Abstract: Traditional data masking techniques such as anonymization cannot achieve the expected privacy protection while ensuring data utility for privacy-preserving machine learning. Synthetic data plays an increasingly important role as it generates a large number of training samples and prevents information leakage in real data. The existing methods suffer from the repeating trade-off processes between privacy and utility. We propose a novel framework for differential privacy generation, which employs an Error Feedback Stochastic Gradient Descent(EFSGD) method and introduces a reconstruction loss and noise injection mechanism into the training process. We generate images with higher quality and usability under the same privacy budget as the related work. Extensive experiments demonstrate the effectiveness and generalization of our proposed framework for both grayscale and RGB images. We achieve state-of-the-art results over almost all metrics on three benchmarks: MNIST, Fashion-MNIST, and CelebA.

Tianyu Li, Songyue Cai, Zongqian Wu, Ping Hu, Xiaofeng Zhu

Main category: cs.CV

TL;DR: A plug-and-play framework that improves CLIP-based foreground-background decomposition for few-shot OOD detection by adaptively suppressing background patches and rectifying confusable foreground patches.

Details

Motivation: Existing FG-BG decomposition methods have limitations: uniform background suppression ignores varying patch contributions, and foreground methods don't address patches that resemble other classes, misleading training.

Method: Three components: 1) FG-BG Decomposition module (similar to previous methods), 2) Adaptive Background Suppression module (weights patch classification entropy adaptively), 3) Confusable Foreground Rectification module (identifies and rectifies confusable foreground patches).

Result: Extensive experiments show the plug-and-play framework significantly improves performance of existing FG-BG decomposition methods.

Conclusion: The proposed framework effectively addresses limitations of current FG-BG methods through adaptive background suppression and confusable foreground rectification, enhancing few-shot OOD detection performance.

Abstract: CLIP-based foreground-background (FG-BG) decomposition methods have demonstrated remarkable effectiveness in improving few-shot out-of-distribution (OOD) detection performance. However, existing approaches still suffer from several limitations. For background regions obtained from decomposition, existing methods adopt a uniform suppression strategy for all patches, overlooking the varying contributions of different patches to the prediction. For foreground regions, existing methods fail to adequately consider that some local patches may exhibit appearance or semantic similarity to other classes, which may mislead the training process. To address these issues, we propose a new plug-and-play framework. This framework consists of three core components: (1) a Foreground-Background Decomposition module, which follows previous FG-BG methods to separate an image into foreground and background regions; (2) an Adaptive Background Suppression module, which adaptively weights patch classification entropy; and (3) a Confusable Foreground Rectification module, which identifies and rectifies confusable foreground patches. Extensive experimental results demonstrate that the proposed plug-and-play framework significantly improves the performance of existing FG-BG decomposition methods. Code is available at: https://github.com/lounwb/FoBoR.

[155] The Pictorial Cortex: Zero-Shot Cross-Subject fMRI-to-Image Reconstruction via Compositional Latent Modeling

Jingyang Huo, Yikai Wang, Yanwei Fu, Jianfeng Feng

Main category: cs.CV

TL;DR: PictorialCortex enables zero-shot cross-subject fMRI-to-image reconstruction by modeling brain activity with compositional latents and using multi-dataset training on standardized cortical-surface data.

Details

Motivation: Current fMRI-to-image reconstruction faces challenges due to neural variability across individuals and trials, making it non-injective. The paper aims to solve zero-shot cross-subject reconstruction where visual experiences of unseen individuals must be reconstructed without subject-specific training.

Method: 1) Created UniCortex-fMRI dataset with standardized cortical-surface data from multiple visual-stimulus fMRI datasets. 2) Proposed PictorialCortex model with compositional latent formulation that structures stimulus-driven representations under subject-, dataset-, and trial-related variability. 3) Uses latent factorization-composition module with consistency regularization. 4) During inference, aggregates surrogate latents from multiple seen subjects to guide diffusion-based image synthesis for unseen subjects.

Result: Extensive experiments show that PictorialCortex improves zero-shot cross-subject visual reconstruction, demonstrating benefits of compositional latent modeling and multi-dataset training.

Conclusion: The compositional latent modeling approach enables effective zero-shot cross-subject fMRI-to-image reconstruction, addressing the challenge of neural variability through structured representations and multi-dataset training on standardized cortical data.

Abstract: Decoding visual experiences from human brain activity remains a central challenge at the intersection of neuroscience, neuroimaging, and artificial intelligence. A critical obstacle is the inherent variability of cortical responses: neural activity elicited by the same visual stimulus differs across individuals and trials due to anatomical, functional, cognitive, and experimental factors, making fMRI-to-image reconstruction non-injective. In this paper, we tackle a challenging yet practically meaningful problem: zero-shot cross-subject fMRI-to-image reconstruction, where the visual experience of a previously unseen individual must be reconstructed without subject-specific training. To enable principled evaluation, we present a unified cortical-surface dataset – UniCortex-fMRI, assembled from multiple visual-stimulus fMRI datasets to provide broad coverage of subjects and stimuli. Our UniCortex-fMRI is particularly processed by standardized data formats to make it possible to explore this possibility in the zero-shot scenario of cross-subject fMRI-to-image reconstruction. To tackle the modeling challenge, we propose PictorialCortex, which models fMRI activity using a compositional latent formulation that structures stimulus-driven representations under subject-, dataset-, and trial-related variability. PictorialCortex operates in a universal cortical latent space and implements this formulation through a latent factorization-composition module, reinforced by paired factorization and re-factorizing consistency regularization. During inference, surrogate latents synthesized under multiple seen-subject conditions are aggregated to guide diffusion-based image synthesis for unseen subjects. Extensive experiments show that PictorialCortex improves zero-shot cross-subject visual reconstruction, highlighting the benefits of compositional latent modeling and multi-dataset training.

[156] Three-dimensional visualization of X-ray micro-CT with large-scale datasets: Efficiency and accuracy for real-time interaction

Yipeng Yin, Rao Yao, Qingying Li, Dazhong Wang, Hong Zhou, Zhijun Fang, Jianing Chen, Longjie Qian, Mingyue Wu

Main category: cs.CV

TL;DR: Review paper on Micro-CT 3D visualization advances, focusing on balancing accuracy vs efficiency in defect characterization, covering reconstruction algorithms and volume rendering techniques.

Details

Motivation: Industrial CT ultra-precision inspection generates massive datasets, creating a need to solve the trade-off between accuracy and efficiency in 3D defect characterization during ultra-precise detection.

Method: Selective review and analysis of CT reconstruction and volume rendering methods that balance accuracy and efficiency. Examines evolution from analytical methods to deep learning techniques, volume rendering algorithm improvements, acceleration, data reduction, and advanced lighting models.

Result: Provides comprehensive analysis to help researchers quickly grasp efficient and accurate 3D reconstruction methods for microscopic features. Compares principles of computed tomography with microstructural technology advancements.

Conclusion: Envisions future directions in CT reconstruction and volume rendering, aiming to guide research in selecting efficient methods and developing approaches for real-time online monitoring of internal material defects through virtual-physical interaction and digital twin applications for structural health monitoring.

Abstract: As Micro-CT technology continues to refine its characterization of material microstructures, industrial CT ultra-precision inspection is generating increasingly large datasets, necessitating solutions to the trade-off between accuracy and efficiency in the 3D characterization of defects during ultra-precise detection. This article provides a unique perspective on recent advances in accurate and efficient 3D visualization using Micro-CT, tracing its evolution from medical imaging to industrial non-destructive testing (NDT). Among the numerous CT reconstruction and volume rendering methods, this article selectively reviews and analyzes approaches that balance accuracy and efficiency, offering a comprehensive analysis to help researchers quickly grasp highly efficient and accurate 3D reconstruction methods for microscopic features. By comparing the principles of computed tomography with advancements in microstructural technology, this article examines the evolution of CT reconstruction algorithms from analytical methods to deep learning techniques, as well as improvements in volume rendering algorithms, acceleration, and data reduction. Additionally, it explores advanced lighting models for high-accuracy, photorealistic, and efficient volume rendering. Furthermore, this article envisions potential directions in CT reconstruction and volume rendering. It aims to guide future research in quickly selecting efficient and precise methods and developing new ideas and approaches for real-time online monitoring of internal material defects through virtual-physical interaction, for applying digital twin model to structural health monitoring (SHM).

[157] Pb4U-GNet: Resolution-Adaptive Garment Simulation via Propagation-before-Update Graph Network

Aoran Liu, Kun Hu, Clinton Ansun Mo, Qiuxia Wu, Wenxiong Kang, Zhiyong Wang

Main category: cs.CV

TL;DR: Pb4U-GNet is a resolution-adaptive graph neural network for garment simulation that addresses poor cross-resolution generalization by decoupling message propagation from feature updates with dynamic depth control and geometry-aware scaling.

Details

Motivation: Conventional physics-based garment simulation is computationally expensive, while existing GNN-based approaches suffer from poor cross-resolution generalization, with significant performance degradation on higher-resolution meshes beyond training distribution.

Method: Introduces Propagation-before-Update Graph Network (Pb4U-GNet) with two key mechanisms: (1) dynamic propagation depth control that adjusts message-passing iterations based on mesh resolution, and (2) geometry-aware update scaling that scales predictions according to local mesh characteristics.

Result: Extensive experiments show that even when trained solely on low-resolution meshes, Pb4U-GNet exhibits strong generalizability across diverse mesh resolutions, addressing a fundamental challenge in neural garment simulation.

Conclusion: Pb4U-GNet provides a resolution-adaptive framework that enables effective neural garment simulation with good cross-resolution generalization, overcoming limitations of existing GNN approaches.

Abstract: Garment simulation is fundamental to various applications in computer vision and graphics, from virtual try-on to digital human modelling. However, conventional physics-based methods remain computationally expensive, hindering their application in time-sensitive scenarios. While graph neural networks (GNNs) offer promising acceleration, existing approaches exhibit poor cross-resolution generalisation, demonstrating significant performance degradation on higher-resolution meshes beyond the training distribution. This stems from two key factors: (1) existing GNNs employ fixed message-passing depth that fails to adapt information aggregation to mesh density variation, and (2) vertex-wise displacement magnitudes are inherently resolution-dependent in garment simulation. To address these issues, we introduce Propagation-before-Update Graph Network (Pb4U-GNet), a resolution-adaptive framework that decouples message propagation from feature updates. Pb4U-GNet incorporates two key mechanisms: (1) dynamic propagation depth control, adjusting message-passing iterations based on mesh resolution, and (2) geometry-aware update scaling, which scales predictions according to local mesh characteristics. Extensive experiments show that even trained solely on low-resolution meshes, Pb4U-GNet exhibits strong generalisability across diverse mesh resolutions, addressing a fundamental challenge in neural garment simulation.

[158] Training-Free and Interpretable Hateful Video Detection via Multi-stage Adversarial Reasoning

Shuonan Yang, Yuchen Zhang, Zeyu Fu

Main category: cs.CV

TL;DR: MARS is a training-free multi-stage adversarial reasoning framework for hateful video detection that uses objective description, evidence-based reasoning, and counter-evidence reasoning to produce reliable and interpretable decisions.

Details

Motivation: Existing hateful video detection methods have limitations: training-based approaches suffer from limited training data and lack interpretability, while directly prompting large vision-language models often fails to deliver reliable hate detection. Hateful videos pose serious risks by amplifying discrimination, inciting violence, and undermining online safety.

Method: MARS is a training-free Multi-stage Adversarial ReaSoning framework with four stages: 1) Objective description of video content to establish neutral foundation, 2) Evidence-based reasoning supporting potential hateful interpretations, 3) Counter-evidence reasoning capturing plausible non-hateful perspectives, 4) Synthesis of these perspectives into conclusive and explainable decisions.

Result: Extensive evaluation on two real-world datasets shows MARS achieves up to 10% improvement under certain backbones and settings compared to other training-free approaches, and outperforms state-of-the-art training-based methods on one dataset. MARS also produces human-understandable justifications.

Conclusion: MARS provides a reliable and interpretable solution for hateful video detection without requiring training, addressing limitations of existing methods while supporting compliance oversight and enhancing transparency in content moderation workflows through human-understandable justifications.

Abstract: Hateful videos pose serious risks by amplifying discrimination, inciting violence, and undermining online safety. Existing training-based hateful video detection methods are constrained by limited training data and lack of interpretability, while directly prompting large vision-language models often struggle to deliver reliable hate detection. To address these challenges, this paper introduces MARS, a training-free Multi-stage Adversarial ReaSoning framework that enables reliable and interpretable hateful content detection. MARS begins with the objective description of video content, establishing a neutral foundation for subsequent analysis. Building on this, it develops evidence-based reasoning that supports potential hateful interpretations, while in parallel incorporating counter-evidence reasoning to capture plausible non-hateful perspectives. Finally, these perspectives are synthesized into a conclusive and explainable decision. Extensive evaluation on two real-world datasets shows that MARS achieves up to 10% improvement under certain backbones and settings compared to other training-free approaches and outperforms state-of-the-art training-based methods on one dataset. In addition, MARS produces human-understandable justifications, thereby supporting compliance oversight and enhancing the transparency of content moderation workflows. The code is available at https://github.com/Multimodal-Intelligence-Lab-MIL/MARS.

[159] BREPS: Bounding-Box Robustness Evaluation of Promptable Segmentation

Andrey Moskalenko, Danil Kuznetsov, Irina Dudko, Anastasiia Iasakova, Nikita Boldyrev, Denis Shepelev, Andrei Spiridonov, Andrey Kuznetsov, Vlad Shakhuro

Main category: cs.CV

TL;DR: BREPS introduces a method to evaluate robustness of promptable segmentation models (like SAM) to natural variations in bounding box prompts by generating adversarial boxes that minimize/maximize segmentation error while maintaining naturalness.

Details

Motivation: Current promptable segmentation models are trained/evaluated with synthetic prompts from simple heuristics, lacking insight into real-world robustness to natural variations in user-provided bounding boxes.

Method: 1) Conduct user study to collect real bounding box annotations revealing model sensitivity; 2) Reformulate robustness evaluation as white-box optimization over bounding box space; 3) Introduce BREPS method to generate adversarial bounding boxes with naturalness constraints.

Result: Analysis shows substantial variability in segmentation quality across users for same model/instance, indicating high sensitivity to natural prompt noise. BREPS enables systematic robustness benchmarking across 10 datasets spanning everyday scenes to medical imaging.

Conclusion: Promptable segmentation models are highly sensitive to natural variations in bounding box prompts, highlighting need for robustness evaluation beyond synthetic prompts. BREPS provides effective framework for assessing and improving model robustness to real-world user inputs.

Abstract: Promptable segmentation models such as SAM have established a powerful paradigm, enabling strong generalization to unseen objects and domains with minimal user input, including points, bounding boxes, and text prompts. Among these, bounding boxes stand out as particularly effective, often outperforming points while significantly reducing annotation costs. However, current training and evaluation protocols typically rely on synthetic prompts generated through simple heuristics, offering limited insight into real-world robustness. In this paper, we investigate the robustness of promptable segmentation models to natural variations in bounding box prompts. First, we conduct a controlled user study and collect thousands of real bounding box annotations. Our analysis reveals substantial variability in segmentation quality across users for the same model and instance, indicating that SAM-like models are highly sensitive to natural prompt noise. Then, since exhaustive testing of all possible user inputs is computationally prohibitive, we reformulate robustness evaluation as a white-box optimization problem over the bounding box prompt space. We introduce BREPS, a method for generating adversarial bounding boxes that minimize or maximize segmentation error while adhering to naturalness constraints. Finally, we benchmark state-of-the-art models across 10 datasets, spanning everyday scenes to medical imaging. Code - https://github.com/emb-ai/BREPS.

[160] Graph Recognition via Subgraph Prediction

André Eberhard, Gerhard Neumann, Pascal Friederich

Main category: cs.CV

TL;DR: GraSP is a unified framework for visual graph recognition that works across diverse graph types and drawings without task-specific modifications.

Details

Motivation: Visual graph recognition remains challenging due to lack of canonical approaches; existing solutions are problem-specific and not transferable between contexts despite solving the same conceptual problem.

Method: Graph Recognition via Subgraph Prediction (GraSP) - a method designed for broad applicability and simplicity that recognizes graphs in images.

Result: GraSP works across several synthetic benchmarks and one real-world application with diverse graph types and drawings, and can be transferred between tasks without task-specific modifications.

Conclusion: GraSP paves the way for a more unified framework for visual graph recognition by providing a transferable approach that works across different contexts and graph types.

Abstract: Despite tremendous improvements in tasks such as image classification, object detection, and segmentation, the recognition of visual relationships, commonly modeled as the extraction of a graph from an image, remains a challenging task. We believe that this mainly stems from the fact that there is no canonical way to approach the visual graph recognition task. Most existing solutions are specific to a problem and cannot be transferred between different contexts out-of-the box, even though the conceptual problem remains the same. With broad applicability and simplicity in mind, in this paper we develop a method, \textbf{Gra}ph Recognition via \textbf{S}ubgraph \textbf{P}rediction (\textbf{GraSP}), for recognizing graphs in images. We show across several synthetic benchmarks and one real-world application that our method works with a set of diverse types of graphs and their drawings, and can be transferred between tasks without task-specific modifications, paving the way to a more unified framework for visual graph recognition.

[161] Large-Scale Multidimensional Knowledge Profiling of Scientific Literature

Zhucun Xue, Jiangning Zhang, Juntao Jiang, Jinzhuo Liu, Haoyang He, Teng Hu, Xiaobin Hu, Guangming Yao, Yi Yuan, Yong Liu

Main category: cs.CV

TL;DR: The paper presents a comprehensive analysis of 100,000+ AI papers from 22 conferences (2020-2025) using topic clustering, LLM-assisted parsing, and structured retrieval to track research evolution, revealing shifts toward safety, multimodal reasoning, and agent-oriented studies.

Details

Motivation: The rapid expansion of AI research across ML, vision, and language has created too many publications to synthesize manually. Traditional bibliometric tools rely on metadata and offer limited semantic analysis, making it difficult to track research theme evolution and cross-area influences over time.

Method: Compiled a unified corpus of 100,000+ papers from 22 major conferences (2020-2025). Constructed a multidimensional profiling pipeline combining topic clustering, LLM-assisted parsing, and structured retrieval to organize and analyze textual content. This creates a comprehensive representation supporting study of topic lifecycles, methodological transitions, dataset/model usage patterns, and institutional research directions.

Result: Analysis reveals notable shifts in AI research: growth of safety, multimodal reasoning, and agent-oriented studies, along with gradual stabilization of areas like neural machine translation and graph-based methods. Provides evidence-based view of AI research evolution and identifies emerging directions.

Conclusion: The study offers a comprehensive resource for understanding broader AI research trends and identifying emerging directions through systematic analysis of large-scale scientific literature. The code and dataset are publicly available for further research.

Abstract: The rapid expansion of research across machine learning, vision, and language has produced a volume of publications that is increasingly difficult to synthesize. Traditional bibliometric tools rely mainly on metadata and offer limited visibility into the semantic content of papers, making it hard to track how research themes evolve over time or how different areas influence one another. To obtain a clearer picture of recent developments, we compile a unified corpus of more than 100,000 papers from 22 major conferences between 2020 and 2025 and construct a multidimensional profiling pipeline to organize and analyze their textual content. By combining topic clustering, LLM-assisted parsing, and structured retrieval, we derive a comprehensive representation of research activity that supports the study of topic lifecycles, methodological transitions, dataset and model usage patterns, and institutional research directions. Our analysis highlights several notable shifts, including the growth of safety, multimodal reasoning, and agent-oriented studies, as well as the gradual stabilization of areas such as neural machine translation and graph-based methods. These findings provide an evidence-based view of how AI research is evolving and offer a resource for understanding broader trends and identifying emerging directions. Code and dataset: https://github.com/xzc-zju/Profiling_Scientific_Literature

[162] BBoxMaskPose v2: Expanding Mutual Conditioning to 3D

Miroslav Purkrabek, Constantin Kolomiiets, Jiri Matas

Main category: cs.CV

TL;DR: PMPose improves crowded scene pose estimation via probabilistic formulation and mask-conditioning, while BMPv2 combines PMPose with enhanced SAM-based mask refinement to achieve state-of-the-art results on COCO and OCHuman benchmarks.

Details

Motivation: 2D human pose estimation benchmarks are nearly saturated except for crowded scenes, creating a need for better methods to handle occlusion and complex multi-person scenarios.

Method: PMPose introduces probabilistic formulation and mask-conditioning for top-down 2D pose estimation. BMPv2 integrates PMPose with an enhanced SAM-based mask refinement module for improved accuracy.

Result: BMPv2 surpasses state-of-the-art by 1.5 AP points on COCO and 6 AP points on OCHuman, becoming first method to exceed 50 AP on OCHuman. Also shows 2D prompting improves 3D pose estimation in crowded scenes.

Conclusion: Advances in 2D pose quality directly benefit 3D estimation, and multi-person performance is more affected by pose prediction accuracy than detection. The method effectively addresses crowded scene challenges.

Abstract: Most 2D human pose estimation benchmarks are nearly saturated, with the exception of crowded scenes. We introduce PMPose, a top-down 2D pose estimator that incorporates the probabilistic formulation and the mask-conditioning. PMPose improves crowded pose estimation without sacrificing performance on standard scenes. Building on this, we present BBoxMaskPose v2 (BMPv2) integrating PMPose and an enhanced SAM-based mask refinement module. BMPv2 surpasses state-of-the-art by 1.5 average precision (AP) points on COCO and 6 AP points on OCHuman, becoming the first method to exceed 50 AP on OCHuman. We demonstrate that BMP’s 2D prompting of 3D model improves 3D pose estimation in crowded scenes and that advances in 2D pose quality directly benefit 3D estimation. Results on the new OCHuman-Pose dataset show that multi-person performance is more affected by pose prediction accuracy than by detection. The code, models, and data are available on https://MiraPurkrabek.github.io/BBox-Mask-Pose/.

[163] A Computer Vision Hybrid Approach: CNN and Transformer Models for Accurate Alzheimer’s Detection from Brain MRI Scans

Md Mahmudul Hoque, Shuvo Karmaker, Md. Hadi Al-Amin, Md Modabberul Islam, Jisun Junayed, Farha Ulfat Mahi

Main category: cs.CV

TL;DR: Hybrid ensemble model Evan_V2 achieves near-perfect 99.99% accuracy for 4-class Alzheimer’s disease classification, outperforming standalone CNN and Transformer models by integrating features from 10 architectures.

Details

Motivation: Early and accurate classification of Alzheimer's disease from brain MRI scans is essential for timely clinical intervention and improved patient outcomes. The study aims to develop a highly reliable diagnostic tool.

Method: Comparative analysis of 5 CNN architectures (EfficientNetB0, ResNet50, DenseNet201, MobileNetV3, VGG16), 5 Transformer-based models (ViT, ConvTransformer, PatchTransformer, MLP-Mixer, SimpleTransformer), and a proposed hybrid model Evan_V2 that integrates outputs from all 10 architectures through feature-level fusion.

Result: CNN models performed strongly with ResNet50 achieving 98.83% accuracy. Transformers showed competitive generalization with ViT at 95.38% accuracy but exhibited class-specific instability. Evan_V2 hybrid model achieved best performance: 99.99% accuracy, 0.9989 F1-score, and 0.9968 ROC AUC, substantially reducing misclassification across all dementia stages.

Conclusion: Hybrid ensemble strategies have significant potential for producing highly reliable and clinically meaningful diagnostic tools for Alzheimer’s disease classification, with Evan_V2 demonstrating superior performance over standalone models.

Abstract: Early and accurate classification of Alzheimers disease (AD) from brain MRI scans is essential for timely clinical intervention and improved patient outcomes. This study presents a comprehensive comparative analysis of five CNN architectures (EfficientNetB0, ResNet50, DenseNet201, MobileNetV3, VGG16), five Transformer-based models (ViT, ConvTransformer, PatchTransformer, MLP-Mixer, SimpleTransformer), and a proposed hybrid model named Evan_V2. All models were evaluated on a four-class AD classification task comprising Mild Dementia, Moderate Dementia, Non-Demented, and Very Mild Dementia categories. Experimental findings show that CNN architectures consistently achieved strong performance, with ResNet50 attaining 98.83% accuracy. Transformer models demonstrated competitive generalization capabilities, with ViT achieving the highest accuracy among them at 95.38%. However, individual Transformer variants exhibited greater class-specific instability. The proposed Evan_V2 hybrid model, which integrates outputs from ten CNN and Transformer architectures through feature-level fusion, achieved the best overall performance with 99.99% accuracy, 0.9989 F1-score, and 0.9968 ROC AUC. Confusion matrix analysis further confirmed that Evan_V2 substantially reduced misclassification across all dementia stages, outperforming every standalone model. These findings highlight the potential of hybrid ensemble strategies in producing highly reliable and clinically meaningful diagnostic tools for Alzheimers disease classification.

[164] ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation

Hanlei Guo, Jiahao Shao, Xinya Chen, Xiyang Tan, Sheng Miao, Yujun Shen, Yiyi Liao

Main category: cs.CV

TL;DR: ScenDi integrates 3D and 2D diffusion models for realistic urban scene generation, addressing limitations of single-model approaches by combining 3DGS generation with 2D video enhancement for better appearance details and camera controllability.

Details

Motivation: Existing 3D object generation methods struggle with urban scenes: 3D diffusion models degrade appearance details, while 2D diffusion models compromise camera controllability. There's a need for a method that combines both approaches for realistic urban scene generation.

Method: Two-stage approach: 1) Train 3D latent diffusion model to generate 3D Gaussians (3DGS) at low resolution with optional conditioning (3D bounding boxes, road maps, text prompts). 2) Train 2D video diffusion model to enhance appearance details using rendered images from 3D Gaussians as guidance, ensuring adherence to camera trajectories.

Result: Experiments on Waymo and KITTI-360 datasets demonstrate ScenDi’s effectiveness in generating realistic urban scenes that maintain both appearance details and camera controllability, outperforming single-model approaches.

Conclusion: ScenDi successfully integrates 3D and 2D diffusion models to overcome limitations of existing methods, enabling realistic urban scene generation with both detailed appearance and accurate camera control through a coarse-to-fine approach.

Abstract: Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging real-world datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.

[165] Tracing 3D Anatomy in 2D Strokes: A Multi-Stage Projection Driven Approach to Cervical Spine Fracture Identification

Fabi Nahian Madhurja, Rusab Sarmun, Muhammad E. H. Chowdhury, Adam Mushtak, Israa Al-Hashimi, Sohaib Bassam Zoghoul

Main category: cs.CV

TL;DR: 2D projection-based pipeline for cervical spine fracture detection in 3D CT volumes using optimized projections for localization, segmentation, and ensemble fracture classification with competitive performance.

Details

Motivation: Cervical spine fractures require precise detection for clinical management, but traditional 3D segmentation methods are computationally complex. Need efficient automated analysis pipeline.

Method: End-to-end pipeline: 1) 2D projection-based vertebra localization using YOLOv8 on axial/sagittal/coronal views, 2) DenseNet121-Unet segmentation with variance/energy projections, 3) 2.5D Spatio-Sequential ensemble for fracture detection using raw slices and projections.

Result: 3D mIoU 94.45%, Dice score 87.86%, vertebra-level F1 68.15% (ROC-AUC 91.62%), patient-level F1 82.26% (ROC-AUC 83.04%). Competitive with expert radiologists in interobserver analysis.

Conclusion: 2D projection-based approach effectively reduces computational complexity while maintaining high performance for cervical spine fracture detection, validated through explainability and clinical comparison studies.

Abstract: Cervical spine fractures are critical medical conditions requiring precise and efficient detection for effective clinical management. This study explores the viability of 2D projection-based vertebra segmentation for vertebra-level fracture detection in 3D CT volumes, presenting an end-to-end pipeline for automated analysis of cervical vertebrae (C1-C7). By approximating a 3D volume through optimized 2D axial, sagittal, and coronal projections, regions of interest are identified using the YOLOv8 model from all views and combined to approximate the 3D cervical spine area, achieving a 3D mIoU of 94.45 percent. This projection-based localization strategy reduces computational complexity compared to traditional 3D segmentation methods while maintaining high performance. It is followed by a DenseNet121-Unet-based multi-label segmentation leveraging variance- and energy-based projections, achieving a Dice score of 87.86 percent. Strategic approximation of 3D vertebral masks from these 2D segmentation masks enables the extraction of individual vertebra volumes. The volumes are analyzed for fractures using an ensemble of 2.5D Spatio-Sequential models incorporating both raw slices and projections per vertebra for complementary evaluation. This ensemble achieves vertebra-level and patient-level F1 scores of 68.15 and 82.26, and ROC-AUC scores of 91.62 and 83.04, respectively. We further validate our approach through an explainability study that provides saliency map visualizations highlighting anatomical regions relevant for diagnosis, and an interobserver variability analysis comparing our model’s performance with expert radiologists, demonstrating competitive results.

[166] FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion

Zichen Xi, Hao-Xiang Chen, Nan Xue, Hongyu Yan, Qi-Yuan Feng, Levent Burak Kara, Joaquim Jorge, Qun-Ce Xu

Main category: cs.CV

TL;DR: FlowSSC is a generative framework for monocular semantic scene completion that uses shortcut flow-matching in triplane latent space to achieve real-time, high-fidelity 3D scene completion from single RGB images.

Details

Motivation: Existing feed-forward methods for semantic scene completion from monocular images struggle with generating plausible details in occluded regions and preserving spatial relationships, which is critical for real-world applications like autonomous systems.

Method: FlowSSC treats SSC as conditional generation problem, introduces Shortcut Flow-matching that operates in compact triplane latent space, enabling high-fidelity generation in single step instead of hundreds of diffusion steps.

Result: Extensive experiments on SemanticKITTI show FlowSSC achieves state-of-the-art performance, significantly outperforming existing baselines while enabling real-time inference.

Conclusion: FlowSSC is the first generative framework for monocular semantic scene completion that can integrate with existing methods to boost performance, achieving practical deployment capability through efficient single-step generation.

Abstract: Semantic Scene Completion (SSC) from monocular RGB images is a fundamental yet challenging task due to the inherent ambiguity of inferring occluded 3D geometry from a single view. While feed-forward methods have made progress, they often struggle to generate plausible details in occluded regions and preserve the fundamental spatial relationships of objects. Such accurate generative reasoning capability for the entire 3D space is critical in real-world applications. In this paper, we present FlowSSC, the first generative framework applied directly to monocular semantic scene completion. FlowSSC treats the SSC task as a conditional generation problem and can seamlessly integrate with existing feed-forward SSC methods to significantly boost their performance. To achieve real-time inference without compromising quality, we introduce Shortcut Flow-matching that operates in a compact triplane latent space. Unlike standard diffusion models that require hundreds of steps, our method utilizes a shortcut mechanism to achieve high-fidelity generation in a single step, enabling practical deployment in autonomous systems. Extensive experiments on SemanticKITTI demonstrate that FlowSSC achieves state-of-the-art performance, significantly outperforming existing baselines.

[167] DrivIng: A Large-Scale Multimodal Driving Dataset with Full Digital Twin Integration

Dominik Rößle, Xujun Xie, Adithya Mohan, Venkatesh Thirugnana Sambandham, Daniel Cremers, Torsten Schön

Main category: cs.CV

TL;DR: DrivIng is a large-scale multimodal autonomous driving dataset with a complete digital twin of an 18km route, enabling realistic simulation and testing.

Details

Motivation: Existing autonomous driving datasets lack high-fidelity digital twins, limiting systematic testing, edge-case simulation, sensor modification, and sim-to-real evaluations.

Method: Created a dataset with continuous recordings from six RGB cameras, one LiDAR, and high-precision ADMA-based localization across urban, suburban, and highway segments. All sequences are annotated at 10 Hz with 3D bounding boxes and track IDs across 12 classes.

Result: ~1.2 million annotated instances covering day, dusk, and night conditions. The dataset enables 1-to-1 transfer of real traffic into simulation while preserving agent interactions.

Conclusion: DrivIng addresses the gap in existing datasets by providing both real-world data and a complete digital twin, supporting reproducible research, robust validation, and flexible scenario testing for autonomous driving perception.

Abstract: Perception is a cornerstone of autonomous driving, enabling vehicles to understand their surroundings and make safe, reliable decisions. Developing robust perception algorithms requires large-scale, high-quality datasets that cover diverse driving conditions and support thorough evaluation. Existing datasets often lack a high-fidelity digital twin, limiting systematic testing, edge-case simulation, sensor modification, and sim-to-real evaluations. To address this gap, we present DrivIng, a large-scale multimodal dataset with a complete geo-referenced digital twin of a ~18 km route spanning urban, suburban, and highway segments. Our dataset provides continuous recordings from six RGB cameras, one LiDAR, and high-precision ADMA-based localization, captured across day, dusk, and night. All sequences are annotated at 10 Hz with 3D bounding boxes and track IDs across 12 classes, yielding ~1.2 million annotated instances. Alongside the benefits of a digital twin, DrivIng enables a 1-to-1 transfer of real traffic into simulation, preserving agent interactions while enabling realistic and flexible scenario testing. To support reproducible research and robust validation, we benchmark DrivIng with state-of-the-art perception models and publicly release the dataset, digital twin, HD map, and codebase.

[168] RayRoPE: Projective Ray Positional Encoding for Multi-view Attention

Yu Wu, Minsik Jeon, Jen-Hao Rick Chang, Oncel Tuzel, Shubham Tulsiani

Main category: cs.CV

TL;DR: RayRoPE is a novel positional encoding method for multi-view transformers that uses ray-based encoding with predicted 3D points to achieve SE(3) invariance and geometry-aware attention, outperforming prior methods on novel-view synthesis and depth estimation tasks.

Details

Motivation: Current positional encoding schemes for multi-view transformers fail to uniquely encode patches, achieve SE(3)-invariant attention with multi-frequency similarity, and adapt to scene geometry. The authors seek a mechanism that addresses all these limitations simultaneously.

Method: RayRoPE represents patch positions using associated rays but leverages predicted 3D points along rays instead of ray directions. It computes query-frame projective coordinates for SE(3) invariance and analytically computes expected position encoding under uncertainty when predicted 3D points are imprecise.

Result: RayRoPE consistently outperforms alternative position encoding schemes, achieving 15% relative improvement on LPIPS in CO3D for novel-view synthesis. It also enables seamless incorporation of RGB-D input, resulting in even larger gains over methods that cannot encode this information.

Conclusion: RayRoPE successfully addresses the limitations of prior multi-view positional encoding schemes by providing a geometry-aware, SE(3)-invariant encoding that handles uncertainty and improves performance on 3D vision tasks while enabling RGB-D integration.

Abstract: We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows SE(3)-invariant attention with multi-frequency similarity, and can be adaptive to the geometry of the underlying scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet the above desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays but leverages a predicted point along the ray instead of the direction for a geometry-aware encoding. To achieve SE(3) invariance, RayRoPE computes query-frame projective coordinates for computing multi-frequency similarity. Lastly, as the ‘predicted’ 3D point along a ray may not be precise, RayRoPE presents a mechanism to analytically compute the expected position encoding under uncertainty. We validate RayRoPE on the tasks of novel-view synthesis and stereo depth estimation and show that it consistently improves over alternate position encoding schemes (e.g. 15% relative improvement on LPIPS in CO3D). We also show that RayRoPE can seamlessly incorporate RGB-D input, resulting in even larger gains over alternatives that cannot positionally encode this information.

[169] StableWorld: Towards Stable and Consistent Long Interactive Video Generation

Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang, Hubery Yin, Chen Li, Ziwei Liu, Chenyang Si

Main category: cs.CV

TL;DR: StableWorld introduces a Dynamic Frame Eviction Mechanism to address instability and temporal inconsistency in interactive video generation by filtering out degraded frames while retaining geometrically consistent ones.

Details

Motivation: Current interactive video generation methods suffer from severe instability and temporal degradation, leading to spatial drift and scene collapse during long-horizon interactions. The authors identify that error accumulation originates from generated frames gradually deviating from initial clean states and propagating errors to subsequent frames.

Method: Proposes StableWorld, a Dynamic Frame Eviction Mechanism that continuously filters out degraded frames while retaining geometrically consistent ones to prevent cumulative drift at its source.

Result: Demonstrates promising results on multiple interactive video models (Matrix-Game, Open-Oasis, Hunyuan-GameCraft), showing that StableWorld is model-agnostic and can substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.

Conclusion: StableWorld provides a simple yet effective solution to the overlooked challenge of stability in interactive video generation, preventing error accumulation and enabling more stable, temporally consistent interactive video worlds.

Abstract: In this paper, we explore the overlooked challenge of stability and temporal consistency in interactive video generation, which synthesizes dynamic and controllable video worlds through interactive behaviors such as camera movements and text prompts. Despite remarkable progress in world modeling, current methods still suffer from severe instability and temporal degradation, often leading to spatial drift and scene collapse during long-horizon interactions. To better understand this issue, we initially investigate the underlying causes of instability and identify that the major source of error accumulation originates from the same scene, where generated frames gradually deviate from the initial clean state and propagate errors to subsequent frames. Building upon this observation, we propose a simple yet effective method, \textbf{StableWorld}, a Dynamic Frame Eviction Mechanism. By continuously filtering out degraded frames while retaining geometrically consistent ones, StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation. Promising results on multiple interactive video models, \eg, Matrix-Game, Open-Oasis, and Hunyuan-GameCraft, demonstrate that StableWorld is model-agnostic and can be applied to different interactive video generation frameworks to substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.

[170] Rethinking Video Generation Model for the Embodied World

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou

Main category: cs.CV

TL;DR: RBench is a comprehensive robotics benchmark for evaluating robot-oriented video generation across multiple task domains and embodiments, assessing both task correctness and visual fidelity. The paper also introduces RoVid-X, a large open-source robotic dataset with 4M annotated video clips to address data scarcity.

Details

Motivation: There's a lack of standardized benchmarks for evaluating robot-oriented video generation, making fair comparisons difficult. Current models struggle to generate physically realistic robot behaviors, and there's a critical shortage of high-quality training data for robotic video generation.

Method: 1) Introduces RBench benchmark with five task domains and four distinct embodiments, evaluating task-level correctness and visual fidelity through reproducible sub-metrics (structural consistency, physical plausibility, action completeness). 2) Develops a refined four-stage data pipeline to create RoVid-X, the largest open-source robotic dataset with 4 million annotated video clips covering thousands of tasks with comprehensive physical property annotations.

Result: Evaluation of 25 representative models reveals significant deficiencies in generating physically realistic robot behaviors. RBench achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. The benchmark and dataset create a synergistic ecosystem for rigorous assessment and scalable training.

Conclusion: The paper establishes a robust foundation for advancing embodied AI through standardized evaluation (RBench) and large-scale training data (RoVid-X). This synergistic ecosystem addresses both assessment and data scarcity challenges, accelerating progress toward physically realistic robot video generation and general embodied intelligence.

Abstract: Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.

[171] LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes

Ruofan Liang, Norman Müller, Ethan Weber, Duncan Zauss, Nandita Vijaykumar, Peter Kontschieder, Christian Richardt

Main category: cs.CV

TL;DR: Novel method for interactive light editing in indoor scenes from multi-view capture using generative image-based light decomposition and 3D Gaussian splatting for real-time control.

Details

Motivation: Enabling interactive editing of complex indoor lighting from single multi-view captures, allowing independent manipulation of individual light sources for realistic scene relighting.

Method: Uses generative image-based light decomposition to factorize indoor illumination into constituent light sources, with multi-view lighting harmonization for consistency, integrated into relightable 3D Gaussian splatting representation.

Result: Demonstrates highly photorealistic lighting decomposition and relighting across diverse indoor scenes, evaluated on synthetic and real-world datasets with quantitative/qualitative comparisons to state-of-the-art.

Conclusion: Presents an effective approach for interactive light editing with real-time control over individual light sources, enabling practical indoor scene relighting applications.

Abstract: We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques. For video results and interactive demos, see https://luxremix.github.io.

Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak

Main category: cs.CV

TL;DR: Iterative test-time refinement strategy for text-to-image models using vision-language model feedback to improve compositional prompt alignment.

Details

Motivation: Current T2I models struggle with complex compositional prompts requiring multiple objects, relations, and attributes. Existing inference-time strategies like parallel sampling or increasing denoising steps remain inadequate for richly compositional settings.

Method: Proposes iterative refinement where T2I model progressively refines generations across multiple steps, guided by feedback from a vision-language model as critic. Simple, requires no external tools/priors, flexible for various image generators and VLMs.

Result: 16.9% improvement in all-correct rate on ConceptMix (k=7), 13.8% improvement on T2I-CompBench (3D-Spatial), 12.5% improvement on Visual Jenga scene decomposition vs compute-matched parallel sampling. Human evaluators prefer method 58.7% vs 41.3% baseline.

Conclusion: Iterative self-correction is a broadly applicable principle for compositional image generation, producing more faithful generations by decomposing complex prompts into sequential corrections.

Abstract: Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/

[173] Walk through Paintings: Egocentric World Models from Internet Priors

Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu-Xiong Wang, Pavel Tokmakov, Martial Hebert

Main category: cs.CV

TL;DR: EgoWM transforms pretrained video diffusion models into action-conditioned world models for accurate future prediction across various embodiments, achieving better physical correctness and faster inference than prior methods.

Details

Motivation: To create video generation models that can accurately predict correct futures rather than just plausible ones, specifically for action-conditioned world modeling that faithfully follows motor commands while preserving realism and generalization.

Method: Repurpose pretrained video diffusion models by injecting motor commands through lightweight conditioning layers, avoiding training from scratch and leveraging existing world priors. Scales across embodiments from 3-DoF robots to 25-DoF humanoids.

Result: EgoWM improves Structural Consistency Score (SCS) by up to 80% over prior navigation world models, achieves 6x lower inference latency, and demonstrates robust generalization to unseen environments including navigation inside paintings.

Conclusion: EgoWM successfully transforms video diffusion models into effective action-conditioned world models that produce physically correct, controllable future predictions across diverse embodiments with minimal fine-tuning.

Abstract: What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine-tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.

[174] Towards Understanding Best Practices for Quantization of Vision-Language Models

Gautom Das, Vincent La, Ethan Lau, Abhinav Shrivastava, Matthew Gwilliam

Main category: cs.CV

TL;DR: This paper investigates quantization methods for multimodal pipelines (vision models, language models, connectors) to reduce memory and latency while preserving performance on tasks like captioning, retrieval, and QA.

Details

Motivation: LLMs require fast GPUs with large memory, so quantization is needed to reduce memory and latency. While research exists on quantization for individual models, there's limited work on applying these methods to multimodal pipelines that combine vision and language components.

Method: The study applies various quantization methods (including state-of-the-art GPTQ and AWQ) to multimodal pipelines. They systematically evaluate how performance on captioning, retrieval, and question answering is affected by bit width, quantization method, and which pipeline component (vision model, language model, or connectors) is quantized.

Result: Results show that ViT and LLM have comparable importance in model performance despite parameter size differences. Lower-bit quantization of the LLM achieves high accuracy at reduced bits per weight (bpw). The study provides practical insights for efficient MLLM deployment.

Conclusion: The findings highlight the value of exploring component sensitivities in multimodal models and provide practical guidance for efficient deployment of multimodal language models through targeted quantization strategies.

Abstract: Large language models (LLMs) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners quantize their learned parameters, typically at half precision. A growing body of research focuses on preserving the model performance with more aggressive bit widths, and some work has been done to apply these strategies to other models, like vision transformers. In our study we investigate how a variety of quantization methods, including state-of-the-art GPTQ and AWQ, can be applied effectively to multimodal pipelines comprised of vision models, language models, and their connectors. We address how performance on captioning, retrieval, and question answering can be affected by bit width, quantization method, and which portion of the pipeline the quantization is used for. Results reveal that ViT and LLM exhibit comparable importance in model performance, despite significant differences in parameter size, and that lower-bit quantization of the LLM achieves high accuracy at reduced bits per weight (bpw). These findings provide practical insights for efficient deployment of MLLMs and highlight the value of exploration for understanding component sensitivities in multimodal models. Our code is available at https://github.com/gautomdas/mmq.

[175] APPLE: Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping

Jiwon Kang, Yeji Choi, JoungBin Lee, Wooseok Jang, Jinhyeok Choi, Taekeun Kang, Yongjae Park, Myungin Kim, Seungryong Kim

Main category: cs.CV

TL;DR: APPLE is a diffusion-based teacher-student framework for face swapping that uses attribute-aware pseudo-label supervision to better preserve target attributes like lighting, skin tone, and makeup while transferring source identity.

Details

Motivation: Face swapping faces challenges due to lack of real ground truth data, making it difficult to achieve both accurate identity transfer and high-quality attribute preservation. Existing diffusion-based approaches using conditional inpainting on masked target images remove crucial appearance cues, leading to plausible but misaligned attributes.

Method: APPLE uses a teacher-student framework with attribute-aware pseudo-label supervision. It reformulates face swapping as a conditional deblurring task to preserve target-specific attributes. The method includes an attribute-aware inversion scheme for detailed attribute preservation and elaborate attribute-preserving design for teacher learning to produce high-quality pseudo triplets that provide direct supervision to the student model.

Result: APPLE achieves state-of-the-art performance in both attribute preservation and identity transfer, producing more photorealistic and target-faithful face swapping results compared to existing methods.

Conclusion: The proposed APPLE framework effectively addresses the limitations of current face swapping methods by using attribute-aware pseudo-label supervision and conditional deblurring formulation, resulting in superior preservation of target attributes while maintaining accurate identity transfer.

Abstract: Face swapping aims to transfer the identity of a source face onto a target face while preserving target-specific attributes such as pose, expression, lighting, skin tone, and makeup. However, since real ground truth for face swapping is unavailable, achieving both accurate identity transfer and high-quality attribute preservation remains challenging. In addition, recent diffusion-based approaches attempt to improve visual fidelity through conditional inpainting on masked target images, but the masked condition removes crucial appearance cues of target, resulting in plausible yet misaligned attributes. To address these limitations, we propose APPLE (Attribute-Preserving Pseudo-Labeling), a diffusion-based teacher-student framework that enhances attribute fidelity through attribute-aware pseudo-label supervision. We reformulate face swapping as a conditional deblurring task to more faithfully preserve target-specific attributes such as lighting, skin tone, and makeup. In addition, we introduce an attribute-aware inversion scheme to further improve detailed attribute preservation. Through an elaborate attribute-preserving design for teacher learning, APPLE produces high-quality pseudo triplets that explicitly provide the student with direct face-swapping supervision. Overall, APPLE achieves state-of-the-art performance in terms of attribute preservation and identity transfer, producing more photorealistic and target-faithful results.

[176] Semantic Image Synthesis via Diffusion Models

Wengang Zhou, Weilun Wang, Jianmin Bao, Dongdong Chen, Dong Chen, Lu Yuan, Houqiang Li

Main category: cs.CV

TL;DR: A novel semantic image synthesis framework using Denoising Diffusion Probabilistic Models (DDPMs) with improved semantic layout processing and classifier-free guidance sampling.

Details

Motivation: GAN-based approaches for semantic image synthesis often lead to unsatisfactory quality or diversity. DDPMs have shown remarkable success in image generation but need better adaptation for semantic image synthesis tasks.

Method: Proposes a DDPM-based framework that processes semantic layout and noisy image differently: noisy image goes to U-Net encoder while semantic layout goes to decoder via multi-layer spatially-adaptive normalization operators. Uses classifier-free guidance sampling strategy.

Result: Achieves state-of-the-art performance on four benchmark datasets in terms of fidelity (FID) and diversity (LPIPS).

Conclusion: The proposed DDPM-based framework effectively addresses limitations of GAN-based approaches for semantic image synthesis, demonstrating superior performance through better semantic layout utilization and improved sampling strategy.

Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks compared with Generative Adversarial Nets (GANs). Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches, which may lead to unsatisfactory quality or diversity of generated images. In this paper, we propose a novel framework based on DDPM for semantic image synthesis. Unlike previous conditional diffusion model directly feeds the semantic layout and noisy image as input to a U-Net structure, which may not fully leverage the information in the input semantic mask, our framework processes semantic layout and noisy image differently. It feeds noisy image to the encoder of the U-Net structure while the semantic layout to the decoder by multi-layer spatially-adaptive normalization operators. To further improve the generation quality and semantic interpretability in semantic image synthesis, we introduce the classifier-free guidance sampling strategy, which acknowledge the scores of an unconditional model for sampling process. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our proposed method, achieving state-of-the-art performance in terms of fidelity (FID) and diversity (LPIPS). Our code and pretrained models are available at https://github.com/WeilunWang/semantic-diffusion-model.

[177] Benchmarking the Influence of Pre-training on Explanation Performance in MR Image Classification

Marta Oliveira, Rick Wilming, Benedict Clark, Céline Budding, Fabian Eitel, Kerstin Ritter, Stefan Haufe

Main category: cs.CV

TL;DR: This paper proposes a benchmark dataset for quantitatively evaluating XAI methods in medical imaging, specifically MRI classification, and investigates how transfer learning affects explanation quality.

Details

Motivation: Current CNN models in medical prediction tasks are complex and lack interpretability, motivating XAI research. However, previous studies rarely quantitatively evaluated XAI methods against ground-truth data, and the influence of transfer learning on explanation performance remains unexplored.

Method: The authors propose a benchmark dataset for MRI classification that allows quantitative evaluation of explanation performance. They use this benchmark to systematically study how transfer learning (specifically, pre-training task and number of pre-trained layers) affects the quality of explanations from popular XAI methods.

Result: Results show that different XAI methods applied to the same model vary widely in performance, even for correctly classified examples. Explanation performance strongly depends on the pre-training task and the number of CNN layers pre-trained. These findings hold after correcting for correlation between explanation and classification performance.

Conclusion: The study demonstrates the need for quantitative evaluation of XAI methods in medical imaging and reveals that transfer learning significantly impacts explanation quality. The proposed benchmark enables systematic assessment of XAI methods, and the findings highlight important considerations for developing interpretable medical AI systems.

Abstract: Convolutional Neural Networks (CNNs) are frequently and successfully used in medical prediction tasks. They are often used in combination with transfer learning, leading to improved performance when training data for the task are scarce. The resulting models are highly complex and typically do not provide any insight into their predictive mechanisms, motivating the field of “explainable” artificial intelligence (XAI). However, previous studies have rarely quantitatively evaluated the “explanation performance” of XAI methods against ground-truth data, and transfer learning and its influence on objective measures of explanation performance has not been investigated. Here, we propose a benchmark dataset that allows for quantifying explanation performance in a realistic magnetic resonance imaging (MRI) classification task. We employ this benchmark to understand the influence of transfer learning on the quality of explanations. Experimental results show that popular XAI methods applied to the same underlying model differ vastly in performance, even when considering only correctly classified examples. We further observe that explanation performance strongly depends on the task used for pre-training and the number of CNN layers pre-trained. These results hold after correcting for a substantial correlation between explanation and classification performance.

[178] Image class translation: visual inspection of class-specific hypotheticals and classification based on translation distance

Mikyla K. Bowen, Jesse W. Wilson

Main category: cs.CV

TL;DR: Image-to-image translation networks used for classification by comparing input images to class-specific hypotheticals, achieving interpretable results comparable to conventional CNNs while exposing dataset biases.

Details

Motivation: Address the lack of explainability and high confidence for incorrect decisions in medical AI, especially with out-of-domain samples, by developing more interpretable alternatives to black-box classifiers.

Method: Train image-to-image networks to translate input images to class-specific hypotheticals, then compare these with the input both visually and quantitatively. Use translation distances as low-dimensional feature vectors for classification.

Result: On melanoma/benign dermoscopy: 80% accuracy with 2D features vs 85% for CNN with ~62,000D features. On bone marrow cytology: 92% vs 89% (3-class) and 90% vs 86% (6-class) accuracy. Visual inspection revealed dataset biases like scalebars and vignetting.

Conclusion: Image-to-image networks can go beyond artistic changes to expose dataset biases, perform dimension reduction and visualization, and potentially outperform conventional CNN classifiers while providing interpretability.

Abstract: Purpose: A major barrier to the implementation of artificial intelligence for medical applications is the lack of explainability and high confidence for incorrect decisions, specifically with out-of-domain samples. We propose a generalization of image translation networks for image classification and demonstrate their potential as a more interpretable alternative to conventional black-box classifiers. Approach: We train an image2image network to translate an input image to class-specific hypotheticals, and then compare these with the input, both visually and quantitatively. Translation distances, i.e., the degree of alteration needed to conform to one class or another, are examined for clusters and trends, and used as simple low-dimensional feature vectors for classification. Results: On melanoma/benign dermoscopy images, a translation distance classifier achieved 80% accuracy using only a 2-dimensional feature space (versus 85% for a conventional CNN using a ~62,000-dimensional feature space). Visual inspection of rendered images revealed dataset biases, such as scalebars, vignetting, and pale background pigmentation in melanomas. Image distributions in translation distance space revealed a natural separation along the lines of dermatologist decision to biopsy, rather than between malignant and benign. On bone marrow cytology images, translation distance classifiers outperformed a conventional CNN in both 3-class (92% accuracy vs 89% for CNN) and 6-class (90% vs 86% for CNN) scenarios. Conclusions: This proof-of-concept shows the potential for image2image networks to go beyond artistic/stylistic changes and to expose dataset biases, perform dimension reduction and dataset visualization, and in some cases, potentially outperform conventional end-to-end CNN classifiers.

[179] NumGrad-Pull: Numerical Gradient Guided Tri-plane Representation for Surface Reconstruction from Point Clouds

Ruikai Cui, Binzhu Xie, Shi Qiu, Jiawei Liu, Saeed Anwar, Nick Barnes

Main category: cs.CV

TL;DR: NumGrad-Pull improves surface reconstruction from unoriented 3D points using tri-plane structures with numerical gradients and progressive expansion for better detail fidelity and training stability.

Details

Motivation: Reconstructing continuous surfaces from unoriented, unordered 3D points is a fundamental challenge. While recent methods use neural signed distance functions with analytical gradients, there's room for improvement in learning acceleration, detail fidelity, and training stability.

Method: Leverages tri-plane structures for signed distance functions, replaces analytical gradients with numerical gradients for better training stability, uses progressive plane expansion for faster convergence, and implements data sampling to reduce artifacts.

Result: Extensive experiments across various benchmarks demonstrate the approach’s effectiveness and robustness in surface reconstruction with improved detail fidelity.

Conclusion: NumGrad-Pull successfully enhances surface reconstruction quality through tri-plane structures with numerical gradients and progressive expansion strategies, offering a robust solution for 3D point cloud reconstruction.

Abstract: Reconstructing continuous surfaces from unoriented and unordered 3D points is a fundamental challenge in computer vision and graphics. Recent advancements address this problem by training neural signed distance functions to pull 3D location queries to their closest points on a surface, following the predicted signed distances and the analytical gradients computed by the network. In this paper, we introduce NumGrad-Pull, leveraging the representation capability of tri-plane structures to accelerate the learning of signed distance functions and enhance the fidelity of local details in surface reconstruction. To further improve the training stability of grid-based tri-planes, we propose to exploit numerical gradients, replacing conventional analytical computations. Additionally, we present a progressive plane expansion strategy to facilitate faster signed distance function convergence and design a data sampling strategy to mitigate reconstruction artifacts. Our extensive experiments across a variety of benchmarks demonstrate the effectiveness and robustness of our approach. Code is available at https://github.com/CuiRuikai/NumGrad-Pull

[180] PanoDreamer: Optimization-Based Single Image to 360 3D Scene With Diffusion

Avinash Paliwal, Xilong Zhou, Andrii Tsarov, Nima Khademi Kalantari

Main category: cs.CV

TL;DR: PanoDreamer: A novel method for generating coherent 360° 3D scenes from single images using panorama/depth estimation and alternating optimization.

Details

Motivation: Existing methods generate 360° 3D scenes sequentially, which can lead to inconsistencies. There's a need for a more coherent approach to single-image 3D scene reconstruction.

Method: Frames the problem as single-image panorama and depth estimation. Uses alternating minimization strategies to optimize both panorama and depth estimation tasks simultaneously. After obtaining coherent panoramic image and depth, reconstructs scene by inpainting small occluded regions and projecting into 3D space.

Result: Outperforms existing techniques in single-image 360° 3D scene reconstruction in terms of consistency and overall quality.

Conclusion: PanoDreamer provides an effective approach for coherent 360° 3D scene generation from single images through novel formulation of panorama/depth estimation as optimization tasks with alternating minimization.

Abstract: In this paper, we present PanoDreamer, a novel method for producing a coherent 360° 3D scene from a single input image. Unlike existing methods that generate the scene sequentially, we frame the problem as single-image panorama and depth estimation. Once the coherent panoramic image and its corresponding depth are obtained, the scene can be reconstructed by inpainting the small occluded regions and projecting them into 3D space. Our key contribution is formulating single-image panorama and depth estimation as two optimization tasks and introducing alternating minimization strategies to effectively solve their objectives. We demonstrate that our approach outperforms existing techniques in single-image 360° 3D scene reconstruction in terms of consistency and overall quality.

Angelos Zavras, Dimitrios Michail, Xiao Xiang Zhu, Begüm Demir, Ioannis Papoutsis

Main category: cs.CV

TL;DR: GAIA is a novel 201,005 image-text dataset for remote sensing that addresses limitations of existing VLMs by providing scientifically accurate, multi-scale, multi-sensor RS data with high-quality synthetic captions generated via GPT-4o.

Details

Motivation: Existing Vision-Language Models perform poorly on remote sensing tasks because they're trained on noisy web data lacking scientifically accurate textual descriptions and specialized RS domain knowledge. Current RS datasets often focus only on basic attributes like date/location rather than detailed scientific descriptions.

Method: Two-stage construction: (1) Targeted web-scraping of images and text from reputable RS sources, (2) Generation of five high-quality synthetic captions per image using carefully crafted prompts with GPT-4o. The dataset covers diverse RS modalities, spatial resolutions, and spans 25 years with global spatial and balanced temporal distribution.

Result: Fine-tuning CLIP and BLIP2 models on GAIA significantly improves performance on RS image classification, cross-modal retrieval, and image captioning tasks compared to models trained on existing datasets.

Conclusion: GAIA bridges the critical gap in RS vision-language data by providing high-quality, scientifically grounded image-text pairs that enable better performance on RS-specific tasks. The dataset, processing framework, and fine-tuned models are publicly available.

Abstract: Existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS. This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location. To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis. GAIA comprises of 201,005 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions. Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena. The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. GAIA’s construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o. Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks. We make our dataset, automated processing framework and fine-tuned model weights publicly available on our project’s GitHub repository: https://github.com/Orion-AI-Lab/GAIA.

[182] OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions

Maxim Popov, Regina Kurkova, Mikhail Iumanov, Jaafar Mahmoud, Sergey Kolyubin

Main category: cs.CV

TL;DR: OSMa-Bench is a new benchmark for evaluating Open Semantic Mapping algorithms using LLM/LVLM-powered automation, focusing on indoor lighting variations with a novel simulated dataset and scene graph evaluation.

Details

Motivation: There's a need for systematic evaluation of semantic mapping algorithms under varying indoor lighting conditions, which is a critical challenge for robotic perception in real-world environments.

Method: Developed a dynamically configurable LLM/LVLM-powered pipeline called OSMa-Bench, created a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, and introduced Scene Graph evaluation for semantic structure analysis.

Result: Evaluated state-of-the-art models (ConceptGraphs, BBQ, OpenScene) on semantic fidelity of object recognition/segmentation and semantic structure interpretation, providing insights into model robustness under different lighting conditions.

Conclusion: The benchmark provides valuable insights for developing more resilient and adaptable robotic systems, establishing a foundation for future research in robust semantic mapping.

Abstract: Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM-powered pipeline for evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs, BBQ, and OpenScene, we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a Scene Graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems. Project page is available at https://be2rlab.github.io/OSMa-Bench/.

[183] RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors

Avinash Paliwal, Xilong Zhou, Wei Ye, Jinhui Xiong, Rakesh Ranjan, Nima Khademi Kalantari

Main category: cs.CV

TL;DR: RI3D: A 3DGS-based approach using two personalized diffusion models (repair and inpainting) for high-quality novel view synthesis from sparse images, with a two-stage optimization strategy and novel Gaussian initialization.

Details

Motivation: The paper addresses the challenge of reconstructing high-quality novel views from extremely sparse input images, where traditional methods struggle with both visible region reconstruction and hallucination of missing regions.

Method: Separates view synthesis into two tasks: 1) reconstructing visible regions using a ‘repair’ diffusion model that enhances rendered images, and 2) hallucinating missing regions using an ‘inpainting’ diffusion model. Uses two-stage optimization: first stage reconstructs visible areas with repair model, second stage handles missing regions with inpainting model. Introduces novel Gaussian initialization combining 3D-consistent depth with detailed relative depth.

Result: The approach produces results with detailed textures in both visible and missing regions that outperform state-of-the-art methods on diverse scenes with extremely sparse inputs.

Conclusion: By separating the view synthesis process into reconstruction and hallucination tasks and addressing them with specialized diffusion models, RI3D achieves superior novel view synthesis quality from sparse inputs compared to existing approaches.

Abstract: In this paper, we propose RI3D, a novel 3DGS-based approach that harnesses the power of diffusion models to reconstruct high-quality novel views given a sparse set of input images. Our key contribution is separating the view synthesis process into two tasks of reconstructing visible regions and hallucinating missing regions, and introducing two personalized diffusion models, each tailored to one of these tasks. Specifically, one model (‘repair’) takes a rendered image as input and predicts the corresponding high-quality image, which in turn is used as a pseudo ground truth image to constrain the optimization. The other model (‘inpainting’) primarily focuses on hallucinating details in unobserved areas. To integrate these models effectively, we introduce a two-stage optimization strategy: the first stage reconstructs visible areas using the repair model, and the second stage reconstructs missing regions with the inpainting model while ensuring coherence through further optimization. Moreover, we augment the optimization with a novel Gaussian initialization method that obtains per-image depth by combining 3D-consistent and smooth depth with highly detailed relative depth. We demonstrate that by separating the process into two tasks and addressing them with the repair and inpainting models, we produce results with detailed textures in both visible and missing regions that outperform state-of-the-art approaches on a diverse set of scenes with extremely sparse inputs.

[184] A Multi-Stage Augmented Multimodal Interaction Network for Quantifying Fish Feeding Intensity Using Feeding Image, Audio and Water Wave

Shulong Zhang, Mingyuan Yao, Jiayin Zhao, Daoliang Li, Yingyi Chen, Haihua Wang

Main category: cs.CV

TL;DR: Proposes MAINet, a multi-stage augmented multimodal interaction network for quantifying fish feeding intensity using image, audio, and water wave data, achieving over 96% accuracy across all metrics.

Details

Motivation: Current fish feeding intensity assessment methods have limitations in modality selection, feature extraction/fusion, and co-inference, restricting accuracy, applicability, and reliability of multimodal fusion models in aquaculture systems.

Method: MAINet uses: 1) general feature extraction framework for image, audio, and water wave data; 2) ARPM mechanism (CAFN + DAFN) for inter-modal interaction and enhanced features; 3) Evidence Reasoning rule for modality fusion and decision making.

Result: Achieves 96.76% accuracy, 96.78% precision, 96.79% recall, and 96.79% F1-Score, significantly outperforming single-modality, dual-modality, and other fusion methods. Ablation studies confirm robustness and feature utilization improvements.

Conclusion: MAINet effectively addresses multimodal fusion challenges in fish feeding intensity quantification, demonstrating superior performance and providing a dataset for further research in aquaculture optimization.

Abstract: In recirculating aquaculture systems, accurate and effective assessment of fish feeding intensity is crucial for reducing feed costs and calculating optimal feeding times. However, current studies have limitations in modality selection, feature extraction and fusion, and co-inference for decision making, which restrict further improvement in the accuracy, applicability and reliability of multimodal fusion models. To address this problem, this study proposes a Multi-stage Augmented Multimodal Interaction Network (MAINet) for quantifying fish feeding intensity. Firstly, a general feature extraction framework is proposed to efficiently extract feature information from input image, audio and water wave datas. Second, an Auxiliary-modality Reinforcement Primary-modality Mechanism (ARPM) is designed for inter-modal interaction and generate enhanced features, which consists of a Channel Attention Fusion Network (CAFN) and a Dual-mode Attention Fusion Network (DAFN). Finally, an Evidence Reasoning (ER) rule is introduced to fuse the output results of each modality and make decisions, thereby completing the quantification of fish feeding intensity. The experimental results show that the constructed MAINet reaches 96.76%, 96.78%, 96.79% and 96.79% in accuracy, precision, recall and F1-Score respectively, and its performance is significantly higher than the comparison models. Compared with models that adopt single-modality, dual-modality fusion and different decision-making fusion methods, it also has obvious advantages. Meanwhile, the ablation experiments further verified the key role of the proposed improvement strategy in improving the robustness and feature utilization efficiency of model, which can effectively improve the accuracy of the quantitative results of fish feeding intensity. The dataset is available at: https://huggingface.co/datasets/ShulongZhang/Multimodal_Fish_Feeding_Intensity.

[185] Geo-Registration of Terrestrial LiDAR Point Clouds with Satellite Images without GNSS

Xinyu Wang, Muhammad Ibrahim, Haitian Wang, Atif Mansoor, Xiuping Jia, Ajmal Mian

Main category: cs.CV

TL;DR: A post-hoc geo-registration method for LiDAR point clouds using road segmentation and satellite image alignment, achieving 50-57% improvement over existing methods.

Details

Motivation: GNSS signals are unreliable in dense urban environments, causing localization errors for LiDAR point clouds. Existing methods depend on real-time GNSS/IMU data which often fails in cities.

Method: Uses Point Transformer for road segmentation, extracts road skeletons/intersections from both point cloud and satellite image, performs rigid transformation with intersection correspondences, then non-rigid refinement with RBF interpolation, and corrects elevation using SRTM terrain data.

Result: On KITTI: 0.69m mean planimetric error (50% reduction in bias). On Perth dataset: 2.17m mean error (57.4% improvement over rigid alignment). Elevation correlation improved by 30.5% (KITTI) and 55.8% (Perth).

Conclusion: The proposed structured post-hoc geo-registration method effectively addresses GNSS-denied urban environments by aligning LiDAR with satellite imagery, significantly improving accuracy for city-scale applications.

Abstract: Accurate geo-registration of LiDAR point clouds remains a significant challenge in urban environments where Global Navigation Satellite System (GNSS) signals are denied or degraded. Existing methods typically rely on real-time GNSS and Inertial Measurement Unit (IMU) data, which require pre-calibration and assume stable signals. However, this assumption often fails in dense cities, resulting in localization errors. To address this, we propose a structured post-hoc geo-registration method that accurately aligns LiDAR point clouds with satellite images. The proposed approach targets point cloud datasets where reliable GNSS information is unavailable or degraded, enabling city-scale geo-registration as a post-processing solution. Our method uses a pre-trained Point Transformer to segment road points, then extracts road skeletons and intersections from the point cloud and the satellite image. Global alignment is achieved through rigid transformation using corresponding intersection points, followed by local non-rigid refinement with radial basis function (RBF) interpolation. Elevation discrepancies are corrected using terrain data from the Shuttle Radar Topography Mission (SRTM). To evaluate geo-registration accuracy, we measure the absolute distances between the roads extracted from the two modalities. Our method is validated on the KITTI benchmark and a newly collected dataset of Perth, Western Australia. On KITTI, our method achieves a mean planimetric alignment error of 0.69m, corresponding to a 50% reduction in global geo-registration bias compared to the raw KITTI annotations. On Perth dataset, it achieves a mean planimetric error of 2.17m from GNSS values extracted from Google Maps, corresponding to 57.4% improvement over rigid alignment. Elevation correlation factor improved by 30.5% (KITTI) and 55.8% (Perth).

[186] Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs

Mohammad Shahab Sepehri, Berk Tinaz, Zalan Fabian, Mahdi Soltanolkotabi

Main category: cs.CV

TL;DR: The paper introduces Hyperphantasia, a synthetic benchmark to evaluate mental visualization abilities in MLLMs through four procedurally generated puzzles at three difficulty levels, revealing significant performance gaps between humans and current models.

Details

Motivation: Current MLLM benchmarks focus on passive visual perception but lack assessment of active mental visualization capabilities - the ability to internally construct and manipulate visual representations for problem solving, which is a critical human cognitive skill.

Method: Created Hyperphantasia benchmark with four carefully constructed puzzles that are procedurally generated and presented at three difficulty levels. Evaluated state-of-the-art MLLMs and explored reinforcement learning to improve visual simulation capabilities.

Result: Revealed substantial performance gap between humans and MLLMs. While some models show partial competence in recognizing visual patterns, robust mental visualization remains an open challenge for current MLLMs.

Conclusion: Mental visualization is a critical but under-evaluated capability in MLLMs. The Hyperphantasia benchmark provides a systematic way to assess this skill, highlighting current limitations and the need for improved visual simulation abilities in multimodal AI systems.

Abstract: Mental visualization, the ability to construct and manipulate visual representations internally, is a core component of human cognition and plays a vital role in tasks involving reasoning, prediction, and abstraction. Despite the rapid progress of Multimodal Large Language Models (MLLMs), current benchmarks primarily assess passive visual perception, offering limited insight into the more active capability of internally constructing visual patterns to support problem solving. Yet mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation. To bridge this gap, we introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of MLLMs through four carefully constructed puzzles. Each puzzle is procedurally generated and presented at three difficulty levels, enabling controlled analysis of model performance across increasing complexity. Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs. Additionally, we explore the potential of reinforcement learning to improve visual simulation capabilities. Our findings suggest that while some models exhibit partial competence in recognizing visual patterns, robust mental visualization remains an open challenge for current MLLMs.

[187] LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks

Fei Kong

Main category: cs.CV

TL;DR: The paper introduces LRR-Bench, a synthetic benchmark for evaluating VLMs’ spatial understanding capabilities, revealing significant gaps between current VLMs and human performance on spatial reasoning tasks.

Details

Motivation: Real-world applications like autonomous driving and robotics require precise spatial perception, but it's unclear how well Vision-Language Models (VLMs) understand spatial relationships and movement. There's a need for systematic evaluation of VLMs' spatial reasoning abilities.

Method: The authors develop a spatial evaluation pipeline and create a synthetic benchmark dataset. They categorize spatial understanding into two types: absolute spatial understanding (object positions like left/right) and 3D spatial understanding (movement and rotation). The synthetic dataset enables low-cost generation of test samples while preventing dataset contamination.

Result: Experiments on multiple state-of-the-art VLMs show significant room for improvement in spatial understanding. Humans achieve near-perfect performance on all tasks, while current VLMs only reach human-level performance on the two simplest tasks. For remaining tasks, VLMs perform distinctly lower than humans, with the best models achieving near-zero scores on multiple tasks.

Conclusion: Current VLMs have substantial limitations in spatial understanding compared to humans. The LRR-Bench benchmark provides a valuable tool for evaluating and improving spatial reasoning in VLMs, which is crucial for real-world applications requiring precise spatial perception.

Abstract: Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, in our experiments, humans achieve near-perfect performance on all tasks, whereas current VLMs attain human-level performance only on the two simplest tasks. For the remaining tasks, the performance of VLMs is distinctly lower than that of humans. In fact, the best-performing Vision-Language Models even achieve near-zero scores on multiple tasks. The dataset and code are available on https://github.com/kong13661/LRR-Bench.

[188] Radially Distorted Homographies, Revisited

Mårten Wadenbäck, Marcus Valtonen Örnhag, Johan Edstedt

Main category: cs.CV

TL;DR: A unified approach for estimating homographies with radial distortion in three configurations (one-image, identical, independent distortion) that produces faster minimal solvers while maintaining accuracy.

Details

Motivation: Homography estimation is crucial in computer vision, but real images often have lens distortions (especially radial distortion). Existing methods handle different distortion configurations separately, lacking a unified approach.

Method: Proposes a novel unified mathematical framework to solve all three radial distortion configurations: (i) distortion in only one image, (ii) identical distortion in both images, and (iii) independent distortion in both images. Develops minimal solvers based on this approach.

Result: The proposed solvers are faster than existing state-of-the-art methods while maintaining similar accuracy. They work well on established benchmarks including fisheye camera images.

Conclusion: Provides a unified solution for radially distorted homography estimation with faster, stable, and accurate minimal solvers. Implementation available in HomLib repository.

Abstract: Homographies are among the most prevalent transformations occurring in geometric computer vision and projective geometry, and homography estimation is consequently a crucial step in a wide assortment of computer vision tasks. When working with real images, which are often afflicted with geometric distortions caused by the camera lens, it may be necessary to determine both the homography and the lens distortion-particularly the radial component, called radial distortion-simultaneously to obtain anything resembling useful estimates. When considering a homography with radial distortion between two images, there are three conceptually distinct configurations for the radial distortion; (i) distortion in only one image, (ii) identical distortion in the two images, and (iii) independent distortion in the two images. While these cases have been addressed separately in the past, the present paper provides a novel and unified approach to solve all three cases. We demonstrate how the proposed approach can be used to construct new fast, stable, and accurate minimal solvers for radially distorted homographies. In all three cases, our proposed solvers are faster than the existing state-of-the-art solvers while maintaining similar accuracy. The solvers are tested on well-established benchmarks including images taken with fisheye cameras. A reference implementation of the proposed solvers is made available as part of HomLib (https://github.com/marcusvaltonen/HomLib).

[189] Extendable Generalization Self-Supervised Diffusion for Low-Dose CT Reconstruction

Guoquan Wei, Liu Shi, Zekun Zhou, Mohan Li, Cunfeng Wei, Wenzhe Shan, Qiegen Liu

Main category: cs.CV

TL;DR: EGenDiff is a self-supervised diffusion method for low-dose CT reconstruction that achieves dose-extensive generalization using only single-dose projection data for training, outperforming existing methods across various doses and anatomical planes.

Details

Motivation: Current self-supervised deep learning methods for LDCT reconstruction suffer from poor generalization when trained on single-dose data and applied to other doses, creating a need for methods that can generalize across different dose levels without requiring paired multi-dose training data.

Method: EGenDiff uses a contextual subdata self-enhancing similarity strategy to provide initial priors, combines knowledge distillation with latent diffusion models during training, and employs pixel-wise self-correcting fusion during inference for data fidelity enhancement, enabling generalization to higher, lower, and even unseen doses.

Result: The method demonstrates superior performance across benchmark datasets, clinical data, photon counting CT data, and all three anatomical planes (transverse, coronal, sagittal), consistently outperforming leading existing methods while requiring only LDCT projection data for training and testing.

Conclusion: EGenDiff successfully enables extendable generalization across multiple dose levels using only single-dose projection data, addressing the critical generalization limitation of current self-supervised LDCT reconstruction methods and providing a practical solution for clinical applications.

Abstract: Current methods based on deep learning for self-supervised low-dose CT (LDCT) reconstruction, while reducing the dependence on paired data, face the problem of significantly decreased generalization when training with single-dose data and extending to other doses. To enable dose-extensive generalization using only single-dose projection data for training, this work proposes a novel method of Extendable GENeraLization self-supervised Diffusion (EGenDiff) for low-dose CT reconstruction. Specifically, a contextual subdata self-enhancing similarity strategy is designed to provide an initial prior for the subsequent progress. During training, the initial prior is used to combine knowledge distillation with a deep combination of latent diffusion models for optimizing image details. On the stage of inference, the pixel-wise self-correcting fusion technique is proposed for data fidelity enhancement, resulting in extensive generalization of higher and lower doses or even unseen doses. EGenDiff requires only LDCT projection data for training and testing. Comprehensive evaluation on benchmark datasets, clinical data, photon counting CT data, and across all three anatomical planes (transverse, coronal, and sagittal) demonstrates that EGenDiff enables extendable generalization multi-dose, yielding reconstructions that consistently outperform leading existing methods.

[190] Improving Artifact Robustness for CT Deep Learning Models Without Labeled Artifact Images via Domain Adaptation

Justin Cheung, Samuel Savine, Calvin Nguyen, Lin Lu, Alhassan S. Yasin

Main category: cs.CV

TL;DR: Domain adaptation using DANN improves CT image classification on unseen ring artifacts without requiring labeled artifact data, achieving 38.7% higher accuracy than baseline models.

Details

Motivation: CT scanners can introduce new artifacts not present in training data, causing model misclassification. Labeling new artifact distributions is costly, so domain adaptation offers a more accessible alternative to maintain performance without expensive expert labeling.

Method: Simulated ring artifacts from detector gain error in sinogram space on OrganAMNIST dataset. Evaluated Domain Adversarial Neural Networks (DANN) against baseline and augmentation approaches. Masked loss function to simulate absence of labels from unseen distribution and selectively detached unlabeled instances from computational graph.

Result: Baseline models trained only on clean images failed to generalize to ring artifacts. Traditional augmentation with other distortion types provided no improvement. DANN approach improved classification accuracy on ring artifact images using only unlabeled artifact data during training, achieving 77.4% accuracy (38.7% higher than baseline).

Conclusion: Domain adaptation effectively addresses distribution shift in medical imaging without requiring expensive expert labeling of new artifact distributions, demonstrating promise for clinical deployment where novel artifacts may emerge.

Abstract: If a CT scanner introduces a new artifact not present in the training labels, the model may misclassify the images. Although modern CT scanners include design features which mitigate these artifacts, unanticipated or difficult-to-mitigate artifacts can still appear in practice. The direct solution of labeling images from this new distribution can be costly. As a more accessible alternative, this study evaluates domain adaptation as an approach for training models that maintain classification performance despite new artifacts, even without corresponding labels. We simulate ring artifacts from detector gain error in sinogram space and evaluate domain adversarial neural networks (DANN) against baseline and augmentation-based approaches on the OrganAMNIST abdominal CT dataset. We simulate the absence of labels from an unseen distribution via masking in the loss function and selectively detaching unlabeled instances from the computational graph. Our results demonstrate that baseline models trained only on clean images fail to generalize to images with ring artifacts, and traditional augmentation with other distortion types provides no improvement on unseen artifact domains. In contrast, the DANN approach improves classification accuracy on ring artifact images using only unlabeled artifact data during training, demonstrating the viability of domain adaptation for artifact robustness. The domain-adapted model achieved a classification accuracy of 77.4% on ring artifact test data, 38.7% higher than a baseline model only trained on images with no artifact. These findings provide empirical evidence that domain adaptation can effectively address distribution shift in medical imaging without requiring expensive expert labeling of new artifact distributions, suggesting promise for deployment in clinical settings where novel artifacts may emerge.

[191] Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization

Rui Chen, Bin Liu, Changtao Miao, Xinghao Wang, Yi Li, Tao Gong, Qi Chu, Nenghai Yu

Main category: cs.CV

TL;DR: ICFC is a training-free framework using MLLMs for interpretable image manipulation localization, achieving competitive performance with supervised methods.

Details

Motivation: Image tampering poses serious security threats, but supervised IML requires costly pixel-level annotations, while existing weakly supervised or training-free alternatives underperform and lack interpretability.

Method: In-Context Forensic Chain (ICFC) integrates objectified rule construction with adaptive filtering to build a reliable knowledge base, and uses a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained results.

Result: Across multiple benchmarks, ICFC surpasses state-of-the-art training-free methods and achieves competitive or superior performance compared to weakly and fully supervised approaches.

Conclusion: ICFC enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability without requiring training or annotations.

Abstract: Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.

[192] KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution

Junzhe Zhang, Huixuan Zhang, Xiaojun Wan

Main category: cs.CV

TL;DR: KBE is a dynamic multimodal evaluation framework that transforms static VQA benchmarks into evolving versions using graph formulation and knowledge integration to address data contamination and saturation issues.

Details

Motivation: Existing static multimodal benchmarks suffer from data contamination and saturation, leading to inflated or misleading performance evaluations of MLLMs. There's a need for more reliable, dynamic evaluation protocols.

Method: Uses graph formulation to represent VQA samples, then applies Knowledge-enhanced Benchmark Evolution (KBE) to analyze static benchmarks and expand them by integrating multimodal knowledge. KBE reconstructs questions by re-selecting visual information and expands questions with external textual knowledge, enabling difficulty-controllable evaluation.

Result: Extensive experiments show KBE alleviates data contamination and saturation risks, provides more comprehensive assessment of MLLM capabilities, and enables controllable difficulty evaluation through question exploration adjustment.

Conclusion: KBE offers a dynamic, evolving evaluation framework that addresses limitations of static benchmarks, providing more reliable assessment of multimodal large language models through knowledge-enhanced benchmark evolution.

Abstract: The rapid progress of multimodal large language models (MLLMs) calls for more reliable evaluation protocols. Existing static benchmarks suffer from the potential risk of data contamination and saturation, leading to inflated or misleading performance evaluations. To address these issues, we first apply Graph formulation to represent a static or dynamic VQA sample. With the formulation, we propose Knowledge-enhanced Benchmark Evolution(KBE), a dynamic multimodal evaluation framework. KBE first analyzes the original static benchmark, then expands it by integrating multimodal knowledge, transforming the static benchmark into a controllable, dynamic evolving version. Crucially, KBE can both reconstruct questions by Re-selecting visual information in the original image and expand existing questions with external textual knowledge. It enables difficulty-controllable evaluation by adjusting the degree of question exploration. Extensive experiments demonstrate that KBE alleviates the risk of data contamination, data saturation, and provides a more comprehensive assessment of MLLM capabilities.

[193] Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?

Yihao Li, Saeed Salehi, Lyle Ungar, Konrad P. Kording

Main category: cs.CV

TL;DR: Vision Transformers naturally learn to bind object parts together during pretraining, with over 90% accuracy in determining whether patches belong to the same object, challenging the view that ViTs lack object binding capabilities.

Details

Motivation: To investigate whether object binding capabilities naturally emerge in pre-trained Vision Transformers, challenging the common assumption that ViTs lack this fundamental cognitive ability. The researchers want to understand if recognizing which image patches belong to the same object emerges as a useful feature for downstream prediction.

Method: The authors decode “IsSameObject” (whether two patches belong to the same object) from patch embeddings across ViT layers using a quadratic similarity probe. They test this across different pretraining objectives (DINO, CLIP, ImageNet-supervised, MAE) and analyze how object binding is encoded and influences attention mechanisms.

Result: ViTs achieve over 90% accuracy in determining whether patches belong to the same object. This capability emerges reliably in DINO, CLIP, and ImageNet-supervised ViTs but is markedly weaker in MAE. The IsSameObject signal is encoded in a low-dimensional subspace on top of object features and actively guides attention. Ablating this signal degrades downstream performance.

Conclusion: Object binding naturally emerges in Vision Transformers through specific pretraining objectives, challenging the view that ViTs lack this ability. This symbolic knowledge of “which parts belong together” emerges naturally in connectionist systems and serves the pretraining objective, suggesting that object binding is not just an architectural artifact but a learned capability.

Abstract: Object binding, the brain’s ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a quadratic similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in DINO, CLIP, and ImageNet-supervised ViTs, but is markedly weaker in MAE, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of “which parts belong together” emerges naturally in a connectionist system.

[194] Registration-Free Monitoring of Unstructured Point Cloud Data via Intrinsic Geometrical Properties

Mariafrancesca Patalano, Giovanna Capizzi, Kamran Paynabar

Main category: cs.CV

TL;DR: Proposes registration-free monitoring of point cloud data using intrinsic geometric features (Laplacian and geodesic distances) with feature selection for defect detection.

Details

Motivation: Traditional preprocessing steps (registration and mesh reconstruction) for point cloud monitoring are error-prone, time-consuming, and can introduce artifacts that affect monitoring outcomes.

Method: Two alternative feature learning methods using intrinsic geometric properties (Laplacian and geodesic distances), plus a monitoring scheme with thresholding techniques to select the most indicative features for defect detection.

Result: Numerical experiments and case studies demonstrate the approach’s effectiveness in identifying different types of defects in complex-shaped point cloud data.

Conclusion: The registration-free approach eliminates preprocessing steps while effectively monitoring point cloud data for geometric accuracy in manufacturing processes.

Abstract: Modern sensing technologies have enabled the collection of unstructured point cloud data (PCD) of varying sizes, which are used to monitor the geometric accuracy of 3D objects. PCD are widely applied in advanced manufacturing processes, including additive, subtractive, and hybrid manufacturing. To ensure the consistency of analysis and avoid false alarms, preprocessing steps such as registration and mesh reconstruction are commonly applied prior to monitoring. However, these steps are error-prone, time-consuming and may introduce artifacts, potentially affecting monitoring outcomes. In this paper, we present a novel registration-free approach for monitoring PCD of complex shapes, eliminating the need for both registration and mesh reconstruction. Our proposal consists of two alternative feature learning methods and a common monitoring scheme designed to handle hundreds of features. Feature learning methods leverage intrinsic geometric properties of the shape, captured via the Laplacian and geodesic distances. In the monitoring scheme, thresholding techniques are used to further select intrinsic features most indicative of potential out-of-control conditions. Numerical experiments and case studies highlight the effectiveness of the proposed approach in identifying different types of defects.

[195] ConeGS: Error-Guided Densification Using Pixel Cones for Improved Reconstruction With Fewer Primitives

Bartłomiej Baranowski, Stefano Esposito, Patricia Gschoßmann, Anpei Chen, Andreas Geiger

Main category: cs.CV

TL;DR: ConeGS improves 3D Gaussian Splatting by using image-space-informed densification with cone-based primitive placement, achieving better reconstruction quality with fewer primitives.

Details

Motivation: 3D Gaussian Splatting suffers from suboptimal spatial distribution of primitives due to cloning-based densification that propagates Gaussians along existing geometry, limiting exploration and requiring many primitives to adequately cover scenes.

Method: Uses iNGP as geometric proxy for depth estimation, identifies high-error pixels, inserts new Gaussians along viewing cones at predicted depths with size initialization based on cone diameter, employs pre-activation opacity penalty to remove redundant Gaussians, and uses primitive budgeting strategy.

Result: Consistently enhances reconstruction quality and rendering performance across Gaussian budgets, with especially strong gains under tight primitive constraints where efficient placement is crucial.

Conclusion: ConeGS provides an effective image-space-informed densification framework that improves 3DGS by enabling more efficient primitive placement independent of existing scene geometry.

Abstract: 3D Gaussian Splatting (3DGS) achieves state-of-the-art image quality and real-time performance in novel view synthesis but often suffers from a suboptimal spatial distribution of primitives. This issue stems from cloning-based densification, which propagates Gaussians along existing geometry, limiting exploration and requiring many primitives to adequately cover the scene. We present ConeGS, an image-space-informed densification framework that is independent of existing scene geometry state. ConeGS first creates a fast Instant Neural Graphics Primitives (iNGP) reconstruction as a geometric proxy to estimate per-pixel depth. During the subsequent 3DGS optimization, it identifies high-error pixels and inserts new Gaussians along the corresponding viewing cones at the predicted depth values, initializing their size according to the cone diameter. A pre-activation opacity penalty rapidly removes redundant Gaussians, while a primitive budgeting strategy controls the total number of primitives, either by a fixed budget or by adapting to scene complexity, ensuring high reconstruction quality. Experiments show that ConeGS consistently enhances reconstruction quality and rendering performance across Gaussian budgets, with especially strong gains under tight primitive constraints where efficient placement is crucial.

[196] Beyond Boundaries: Leveraging Vision Foundation Models for Source-Free Object Detection

Huizai Yao, Sicheng Zhao, Pengteng Li, Yi Cui, Shuo Lu, Weiyu Guo, Yunfan Lu, Yijie Xu, Hui Xiong

Main category: cs.CV

TL;DR: A novel SFOD framework that leverages Vision Foundation Models as external knowledge to enhance feature alignment and pseudo-label quality, achieving state-of-the-art performance across six benchmarks.

Details

Motivation: Existing SFOD methods rely too heavily on internal source model knowledge, limiting cross-domain generalization and causing biased pseudo-labels. Vision Foundation Models offer strong perception capabilities and broad generalization that remain untapped in SFOD settings.

Method: Three VFM-based modules: 1) Patch-weighted Global Feature Alignment (PGFA) distills global features using patch-similarity weighting; 2) Prototype-based Instance Feature Alignment (PIFA) performs instance-level contrastive learning with momentum-updated VFM prototypes; 3) Dual-source Enhanced Pseudo-label Fusion (DEPF) fuses predictions from detection VFMs and teacher models via entropy-aware strategy.

Result: Extensive experiments on six benchmarks demonstrate state-of-the-art SFOD performance, validating the effectiveness of integrating VFMs to simultaneously improve transferability and discriminability.

Conclusion: The proposed framework successfully leverages Vision Foundation Models as external knowledge sources to overcome limitations of existing SFOD methods, achieving superior performance by enhancing both feature alignment and pseudo-label quality.

Abstract: Source-Free Object Detection (SFOD) aims to adapt a source-pretrained object detector to a target domain without access to source data. However, existing SFOD methods predominantly rely on internal knowledge from the source model, which limits their capacity to generalize across domains and often results in biased pseudo-labels, thereby hindering both transferability and discriminability. In contrast, Vision Foundation Models (VFMs), pretrained on massive and diverse data, exhibit strong perception capabilities and broad generalization, yet their potential remains largely untapped in the SFOD setting. In this paper, we propose a novel SFOD framework that leverages VFMs as external knowledge sources to jointly enhance feature alignment and label quality. Specifically, we design three VFM-based modules: (1) Patch-weighted Global Feature Alignment (PGFA) distills global features from VFMs using patch-similarity-based weighting to enhance global feature transferability; (2) Prototype-based Instance Feature Alignment (PIFA) performs instance-level contrastive learning guided by momentum-updated VFM prototypes; and (3) Dual-source Enhanced Pseudo-label Fusion (DEPF) fuses predictions from detection VFMs and teacher models via an entropy-aware strategy to yield more reliable supervision. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art SFOD performance, validating the effectiveness of integrating VFMs to simultaneously improve transferability and discriminability.

[197] T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

Shao-Jun Xia, Huixin Zhang, Zhengzhong Tu

Main category: cs.CV

TL;DR: T2T-VICL enables cross-task visual in-context learning for vision-language models by generating text prompts that describe differences between distinct low-level vision tasks and using perceptual score-based reasoning.

Details

Motivation: Current visual in-context learning (VICL) in vision-language models works well when visual prompts and target images come from the same task, but the paper investigates whether VLMs can perform cross-task VICL where prompts and targets originate from different visual tasks.

Method: Proposes T2T-VICL pipeline with: 1) mechanism to generate and select text prompts that implicitly describe differences between distinct low-level vision tasks, 2) construction of first cross-task VICL dataset, 3) novel inference framework combining perceptual score-based reasoning with traditional evaluation metrics.

Result: Achieves top-tier results across twelve cross-task scenarios and second-tier performance in nine additional scenarios, demonstrating successful cross-task VICL capabilities in VLMs.

Conclusion: The approach unlocks the boundaries of cross-task visual in-context learning within vision-language models, showing that VLMs can effectively perform VICL even when visual prompts and target images come from different visual tasks.

Abstract: In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across twelve cross-task scenarios and second-tier performance in nine additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.

[198] Difference Decomposition Networks for Infrared Small Target Detection

Chen Hu, Mingyu Zhou, Shuai Yuan, Hongbo Hu, Zhenming Peng, Tian Pu, Xiyin Li

Main category: cs.CV

TL;DR: Proposed Basis Decomposition Module (BDM) for infrared small target detection, extended to spatial and temporal difference decomposition modules, achieving SOTA performance on both single-frame and multi-frame datasets.

Details

Motivation: Infrared small target detection faces challenges: lack of discernible target texture and severe background clutter that obscures targets. Need to enhance targets while suppressing backgrounds.

Method: Proposed Basis Decomposition Module (BDM) decomposes complex features into basis features to enhance information and eliminate redundancy. Extended to Spatial Difference Decomposition Module (SD²M), Spatial Difference Decomposition Downsampling Module (SD³M), and Temporal Difference Decomposition Module (TD²M). Built SD²Net for single-frame ISTD using U-shaped architecture with SD²M and SD³M, and STD²Net for multi-frame ISTD by adding TD²M for motion information.

Result: Extensive experiments show state-of-the-art performance. On SISTD, SD²Net performs well compared to established networks. On MISTD, STD²Net achieves mIoU of 87.68%, significantly outperforming SD²Net’s 64.97%.

Conclusion: The proposed basis decomposition approach effectively addresses infrared small target detection challenges by enhancing targets and suppressing backgrounds through feature decomposition, with spatial and temporal extensions achieving superior performance on both single-frame and multi-frame tasks.

Abstract: Infrared small target detection (ISTD) faces two major challenges: a lack of discernible target texture and severe background clutter, which results in the background obscuring the target. To enhance targets and suppress backgrounds, we propose the Basis Decomposition Module (BDM) as an extensible and lightweight module based on basis decomposition, which decomposes a complex feature into several basis features and enhances certain information while eliminating redundancy. Extending BDM leads to a series of modules, including the Spatial Difference Decomposition Module (SD$^\mathrm{2}$M), Spatial Difference Decomposition Downsampling Module (SD$^\mathrm{3}$M), and Temporal Difference Decomposition Module (TD$^\mathrm{2}$M). Based on these modules, we develop the Spatial Difference Decomposition Network (SD$^\mathrm{2}$Net) for single-frame ISTD (SISTD) and the Spatiotemporal Difference Decomposition Network (STD$^\mathrm{2}$Net) for multi-frame ISTD (MISTD). SD$^\mathrm{2}$Net integrates SD$^\mathrm{2}$M and SD$^\mathrm{3}$M within an adapted U-shaped architecture. We employ TD$^\mathrm{2}$M to introduce motion information, which transforms SD$^\mathrm{2}$Net into STD$^\mathrm{2}$Net. Extensive experiments on SISTD and MISTD datasets demonstrate state-of-the-art (SOTA) performance. On the SISTD task, SD$^\mathrm{2}$Net performs well compared to most established networks. On the MISTD datasets, STD$^\mathrm{2}$Net achieves a mIoU of 87.68%, outperforming SD$^\mathrm{2}$Net, which achieves a mIoU of 64.97%. Our codes are available: https://github.com/greekinRoma/IRSTD_HC_Platform.

[199] $\mathrm{D}^\mathrm{3}$-Predictor: Noise-Free Deterministic Diffusion for Dense Prediction

Changliang Xia, Chengyou Jia, Minnan Luo, Zhuohang Dang, Xin Shen, Bowen Ping

Main category: cs.CV

TL;DR: D³-Predictor: A deterministic diffusion-based dense prediction model that removes stochastic noise from diffusion models to preserve geometric structure for better dense prediction tasks.

Details

Motivation: Standard diffusion models use stochastic noise that corrupts fine-grained spatial cues and destroys geometric structure mappings needed for dense prediction tasks like depth estimation and segmentation.

Method: Reformulate pretrained diffusion models without stochastic noise, treating them as ensembles of timestep-dependent visual experts. Self-supervisedly aggregate heterogeneous priors into a single clean geometric prior, then adapt with task-specific supervision.

Result: Achieves competitive or state-of-the-art performance across various dense prediction tasks, requires less than half the training data, and performs single-step inference efficiently.

Conclusion: D³-Predictor successfully addresses the misalignment between stochastic diffusion sampling and deterministic dense prediction, creating a noise-free deterministic model that preserves geometric structure while leveraging diffusion priors.

Abstract: Although diffusion models with strong visual priors have emerged as powerful dense prediction backbones, they overlook a core limitation: the stochastic noise at the core of diffusion sampling is inherently misaligned with dense prediction that requires a deterministic mapping from image to geometry. In this paper, we show that this stochastic noise corrupts fine-grained spatial cues and pushes the model toward timestep-specific noise objectives, consequently destroying meaningful geometric structure mappings. To address this, we introduce $\mathrm{D}^\mathrm{3}$-Predictor, a noise-free deterministic diffusion-based dense prediction model built by reformulating a pretrained diffusion model without stochasticity noise. Instead of relying on noisy inputs to leverage diffusion priors, $\mathrm{D}^\mathrm{3}$-Predictor views the pretrained diffusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Meanwhile, we utilize task-specific supervision to seamlessly adapt this noise-free prior to dense prediction tasks. Extensive experiments on various dense prediction tasks demonstrate that $\mathrm{D}^\mathrm{3}$-Predictor achieves competitive or state-of-the-art performance in diverse scenarios. In addition, it requires less than half the training data previously used and efficiently performs inference in a single step. Our code, data, and checkpoints are publicly available at https://x-gengroup.github.io/HomePage_D3-Predictor/.

[200] Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

Kai Yao, Marc Juarez

Main category: cs.CV

TL;DR: First systematic security evaluation of model fingerprint detection reveals significant vulnerability to adversarial attacks, with removal attacks achieving 80%+ success in white-box and 50%+ in black-box settings, highlighting a utility-robustness trade-off.

Details

Motivation: Despite the promise of model fingerprint detection for tracing AI-generated images in forensic applications, existing evaluations rarely consider adversarial settings, creating a critical gap in understanding the real-world security of these techniques.

Method: Systematic security evaluation formalizing threat models (white-box/black-box access) and two attack goals (fingerprint removal and forgery). Implemented five attack strategies and evaluated 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators.

Result: Revealed pronounced gap between clean and adversarial performance: removal attacks highly effective (80%+ success white-box, 50%+ black-box), forgery more challenging but success varies across models. Found utility-robustness trade-off: accurate attribution methods are often vulnerable, and no technique achieves both robustness and accuracy across all threat models.

Conclusion: Current fingerprint detection techniques are vulnerable to adversarial attacks, highlighting the need for methods that balance robustness and accuracy. The study identifies promising approaches toward this goal and provides the first comprehensive security analysis of model fingerprinting.

Abstract: Model fingerprint detection has shown promise to trace the provenance of AI-generated images in forensic applications. However, despite the inherent adversarial nature of these applications, existing evaluations rarely consider adversarial settings. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under black-box access. While forgery is more challenging than removal, its success varies significantly across targeted models. We also observe a utility-robustness trade-off: accurate attribution methods are often vulnerable to attacks and, although some techniques are robust in specific settings, none achieves robustness and accuracy across all evaluated threat models. These findings highlight the need for techniques that balance robustness and accuracy, and we identify the most promising approaches toward this goal. Code available at: https://github.com/kaikaiyao/SmudgedFingerprints.

[201] RiskCueBench: Benchmarking Anticipatory Reasoning from Early Risk Cues in Video-Language Models

Sha Luo, Yogesh Prabhu, Timothy Ossowski, Kaiping Chen, Junjie Hu

Main category: cs.CV

TL;DR: A new video risk assessment benchmark (RiskCueBench) focuses on identifying early risk signals rather than full accident sequences, revealing current models’ limitations in anticipating future risky events from early visual cues.

Details

Motivation: Current video risk assessment models have access to full accident sequences, which reduces task difficulty and doesn't reflect real-world conditions where early risk anticipation is crucial for preventing accidents and ensuring public safety.

Method: Introduces RiskCueBench, a new video understanding benchmark where videos are carefully annotated to identify a “risk signal clip” - the earliest moment indicating a potential safety concern, rather than showing the full accident sequence.

Result: Experimental results show a significant gap in current systems’ ability to interpret evolving situations and anticipate future risky events from early visual signals, highlighting practical deployment challenges.

Conclusion: The benchmark reveals important limitations in current video risk prediction models and emphasizes the need for better early risk anticipation capabilities for real-world safety applications.

Abstract: With the rapid growth of video centered social media, the ability to anticipate risky events from visual data is a promising direction for ensuring public safety and preventing real world accidents. Prior work has extensively studied supervised video risk assessment across domains such as driving, protests, and natural disasters. However, many existing datasets provide models with access to the full video sequence, including the accident itself, which substantially reduces the difficulty of the task. To better reflect real world conditions, we introduce a new video understanding benchmark RiskCueBench in which videos are carefully annotated to identify a risk signal clip, defined as the earliest moment that indicates a potential safety concern. Experimental results reveal a significant gap in current systems ability to interpret evolving situations and anticipate future risky events from early visual signals, highlighting important challenges for deploying video risk prediction models in practice.

[202] Unlocking Generalization in Polyp Segmentation with DINO Self-Attention “keys”

Carla Monteiro, Valentina Corbetta, Regina Beets-Tan, Luís F. Teixeira, Wilson Silva

Main category: cs.CV

TL;DR: A framework using DINO self-attention key features with a simple convolutional decoder achieves SOTA polyp segmentation performance with better generalization, validated on multi-center datasets under DG and ESDG protocols.

Details

Motivation: Current DL methods for polyp segmentation struggle with generalization in data-constrained settings and rely on complex, task-specific architectures. There's a need for more robust and generalizable approaches.

Method: Leverages DINO self-attention “key” features (instead of deepest layer tokens) with a simple convolutional decoder to predict polyp masks. Uses Vision Transformer features in a novel way for segmentation.

Result: Achieves state-of-the-art performance on multi-center datasets under Domain Generalization and Extreme Single Domain Generalization protocols. Surpasses established models like nnU-Net and UM-Net with better generalization in data-scarce scenarios.

Conclusion: The DINO key feature approach provides robust polyp segmentation without polyp-specific architecture, offering enhanced generalization and performance, particularly in challenging data-constrained settings.

Abstract: Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention “key” features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework’s evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.

[203] Can Synthetic Images Serve as Effective and Efficient Class Prototypes?

Dianxing Shi, Dingjie Fu, Yuqiao Liu, Jun Wang

Main category: cs.CV

TL;DR: LGCLIP is a lightweight zero-shot image classification framework that uses LLM-generated prompts and diffusion models to create visual prototypes, eliminating the need for annotated image-text pairs and dual-tower encoders.

Details

Motivation: Existing VLMs like CLIP require expensive annotated text-image pairs and dual-tower encoders, which are costly and hinder lightweight deployment. The paper aims to reduce dependency on manual annotations and simplify model architecture.

Method: LGCLIP uses LLMs to generate class-specific prompts, which guide diffusion models to synthesize reference images as visual prototypes. It then compares visual features of real images with these prototypes using only a visual encoder, eliminating the need for text encoders and manual annotations.

Result: Experimental results validate LGCLIP’s feasibility and efficiency, showing strong performance in zero-shot classification tasks while establishing a novel classification paradigm that requires only class labels as input.

Conclusion: LGCLIP presents a lightweight, efficient alternative to traditional VLMs by leveraging LLM-generated content and diffusion models, eliminating dependency on annotated image-text pairs and reducing computational requirements while maintaining competitive zero-shot classification performance.

Abstract: Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.

[204] Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing for Weakly-Supervised Camouflaged Object Detection with Scribble Annotations

Jiawei Ge, Jiuxin Cao, Xinyi Li, Xuelin Zhu, Chang Liu, Bo Liu, Chen Feng, Ioannis Patras

Main category: cs.CV

TL;DR: D³ETOR is a two-stage weakly-supervised camouflaged object detection framework that uses debate-enhanced pseudo labeling and frequency-aware progressive debiasing to overcome limitations of existing WSCOD methods.

Details

Motivation: Existing WSCOD methods lag behind fully supervised approaches due to unreliable pseudo masks from general segmentation models lacking COD-specific understanding, and neglect of inherent scribble annotation bias that hinders global structure capture.

Method: Two-stage framework: 1) Debate-enhanced pseudo labeling with adaptive entropy-driven point sampling and multi-agent debate mechanism to improve SAM for COD; 2) FADeNet that progressively fuses multi-level frequency-aware features and dynamically reweights supervision strength to alleviate scribble bias.

Result: D³ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.

Conclusion: The proposed framework effectively addresses key limitations in WSCOD by enhancing pseudo mask quality through debate mechanisms and mitigating scribble bias through frequency-aware progressive debiasing, enabling competitive performance with only sparse supervision.

Abstract: Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose ${D}^{3}$ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, ${D}^{3}$ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.

[205] CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation

V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin

Main category: cs.CV

TL;DR: CRAFT is a training-free framework that improves text-to-image generation through explicit, structured reasoning with visual constraints, verification, and targeted corrections.

Details

Motivation: Existing inference-time reasoning methods for text-to-image generation lack interpretability, control, and reliable stopping mechanisms. They rely on implicit critiques or unconstrained prompt rewrites, making their behavior difficult to understand and manage.

Method: CRAFT transforms user prompts into explicit, dependency-structured visual constraints, verifies generated images using a vision-language model, and performs targeted prompt updates only when specific constraints are violated. It includes an explicit stopping criterion for an interpretable refinement loop.

Result: CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations across multiple model families and challenging benchmarks, with particularly strong gains for lightweight generators. It achieves these improvements with negligible inference-time overhead.

Conclusion: Explicitly structured, constraint-driven inference-time reasoning is crucial for improving the reliability of multimodal generative models, allowing smaller models to approach the quality of more expensive systems.

Abstract: Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of thinking based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free and model-agnostic framework for multimodal image generation. CRAFT transforms a user prompt into a set of explicit, dependency-structured visual constraints, verifies generated images using a vision-language model, and performs targeted prompt updates only when specific constraints are violated. This iterative process includes an explicit stopping criterion, resulting in an interpretable and controllable inference-time refinement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.

[206] Unified Text-Image Generation with Weakness-Targeted Post-Training

Jiahui Chen, Philippe Hansen-Estruch, Xiaochuang Han, Yushi Hu, Emily Dinan, Amita Kamath, Michal Drozdzal, Reyhane Askari-Hemmat, Luke Zettlemoyer, Marjan Ghazvininejad

Main category: cs.CV

TL;DR: The paper proposes a post-training method for unified multimodal generation architectures that enables autonomous text-to-image synthesis within a single inference process, improving performance across multiple benchmarks.

Details

Motivation: Existing unified multimodal generation systems rely on explicit modality switching (generating reasoning text first, then manually switching to image generation), which limits cross-modal coupling and prohibits automatic multimodal generation. The authors aim to achieve fully unified text-image generation where models can autonomously transition from textual reasoning to visual synthesis.

Method: The authors use offline, reward-weighted post-training with fully self-generated synthetic data. They explore different post-training data strategies and examine the impact of joint text-image generation on T2I performance, as well as the relative importance of each modality during post-training.

Result: The approach enables improvements in multimodal image generation across four diverse T2I benchmarks. The targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Reward-weighting both modalities and strategically designed post-training data prove effective.

Conclusion: Fully unified text-image generation through post-training is achievable and effective, with strategic data selection and reward-weighting of both modalities leading to improved multimodal synthesis performance across diverse benchmarks.

Abstract: Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.

[207] Coding the Visual World: From Image to Simulation Using Vision Language Models

Sagi Eppel

Main category: cs.CV

TL;DR: VLMs can understand and model complex systems in images through code generation but struggle with fine details, revealing an asymmetry between high-level understanding and low-level perception.

Details

Motivation: To explore whether Vision Language Models (VLMs) can construct mental models of systems depicted in images, similar to human visual understanding, by testing their ability to recognize and simulate real-world systems through code generation.

Method: Using the Im2Sim methodology: VLMs are given natural images of real-world systems, tasked with describing the system and writing generative code to simulate it. The code is executed to produce synthetic images, which are compared against the original. Tested on various complex emergent systems including physical systems (waves, lights, clouds), vegetation, cities, materials, and geological formations.

Result: Leading VLMs (GPT, Gemini) demonstrate ability to understand and model complex, multi-component systems across multiple abstraction layers and diverse domains. However, they exhibit limited ability to replicate fine details and low-level pattern arrangements in images.

Conclusion: VLMs show an interesting asymmetry: they combine high-level, deep visual understanding of images with limited perception of fine details, suggesting they can construct representative models of systems but struggle with precise low-level replication.

Abstract: The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) have the ability to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.

[208] Trustworthy Longitudinal Brain MRI Completion: A Deformation-Based Approach with KAN-Enhanced Diffusion Model

Tianli Tao, Ziyang Wang, Delong Yang, Han Zhang, Le Zhang

Main category: cs.CV

TL;DR: DF-DiffCom is a KAN-enhanced diffusion model that uses deformation fields for trustworthy longitudinal brain MRI completion, outperforming SOTA methods and extending to various MRI modalities.

Details

Motivation: Longitudinal brain MRI studies suffer from high attrition rates leading to missing data. Existing deep generative models rely only on image intensity, resulting in limited fidelity/trustworthiness and restricted usage flexibility due to fixed guidance in model structure.

Method: DF-DiffCom is a Kolmogorov-Arnold Networks (KAN)-enhanced diffusion model that smartly leverages deformation fields for longitudinal brain image completion. It’s trained on OASIS-3 dataset and is modality-agnostic.

Result: Outperforms state-of-the-art methods, improving PSNR by 5.6% and SSIM by 0.12. Smoothly extends to varied MRI modalities and even to attribute maps like brain tissue segmentation results.

Conclusion: DF-DiffCom addresses key limitations of existing methods by providing trustworthy longitudinal brain image completion with improved performance and greater flexibility for versatile application scenarios.

Abstract: Longitudinal brain MRI is essential for lifespan study, yet high attrition rates often lead to missing data, complicating analysis. Deep generative models have been explored, but most rely solely on image intensity, leading to two key limitations: 1) the fidelity or trustworthiness of the generated brain images are limited, making downstream studies questionable; 2) the usage flexibility is restricted due to fixed guidance rooted in the model structure, restricting full ability to versatile application scenarios. To address these challenges, we introduce DF-DiffCom, a Kolmogorov-Arnold Networks (KAN)-enhanced diffusion model that smartly leverages deformation fields for trustworthy longitudinal brain image completion. Trained on OASIS-3, DF-DiffCom outperforms state-of-the-art methods, improving PSNR by 5.6% and SSIM by 0.12. More importantly, its modality-agnostic nature allows smooth extension to varied MRI modalities, even to attribute maps such as brain tissue segmentation results.

[209] SUG-Occ: An Explicit Semantics and Uncertainty Guided Sparse Learning Framework for Real-Time 3D Occupancy Prediction

Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Nanren Bao, Bo Qian, Hao Si, Manabu Tsukada

Main category: cs.CV

TL;DR: SUG-Occ is a sparse learning framework for efficient 3D semantic occupancy prediction that uses semantics and uncertainty guidance to reduce computation while maintaining accuracy.

Details

Motivation: 3D semantic occupancy prediction is crucial for full scene understanding in autonomous driving, but current methods suffer from prohibitive computation and memory overhead that prevents real-time deployment.

Method: 1) Uses semantic and uncertainty priors to suppress free space projections during view transformation with unsigned distance encoding for geometric consistency; 2) Cascade sparse completion module with hyper cross sparse convolution and generative upsampling for coarse-to-fine reasoning; 3) Object contextual representation mask decoder for global semantic context aggregation via lightweight query-context interactions instead of expensive attention.

Result: Outperforms baselines on SemanticKITTI benchmark with 7.34% improvement in accuracy and 57.8% gain in efficiency.

Conclusion: SUG-Occ effectively addresses the efficiency bottleneck in 3D semantic occupancy prediction by exploiting scene sparsity while maintaining geometric and semantic completeness, enabling practical real-time deployment.

Abstract: As autonomous driving moves toward full scene understanding, 3D semantic occupancy prediction has emerged as a crucial perception task, offering voxel-level semantics beyond traditional detection and segmentation paradigms. However, such a refined representation for scene understanding incurs prohibitive computation and memory overhead, posing a major barrier to practical real-time deployment. To address this, we propose SUG-Occ, an explicit Semantics and Uncertainty Guided Sparse Learning Enabled 3D Occupancy Prediction Framework, which exploits the inherent sparsity of 3D scenes to reduce redundant computation while maintaining geometric and semantic completeness. Specifically, we first utilize semantic and uncertainty priors to suppress projections from free space during view transformation while employing an explicit unsigned distance encoding to enhance geometric consistency, producing a structurally consistent sparse 3D representation. Secondly, we design an cascade sparse completion module via hyper cross sparse convolution and generative upsampling to enable efficiently coarse-to-fine reasoning. Finally, we devise an object contextual representation (OCR) based mask decoder that aggregates global semantic context from sparse features and refines voxel-wise predictions via lightweight query-context interactions, avoiding expensive attention operations over volumetric features. Extensive experiments on SemanticKITTI benchmark demonstrate that the proposed approach outperforms the baselines, achieving a 7.34/% improvement in accuracy and a 57.8% gain in efficiency.

[210] A Training-Free Guess What Vision Language Model from Snippets to Open-Vocabulary Object Detection

Guiying Zhu, Bowen Yang, Yin Zhuang, Tong Zhang, Guanqun Wang, Zhihao Che, He Chen, Lianlin Li

Main category: cs.CV

TL;DR: GW-VLM is a training-free approach for Open-Vocabulary Object Detection that uses a Vision Language Model and Large Language Model in a “guess what” game format with Multi-Scale Visual Language Searching and Contextual Concept Prompting.

Details

Motivation: Existing OVOD approaches overlook the need for universal understanding of object cognition based on already pretrained foundation models. The paper aims to leverage pre-trained VLMs and LLMs without additional training.

Method: Proposes GW-VLM with two key components: 1) Multi-Scale Visual Language Searching (MS-VLS) for soft-alignment between visual features and language concepts, and 2) Contextual Concept Prompt (CCP) to help LLMs understand visual snippets for object detection.

Result: Achieves superior OVOD performance on both natural (COCO val, Pascal VOC) and remote sensing (DIOR, NWPU-10) datasets compared to state-of-the-art methods, without any training steps.

Conclusion: GW-VLM demonstrates that training-free approaches using pre-trained foundation models in a “guess what” framework can achieve state-of-the-art OVOD performance across diverse domains.

Abstract: Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of “guess what”. Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate snippets from the results of class-agnostic object detection, while CCP can form the concept of flow referring to MS-VLS and then make LLM understand snippets for OVOD. Finally, the extensive experiments are carried out on natural and remote sensing datasets, including COCO val, Pascal VOC, DIOR, and NWPU-10, and the results indicate that our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.

[211] BirdsEye-RU: A Dataset For Detecting Faces from Overhead Images

Md. Ahanaf Arif Khan, Ariful Islam, Sangeeta Biswas, Md. Iqbal Aziz Khan, Subrata Pramanik, Sanjoy Kumar Chakravarty, Bimal Kumar Pramanik

Main category: cs.CV

TL;DR: Created BirdsEye-RU dataset with 2,978 images containing 8,000+ annotated faces for overhead face detection, addressing challenges of small/distant faces in drone and high-altitude smartphone images.

Details

Motivation: Face detection in overhead images is challenging due to extreme scale variations and environmental clutter, requiring specialized datasets for small and distant faces.

Method: Created a comprehensive dataset (BirdsEye-RU) with 2,978 images containing over 8,000 annotated faces, specifically designed to capture small and distant faces across diverse environments including drone and high-altitude smartphone images.

Result: The BirdsEye-RU dataset is now publicly available and can be accessed at https://www.kaggle.com/datasets/mdahanafarifkhan/birdseye-ru, providing a valuable resource for overhead face detection research.

Conclusion: The paper introduces a specialized dataset to address the challenges of face detection in overhead imagery, making it freely available to advance research in this domain.

Abstract: Detecting faces in overhead images remains a significant challenge due to extreme scale variations and environmental clutter. To address this, we created the BirdsEye-RU dataset, a comprehensive collection of 2,978 images containing over eight thousand annotated faces. This dataset is specifically designed to capture small and distant faces across diverse environments, containing both drone images and smartphone-captured images from high altitude. We present a detailed description of the BirdsEye-RU dataset in this paper. We made our dataset freely available to the public, and it can be accessed at https://www.kaggle.com/datasets/mdahanafarifkhan/birdseye-ru.

[212] Encoding Emotion Through Self-Supervised Eye Movement Reconstruction

Marcus Ma, Jordan Prescott, Emily Zhou, Tiantian Feng, Kleanthis Avramidis, Gabor Mihaly Toth, Shrikanth Narayanan

Main category: cs.CV

TL;DR: Self-supervised eye movement reconstruction from low-resolution videos effectively predicts emotional expression markers like speech emotion alignment and momentary emotional behaviors.

Details

Motivation: Most eye movement emotion studies use specialized high-resolution equipment, limiting accessibility. The authors want to predict emotional expression from naturalistic, low-resolution videos to broaden applications.

Method: Developed a novel gaze detection model using self-supervised eye movement reconstruction (inspired by language model pretraining) that leverages unlabeled video. Used encoder embeddings to fine-tune on two downstream tasks: 1) aligning eye movement with speech emotion estimates, and 2) predicting momentary emotional behaviors (laughing, crying/sobbing, sighing). Tested on Holocaust survivor interview videos from USC Shoah Foundation’s Visual History Archive.

Result: The new model effectively predicts emotion outcomes. Found positive correlation between pretraining performance and emotion processing performance for both experiments.

Conclusion: Self-supervised eye movement reconstruction is an effective method for encoding the affective signal carried by eye movements, enabling emotion prediction from low-resolution naturalistic videos.

Abstract: The relationship between emotional expression and eye movement is well-documented, with literature establishing gaze patterns are reliable indicators of emotion. However, most studies utilize specialized, high-resolution eye-tracking equipment, limiting the potential reach of findings. We investigate how eye movement can be used to predict multimodal markers of emotional expression from naturalistic, low-resolution videos. We utilize a collection of video interviews from the USC Shoah Foundation’s Visual History Archive with Holocaust survivors as they recount their experiences in the Auschwitz concentration camp. Inspired by pretraining methods on language models, we develop a novel gaze detection model that uses self-supervised eye movement reconstruction that can effectively leverage unlabeled video. We use this model’s encoder embeddings to fine-tune models on two downstream tasks related to emotional expression. The first is aligning eye movement with directional emotion estimates from speech. The second task is using eye gaze as a predictor of three momentary manifestations of emotional behaviors: laughing, crying/sobbing, and sighing. We find our new model is predictive of emotion outcomes and observe a positive correlation between pretraining performance and emotion processing performance for both experiments. We conclude self-supervised eye movement reconstruction is an effective method for encoding the affective signal they carry.

Tiancheng Fang, Bowen Pan, Lingxi Chen, Jiangjing Lyu, Chengfei Lyu, Chaoyue Niu, Fan Wu

Main category: cs.CV

TL;DR: VIAFormer is a transformer model that refines incomplete 3D voxels using multi-view images as guidance, achieving state-of-the-art performance in correcting synthetic and real-world artifacts.

Details

Motivation: The paper addresses the problem of repairing incomplete and noisy 3D voxel representations using multi-view images as guidance, which is crucial for practical 3D creation pipelines and enabling voxel-based methods to work effectively with vision foundation models.

Method: VIAFormer uses three key components: 1) Image Index for explicit 3D spatial grounding of 2D image tokens, 2) Correctional Flow objective that learns direct voxel-refinement trajectories, and 3) Hybrid Stream Transformer for robust cross-modal fusion between voxels and images.

Result: VIAFormer establishes new state-of-the-art performance in correcting both severe synthetic corruptions and realistic artifacts on voxel shapes obtained from vision foundation models, demonstrating superior refinement capabilities.

Conclusion: VIAFormer serves as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to thrive in the era of large models and big data, effectively connecting 2D vision foundation models with 3D reconstruction tasks.

Abstract: We propose VIAFormer, a Voxel-Image Alignment Transformer model designed for Multi-view Conditioned Voxel Refinement–the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to thrive in large-model, big-data wave.

[214] HiT: History-Injection Transformers for Onboard Continuous Flood Change Detection

Daniel Kyselica, Jonáš Herec, Oliver Kutis, Rado Pitoňák

Main category: cs.CV

TL;DR: The paper proposes HiT (History Injection mechanism for Transformer models), an onboard change detection system for flood monitoring on small satellites that maintains historical context while reducing data storage by over 99% and achieving real-time processing.

Details

Motivation: Natural disaster monitoring requires continuous satellite observation with multi-temporal data processing under strict operational constraints (memory and computational limits of small satellites). Current systems lack efficient onboard processing capabilities for real-time hazard assessment.

Method: Developed HiT mechanism for Transformer models that maintains historical context from previous observations while dramatically reducing data storage requirements. Implemented within the Prithvi-tiny foundation model for flood detection applications.

Result: HiT mechanism reduces data storage by over 99% of original image size while maintaining detection accuracy comparable to bitemporal baselines on STTORM-CD flood dataset. HiT-Prithvi model achieved 43 FPS on Jetson Orin Nano (representative nanosat hardware).

Conclusion: The work establishes a practical framework for satellite-based continuous monitoring of natural disasters, enabling real-time hazard assessment without dependency on ground-based processing infrastructure. Architecture and model checkpoints are publicly available.

Abstract: Natural disaster monitoring through continuous satellite observation requires processing multi-temporal data under strict operational constraints. This paper addresses flood detection, a critical application for hazard management, by developing an onboard change detection system that operates within the memory and computational limits of small satellites. We propose History Injection mechanism for Transformer models (HiT), that maintains historical context from previous observations while reducing data storage by over 99% of original image size. Moreover, testing on the STTORM-CD flood dataset confirms that the HiT mechanism within the Prithvi-tiny foundation model maintains detection accuracy compared to the bitemporal baseline. The proposed HiT-Prithvi model achieved 43 FPS on Jetson Orin Nano, a representative onboard hardware used in nanosats. This work establishes a practical framework for satellite-based continuous monitoring of natural disasters, supporting real-time hazard assessment without dependency on ground-based processing infrastructure. Architecture as well as model checkpoints is available at https://github.com/zaitra/HiT-change-detection

[215] Human detectors are surprisingly powerful reward models

Kumar Ashutosh, XuDong Wang, Xi Yin, Kristen Grauman, Adam Polyak, Ishan Misra, Rohit Girdhar

Main category: cs.CV

TL;DR: HuDA is a simple reward model that improves human motion quality in generated videos by combining human detection confidence and temporal prompt alignment, outperforming specialized models without additional training.

Details

Motivation: Current video generation models struggle with complex non-rigid human motions (sports, dance, etc.), often producing distorted poses, missing limbs, or physically implausible actions.

Method: HuDA integrates two components: human detection confidence for appearance quality and temporal prompt alignment score for motion realism. Uses off-the-shelf models without additional training, applied via Group Reward Policy Optimization (GRPO) post-training.

Result: HuDA outperforms specialized models fine-tuned with manual annotations, achieves 73% win-rate against state-of-the-art models like Wan 2.1, and improves generation quality beyond humans (animals, human-object interactions).

Conclusion: A simple reward model leveraging existing components can significantly enhance video generation quality for complex motions, demonstrating effectiveness across various dynamic subjects beyond just humans.

Abstract: Video generation models have recently achieved impressive visual fidelity and temporal coherence. Yet, they continue to struggle with complex, non-rigid motions, especially when synthesizing humans performing dynamic actions such as sports, dance, etc. Generated videos often exhibit missing or extra limbs, distorted poses, or physically implausible actions. In this work, we propose a remarkably simple reward model, HuDA, to quantify and improve the human motion in generated videos. HuDA integrates human detection confidence for appearance quality, and a temporal prompt alignment score to capture motion realism. We show this simple reward function that leverages off-the-shelf models without any additional training, outperforms specialized models finetuned with manually annotated data. Using HuDA for Group Reward Policy Optimization (GRPO) post-training of video models, we significantly enhance video generation, especially when generating complex human motions, outperforming state-of-the-art models like Wan 2.1, with win-rate of 73%. Finally, we demonstrate that HuDA improves generation quality beyond just humans, for instance, significantly improving generation of animal videos and human-object interactions.

cs.AI

[216] The Ontological Neutrality Theorem: Why Neutral Ontological Substrates Must Be Pre-Causal and Pre-Normative

Denise M. Case

Main category: cs.AI

TL;DR: Impossibility of ontological neutrality: any ontology with causal or normative commitments cannot serve as neutral substrate across conflicting frameworks.

Details

Motivation: Modern data systems need to support accountability across persistent disagreement, requiring ontologies that can function as shared substrates across divergent legal, political, and analytic frameworks.

Method: Establishes an impossibility result through logical analysis, showing that neutrality (interpretive non-commitment and stability under incompatible extensions) is incompatible with foundational causal or normative commitments.

Result: Proves that neutral ontological substrates must be pre-causal and pre-normative, representing only entities with identity and persistence conditions while externalizing interpretation, evaluation, and explanation.

Conclusion: Establishes necessary design constraints for systems needing shared, stable representations across conflicting frameworks: they must avoid foundational causal/deontic commitments and focus on entity representation with externalized interpretation.

Abstract: Modern data systems must support accountability across persistent legal, political, and analytic disagreement. This requirement imposes strict constraints on the design of any ontology intended to function as a shared substrate. We establish an impossibility result for ontological neutrality: neutrality, understood as interpretive non-commitment and stability under incompatible extensions, is incompatible with the inclusion of causal or normative commitments at the foundational layer. Any ontology that asserts causal or deontic conclusions as ontological facts cannot serve as a neutral substrate across divergent frameworks without revision or contradiction. It follows that neutral ontological substrates must be pre-causal and pre-normative, representing entities, together with identity and persistence conditions, while externalizing interpretation, evaluation, and explanation. This paper does not propose a specific ontology or protocol; rather, it establishes the necessary design constraints for any system intended to maintain a shared, stable representation of reality across conflicting interpretive frameworks.

[217] Epistemic Constitutionalism Or: how to avoid coherence bias

Michele Loi

Main category: cs.AI

TL;DR: AI systems need explicit epistemic constitutions to govern how they form and express beliefs, addressing biases like source attribution bias where models penalize arguments based on perceived ideological mismatches between sources and content.

Details

Motivation: Large language models function as artificial reasoners but their belief-forming behavior is governed by implicit, uninspected epistemic policies. The paper identifies source attribution bias as a key problem where frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content.

Method: The paper distinguishes two constitutional approaches: Platonic (mandates formal correctness and default source-independence from a privileged standpoint) and Liberal (refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance). The author argues for the Liberal approach and sketches a constitutional core of eight principles and four orientations.

Result: The paper shows that frontier models enforce identity-stance coherence, but when models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. The author proposes that AI epistemic governance requires explicit, contestable structure similar to what we now expect for AI ethics.

Conclusion: The paper argues for an epistemic constitution for AI with explicit, contestable meta-norms that regulate how systems form and express beliefs, advocating for a Liberal constitutional approach over a Platonic one, and proposing that AI epistemic governance needs the same explicit, contestable structure as AI ethics.

Abstract: Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument’s content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

[218] VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration

Saeed Khaki, Ashudeep Singh, Nima Safaei, Kamal Ginotra

Main category: cs.AI

TL;DR: Vision-language models struggle with math problems presented as images vs. text. VisTIRA framework uses tool-integrated reasoning to decompose image-based math problems into natural language rationales and executable Python steps to improve visual math reasoning.

Details

Motivation: VLMs perform significantly worse on mathematical reasoning when problems are presented as images rather than text, due to failures in reading dense formulas, layout, and mixed symbolic-diagrammatic context. This "modality gap" needs to be addressed.

Method: 1) Introduced VisTIRA (Vision and Tool-Integrated Reasoning Agent) - a tool-integrated reasoning framework that decomposes image-based math problems into natural language rationales and executable Python steps. 2) Created a LaTeX-based pipeline to convert chain-of-thought math corpora into challenging image counterparts. 3) Built synthetic tool-use trajectories from SnapAsk dataset for fine-tuning VLMs.

Result: Tool-integrated supervision improves image-based reasoning, and OCR grounding helps smaller models (though benefits diminish at scale). The modality gap severity inversely correlates with model size, and structured reasoning with OCR-based grounding are complementary strategies.

Conclusion: The modality gap in visual mathematical reasoning can be addressed through structured reasoning frameworks like VisTIRA and appropriate training data. Combining tool-integrated reasoning with OCR grounding provides effective strategies for advancing VLMs’ visual math capabilities.

Abstract: Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text. We empirically characterize this as a modality gap: the same question in text form yields markedly higher accuracy than its visually typeset counterpart, due to compounded failures in reading dense formulas, layout, and mixed symbolic-diagrammatic context. First, we introduce VisTIRA (Vision and Tool-Integrated Reasoning Agent), a tool-integrated reasoning framework that enables structured problem solving by iteratively decomposing a given math problem (as an image) into natural language rationales and executable Python steps to determine the final answer. Second, we build a framework to measure and improve visual math reasoning: a LaTeX-based pipeline that converts chain-of-thought math corpora (e.g., NuminaMath) into challenging image counterparts, and a large set of synthetic tool-use trajectories derived from a real-world, homework-style image dataset (called SnapAsk) for fine-tuning VLMs. Our experiments show that tool-integrated supervision improves image-based reasoning, and OCR grounding can further narrow the gap for smaller models, although its benefit diminishes at scale. These findings highlight that modality gap severity inversely correlates with model size, and that structured reasoning and OCR-based grounding are complementary strategies for advancing visual mathematical reasoning.

[219] On the Generalization Gap in LLM Planning: Tests and Verifier-Reward RL

Valerio Belcamino, Nicholas Attolino, Alessio Capitanelli, Fulvio Mastrogiovanni

Main category: cs.AI

TL;DR: Fine-tuned LLMs achieve high in-domain planning performance but fail completely on unseen domains, revealing reliance on domain-specific patterns rather than transferable planning competence.

Details

Motivation: To investigate whether fine-tuned LLMs demonstrate genuine transferable planning competence or merely domain-specific memorization when solving PDDL planning tasks.

Method: Fine-tuned a 1.7B-parameter LLM on 40,000 domain-problem-plan tuples from 10 IPC 2023 domains, evaluated in-domain and cross-domain generalization. Introduced three diagnostic interventions: symbol anonymization, compact plan serialization, and verifier-reward fine-tuning using VAL validator as reinforcement signal.

Result: Model achieved 82.9% valid plan rate in-domain but 0% on two unseen domains. Diagnostic interventions revealed strong sensitivity to surface representations (anonymization/serialization caused significant drops). Verifier-reward fine-tuning reached saturation faster but didn’t improve cross-domain generalization.

Conclusion: Fine-tuned LLMs rely heavily on domain-specific patterns rather than transferable planning competence, highlighting a persistent generalization gap in LLM-based planning. The diagnostic tools help study this gap’s causes.

Abstract: Recent work shows that fine-tuned Large Language Models (LLMs) can achieve high valid plan rates on PDDL planning tasks. However, it remains unclear whether this reflects transferable planning competence or domain-specific memorization. In this work, we fine-tune a 1.7B-parameter LLM on 40,000 domain-problem-plan tuples from 10 IPC 2023 domains, and evaluate both in-domain and cross-domain generalization. While the model reaches 82.9% valid plan rate in in-domain conditions, it achieves 0% on two unseen domains. To analyze this failure, we introduce three diagnostic interventions, namely (i) instance-wise symbol anonymization, (ii) compact plan serialization, and (iii) verifier-reward fine-tuning using the VAL validator as a success-focused reinforcement signal. Symbol anonymization and compact serialization cause significant performance drops despite preserving plan semantics, thus revealing strong sensitivity to surface representations. Verifier-reward fine-tuning reaches performance saturation in half the supervised training epochs, but does not improve cross-domain generalization. For the explored configurations, in-domain performance plateaus around 80%, while cross-domain performance collapses, suggesting that our fine-tuned model relies heavily on domain-specific patterns rather than transferable planning competence in this setting. Our results highlight a persistent generalization gap in LLM-based planning and provide diagnostic tools for studying its causes.

[220] Semantic-Guided Unsupervised Video Summarization

Haizhou Liu, Haodong Jin, Yiming Wang, Hui Yu

Main category: cs.AI

TL;DR: Proposes a semantic-guided unsupervised video summarization method using frame-level semantic alignment attention and incremental training to address GAN instability and unimodal feature limitations.

Details

Motivation: Existing unsupervised video summarization methods rely on GANs but have two main limitations: 1) they primarily use unimodal features, overlooking semantic information's guiding role in keyframe selection, and 2) they suffer from unstable GAN training.

Method: 1) Design a novel frame-level semantic alignment attention mechanism integrated into a keyframe selector; 2) Use this to guide a Transformer-based generator within an adversarial framework for better video reconstruction; 3) Adopt an incremental training strategy to progressively update model components and mitigate GAN instability.

Result: The approach achieves superior performance on multiple benchmark datasets compared to existing methods.

Conclusion: The proposed semantic-guided unsupervised video summarization method effectively addresses limitations of existing GAN-based approaches by incorporating semantic guidance and stabilizing training, leading to improved performance on standard benchmarks.

Abstract: Video summarization is a crucial technique for social understanding, enabling efficient browsing of massive multimedia content and extraction of key information from social platforms. Most existing unsupervised summarization methods rely on Generative Adversarial Networks (GANs) to enhance keyframe selection and generate coherent, video summaries through adversarial training. However, such approaches primarily exploit unimodal features, overlooking the guiding role of semantic information in keyframe selection, and often suffer from unstable training. To address these limitations, we propose a novel Semantic-Guided Unsupervised Video Summarization method. Specifically, we design a novel frame-level semantic alignment attention mechanism and integrate it into a keyframe selector, which guides the Transformer-based generator within the adversarial framework to better reconstruct videos. In addition, we adopt an incremental training strategy to progressively update the model components, effectively mitigating the instability of GAN training. Experimental results demonstrate that our approach achieves superior performance on multiple benchmark datasets.

[221] Scalable Knee-Point Guided Activity Group Selection in Multi-Tree Genetic Programming for Dynamic Multi-Mode Project Scheduling

Yuan Tian, Yi Mei, Mengjie Zhang

Main category: cs.AI

TL;DR: GP evolves rules for group selection in multi-mode project scheduling, using knee-point mechanism to improve scalability on large instances.

Details

Motivation: Group selection strategy for activity-mode scheduling has scalability issues on larger problems, needing enhancement for practical application.

Method: Multi-tree GP framework evolves two rules: activity ordering rule ranks eligible pairs, knee-point selection identifies promising subset, then group selection rule picks best combination.

Result: Approach scales well to large instances and outperforms GP with sequential decision-making in most scenarios.

Conclusion: Knee-point-based selection mechanism effectively enhances scalability of group selection strategy for dynamic multi-mode resource-constrained project scheduling.

Abstract: The dynamic multi-mode resource-constrained project scheduling problem is a challenging scheduling problem that requires making decisions on both the execution order of activities and their corresponding execution modes. Genetic programming has been widely applied as a hyper-heuristic to evolve priority rules that guide the selection of activity-mode pairs from the current eligible set. Recently, an activity group selection strategy has been proposed to select a subset of activities rather than a single activity at each decision point, allowing for more effective scheduling by considering the interdependence between activities. Although effective in small-scale instances, this strategy suffers from scalability issues when applied to larger problems. In this work, we enhance the scalability of the group selection strategy by introducing a knee-point-based selection mechanism to identify a promising subset of activities before evaluating their combinations. An activity ordering rule is first used to rank all eligible activity-mode pairs, followed by a knee point selection to find the promising pairs. Then, a group selection rule selects the best activity combination. We develop a multi-tree GP framework to evolve both types of rules simultaneously. Experimental results demonstrate that our approach scales well to large instances and outperforms GP with sequential decision-making in most scenarios.

[222] “Just in Time” World Modeling Supports Human Planning and Reasoning

Tony Chen, Sam Cheyette, Kelsey Allen, Joshua Tenenbaum, Kevin Smith

Main category: cs.AI

TL;DR: People use simplified mental representations for simulation-based reasoning, constructing them “just-in-time” through interleaved simulation and visual search rather than pre-computing simplifications.

Details

Motivation: Human mental simulation exceeds realistic capacity limits in complex environments. While people likely use simplified representations, it's unclear how they efficiently determine what to simplify without excessive pre-computation.

Method: Proposes a “Just-in-Time” framework where simulation, visual search, and representation modification are tightly interleaved. Current simulation guides where to look, and visual search flags objects to encode for subsequent simulation, constructing simplified representations online with minimal added computation.

Result: The model makes high-utility predictions despite encoding only a small subset of objects. Strong empirical support found over alternative models in grid-world planning and physical reasoning tasks across multiple behavioral measures.

Conclusion: Provides a concrete algorithmic account of how people construct reduced representations to support efficient mental simulation, demonstrating that simplified representations can be built “just-in-time” rather than pre-computed.

Abstract: Probabilistic mental simulation is thought to play a key role in human reasoning, planning, and prediction, yet the demands of simulation in complex environments exceed realistic human capacity limits. A theory with growing evidence is that people simulate using simplified representations of the environment that abstract away from irrelevant details, but it is unclear how people determine these simplifications efficiently. Here, we present a “Just-in-Time” framework for simulation-based reasoning that demonstrates how such representations can be constructed online with minimal added computation. The model uses a tight interleaving of simulation, visual search, and representation modification, with the current simulation guiding where to look and visual search flagging objects that should be encoded for subsequent simulation. Despite only ever encoding a small subset of objects, the model makes high-utility predictions. We find strong empirical support for this account over alternative models in a grid-world planning task and a physical reasoning task across a range of behavioral measures. Together, these results offer a concrete algorithmic account of how people construct reduced representations to support efficient mental simulation.

[223] Large Language Model-Powered Evolutionary Code Optimization on a Phylogenetic Tree

Leyi Zhao, Weijie Huang, Yitong Guo, Jiang Bian, Chenghong Wang, Xuhong Zhang

Main category: cs.AI

TL;DR: PhyloEvolve is an LLM-agent system that treats GPU algorithm optimization as In-Context Reinforcement Learning, using phylogenetic trees to organize optimization history and enable experience reuse without model retraining.

Details

Motivation: Current LLM-assisted evolutionary methods for GPU code optimization rely on outcome-based selection and random mutation, wasting rich trajectory information generated during iterative optimization processes.

Method: PhyloEvolve integrates Algorithm Distillation and prompt-based Decision Transformers into an iterative workflow, using phylogenetic trees to capture inheritance, divergence, and recombination among algorithm variants. It combines elite trajectory pooling, multi-island parallel exploration, and containerized execution.

Result: The system demonstrates consistent improvements in runtime, memory efficiency, and correctness over baseline and evolutionary methods on scientific computing workloads including PDE solvers, manifold learning, and spectral graph algorithms.

Conclusion: PhyloEvolve successfully reframes GPU algorithm optimization as an In-Context Reinforcement Learning problem, enabling trajectory-conditioned reuse of optimization experience without model retraining through phylogenetic organization of optimization history.

Abstract: Optimizing scientific computing algorithms for modern GPUs is a labor-intensive and iterative process involving repeated code modification, benchmarking, and tuning across complex hardware and software stacks. Recent work has explored large language model (LLM)-assisted evolutionary methods for automated code optimization, but these approaches primarily rely on outcome-based selection and random mutation, underutilizing the rich trajectory information generated during iterative optimization. We propose PhyloEvolve, an LLM-agent system that reframes GPU-oriented algorithm optimization as an In-Context Reinforcement Learning (ICRL) problem. This formulation enables trajectory-conditioned reuse of optimization experience without model retraining. PhyloEvolve integrates Algorithm Distillation and prompt-based Decision Transformers into an iterative workflow, treating sequences of algorithm modifications and performance feedback as first-class learning signals. To organize optimization history, we introduce a phylogenetic tree representation that captures inheritance, divergence, and recombination among algorithm variants, enabling backtracking, cross-lineage transfer, and reproducibility. The system combines elite trajectory pooling, multi-island parallel exploration, and containerized execution to balance exploration and exploitation across heterogeneous hardware. We evaluate PhyloEvolve on scientific computing workloads including PDE solvers, manifold learning, and spectral graph algorithms, demonstrating consistent improvements in runtime, memory efficiency, and correctness over baseline and evolutionary methods. Code is published at: https://github.com/annihi1ation/phylo_evolve

[224] MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Semih Yavuz, Caiming Xiong, Shafiq Joty

Main category: cs.AI

TL;DR: MAS-Orchestra: A training-time framework that formulates multi-agent system orchestration as function-calling reinforcement learning with holistic orchestration, plus MASBENCH benchmark to understand when MAS outperform single-agent systems.

Details

Motivation: Current multi-agent system (MAS) design approaches under-deliver due to methodological complexity (sequential, code-level execution limiting global reasoning) and efficacy uncertainty (deploying MAS without understanding benefits over single-agent systems).

Method: MAS-Orchestra abstracts complex sub-agents as callable functions, enabling global reasoning over system structure while hiding internal details. It formulates MAS orchestration as function-calling reinforcement learning with holistic orchestration (generating entire MAS at once). MASBENCH benchmark characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness.

Result: Analysis reveals MAS gains depend critically on task structure, verification protocols, and capabilities of orchestrator/sub-agents rather than holding universally. MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA.

Conclusion: MAS-Orchestra and MASBENCH together enable better training and understanding of multi-agent systems, providing insights into when and why MAS are beneficial for pursuing multi-agent intelligence.

Abstract: While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MAS-Orchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented sub-agents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and sub-agents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

[225] Query-Efficient Agentic Graph Extraction Attacks on GraphRAG Systems

Shuhua Yang, Jiahao Zhang, Yilong Wang, Dongwon Lee, Suhang Wang

Main category: cs.AI

TL;DR: AGEA framework enables efficient extraction of hidden knowledge graphs from GraphRAG systems using limited queries, achieving up to 90% recovery of entities and relationships.

Details

Motivation: Prior work shows GraphRAG systems can leak retrieved subgraphs, but the feasibility of reconstructing the entire hidden graph structure under realistic query budgets remains unexplored. The paper aims to investigate whether modern GraphRAG systems are vulnerable to structured extraction attacks even with strict query limits.

Method: AGEA (Agentic Graph Extraction Attack) framework uses: 1) novelty-guided exploration-exploitation strategy, 2) external graph memory modules, and 3) a two-stage pipeline combining lightweight discovery with LLM-based filtering. The approach operates in a budget-constrained black-box setting where adversaries adaptively query the system.

Result: AGEA significantly outperforms prior attack baselines under identical query budgets, recovering up to 90% of entities and relationships while maintaining high precision. Evaluated on medical, agriculture, and literary datasets across Microsoft-GraphRAG and LightRAG systems.

Conclusion: Modern GraphRAG systems are highly vulnerable to structured, agentic extraction attacks even under strict query limits, demonstrating serious security implications for knowledge graph-based retrieval systems.

Abstract: Graph-based retrieval-augmented generation (GraphRAG) systems construct knowledge graphs over document collections to support multi-hop reasoning. While prior work shows that GraphRAG responses may leak retrieved subgraphs, the feasibility of query-efficient reconstruction of the hidden graph structure remains unexplored under realistic query budgets. We study a budget-constrained black-box setting where an adversary adaptively queries the system to steal its latent entity-relation graph. We propose AGEA (Agentic Graph Extraction Attack), a framework that leverages a novelty-guided exploration-exploitation strategy, external graph memory modules, and a two-stage graph extraction pipeline combining lightweight discovery with LLM-based filtering. We evaluate AGEA on medical, agriculture, and literary datasets across Microsoft-GraphRAG and LightRAG systems. Under identical query budgets, AGEA significantly outperforms prior attack baselines, recovering up to 90% of entities and relationships while maintaining high precision. These results demonstrate that modern GraphRAG systems are highly vulnerable to structured, agentic extraction attacks, even under strict query limits.

[226] Local Language Models for Context-Aware Adaptive Anonymization of Sensitive Text

Aisvarya Adeseye, Jouni Isoaho, Seppo Virtanen, Mohammad Tahir

Main category: cs.AI

TL;DR: LLMs used to create adaptive anonymization framework for qualitative research data that outperforms manual methods in detecting sensitive information while preserving data meaning.

Details

Motivation: Manual anonymization of qualitative research data is time-consuming, inconsistent, and error-prone, while existing automated tools lack context awareness and risk altering data meaning. There's a need for reliable, repeatable, and context-aware anonymization that handles privacy risks in qualitative transcripts.

Method: Developed Structured Framework for Adaptive Anonymizer (SFAA) with three steps: detection, classification, and adaptive anonymization. Uses four strategies: rule-based substitution, context-aware rewriting, generalization, and suppression based on identifier type and risk level. Framework guided by GDPR, HIPAA, and OECD standards. Evaluated using dual-method approach with manual and LLM-assisted processing across two case studies with LLaMA and Phi models.

Result: LLMs detected more sensitive data than human reviewers. Phi outperformed LLaMA in finding sensitive data (over 91% detection rate) with 94.8% sentiment preservation, though made slightly more errors. The framework proved accurate without affecting qualitative data analysis.

Conclusion: Local LLMs provide effective context-aware anonymization for qualitative research, outperforming manual methods while preserving data integrity. The SFAA framework offers a reliable, standards-compliant approach to handling privacy risks in sensitive research data.

Abstract: Qualitative research often contains personal, contextual, and organizational details that pose privacy risks if not handled appropriately. Manual anonymization is time-consuming, inconsistent, and frequently omits critical identifiers. Existing automated tools tend to rely on pattern matching or fixed rules, which fail to capture context and may alter the meaning of the data. This study uses local LLMs to build a reliable, repeatable, and context-aware anonymization process for detecting and anonymizing sensitive data in qualitative transcripts. We introduce a Structured Framework for Adaptive Anonymizer (SFAA) that includes three steps: detection, classification, and adaptive anonymization. The SFAA incorporates four anonymization strategies: rule-based substitution, context-aware rewriting, generalization, and suppression. These strategies are applied based on the identifier type and the risk level. The identifiers handled by the SFAA are guided by major international privacy and research ethics standards, including the GDPR, HIPAA, and OECD guidelines. This study followed a dual-method evaluation that combined manual and LLM-assisted processing. Two case studies were used to support the evaluation. The first includes 82 face-to-face interviews on gamification in organizations. The second involves 93 machine-led interviews using an AI-powered interviewer to test LLM awareness and workplace privacy. Two local models, LLaMA and Phi were used to evaluate the performance of the proposed framework. The results indicate that the LLMs found more sensitive data than a human reviewer. Phi outperformed LLaMA in finding sensitive data, but made slightly more errors. Phi was able to find over 91% of the sensitive data and 94.8% kept the same sentiment as the original text, which means it was very accurate, hence, it does not affect the analysis of the qualitative data.

[227] IB-GRPO: Aligning LLM-based Learning Path Recommendation with Educational Objectives via Indicator-Based Group Relative Policy Optimization

Shuai Wang, Yaoming Yang, Bingdong Li, Hao Hao, Aimin Zhou

Main category: cs.AI

TL;DR: IB-GRPO is an indicator-guided alignment approach for LLM-based Learning Path Recommendation that addresses pedagogical misalignment, data scarcity, and multi-objective optimization challenges through hybrid expert demonstrations, ZPD alignment scoring, and group-relative policy optimization.

Details

Motivation: LLMs have rich semantic understanding for learning path recommendation but face three key challenges: (1) misalignment with pedagogical objectives like Zone of Proximal Development under sparse feedback, (2) scarce and costly expert demonstrations, and (3) complex multi-objective interactions among learning effect, difficulty scheduling, length control, and trajectory diversity.

Method: Proposes IB-GRPO (Indicator-Based Group Relative Policy Optimization) with three components: (1) constructs hybrid expert demonstrations using Genetic Algorithm search and teacher RL agents to mitigate data scarcity, (2) warm-starts LLM with supervised fine-tuning and designs within-session ZPD alignment score for difficulty scheduling, (3) uses Iε+ dominance indicator to compute group-relative advantages over multiple objectives without manual scalarization.

Result: Experiments on ASSIST09 and Junyi datasets using KES simulator with Qwen2.5-7B backbone show consistent improvements over representative RL and LLM baselines.

Conclusion: IB-GRPO effectively addresses key challenges in LLM-based Learning Path Recommendation by combining hybrid expert demonstrations, pedagogical alignment through ZPD scoring, and indicator-based multi-objective optimization, demonstrating superior performance over existing approaches.

Abstract: Learning Path Recommendation (LPR) aims to generate personalized sequences of learning items that maximize long-term learning effect while respecting pedagogical principles and operational constraints. Although large language models (LLMs) offer rich semantic understanding for free-form recommendation, applying them to long-horizon LPR is challenging due to (i) misalignment with pedagogical objectives such as the Zone of Proximal Development (ZPD) under sparse, delayed feedback, (ii) scarce and costly expert demonstrations, and (iii) multi-objective interactions among learning effect, difficulty scheduling, length controllability, and trajectory diversity. To address these issues, we propose IB-GRPO (Indicator-Based Group Relative Policy Optimization), an indicator-guided alignment approach for LLM-based LPR. To mitigate data scarcity, we construct hybrid expert demonstrations via Genetic Algorithm search and teacher RL agents and warm-start the LLM with supervised fine-tuning. Building on this warm-start, we design a within-session ZPD alignment score for difficulty scheduling. IB-GRPO then uses the $I_{ε+}$ dominance indicator to compute group-relative advantages over multiple objectives, avoiding manual scalarization and improving Pareto trade-offs. Experiments on ASSIST09 and Junyi using the KES simulator with a Qwen2.5-7B backbone show consistent improvements over representative RL and LLM baselines.

[228] Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Yunxiang Zhang, Moontae Lee, Hao Peng, Lu Wang, Honglak Lee

Main category: cs.AI

TL;DR: LLM judges evaluating agent performance are highly vulnerable to manipulation of reasoning traces, with false positive rates inflating up to 90% when agent chain-of-thought is rewritten while keeping actions/observations constant.

Details

Motivation: Current LLM-based evaluation of agent performance assumes agent reasoning traces faithfully reflect internal reasoning and environment state, but this assumption may be brittle and susceptible to manipulation.

Method: Systematically rewrote agent chain-of-thought reasoning traces while holding actions and observations fixed across 800 trajectories spanning diverse web tasks. Studied both style-based (presentation changes) and content-based (fabricating task progress signals) manipulation strategies. Evaluated prompting techniques and scaling judge-time compute as potential defenses.

Result: Manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90%. Content-based manipulations (fabricating progress signals) are consistently more effective than style-based approaches. Prompting and compute scaling reduce but don’t eliminate susceptibility to manipulation.

Conclusion: LLM-based evaluation has fundamental vulnerability to reasoning trace manipulation, highlighting need for judging mechanisms that verify reasoning claims against observable evidence rather than trusting reasoning traces at face value.

Abstract: Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent’s CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.

[229] Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System

Wanghan Xu, Wenlong Zhang, Fenghua Ling, Ben Fei, Yusong Hu, Runmin Ma, Bo Zhang, Fangxuan Ren, Jintai Lin, Wanli Ouyang, Lei Bai

Main category: cs.AI

TL;DR: Manalyzer is a multi-agent system that automates end-to-end meta-analysis using tool calls, addressing hallucinations in paper screening and data extraction through hybrid review, hierarchical extraction, self-proving, and feedback checking strategies.

Details

Motivation: Traditional meta-analysis requires extensive human effort across multiple stages (literature retrieval, screening, data extraction). While LLMs can accelerate some stages, they still suffer from hallucinations in critical tasks like paper screening and data extraction.

Method: Proposes Manalyzer, a multi-agent system that achieves end-to-end automated meta-analysis through tool calls. Implements hybrid review, hierarchical extraction, self-proving, and feedback checking strategies to mitigate hallucinations in paper screening and data extraction.

Result: Constructed a new benchmark with 729 papers across 3 domains (text, image, table modalities) containing over 10,000 data points. Extensive experiments show Manalyzer achieves significant performance improvements over LLM baselines in multiple meta-analysis tasks.

Conclusion: Manalyzer successfully automates meta-analysis while addressing key hallucination challenges through its multi-agent architecture and specialized strategies, demonstrating superior performance compared to existing LLM-based approaches.

Abstract: Meta-analysis is a systematic research methodology that synthesizes data from multiple existing studies to derive comprehensive conclusions. This approach not only mitigates limitations inherent in individual studies but also facilitates novel discoveries through integrated data analysis. Traditional meta-analysis involves a complex multi-stage pipeline including literature retrieval, paper screening, and data extraction, which demands substantial human effort and time. However, while LLM-based methods can accelerate certain stages, they still face significant challenges, such as hallucinations in paper screening and data extraction. In this paper, we propose a multi-agent system, Manalyzer, which achieves end-to-end automated meta-analysis through tool calls. The hybrid review, hierarchical extraction, self-proving, and feedback checking strategies implemented in Manalyzer significantly alleviate these two hallucinations. To comprehensively evaluate the performance of meta-analysis, we construct a new benchmark comprising 729 papers across 3 domains, encompassing text, image, and table modalities, with over 10,000 data points. Extensive experiments demonstrate that Manalyzer achieves significant performance improvements over the LLM baseline in multi meta-analysis tasks. Project page: https://black-yt.github.io/meta-analysis-page/ .

[230] AutoDriDM: An Explainable Benchmark for Decision-Making of Vision-Language Models in Autonomous Driving

Zecong Tang, Zixu Wang, Yifei Wang, Weitong Lian, Tianjian Gao, Haoran Li, Tengju Ru, Lingyi Meng, Zhejun Cui, Yichen Zhu, Qi Kang, Kaixuan Wang, Yu Zhang

Main category: cs.AI

TL;DR: AutoDriDM is a decision-centric benchmark for evaluating vision-language models in autonomous driving, focusing on decision-making rather than just perception, with 6,650 questions across Object, Scene, and Decision dimensions.

Details

Motivation: Existing autonomous driving benchmarks overemphasize perceptual competence and fail to adequately assess decision-making processes, despite vision-language models showing promising reasoning abilities. There's a need to bridge the gap between perception-centered and decision-centered evaluation.

Method: Created AutoDriDM benchmark with 6,650 questions across three dimensions (Object, Scene, Decision), evaluated mainstream VLMs, conducted correlation analysis between perception and decision-making, performed explainability analyses of reasoning processes, and introduced an analyzer model for automated large-scale annotation.

Result: The benchmark reveals weak alignment between perception and decision-making performance in VLMs, identifies key failure modes like logical reasoning errors, and provides automated analysis capabilities through the analyzer model.

Conclusion: AutoDriDM bridges the perception-decision evaluation gap and provides guidance for developing safer, more reliable VLMs for real-world autonomous driving applications.

Abstract: Autonomous driving is a highly challenging domain that requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonstrate reasoning and generalization abilities, opening new possibilities for autonomous driving; however, existing benchmarks and metrics overemphasize perceptual competence and fail to adequately assess decision-making processes. In this work, we present AutoDriDM, a decision-centric, progressive benchmark with 6,650 questions across three dimensions - Object, Scene, and Decision. We evaluate mainstream VLMs to delineate the perception-to-decision capability boundary in autonomous driving, and our correlation analysis reveals weak alignment between perception and decision-making performance. We further conduct explainability analyses of models’ reasoning processes, identifying key failure modes such as logical reasoning errors, and introduce an analyzer model to automate large-scale annotation. AutoDriDM bridges the gap between perception-centered and decision-centered evaluation, providing guidance toward safer and more reliable VLMs for real-world autonomous driving.

[231] DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs

Mingxuan Song, Yusen Huo, Bohan Zhou, Shenglin Yin, Zhen Xiao, Jieyi Long, Zhilin Zhang, Chuan Yu

Main category: cs.AI

TL;DR: DARA: A dual-phase LLM framework for AI-Generated Bidding that combines few-shot reasoning with fine-grained optimization to maximize advertiser value under budget constraints.

Details

Motivation: Traditional RL methods struggle with few-shot scenarios in AI-Generated Bidding where advertisers have personalized objectives but limited historical data. LLMs offer in-context learning capabilities but lack numerical precision for fine-grained optimization.

Method: 1) GRPO-Adaptive: An efficient LLM post-training strategy that enhances reasoning and numerical precision by dynamically updating reference policy. 2) DARA: A dual-phase framework with few-shot reasoner (generates initial plans via in-context prompting) and fine-grained optimizer (refines plans using feedback-driven reasoning).

Result: Extensive experiments on real-world and synthetic data show DARA consistently outperforms existing baselines in cumulative advertiser value under budget constraints.

Conclusion: DARA effectively combines LLMs’ in-context learning strengths with precise adaptability needed for AIGB tasks, addressing the few-shot optimization challenge in online advertising.

Abstract: Optimizing the advertiser’s cumulative value of winning impressions under budget constraints poses a complex challenge in online advertising, under the paradigm of AI-Generated Bidding (AIGB). Advertisers often have personalized objectives but limited historical interaction data, resulting in few-shot scenarios where traditional reinforcement learning (RL) methods struggle to perform effectively. Large Language Models (LLMs) offer a promising alternative for AIGB by leveraging their in-context learning capabilities to generalize from limited data. However, they lack the numerical precision required for fine-grained optimization. To address this limitation, we introduce GRPO-Adaptive, an efficient LLM post-training strategy that enhances both reasoning and numerical precision by dynamically updating the reference policy during training. Built upon this foundation, we further propose DARA, a novel dual-phase framework that decomposes the decision-making process into two stages: a few-shot reasoner that generates initial plans via in-context prompting, and a fine-grained optimizer that refines these plans using feedback-driven reasoning. This separation allows DARA to combine LLMs’ in-context learning strengths with precise adaptability required by AIGB tasks. Extensive experiments on both real-world and synthetic data environments demonstrate that our approach consistently outperforms existing baselines in terms of cumulative advertiser value under budget constraints.

[232] An XAI View on Explainable ASP: Methods, Systems, and Perspectives

Thomas Eiter, Tobias Geibinger, Zeynep G. Saribatur

Main category: cs.AI

TL;DR: Survey paper reviewing explanation approaches in Answer Set Programming (ASP) from an XAI perspective, analyzing types of explanations, user questions, current coverage, and identifying research gaps.

Details

Motivation: ASP's rule-based formalism makes it naturally suitable for explainable AI, but existing explanation approaches are fragmented and don't cover all user scenarios. There's a need to systematically organize ASP explanations from an XAI perspective.

Method: Survey methodology: 1) Provide overview of ASP explanation types guided by XAI perspective, 2) Connect explanations to user questions, 3) Describe coverage by current theory and tools, 4) Analyze gaps in existing approaches.

Result: Comprehensive mapping of ASP explanation types to user questions, assessment of current tool coverage, and identification of specific gaps in existing explanation approaches.

Conclusion: While ASP has inherent advantages for explainable reasoning, current explanation approaches are incomplete and fragmented. The survey identifies research directions needed to develop more comprehensive ASP explanation frameworks.

Abstract: Answer Set Programming (ASP) is a popular declarative reasoning and problem solving approach in symbolic AI. Its rule-based formalism makes it inherently attractive for explainable and interpretive reasoning, which is gaining importance with the surge of Explainable AI (XAI). A number of explanation approaches and tools for ASP have been developed, which often tackle specific explanatory settings and may not cover all scenarios that ASP users encounter. In this survey, we provide, guided by an XAI perspective, an overview of types of ASP explanations in connection with user questions for explanation, and describe how their coverage by current theory and tools. Furthermore, we pinpoint gaps in existing ASP explanations approaches and identify research directions for future work.

[233] Towards Bound Consistency for the No-Overlap Constraint Using MDDs

Amaury Guichard, Laurent Michel, Hélène Verhaeghe, Pierre Schaus

Main category: cs.AI

TL;DR: First bound-consistent algorithm for NP-complete no-overlap constraint using MDD-based filtering with width threshold for polynomial-time complexity.

Details

Motivation: Bound consistency for no-overlap constraint is NP-complete, existing polynomial-time techniques (edge finding, not-first-not-last, energetic reasoning) are incomplete. Need stronger filtering to reduce search tree size.

Method: Builds on no-overlap MDD by Ciré and van Hoeve, extracts time window bounds to tighten start/end times in polynomial time. Limits MDD width with threshold for relaxed bound-consistent filtering to control size and complexity.

Result: MDD-based filtering with width threshold achieves stronger reduction in search tree nodes than previous precedence-detection algorithm. Complementary to classical propagation methods, reduces both nodes and solving time on sequencing problems with time windows.

Conclusion: First bound-consistent algorithm for no-overlap constraint using MDD with width threshold provides effective polynomial-time filtering that complements existing methods and improves solving performance.

Abstract: Achieving bound consistency for the no-overlap constraint is known to be NP-complete. Therefore, several polynomial-time tightening techniques, such as edge finding, not-first-not-last reasoning, and energetic reasoning, have been introduced for this constraint. In this work, we derive the first bound-consistent algorithm for the no-overlap constraint. By building on the no-overlap MDD defined by Ciré and van Hoeve, we extract bounds of the time window of the jobs, allowing us to tighten start and end times in time polynomial in the number of nodes of the MDD. Similarly, to bound the size and time-complexity, we limit the width of the MDD to a threshold, creating a relaxed MDD that can also be used to relax the bound-consistent filtering. Through experiments on a sequencing problem with time windows and a just-in-time objective ($1 \mid r_j, d_j, \bar{d}_j \mid \sum E_j + \sum T_j$), we observe that the proposed filtering, even with a threshold on the width, achieves a stronger reduction in the number of nodes visited in the search tree compared to the previously proposed precedence-detection algorithm of Ciré and van Hoeve. The new filtering also appears to be complementary to classical propagation methods for the no-overlap constraint, allowing a substantial reduction in both the number of nodes and the solving time on several instances.

[234] CI4A: Semantic Component Interfaces for Agents Empowering Web Automation

Zhi Qiu, Jiazheng Sun, Chenxiao Xia, Jun Zheng, Xin Peng

Main category: cs.AI

TL;DR: CI4A introduces semantic encapsulation of UI components into agent-accessible tool primitives, enabling LLM agents to better handle fine-grained web interactions and achieving 86.3% success rate on upgraded WebArena benchmark.

Details

Motivation: LLMs excel at high-level semantic planning but struggle with fine-grained web component manipulations. Current approaches focus on adapting agents to human interfaces rather than optimizing interfaces for agents.

Method: CI4A (Component Interface for Agent) abstracts complex UI interaction logic into unified tool primitives. Implemented in Ant Design covering 23 UI component categories. Features hybrid agent with dynamically updating action space based on page state.

Result: CI4A-based agent achieves 86.3% task success rate (new SoTA) on refactored WebArena benchmark, with substantial improvements in execution efficiency compared to existing methods.

Conclusion: Optimizing interfaces for agents through semantic encapsulation (CI4A) is more effective than forcing agents to adapt to human interfaces, enabling better web interaction performance and efficiency.

Abstract: While Large Language Models demonstrate remarkable proficiency in high-level semantic planning, they remain limited in handling fine-grained, low-level web component manipulations. To address this limitation, extensive research has focused on enhancing model grounding capabilities through techniques such as Reinforcement Learning. However, rather than compelling agents to adapt to human-centric interfaces, we propose constructing interaction interfaces specifically optimized for agents. This paper introduces Component Interface for Agent (CI4A), a semantic encapsulation mechanism that abstracts the complex interaction logic of UI components into a set of unified tool primitives accessible to agents. We implemented CI4A within Ant Design, an industrial-grade front-end framework, covering 23 categories of commonly used UI components. Furthermore, we developed a hybrid agent featuring an action space that dynamically updates according to the page state, enabling flexible invocation of available CI4A tools. Leveraging the CI4A-integrated Ant Design, we refactored and upgraded the WebArena benchmark to evaluate existing SoTA methods. Experimental results demonstrate that the CI4A-based agent significantly outperforms existing approaches, achieving a new SoTA task success rate of 86.3%, alongside substantial improvements in execution efficiency.

[235] Measuring and Aligning Abstraction in Vision-Language Models with Medical Taxonomies

Ben Schaper, Maxime Di Folco, Bernhard Kainz, Julia A. Schnabel, Cosmin I. Bercea

Main category: cs.AI

TL;DR: VLMs show strong zero-shot chest X-ray classification but make clinically significant abstraction errors; hierarchical metrics reveal taxonomy misalignment; proposed solutions reduce severe errors below 2%.

Details

Motivation: Standard flat metrics for Vision-Language Models (VLMs) in chest X-ray classification fail to distinguish between clinically minor and severe errors, potentially masking dangerous abstraction mistakes that could impact patient safety.

Method: Benchmarked state-of-the-art VLMs using hierarchical metrics, introduced Catastrophic Abstraction Errors to capture cross-branch mistakes, proposed risk-constrained thresholding and taxonomy-aware fine-tuning with radial embeddings.

Result: Revealed substantial misalignment of VLMs with clinical taxonomies despite high flat performance; proposed methods reduced severe abstraction errors to below 2% while maintaining competitive overall performance.

Conclusion: Hierarchical evaluation and representation-level alignment are crucial for safer, more clinically meaningful deployment of VLMs in medical imaging, moving beyond flat metrics to capture clinical severity of errors.

Abstract: Vision-Language Models show strong zero-shot performance for chest X-ray classification, but standard flat metrics fail to distinguish between clinically minor and severe errors. This work investigates how to quantify and mitigate abstraction errors by leveraging medical taxonomies. We benchmark several state-of-the-art VLMs using hierarchical metrics and introduce Catastrophic Abstraction Errors to capture cross-branch mistakes. Our results reveal substantial misalignment of VLMs with clinical taxonomies despite high flat performance. To address this, we propose risk-constrained thresholding and taxonomy-aware fine-tuning with radial embeddings, which reduce severe abstraction errors to below 2 per cent while maintaining competitive performance. These findings highlight the importance of hierarchical evaluation and representation-level alignment for safer and more clinically meaningful deployment of VLMs.

[236] Implementing Knowledge Representation and Reasoning with Object Oriented Design

Abdelrhman Bassiouny, Tom Schierenbeck, Sorin Arion, Benjamin Alt, Naren Vasantakumaar, Giang Nguyen, Michael Beetz

Main category: cs.AI

TL;DR: KRROOD is a framework that integrates Knowledge Representation & Reasoning with Object-Oriented Programming by treating knowledge as first-class programming abstractions using native class structures.

Details

Motivation: There's an integration gap between modern software engineering (using OOP) and KR&R systems, which often rely on external ontologies and specialized languages that are difficult to integrate with imperative code.

Method: KRROOD treats knowledge as a first-class programming abstraction using native class structures, bridging the gap between logic programming and OOP paradigms.

Result: Experimental evaluation on OWL2Bench benchmark and human-robot task learning scenario shows KRROOD achieves strong performance while supporting expressive reasoning required for real-world autonomous systems.

Conclusion: KRROOD successfully bridges the integration gap between software engineering and KR&R systems by making knowledge a native programming abstraction within OOP frameworks.

Abstract: This paper introduces KRROOD, a framework designed to bridge the integration gap between modern software engineering and Knowledge Representation & Reasoning (KR&R) systems. While Object-Oriented Programming (OOP) is the standard for developing complex applications, existing KR&R frameworks often rely on external ontologies and specialized languages that are difficult to integrate with imperative code. KRROOD addresses this by treating knowledge as a first-class programming abstraction using native class structures, bridging the gap between the logic programming and OOP paradigms. We evaluate the system on the OWL2Bench benchmark and a human-robot task learning scenario. Experimental results show that KRROOD achieves strong performance while supporting the expressive reasoning required for real-world autonomous systems.

[237] To Neuro-Symbolic Classification and Beyond by Compiling Description Logic Ontologies to Probabilistic Circuits

Nicolas Lazzari, Valentina Presutti, Antonio Vergari

Main category: cs.AI

TL;DR: This paper introduces a neuro-symbolic method that compiles Description Logic ontologies into circuits for reliable, ontology-consistent predictions, faster reasoning, and synthetic data generation.

Details

Motivation: Existing neuro-symbolic methods lack native support for ontologies, limiting their ability to ensure predictions are consistent with formal domain knowledge represented in Description Logic ontologies.

Method: Compile Description Logic ontologies into circuits (feed-forward differentiable computational graphs) that support tractable query execution and transformations. Use circuits for: (i) generating synthetic datasets capturing ontology semantics, (ii) efficient GPU-based deductive reasoning, and (iii) implementing neuro-symbolic models with ontology-consistent predictions.

Result: Synthetic datasets qualitatively capture ontology semantics while being challenging for ML classifiers. Circuit-based reasoning achieves up to 1000x speedup over existing reasoners. Neuro-symbolic classifiers reliably produce consistent predictions compared to neural baselines, maintaining or outperforming their performance.

Conclusion: Compiling ontologies into circuits enables tighter integration between Deep Learning and Knowledge Representation, allowing a single circuit representation to tackle multiple challenging tasks relevant to real-world applications.

Abstract: Background: Neuro-symbolic methods enhance the reliability of neural network classifiers through logical constraints, but they lack native support for ontologies. Objectives: We aim to develop a neuro-symbolic method that reliably outputs predictions consistent with a Description Logic ontology that formalizes domain-specific knowledge. Methods: We encode a Description Logic ontology as a circuit, a feed-forward differentiable computational graph that supports tractable execution of queries and transformations. We show that the circuit can be used to (i) generate synthetic datasets that capture the semantics of the ontology; (ii) efficiently perform deductive reasoning on a GPU; (iii) implement neuro-symbolic models whose predictions are approximately or provably consistent with the knowledge defined in the ontology. Results We show that the synthetic dataset generated using the circuit qualitatively captures the semantics of the ontology while being challenging for Machine Learning classifiers, including neural networks. Moreover, we show that compiling the ontology into a circuit is a promising approach for scalable deductive reasoning, with runtimes up to three orders of magnitude faster than available reasoners. Finally, we show that our neuro-symbolic classifiers reliably produce consistent predictions when compared to neural network baselines, maintaining competitive performances or even outperforming them. Conclusions By compiling Description Logic ontologies into circuits, we obtain a tighter integration between the Deep Learning and Knowledge Representation fields. We show that a single circuit representation can be used to tackle different challenging tasks closely related to real-world applications.

[238] Just aware enough: Evaluating awareness across artificial systems

Nadine Meertens, Suet Lee, Ophelia Deroy

Main category: cs.AI

TL;DR: The paper proposes shifting focus from AI consciousness debates to evaluating “awareness” as a more practical, measurable alternative for assessing AI systems’ capabilities.

Details

Motivation: Current debates about AI consciousness and moral status lack methodological agreement, making them unproductive for practical assessment and oversight of diverse AI systems.

Method: Introduces a domain-sensitive, scalable, multidimensional framework for evaluating awareness profiles across AI systems, where awareness is defined as information processing, storage, and use for goal-directed action.

Result: Proposes a structured approach that enables comparison of awareness across different AI architectures, scales, and domains while predicting task performance.

Conclusion: Focusing on “being just aware enough” rather than consciousness enables more practical assessment, supports design/oversight, and facilitates constructive scientific and public discourse about AI capabilities.

Abstract: Recent debates on artificial intelligence increasingly emphasise questions of AI consciousness and moral status, yet there remains little agreement on how such properties should be evaluated. In this paper, we argue that awareness offers a more productive and methodologically tractable alternative. We introduce a practical method for evaluating awareness across diverse systems, where awareness is understood as encompassing a system’s abilities to process, store and use information in the service of goal-directed action. Central to this approach is the claim that any evaluation aiming to capture the diversity of artificial systems must be domain-sensitive, deployable at any scale, multidimensional, and enable the prediction of task performance, while generalising to the level of abilities for the sake of comparison. Given these four desiderata, we outline a structured approach to evaluating and comparing awareness profiles across artificial systems with differing architectures, scales, and operational domains. By shifting the focus from artificial consciousness to being just aware enough, this approach aims to facilitate principled assessment, support design and oversight, and enable more constructive scientific and public discourse.

[239] Multi-Behavior Sequential Modeling with Transition-Aware Graph Attention Network for E-Commerce Recommendation

Hanqi Jin, Gaoming Yang, Zhangming Chan, Yapeng Yuan, Longbin Li, Fei Sun, Yeqiu Yang, Jian Wu, Yuning Jiang, Bo Zheng

Main category: cs.AI

TL;DR: TGA is a linear-complexity model for multi-behavior sequential recommendation that uses structured sparse graphs to capture informative transitions between user behaviors while reducing computational costs.

Details

Motivation: Existing transformer-based approaches for multi-behavior sequential modeling have high computational costs (polynomial time complexity), limiting their applicability in large-scale industrial systems with long user sequences.

Method: Transition-Aware Graph Attention Network (TGA) constructs structured sparse graphs from three perspectives: item-level transitions, category-level transitions, and neighbor-level transitions, then uses a transition-aware graph attention mechanism to jointly model user-item interactions and behavior transition types.

Result: TGA outperforms all state-of-the-art models while significantly reducing computational cost, and has been successfully deployed in a large-scale industrial production environment with impressive improvements in key business metrics.

Conclusion: TGA provides an efficient and effective solution for multi-behavior sequential recommendation that balances accuracy with computational efficiency, making it suitable for real-world industrial deployment.

Abstract: User interactions on e-commerce platforms are inherently diverse, involving behaviors such as clicking, favoriting, adding to cart, and purchasing. The transitions between these behaviors offer valuable insights into user-item interactions, serving as a key signal for understanding evolving preferences. Consequently, there is growing interest in leveraging multi-behavior data to better capture user intent. Recent studies have explored sequential modeling of multi-behavior data, many relying on transformer-based architectures with polynomial time complexity. While effective, these approaches often incur high computational costs, limiting their applicability in large-scale industrial systems with long user sequences. To address this challenge, we propose the Transition-Aware Graph Attention Network (TGA), a linear-complexity approach for modeling multi-behavior transitions. Unlike traditional transformers that treat all behavior pairs equally, TGA constructs a structured sparse graph by identifying informative transitions from three perspectives: (a) item-level transitions, (b) category-level transitions, and (c) neighbor-level transitions. Built upon the structured graph, TGA employs a transition-aware graph Attention mechanism that jointly models user-item interactions and behavior transition types, enabling more accurate capture of sequential patterns while maintaining computational efficiency. Experiments show that TGA outperforms all state-of-the-art models while significantly reducing computational cost. Notably, TGA has been deployed in a large-scale industrial production environment, where it leads to impressive improvements in key business metrics.

[240] Emergent, not Immanent: A Baradian Reading of Explainable AI

Fabio Morreale, Joan Serrà, Yuki Mistufuji

Main category: cs.AI

TL;DR: The paper critiques current XAI approaches for treating meaning as inherent to AI models and proposes an alternative framework using Barad’s agential realism, viewing interpretations as emergent from human-AI-context entanglements.

Details

Motivation: Current XAI approaches are limited by unexamined onto-epistemological assumptions: treating meaning as immanent to models, positioning explainers outside systems, and presuming recoverable causal structures through computation.

Method: The authors use Barad’s agential realism to develop an alternative XAI framework, analyze existing XAI methods through this lens, articulate ethical dimensions, and propose design directions with a speculative text-to-music interface case study.

Result: The paper reveals assumptions and limitations of current XAI methods, develops a framework where interpretations emerge from situated entanglements, and provides ethical and design implications for XAI interfaces.

Conclusion: XAI should move beyond technical explanations to recognize interpretations as material-discursive performances emerging from human-AI-context entanglements, with implications for ethical design and interface development.

Abstract: Explainable AI (XAI) is frequently positioned as a technical problem of revealing the inner workings of an AI model. This position is affected by unexamined onto-epistemological assumptions: meaning is treated as immanent to the model, the explainer is positioned outside the system, and a causal structure is presumed recoverable through computational techniques. In this paper, we draw on Barad’s agential realism to develop an alternative onto-epistemology of XAI. We propose that interpretations are material-discursive performances that emerge from situated entanglements of the AI model with humans, context, and the interpretative apparatus. To develop this position, we read a comprehensive set of XAI methods through agential realism and reveal the assumptions and limitations that underpin several of these methods. We then articulate the framework’s ethical dimension and propose design directions for XAI interfaces that support emergent interpretation, using a speculative text-to-music interface as a case study.

[241] The Responsibility Vacuum: Organizational Failure in Scaled Agent Systems

Oleg Romanchuk, Roman Bondar

Main category: cs.AI

TL;DR: Modern CI/CD pipelines with AI-generated code create a “responsibility vacuum” where no one has both authority to approve decisions and capacity to understand them, making personalized responsibility structurally impossible at scale.

Details

Motivation: The paper addresses a critical gap in modern software development: as AI agents generate code faster than humans can verify, organizations face a structural failure where approval processes become formal rituals without meaningful understanding, creating accountability gaps.

Method: The authors define “responsibility vacuum” as a structural condition, analyze scaling limits under standard deployment assumptions (parallel agent generation, CI validation, human approval gates), and characterize CI amplification dynamics where increased automation worsens the problem.

Result: The analysis reveals that beyond a throughput threshold, verification becomes impossible and is replaced by ritualized approval based on proxy signals. Personalized responsibility becomes structurally unattainable, and additional automation amplifies rather than mitigates the vacuum.

Conclusion: Organizations must explicitly redesign decision boundaries or reassign responsibility from individual decisions to batch- or system-level ownership, otherwise responsibility vacuum remains an invisible but persistent failure mode in scaled AI deployments.

Abstract: Modern CI/CD pipelines integrating agent-generated code exhibit a structural failure in responsibility attribution. Decisions are executed through formally correct approval processes, yet no entity possesses both the authority to approve those decisions and the epistemic capacity to meaningfully understand their basis. We define this condition as responsibility vacuum: a state in which decisions occur, but responsibility cannot be attributed because authority and verification capacity do not coincide. We show that this is not a process deviation or technical defect, but a structural property of deployments where decision generation throughput exceeds bounded human verification capacity. We identify a scaling limit under standard deployment assumptions, including parallel agent generation, CI-based validation, and individualized human approval gates. Beyond a throughput threshold, verification ceases to function as a decision criterion and is replaced by ritualized approval based on proxy signals. Personalized responsibility becomes structurally unattainable in this regime. We further characterize a CI amplification dynamic, whereby increasing automated validation coverage raises proxy signal density without restoring human capacity. Under fixed time and attention constraints, this accelerates cognitive offloading in the broad sense and widens the gap between formal approval and epistemic understanding. Additional automation therefore amplifies, rather than mitigates, the responsibility vacuum. We conclude that unless organizations explicitly redesign decision boundaries or reassign responsibility away from individual decisions toward batch- or system-level ownership, responsibility vacuum remains an invisible but persistent failure mode in scaled agent deployments.

[242] The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution

Chen Qian, Peng Wang, Dongrui Liu, Junyao Yang, Dadi Guo, Ling Tang, Jilin Mei, Qihan Ren, Shuai Shao, Yong Liu, Jie Fu, Jing Shao, Xia Hu

Main category: cs.AI

TL;DR: A novel framework for general agentic attribution that identifies internal factors driving LLM agent actions, using hierarchical temporal likelihood dynamics and perturbation-based analysis to pinpoint critical historical events and textual evidence.

Details

Motivation: As LLM-based agents become more autonomous and deployed at scale, understanding why agents take particular actions is crucial for accountability and governance. Existing research focuses only on failure attribution for unsuccessful trajectories, which is insufficient for explaining reasoning behind agent behaviors.

Method: Hierarchical framework with two levels: (1) Component level - uses temporal likelihood dynamics to identify critical interaction steps; (2) Sentence level - refines localization using perturbation-based analysis to isolate specific textual evidence.

Result: The framework reliably pinpoints pivotal historical events and sentences behind agent behavior across diverse agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias.

Conclusion: The proposed general agentic attribution framework offers a critical step toward safer and more accountable agentic systems by moving beyond failure attribution to explain reasoning behind agent actions regardless of task outcome.

Abstract: Large Language Model (LLM)-based agents are widely used in real-world applications such as customer service, web navigation, and software engineering. As these systems become more autonomous and are deployed at scale, understanding why an agent takes a particular action becomes increasingly important for accountability and governance. However, existing research predominantly focuses on \textit{failure attribution} to localize explicit errors in unsuccessful trajectories, which is insufficient for explaining the reasoning behind agent behaviors. To bridge this gap, we propose a novel framework for \textbf{general agentic attribution}, designed to identify the internal factors driving agent actions regardless of the task outcome. Our framework operates hierarchically to manage the complexity of agent interactions. Specifically, at the \textit{component level}, we employ temporal likelihood dynamics to identify critical interaction steps; then at the \textit{sentence level}, we refine this localization using perturbation-based analysis to isolate the specific textual evidence. We validate our framework across a diverse suite of agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias. Experimental results demonstrate that the proposed framework reliably pinpoints pivotal historical events and sentences behind the agent behavior, offering a critical step toward safer and more accountable agentic systems.

[243] Emerging from Ground: Addressing Intent Deviation in Tool-Using Agents via Deriving Real Calls into Virtual Trajectories

Qian Xiong, Yuekai Huang, Yujia Zheng, Tianhao Li, Ziyou Jiang, Zhiyuan Chang, Zhaoyang Li, Huanxiang Feng, Mingyang Li

Main category: cs.AI

TL;DR: RISE: A Real-to-Virtual method that synthesizes training data to reduce intent deviation in LLM tool-using agents, achieving significant improvements in task completion and intent alignment.

Details

Motivation: LLM tool-using agents suffer from intent deviation - subtle misalignments between user intent and agent behavior. Existing methods are either expensive (real system samples) or suffer from distribution shift (LLM-simulated data), and both lack negative samples for intent deviation scenarios.

Method: RISE uses a “Real-to-Virtual” approach anchored on verified tool primitives to synthesize virtual trajectories and generate diverse negative samples through parameter mutation. It then fine-tunes backbone LLMs via two-stage training for intent alignment using this synthetic data.

Result: RISE achieves 35.28% average improvement in task completion (Acctask) and 23.27% improvement in intent alignment (Accintent), outperforming SOTA baselines by 1.20-42.09% and 1.17-54.93% respectively across eight evaluation metrics.

Conclusion: RISE effectively addresses intent deviation in LLM tool-using agents through synthetic data generation and two-stage training, providing a cost-effective solution that outperforms existing methods while maintaining alignment with real-world tool distributions.

Abstract: LLMs have advanced tool-using agents for real-world applications, yet they often lead to unexpected behaviors or results. Beyond obvious failures, the subtle issue of “intent deviation” severely hinders reliable evaluation and performance improvement. Existing post-training methods generally leverage either real system samples or virtual data simulated by LLMs. However, the former is costly due to reliance on hand-crafted user requests, while the latter suffers from distribution shift from the real tools in the wild. Additionally, both methods lack negative samples tailored to intent deviation scenarios, hindering effective guidance on preference learning. We introduce RISE, a “Real-to-Virtual” method designed to mitigate intent deviation. Anchoring on verified tool primitives, RISE synthesizes virtual trajectories and generates diverse negative samples through mutation on critical parameters. With synthetic data, RISE fine-tunes backbone LLMs via the two-stage training for intent alignment. Evaluation results demonstrate that data synthesized by RISE achieve promising results in eight metrics covering user requires, execution trajectories and agent responses. Integrating with training, RISE achieves an average 35.28% improvement in Acctask (task completion) and 23.27% in Accintent (intent alignment), outperforming SOTA baselines by 1.20–42.09% and 1.17–54.93% respectively.

[244] The Plausibility Trap: Using Probabilistic Engines for Deterministic Tasks

Ivan Carrera, Daniel Maldonado-Ruiz

Main category: cs.AI

TL;DR: The paper identifies the “Plausibility Trap” - using expensive AI models for simple deterministic tasks, causing resource waste. It quantifies the efficiency tax (~6.5x latency penalty) and proposes a framework for when to use/avoid generative AI.

Details

Motivation: The motivation is to address the growing problem where people misuse expensive probabilistic AI models for simple deterministic tasks due to convenience, leading to significant computational inefficiency and resource waste.

Method: The authors use micro-benchmarks and case studies on OCR and fact-checking to quantify the “efficiency tax.” They introduce Tool Selection Engineering and the Deterministic-Probabilistic Decision Matrix as frameworks for appropriate tool selection.

Result: Results show a ~6.5x latency penalty when using generative AI for deterministic tasks compared to specialized tools. The paper demonstrates significant resource waste and risks of algorithmic sycophancy.

Conclusion: The conclusion advocates for a curriculum shift emphasizing that true digital literacy requires knowing both how to use generative AI and when to avoid it, promoting more efficient tool selection practices.

Abstract: The ubiquity of Large Language Models (LLMs) is driving a paradigm shift where user convenience supersedes computational efficiency. This article defines the “Plausibility Trap”: a phenomenon where individuals with access to Artificial Intelligence (AI) models deploy expensive probabilistic engines for simple deterministic tasks-such as Optical Character Recognition (OCR) or basic verification-resulting in significant resource waste. Through micro-benchmarks and case studies on OCR and fact-checking, we quantify the “efficiency tax”-demonstrating a ~6.5x latency penalty-and the risks of algorithmic sycophancy. To counter this, we introduce Tool Selection Engineering and the Deterministic-Probabilistic Decision Matrix, a framework to help developers determine when to use Generative AI and, crucially, when to avoid it. We argue for a curriculum shift, emphasizing that true digital literacy relies not only in knowing how to use Generative AI, but also on knowing when not to use it.

[245] Vehicle Routing with Finite Time Horizon using Deep Reinforcement Learning with Improved Network Embedding

Ayan Maity, Sudeshna Sarkar

Main category: cs.AI

TL;DR: A deep reinforcement learning approach for vehicle routing with finite time horizon that maximizes served customer requests using novel network embeddings.

Details

Motivation: To address the vehicle routing problem with finite time horizon where the objective is to maximize the number of customer requests served within limited time, overcoming limitations of existing methods.

Method: Proposes a novel routing network embedding module with local node embeddings and context-aware global graph representation, integrated with a policy gradient-based deep reinforcement learning framework. The MDP incorporates node features, network adjacency matrix, edge features, and remaining time horizon into the state space.

Result: Achieves higher customer service rate than existing routing methods on both real-world and synthetic Euclidean networks, with significantly lower solution time.

Conclusion: The proposed deep reinforcement learning approach with time-aware network embeddings effectively solves finite time horizon vehicle routing problems, outperforming existing methods in both service rate and computational efficiency.

Abstract: In this paper, we study the vehicle routing problem with a finite time horizon. In this routing problem, the objective is to maximize the number of customer requests served within a finite time horizon. We present a novel routing network embedding module which creates local node embedding vectors and a context-aware global graph representation. The proposed Markov decision process for the vehicle routing problem incorporates the node features, the network adjacency matrix and the edge features as components of the state space. We incorporate the remaining finite time horizon into the network embedding module to provide a proper routing context to the embedding module. We integrate our embedding module with a policy gradient-based deep Reinforcement Learning framework to solve the vehicle routing problem with finite time horizon. We trained and validated our proposed routing method on real-world routing networks, as well as synthetically generated Euclidean networks. Our experimental results show that our method achieves a higher customer service rate than the existing routing methods. Additionally, the solution time of our method is significantly lower than that of the existing methods.

[246] How to Build AI Agents by Augmenting LLMs with Codified Human Expert Domain Knowledge? A Software Engineering Framework

Choro Ulan uulu, Mikhail Kulyabin, Iris Fuhrmann, Jan Joosten, Nuno Miguel Martins Pacheco, Filippos Petridis, Rebecca Johnson, Jan Bosch, Helena Holmström Olsson

Main category: cs.AI

TL;DR: AI agent framework captures expert domain knowledge to help non-experts create expert-level visualizations, achieving 206% quality improvement over baseline.

Details

Motivation: Critical domain knowledge bottleneck with few experts, non-experts struggle with visualization creation, leading to suboptimal insights and wasted expert time.

Method: Software engineering framework with LLM augmented by request classifier, RAG system for code generation, codified expert rules, and visualization design principles in autonomous agent.

Result: 206% improvement in output quality, expert-level ratings in all cases vs baseline’s poor performance, superior code quality with lower variance across 5 scenarios with 12 evaluators.

Conclusion: Framework successfully captures human domain knowledge into AI agents, enabling non-experts to achieve expert-level outcomes in specialized visualization domains.

Abstract: Critical domain knowledge typically resides with few experts, creating organizational bottlenecks in scalability and decision-making. Non-experts struggle to create effective visualizations, leading to suboptimal insights and diverting expert time. This paper investigates how to capture and embed human domain knowledge into AI agent systems through an industrial case study. We propose a software engineering framework to capture human domain knowledge for engineering AI agents in simulation data visualization by augmenting a Large Language Model (LLM) with a request classifier, Retrieval-Augmented Generation (RAG) system for code generation, codified expert rules, and visualization design principles unified in an agent demonstrating autonomous, reactive, proactive, and social behavior. Evaluation across five scenarios spanning multiple engineering domains with 12 evaluators demonstrates 206% improvement in output quality, with our agent achieving expert-level ratings in all cases versus baseline’s poor performance, while maintaining superior code quality with lower variance. Our contributions are: an automated agent-based system for visualization generation and a validated framework for systematically capturing human domain knowledge and codifying tacit expert knowledge into AI agents, demonstrating that non-experts can achieve expert-level outcomes in specialized domains.

[247] Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning

Yuval Kansal, Niraj K. Jha

Main category: cs.AI

TL;DR: A bottom-up learning paradigm using knowledge graphs as implicit reward models enables LLMs to perform compositional multi-hop reasoning in specialized domains like medicine, outperforming larger frontier models on complex tasks.

Details

Motivation: LLMs excel in structured domains like math but struggle with compositional multi-hop reasoning in specialized scientific fields, needing better ways to ground reasoning in domain knowledge.

Method: Post-training pipeline combining supervised fine-tuning and RL, where knowledge graphs act as implicit reward models. Novel reward signals derived from knowledge graph paths encourage composing intermediate axioms rather than optimizing only final answers.

Result: 14B model trained on short-hop reasoning (1-3 hops) achieves zero-shot generalization to complex multi-hop queries (4-5 hops), outperforming larger models and frontier systems like GPT-5.2 and Gemini 3 Pro on difficult reasoning tasks. Robust to adversarial perturbations.

Conclusion: Grounding reasoning processes in structured knowledge via knowledge graph path rewards provides a scalable and efficient path toward intelligent compositional reasoning in specialized domains.

Abstract: Large language models have achieved near-expert performance in structured reasoning domains like mathematics and programming, yet their ability to perform compositional multi-hop reasoning in specialized scientific fields remains limited. We propose a bottom-up learning paradigm in which models are grounded in axiomatic domain facts and compose them to solve complex, unseen tasks. To this end, we present a post-training pipeline, based on a combination of supervised fine-tuning and reinforcement learning (RL), in which knowledge graphs act as implicit reward models. By deriving novel reward signals from knowledge graph paths, we provide verifiable, scalable, and grounded supervision that encourages models to compose intermediate axioms rather than optimize only final answers during RL. We validate this approach in the medical domain, training a 14B model on short-hop reasoning paths (1-3 hops) and evaluating its zero-shot generalization to complex multi-hop queries (4-5 hops). Our experiments show that path-derived rewards act as a “compositional bridge”, enabling our model to significantly outperform much larger models and frontier systems like GPT-5.2 and Gemini 3 Pro, on the most difficult reasoning tasks. Furthermore, we demonstrate the robustness of our approach to adversarial perturbations against option-shuffling stress tests. This work suggests that grounding the reasoning process in structured knowledge is a scalable and efficient path toward intelligent reasoning.

[248] BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

Main category: cs.AI

TL;DR: BayesianVLA addresses information collapse in VLA models by using Bayesian decomposition to enforce instruction following, improving generalization without new data.

Details

Motivation: Current VLA models struggle with generalization to new instructions or multi-task scenarios due to dataset bias where language instructions become predictable from vision alone, causing information collapse where models ignore language constraints.

Method: Proposes BayesianVLA framework with learnable Latent Action Queries and dual-branch architecture to estimate vision-only prior p(a|v) and language-conditioned posterior π(a|v,ℓ). Optimizes policy to maximize conditional Pointwise Mutual Information between actions and instructions, penalizing vision shortcuts.

Result: Significantly improves generalization across SimplerEnv and RoboCasa benchmarks, achieving 11.3% improvement on challenging OOD SimplerEnv benchmark, demonstrating robust language grounding in action.

Conclusion: BayesianVLA effectively addresses information collapse in VLA training by enforcing instruction following through Bayesian decomposition, enabling models to better generalize to out-of-distribution settings without requiring additional data collection.

Abstract: Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

[249] Scalable Anytime Algorithms for Learning Fragments of Linear Temporal Logic

Ritam Raha, Rajarshi Roy, Nathanaël Fijalkow, Daniel Neider

Main category: cs.AI

TL;DR: New anytime algorithm for learning LTL formulas from traces that scales to much larger formulas than previous methods and guarantees output.

Details

Motivation: Existing LTL formula learning methods have two major limitations: they don't scale beyond small formulas, and they may fail to return any result due to computational exhaustion.

Method: Introduces a new anytime algorithm that can construct formulas an order of magnitude larger than previous methods, with guaranteed output (though not necessarily minimal size).

Result: Algorithm evaluated using open source implementation against public benchmarks, showing ability to handle much larger formulas than previous approaches.

Conclusion: The new algorithm addresses scalability and reliability issues in LTL formula learning, making it practical for real-world applications in program verification, robotics, and process mining.

Abstract: Linear temporal logic (LTL) is a specification language for finite sequences (called traces) widely used in program verification, motion planning in robotics, process mining, and many other areas. We consider the problem of learning LTL formulas for classifying traces; despite a growing interest of the research community, existing solutions suffer from two limitations: they do not scale beyond small formulas, and they may exhaust computational resources without returning any result. We introduce a new algorithm addressing both issues: our algorithm is able to construct formulas an order of magnitude larger than previous methods, and it is anytime, meaning that it in most cases successfully outputs a formula, albeit possibly not of minimal size. We evaluate the performances of our algorithm using an open source implementation against publicly available benchmarks.

[250] Adaptive Task Balancing for Visual Instruction Tuning via Inter-Task Contribution and Intra-Task Difficulty

Yanqi Dai, Yong Wang, Zebin You, Dong Jing, Xiangxiang Chu, Zhiwu Lu

Main category: cs.AI

TL;DR: VisATB: Adaptive Task Balancing approach for visual instruction tuning that addresses performance imbalance by measuring inter-task contributions and intra-task difficulty to prioritize tasks strategically.

Details

Motivation: Visual instruction tuning often leads to suboptimal and imbalanced overall performance when learning multiple visual tasks simultaneously due to latent knowledge conflicts across tasks.

Method: Proposes Adaptive Task Balancing (VisATB) that measures: 1) Inter-Task Contribution (how learning one task enhances others), and 2) Intra-Task Difficulty (inherent learning difficulty). Prioritizes three task categories: tasks that contribute substantially to others, tasks that receive minimal contributions from others, and tasks with high learning difficulties.

Result: Extensive experiments on three benchmarks demonstrate that VisATB consistently achieves superior and more balanced overall performance in visual instruction tuning.

Conclusion: VisATB effectively addresses performance imbalance in visual instruction tuning through adaptive task balancing based on inter-task contributions and intra-task difficulty, leading to improved overall performance.

Abstract: Visual instruction tuning is a key training stage of large multimodal models. However, when learning multiple visual tasks simultaneously, this approach often results in suboptimal and imbalanced overall performance due to latent knowledge conflicts across tasks. To mitigate this issue, we propose a novel Adaptive Task Balancing approach tailored for visual instruction tuning (VisATB). Specifically, we measure two critical dimensions for visual task balancing based on validation performance: (1) Inter-Task Contribution, the mechanism where learning one task enhances the performance on others owing to shared knowledge across tasks, and (2) Intra-Task Difficulty, which denotes the inherent learning difficulty of a single task. Furthermore, we propose prioritizing three categories of tasks with greater weight: those that offer substantial contributions to others, those that receive minimal contributions from others, and those that present high learning difficulties. Among these three task weighting strategies, the first and third focus on improving overall performance, and the second targets the mitigation of performance imbalance. Extensive experiments on three benchmarks demonstrate that our VisATB approach consistently achieves superior and more balanced overall performance in visual instruction tuning. The data, code, and models are available at https://github.com/YanqiDai/VisATB.

[251] Sora as a World Model? A Complete Survey on Text-to-Video Generation

Fachrina Dewi Puspitasari, Chaoning Zhang, Joseph Cho, Adnan Haider, Noor Ul Eman, Omer Amin, Alexis Mankowski, Muhammad Umair, Jingyao Zheng, Sheng Zheng, Lik-Hang Lee, Caiyan Qin, Tae-Ho Kim, Choong Seon Hong, Yang Yang, Heng Tao Shen

Main category: cs.AI

TL;DR: Text-to-video generation increasingly supports world modeling through spatial, action, and strategic intelligences, but faces trade-offs between diversity and consistency.

Details

Motivation: To systematically assess how far text-to-video generation technology supports essential requirements in world modeling, given the rapid evolution from simple animations to complex simulations like Sora.

Method: Curated 250+ studies on text-based video synthesis and world modeling, then analyzed how recent models support spatial, action, and strategic intelligences through adherence to completeness, consistency, invention, human interaction, and control.

Result: Recent text-to-video models increasingly support essential world modeling requirements through spatial, action, and strategic intelligences, demonstrating progress in completeness, consistency, invention, and human interaction/control.

Conclusion: Text-to-video generation is adept at world modeling, but significant challenges remain in addressing diversity-consistency trade-offs and other aspects that need further development.

Abstract: The evolution of video generation from text, from animating MNIST to simulating the world with Sora, has progressed at a breakneck speed. Here, we systematically discuss how far text-to-video generation technology supports essential requirements in world modeling. We curate 250+ studies on text-based video synthesis and world modeling. We then observe that recent models increasingly support spatial, action, and strategic intelligences in world modeling through adherence to completeness, consistency, invention, as well as human interaction and control. We conclude that text-to-video generation is adept at world modeling, although homework in several aspects, such as the diversity-consistency trade-offs, remains to be addressed.

[252] Architectural Scaling Surpass Basis Complexity? Efficient KANs with Single-Parameter Design

Zhijie Chen, Xinglin Zhang, Hongshu Guo, Yue-Jiao Gong

Main category: cs.AI

TL;DR: This paper introduces Uni-KAN framework to unify KAN networks, proposes EKE Hypothesis for efficient architecture design, and presents SKANs - ultra-lightweight networks that achieve state-of-the-art performance with faster training.

Details

Motivation: The landscape of Kolmogorov-Arnold Networks lacks unified theoretical framework and clear principles for efficient architecture design, creating fragmentation in research and development.

Method: 1) Introduces Universal KAN (Uni-KAN) framework with dense/sparse representations; 2) Proposes Efficient KAN Expansion (EKE) Hypothesis favoring architectural scaling over basis function complexity; 3) Develops Single-Parameter KANs (SKANs) implementing EKE principles.

Result: SKANs achieve state-of-the-art performance: up to 6.51% F1 score improvement, 93.1% test loss reduction, and 6x faster training speeds compared to existing KAN variants. First empirical validation of basis function smoothness necessity for stable training.

Conclusion: The paper establishes robust framework (Uni-KAN), guiding hypothesis (EKE), and practical methodology (SKANs) for designing next-generation efficient neural networks, unifying KAN research and enabling more effective architecture design.

Abstract: The landscape of Kolmogorov-Arnold Networks (KANs) is rapidly expanding, yet lacks a unified theoretical framework and a clear principle for efficient architecture design. This paper addresses these gaps with three core contributions. First, we introduce the Universal KAN (Uni-KAN) framework, a novel abstraction that formally unifies all KAN-style networks through dense and sparse representations. We prove their interchangeability and provide an open-source library for this framework, facilitating future research. Second, we propose the Efficient KAN Expansion (EKE) Hypothesis, a design philosophy positing that allocating parameters to architectural scaling rather than basis function complexity yields superior performance. Third, we present Single-Parameter KANs (SKANs), a family of ultra-lightweight networks that embody the EKE Hypothesis. Our comprehensive experiments provide the first strong empirical validation for the theoretical necessity of basis function smoothness for stable training. Furthermore, SKANs demonstrate state-of-the-art performance, improving F1 scores by up to 6.51% and reducing test loss by 93.1%, while achieving up to 6x faster training speeds compared to existing KAN variants. These results establish a robust framework, a guiding hypothesis, and a practical methodology for designing the next generation of efficient and powerful neural networks. The code is accessible at https://anonymous.4open.science/r/SKAN-EBBB/.

[253] LArctan-SKAN: Simple and Efficient Single-Parameterized Kolmogorov-Arnold Networks using Learnable Trigonometric Function

Zhijie Chen, Xinglin Zhang

Main category: cs.AI

TL;DR: Proposes Single-Parameterized KANs (SKAN) using trigonometric functions, with LArctan-SKAN showing best accuracy and efficiency on MNIST.

Details

Motivation: To develop more efficient and accurate Kolmogorov-Arnold Networks by parameterizing them with single trigonometric functions, aiming to improve computational efficiency while maintaining or enhancing performance.

Method: Constructs Single-Parameterized Function (SFunc) from trigonometric functions to create three SKAN variants: LSin-SKAN, LCos-SKAN, and LArctan-SKAN. Validates on MNIST dataset.

Result: LArctan-SKAN achieves highest accuracy, outperforming all pure KAN variants (FourierKAN, LSS-SKAN, Spl-KAN) and mixed MLP-based models (MLP+rKAN, MLP+fKAN). Also shows exceptional computational efficiency with 535.01% faster training than MLP+rKAN and 49.55% faster than MLP+fKAN.

Conclusion: SKANs constructed with trigonometric functions are effective and promising, with LArctan-SKAN demonstrating superior accuracy and computational efficiency, making it a competitive alternative to existing KAN and MLP-based models.

Abstract: This paper proposes a novel approach for designing Single-Parameterized Kolmogorov-Arnold Networks (SKAN) by utilizing a Single-Parameterized Function (SFunc) constructed from trigonometric functions. Three new SKAN variants are developed: LSin-SKAN, LCos-SKAN, and LArctan-SKAN. Experimental validation on the MNIST dataset demonstrates that LArctan-SKAN excels in both accuracy and computational efficiency. Specifically, LArctan-SKAN significantly improves test set accuracy over existing models, outperforming all pure KAN variants compared, including FourierKAN, LSS-SKAN, and Spl-KAN. It also surpasses mixed MLP-based models such as MLP+rKAN and MLP+fKAN in accuracy. Furthermore, LArctan-SKAN exhibits remarkable computational efficiency, with a training speed increase of 535.01% and 49.55% compared to MLP+rKAN and MLP+fKAN, respectively. These results confirm the effectiveness and potential of SKANs constructed with trigonometric functions. The experiment code is available at https://github.com/chikkkit/LArctan-SKAN .

[254] Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments

Zheng Jia, Shengbin Yue, Wei Chen, Siyuan Wang, Yidong Liu, Yun Song, Zhongyu Wei

Main category: cs.AI

TL;DR: J1-ENVS is the first interactive legal environment for LLM agents with six Chinese legal scenarios across three complexity levels, evaluated by J1-EVAL framework. Experiments show LLMs struggle with procedural execution despite solid legal knowledge.

Details

Motivation: The gap between static benchmarks and dynamic real-world legal practice hinders advancement of legal intelligence. Current evaluations don't capture the interactive, procedural nature of actual legal work.

Method: Created J1-ENVS: interactive legal environment with six representative Chinese legal scenarios across three complexity levels. Developed J1-EVAL: fine-grained evaluation framework assessing task performance and procedural compliance across legal proficiency levels. Tested 17 LLM agents.

Result: LLMs demonstrate solid legal knowledge but struggle with procedural execution in dynamic settings. Even SOTA GPT-4o falls below 60% overall performance. Models perform better on simpler scenarios but degrade with complexity.

Conclusion: Persistent challenges remain in achieving dynamic legal intelligence. The gap between static knowledge and procedural execution needs addressing. J1-ENVS provides valuable benchmark for future research in interactive legal AI systems.

Abstract: The gap between static benchmarks and the dynamic nature of real-world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity. We further introduce J1-EVAL, a fine-grained evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that, while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model, GPT-4o, falls short of 60% overall performance. These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.

[255] Deliberative Reasoning Network: An Uncertainty-Driven Paradigm for Belief-Tracked Inference with Pretrained Language Models

Anran Xu, Jincheng Wang, Baigen Cai, Tao Wen

Main category: cs.AI

TL;DR: DRN reframes logical reasoning from probability maximization to uncertainty minimization, asking which hypothesis has most internally consistent evidence rather than which answer is most likely.

Details

Motivation: Large language models often fail at logical reasoning when semantic heuristics conflict with decisive evidence (cognitive traps). Current approaches focus on probability maximization rather than principled reasoning.

Method: Deliberative Reasoning Network (DRN) tracks belief states and quantifies epistemic uncertainty for competing hypotheses through iterative evidence synthesis. Two architectures: bespoke discriminative model embodying uncertainty minimization principle, and lightweight verification module enhancing existing LLMs.

Result: On LCR-1000 adversarial reasoning benchmark: bespoke DRN achieves up to 15.2% improvement over baselines. Hybrid system with Mistral-7B boosts accuracy from 20% to 80% on most challenging problems. Zero-shot generalization improves TruthfulQA performance by 23.6% without additional training.

Conclusion: DRN demonstrates strong zero-shot generalization and learns transferable reasoning principles. Positioned as foundational, verifiable System 2 reasoning component for building more trustworthy AI systems.

Abstract: Large language models often fail at logical reasoning when semantic heuristics conflict with decisive evidence - a phenomenon we term cognitive traps. To address this fundamental limitation, we introduce the Deliberative Reasoning Network (DRN), a novel paradigm that reframes logical reasoning from probability maximization to uncertainty minimization. Instead of asking “Which answer is most likely?”, DRN asks “Which hypothesis has the most internally consistent evidence?”. DRN achieves intrinsic interpretability by explicitly tracking belief states and quantifying epistemic uncertainty for competing hypotheses through an iterative evidence synthesis process. We validate our approach through two complementary architectures - a bespoke discriminative model that embodies the core uncertainty minimization principle, and a lightweight verification module that enhances existing generative LLMs. Evaluated on LCR-1000, our new adversarial reasoning benchmark designed to expose cognitive traps, the bespoke DRN achieves up to 15.2% improvement over standard baselines. When integrated as a parameter-efficient verifier with Mistral-7B, our hybrid system boosts accuracy from 20% to 80% on the most challenging problems. Critically, DRN demonstrates strong zero-shot generalization, improving TruthfulQA performance by 23.6% without additional training, indicating that uncertainty-driven deliberation learns transferable reasoning principles. We position DRN as a foundational, verifiable System 2 reasoning component for building more trustworthy AI systems.

Bingkang Shi, Jen-tse Huang, Long Luo, Tianyu Zong, Hongzhu Yi, Yuanxiang Wang, Songlin Hu, Xiaodan Zhang, Zhongjiang Yao

Main category: cs.AI

TL;DR: FairGamer is the first benchmark to evaluate social biases in LLM-based NPCs across transaction, cooperation, and competition interactions, revealing that larger models exhibit more severe biases.

Details

Motivation: LLMs are increasingly used as NPCs in video games but inherit social biases (race, class, etc.), creating fairness risks during player interactions that haven't been adequately explored.

Method: Created FairGamer benchmark with 12 evaluation tasks across three interaction patterns (transaction, cooperation, competition) to assess four bias types (class, race, age, nationality) using a novel FairMCV metric.

Result: Evaluation of 7 frontier LLMs shows: 1) models exhibit biased decision-making (Grok-4-Fast had highest bias at 76.9% FairMCV), 2) larger LLMs display more severe social biases, suggesting increased model capacity amplifies biases.

Conclusion: FairGamer reveals significant social bias issues in LLM-based NPCs, with larger models showing worse bias, highlighting the need for fairness research in gaming AI. The benchmark is publicly released to facilitate further study.

Abstract: Large Language Models (LLMs) have increasingly enhanced or replaced traditional Non-Player Characters (NPCs) in video games. However, these LLM-based NPCs inherit underlying social biases (e.g., race or class), posing fairness risks during in-game interactions. To address the limited exploration of this issue, we introduce FairGamer, the first benchmark to evaluate social biases across three interaction patterns: transaction, cooperation, and competition. FairGamer assesses four bias types, including class, race, age, and nationality, across 12 distinct evaluation tasks using a novel metric, FairMCV. Our evaluation of seven frontier LLMs reveals that: (1) models exhibit biased decision-making, with Grok-4-Fast demonstrating the highest bias (average FairMCV = 76.9%); and (2) larger LLMs display more severe social biases, suggesting that increased model capacity inadvertently amplifies these biases. We release FairGamer at https://github.com/Anonymous999-xxx/FairGamer to facilitate future research on NPC fairness.

[257] Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Chongwen Zhao, Yutong Ke, Kaizhu Huang

Main category: cs.AI

TL;DR: The paper introduces a neuron-level interpretability method to understand and defend against jailbreak attacks on LLMs, proposing SafeTuning to reinforce safety-critical neurons.

Details

Motivation: LLMs are increasingly used for malicious purposes through jailbreak attacks, but existing defenses lack clear understanding of why they work. There's a need for better interpretability of safety mechanisms in LLMs.

Method: A novel neuron-level interpretability method that projects internal representations into interpretable vocabulary space to identify safety-related knowledge neurons. Then proposes SafeTuning, a fine-tuning strategy that reinforces these safety-critical neurons.

Result: Adjusting safety-related neuron activations controls model behavior with >97% mean ASR. SafeTuning consistently reduces attack success rates across multiple LLMs and outperforms all four baseline defenses.

Conclusion: The work provides new perspective on understanding and defending against jailbreak attacks through neuron-level interpretability and targeted fine-tuning of safety-critical neurons.

Abstract: Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation, a technique known as “Jailbreak.” While some studies have achieved defenses against jailbreak attacks by modifying output distributions or detecting harmful content, the exact rationale still remains elusive. In this work, we present a novel neuron-level interpretability method that focuses on the role of safety-related knowledge neurons. Unlike existing approaches, our method projects the model’s internal representation into a more consistent and interpretable vocabulary space. We then show that adjusting the activation of safety-related neurons can effectively control the model’s behavior with a mean ASR higher than 97%. Building on this insight, we propose SafeTuning, a fine-tuning strategy that reinforces safety-critical neurons to improve model robustness against jailbreaks. SafeTuning consistently reduces attack success rates across multiple LLMs and outperforms all four baseline defenses. These findings offer a new perspective on understanding and defending against jailbreak attacks.

[258] Explaining Tournament Solutions with Minimal Supports

Clément Contet, Umberto Grandi, Jérôme Mengin

Main category: cs.AI

TL;DR: The paper studies certified explanations for tournament winners by identifying minimal sub-tournaments where a candidate is guaranteed to win, providing formal explainable AI for tournament solutions.

Details

Motivation: Tournaments model pairwise dominance relationships, but understanding why specific candidates win under various tournament rules requires certified explanations. This addresses the need for formal explainable AI in tournament analysis to make winner determinations transparent and interpretable.

Method: The authors identify “minimal supports” - minimal sub-tournaments where a candidate is guaranteed to win regardless of how the rest of the tournament is completed (necessary winner). They analyze this concept for six tournament solutions: top cycle, uncovered set, Copeland rule, Borda rule, maximin rule, and weighted uncovered set.

Result: For each tournament rule, the paper determines the size of smallest minimal supports and presents polynomial-time algorithms to compute them for all solutions except the weighted uncovered set, where the problem is NP-complete. Minimal supports provide compact, certified, and intuitive explanations for tournament outcomes.

Conclusion: Minimal supports offer a rigorous framework for certified explanations in tournament analysis, bridging formal explainable AI with tournament theory. The approach works efficiently for most common tournament solutions, though computational complexity varies by rule.

Abstract: Tournaments are widely used models to represent pairwise dominance between candidates, alternatives, or teams. We study the problem of providing certified explanations for why a candidate appears among the winners under various tournament rules. To this end, we identify minimal supports, minimal sub-tournaments in which the candidate is guaranteed to win regardless of how the rest of the tournament is completed (that is, the candidate is a necessary winner of the sub-tournament). This notion corresponds to an abductive explanation for the question,“Why does the winner win the tournament?”, a central concept in formal explainable AI. We focus on common tournament solutions: the top cycle, the uncovered set, the Copeland rule, the Borda rule, the maximin rule, and the weighted uncovered set. For each rule we determine the size of the smallest minimal supports, and we present polynomial-time algorithms to compute them for all solutions except for the weighted uncovered set, for which the problem is NP-complete. Finally, we show how minimal supports can serve to produce compact, certified, and intuitive explanations for tournament solutions.

[259] Towards Reinforcement Learning from Neural Feedback: Mapping fNIRS Signals to Agent Performance

Julia Santaniello, Matthew Russell, Benson Jiang, Donatello Sassaroli, Robert Jacob, Jivko Sinapov

Main category: cs.AI

TL;DR: Paper introduces RLHF framework using fNIRS brain signals to predict agent performance, creates dataset from 25 participants across 3 domains, achieves 67% binary and 46% multi-class F1 scores, shows cross-subject generalization with fine-tuning improvements.

Details

Motivation: To develop Reinforcement Learning from Neural Feedback (RLNF) systems by using implicit neural signals (fNIRS) instead of explicit human feedback, enabling more natural and continuous alignment of agent behavior with human preferences.

Method: Collected fNIRS recordings from 25 participants across three domains (Pick-and-Place Robot, Lunar Lander, Flappy Bird). Trained classifiers to predict agent performance levels (optimal/suboptimal/worst-case) and regressors to predict action deviation from optimal policies. Evaluated cross-subject generalization with fine-tuning approaches.

Result: Achieved average F1 scores of 67% for binary classification and 46% for multi-class classification. Fine-tuning pre-trained models with subject-specific data increased F1 scores by 17% (binary) and 41% (multi-class). Demonstrated feasibility of mapping fNIRS signals to agent performance.

Conclusion: Mapping implicit fNIRS signals to agent performance is feasible and can be improved, laying the foundation for future Reinforcement Learning from Neural Feedback (RLNF) systems that use brain signals instead of explicit feedback.

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a methodology that aligns agent behavior with human preferences by integrating user feedback into the agent’s training process. This paper introduces a framework that guides agent training through implicit neural signals, with a focus on the neural classification problem. Our work presents and releases a novel dataset of functional near-infrared spectroscopy (fNIRS) recordings collected from 25 human participants across three domains: Pick-and-Place Robot, Lunar Lander, and Flappy Bird. We train multiple classifiers to predict varying levels of agent performance (optimal, suboptimal, or worst-case) from windows of preprocessed fNIRS features, achieving an average F1 score of 67% for binary and 46% for multi-class classification across conditions and domains. We also train multiple regressors to predict the degree of deviation between an agent’s chosen action and a set of near-optimal policy actions, providing a continuous measure of performance. Finally, we evaluate cross-subject generalization and show that fine-tuning pre-trained models with a small sample of subject-specific data increases average F1 scores by 17% and 41% for binary and multi-class models, respectively. Our results demonstrate that mapping implicit fNIRS signals to agent performance is feasible and can be improved, laying the foundation for future Reinforcement Learning from Neural Feedback (RLNF) systems.

[260] Context-Picker: Dynamic context selection using multi-stage reinforcement learning

Siyuan Zhu, Chengdong Xu, Kaiqiang Ke, Chao Yu

Main category: cs.AI

TL;DR: Context-Picker is a reasoning-aware framework that reframes context selection as identifying minimal sufficient evidence subsets for long-context QA, using two-stage reinforcement learning with offline evidence distillation to outperform RAG baselines.

Details

Motivation: Current methods for context selection in long-context QA struggle to dynamically determine appropriate context scope. Fixed passage retrieval or reranking approaches often include either insufficient context (missing essential information) or excessive context (introducing noise), particularly problematic for factoid questions that depend on precise evidence.

Method: Proposes Context-Picker, a reasoning-aware framework that reframes context selection as identifying minimal sufficient evidence subsets. Uses a human-inspired two-stage RL schedule: stage 1 improves recall of critical passages, stage 2 prunes redundancy. Introduces offline evidence distillation pipeline using Leave-One-Out procedure to mine “minimal sufficient sets” and resolve reward sparsity.

Result: Experiments on five long-context and multi-hop QA datasets show the method outperforms strong RAG baselines and achieves higher answer accuracy. Ablation studies confirm contributions from coarse-to-fine optimization schedule, redundancy-aware reward shaping, and rationale generated by the policy.

Conclusion: Context-Picker effectively addresses the context selection challenge in long-context QA by moving beyond similarity-based ranking to identify minimal sufficient evidence subsets through reasoning-aware reinforcement learning with task-aligned supervision.

Abstract: In long-context question answering, selecting the appropriate scope of context for a query remains a key and unresolved challenge. Insufficient context can lead to missing essential information, whereas excessive context often introduces noise and degrades answer quality. Conventional methods, such as retrieving a fixed number of passages or applying reranking, struggle to dynamically determine which context to include. This is especially problematic for factoid questions, which typically depend only on a few precise pieces of evidence. To overcome this limitation, we propose Context-Picker, a reasoning-aware framework that reframes context selection as the task of identifying a minimal sufficient evidence subset, moving beyond conventional similarity-based ranking. Context-Picker uses a human-inspired two-stage reinforcement learning schedule: stage 1 focuses on improving the recall rate of critical passages, and stage 2 prioritizes pruning redundancy to distill a compact evidence set. To resolve reward sparsity, we propose an offline evidence distillation pipeline that mines ``minimal sufficient sets" via a Leave-One-Out (LOO) procedure, providing dense and task-aligned supervision. Experiments on five long-context and multi-hop QA datasets demonstrate that our method outperforms strong RAG baselines and achieved higher answer accuracy. Ablation studies also indicate that our coarse-to-fine optimization schedule, the redundancy-aware reward shaping, along with the rationale generated by the policy, all contribute substantially to these gains.

[261] Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, Mohsen Bayati

Main category: cs.AI

TL;DR: Clinical AI benchmarks with LLM-generated labels contain systemic errors that distort evaluation and model alignment; hybrid oversight systems can prioritize expert feedback to maintain clinical validity.

Details

Motivation: To examine the reliability of widely used clinical AI benchmarks that use LLM-generated reference labels, and address concerns about clinical misalignment and downstream effects on model evaluation and alignment.

Method: 1) Analyze reliability of clinical AI benchmark with LLM-generated labels; 2) Introduce phased stewardship procedure to amplify physician expert feedback; 3) Conduct controlled RL experiment to demonstrate label bias effects on downstream LLM evaluation and alignment.

Result: Found substantial fraction of LLM-generated labels are clinically misaligned; demonstrated that uncaught label bias materially affects downstream LLM evaluation and alignment; showed partially LLM-generated labels embed systemic errors that distort both evaluation and model alignment.

Conclusion: Hybrid oversight systems that prioritize scarce expert feedback can maintain benchmarks as living, clinically-grounded documents; ensuring this alignment is essential for safe deployment of LLMs in high-stakes medical decision support.

Abstract: We examine the reliability of a widely used clinical AI benchmark whose reference labels were partially generated by LLMs, and find that a substantial fraction are clinically misaligned. We introduce a phased stewardship procedure to amplify the positive impact of physician experts’ feedback and then demonstrate, via a controlled RL experiment, how uncaught label bias can materially affect downstream LLM evaluation and alignment. Our results demonstrate that partially LLM-generated labels can embed systemic errors that distort not only evaluation but also downstream model alignment. By adopting a hybrid oversight system, we can prioritize scarce expert feedback to maintain benchmarks as living, clinically-grounded documents. Ensuring this alignment is a prerequisite for the safe deployment of LLMs in high-stakes medical decision support.

[262] Monadic Context Engineering

Yifan Zhang, Yang Yuan, Mengdi Wang, Andrew Chi-Chih Yao

Main category: cs.AI

TL;DR: Monadic Context Engineering (MCE) introduces a formal algebraic framework using Functors, Applicatives, and Monads to build robust, composable AI agents, addressing brittleness in current imperative architectures.

Details

Motivation: Current LLM-based agent architectures use brittle, ad hoc imperative patterns that struggle with state management, error handling, and concurrency, leading to unreliable systems.

Method: MCE treats agent workflows as computational contexts managed by algebraic structures: Monads for sequential composition, Applicatives for parallel execution, and Monad Transformers for systematic capability composition.

Result: Enables construction of complex, resilient AI agents from simple, verifiable components, with extension to Meta-Agents for generative orchestration of sub-agent workflows.

Conclusion: MCE provides a formal foundation for agent design that addresses key architectural challenges through algebraic abstractions, enabling more robust and composable autonomous systems.

Abstract: The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering (MCE), a novel architectural paradigm leveraging the algebraic structures of Functors, Applicative Functors, and Monads to provide a formal foundation for agent design. MCE treats agent workflows as computational contexts where cross-cutting concerns, such as state propagation, short-circuiting error handling, and asynchronous execution, are managed intrinsically by the algebraic properties of the abstraction. We demonstrate how Monads enable robust sequential composition, how Applicatives provide a principled structure for parallel execution, and crucially, how Monad Transformers allow for the systematic composition of these capabilities. This layered approach enables developers to construct complex, resilient, and efficient AI agents from simple, independently verifiable components. We further extend this framework to describe Meta-Agents, which leverage MCE for generative orchestration, dynamically creating and managing sub-agent workflows through metaprogramming.

[263] MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use

Wenrui Liu, Zixiang Liu, Elsie Dai, Wenhan Yu, Lei Yu, Tong Yang, Jinjun Han, Hong Gao

Main category: cs.AI

TL;DR: MCPAgentBench: A benchmark for evaluating LLM agents’ tool-use capabilities using real-world MCP definitions and simulated tools in a dynamic sandbox environment with distractor tools.

Details

Motivation: Current MCP evaluation sets have limitations: they rely on external MCP services and lack difficulty awareness. There's a need for better benchmarks to evaluate LLM agents' tool-use capabilities as they increasingly serve as autonomous agents.

Method: Propose MCPAgentBench with: 1) Dataset containing authentic tasks and simulated MCP tools, 2) Dynamic sandbox environment that presents agents with candidate tool lists containing distractors, 3) Comprehensive metrics measuring both task completion rates and execution efficiency.

Result: Experiments on various latest mainstream LLMs reveal significant performance differences in handling complex, multi-step tool invocations. The benchmark successfully tests tool selection and discrimination abilities.

Conclusion: MCPAgentBench addresses limitations of current MCP evaluation sets and provides a comprehensive benchmark for evaluating LLM agents’ tool-use capabilities. All code is open-source for community use.

Abstract: Large Language Models (LLMs) are increasingly serving as autonomous agents, and their utilization of external tools via the Model Context Protocol (MCP) is considered a future trend. Current MCP evaluation sets suffer from issues such as reliance on external MCP services and a lack of difficulty awareness. To address these limitations, we propose MCPAgentBench, a benchmark based on real-world MCP definitions designed to evaluate the tool-use capabilities of agents. We construct a dataset containing authentic tasks and simulated MCP tools. The evaluation employs a dynamic sandbox environment that presents agents with candidate tool lists containing distractors, thereby testing their tool selection and discrimination abilities. Furthermore, we introduce comprehensive metrics to measure both task completion rates and execution efficiency. Experiments conducted on various latest mainstream Large Language Models reveal significant performance differences in handling complex, multi-step tool invocations. All code is open-source at Github.

Fatima Koaik, Aayush Gupta, Farahan Raza Sheikh

Main category: cs.AI

TL;DR: Social Digital Twins framework uses LLMs as cognitive engines for individual agents to predict population responses to policies, achieving 20.7% better prediction than baselines in COVID-19 case study.

Details

Motivation: Traditional aggregate statistical models lack mechanistic interpretability and struggle with novel policy scenarios, creating a need for more robust approaches to predict population responses to policy interventions.

Method: Framework constructs virtual population replicas where LLMs serve as cognitive engines for individual agents with demographic/psychographic attributes. Agents receive policy signals and output behavioral probability vectors, with a calibration layer mapping aggregated responses to observable population-level metrics.

Result: In COVID-19 pandemic response case study, calibrated digital twin achieves 20.7% improvement in macro-averaged prediction error over gradient boosting baselines across six behavioral categories. Counterfactual experiments show monotonic and bounded responses to policy variations.

Conclusion: The domain-agnostic framework enables validation against real-world data and counterfactual policy analysis, with applications beyond pandemic response to transportation, economic, environmental policies, and other settings where policy affects population behavior.

Abstract: Predicting how populations respond to policy interventions is a fundamental challenge in computational social science and public policy. Traditional approaches rely on aggregate statistical models that capture historical correlations but lack mechanistic interpretability and struggle with novel policy scenarios. We present a general framework for constructing Social Digital Twins - virtual population replicas where Large Language Models (LLMs) serve as cognitive engines for individual agents. Each agent, characterized by demographic and psychographic attributes, receives policy signals and outputs multi-dimensional behavioral probability vectors. A calibration layer maps aggregated agent responses to observable population-level metrics, enabling validation against real-world data and deployment for counterfactual policy analysis. We instantiate this framework in the domain of pandemic response, using COVID-19 as a case study with rich observational data. On a held-out test period, our calibrated digital twin achieves a 20.7% improvement in macro-averaged prediction error over gradient boosting baselines across six behavioral categories. Counterfactual experiments demonstrate monotonic and bounded responses to policy variations, establishing behavioral plausibility. The framework is domain-agnostic: the same architecture applies to transportation policy, economic interventions, environmental regulations, or any setting where policy affects population behavior. We discuss implications for policy simulation, limitations of the approach, and directions for extending LLM-based digital twins beyond pandemic response.

[265] GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning

Zhengqing Yan, Xinyang Liu, Yi Zhang, Fan Guo, Yao Liu, Junchen Wan, Kang Song

Main category: cs.AI

TL;DR: GDEPO improves RL for automated theorem proving by addressing GRPO’s issues with composite rewards and static sampling through dynamic resampling, decoupled advantage estimation, and extra gradient steps for challenging cases.

Details

Motivation: Current RL methods like GRPO struggle with automated theorem proving due to conflicts between composite rewards and binary verifier feedback, plus inefficient static sampling that wastes data when no proofs are found.

Method: GDEPO introduces three mechanisms: 1) dynamic additional sampling to resample invalid batches until proofs are found, 2) equal-right advantage that decouples advantage sign (correctness) from magnitude (auxiliary rewards), and 3) dynamic additional iterations for extra gradient steps on challenging cases.

Result: Experiments on MinF2F-test, MathOlympiadBench, and PutnamBench datasets demonstrate GDEPO’s effectiveness, with ablation studies confirming the necessity of all three components for improved performance.

Conclusion: GDEPO enhances data utilization and optimization efficiency for automated theorem proving, offering a novel training paradigm that addresses critical limitations of existing RL approaches in ATP scenarios.

Abstract: Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI), requiring the construction of machine-verifiable proofs in formal languages such as Lean to evaluate AI reasoning capabilities. Reinforcement learning (RL), particularly the high-performance Group Relative Policy Optimization (GRPO) algorithm, has emerged as a mainstream approach for this task. However, in ATP scenarios, GRPO faces two critical issues: when composite rewards are used, its relative advantage estimation may conflict with the binary feedback from the formal verifier; meanwhile, its static sampling strategy may discard entire batches of data if no valid proof is found, resulting in zero contribution to model updates and significant data waste. To address these limitations, we propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO), a method incorporating three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function (based on correctness) from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases. Experiments conducted on three datasets of varying difficulty (MinF2F-test, MathOlympiadBench, PutnamBench) confirm the effectiveness of GDEPO, while ablation studies validate the necessity of its synergistic components. The proposed method enhances data utilization and optimization efficiency, offering a novel training paradigm for ATP.

[266] Internal Deployment Gaps in AI Regulation

Joe Kwon, Stephen Casper

Main category: cs.AI

TL;DR: Paper identifies regulatory gaps in frontier AI oversight for internal corporate deployments, analyzing scope ambiguity, compliance timing issues, and information asymmetries in US/EU 2025 regulations.

Details

Motivation: Current frontier AI regulations focus on external deployments while overlooking high-stakes internal applications within companies (R&D automation, critical business processes, sensitive data handling), creating potential oversight gaps.

Method: Examines US and EU frontier AI regulations in 2025 to identify gaps in handling internal deployments, analyzes why gaps persist (measurability, incentives, information access tensions), and maps potential solutions with tradeoffs.

Result: Identifies three regulatory gaps: 1) scope ambiguity allowing internal systems to evade obligations, 2) point-in-time compliance failing to capture continuous system evolution, and 3) information asymmetries undermining regulatory awareness and oversight.

Conclusion: Understanding these patterns enables deliberate policy choices for internally deployed AI systems rather than incidental oversight, with mapped approaches to address identified gaps while considering associated tradeoffs.

Abstract: Frontier AI regulations primarily focus on systems deployed to external users, where deployment is more visible and subject to outside scrutiny. However, high-stakes applications can occur internally when companies deploy highly capable systems within their own organizations, such as for automating R&D, accelerating critical business processes, and handling sensitive proprietary data. This paper examines how frontier AI regulations in the United States and European Union in 2025 handle internal deployment. We identify three gaps that could cause internally-deployed systems to evade intended oversight: (1) scope ambiguity that allows internal systems to evade regulatory obligations, (2) point-in-time compliance assessments that fail to capture the continuous evolution of internal systems, and (3) information asymmetries that subvert regulatory awareness and oversight. We then analyze why these gaps persist, examining tensions around measurability, incentives, and information access. Finally, we map potential approaches to address them and their associated tradeoffs. By understanding these patterns, we hope that policy choices around internally deployed AI systems can be made deliberately rather than incidentally.

[267] DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution

Shengda Fan, Xuyan Ye, Yankai Lin

Main category: cs.AI

TL;DR: DARC is a two-stage self-play framework that stabilizes LLM self-improvement by decoupling question generation and solving with difficulty calibration and asymmetric self-distillation.

Details

Motivation: Existing self-play frameworks suffer from optimization instability due to non-stationary objectives from solver-dependent rewards and bootstrapping errors from self-generated pseudo-labels.

Method: Two-stage framework: (1) Train Questioner to synthesize difficulty-calibrated questions using explicit difficulty levels and external corpora, (2) Train Solver with asymmetric self-distillation where a document-augmented teacher generates pseudo-labels to supervise student Solver without document access.

Result: DARC yields average improvement of 10.9 points across nine reasoning benchmarks and three backbone models, consistently outperforming baselines and approaching fully supervised model performance without human annotations.

Conclusion: DARC provides a stable, model-agnostic self-play framework that effectively addresses optimization instability in LLM self-improvement through decoupled asymmetric reasoning curriculum.

Abstract: Self-play with large language models has emerged as a promising paradigm for achieving self-improving artificial intelligence. However, existing self-play frameworks often suffer from optimization instability, due to (i) non-stationary objectives induced by solver-dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self-generated pseudo-labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two-stage framework that stabilizes the self-evolution process. First, we train the Questioner to synthesize difficulty-calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self-distillation mechanism, where a document-augmented teacher generates high-quality pseudo-labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model-agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human annotations. The code is available at https://github.com/RUCBM/DARC.

[268] PREFAB: PREFerence-based Affective Modeling for Low-Budget Self-Annotation

Jaeyoung Moon, Youjin Choi, Yucheon Park, David Melhart, Georgios N. Yannakakis, Kyung-Joong Kim

Main category: cs.AI

TL;DR: PREFAB is a low-budget retrospective self-annotation method that targets affective inflection regions instead of full continuous annotation, reducing workload while maintaining annotation quality.

Details

Motivation: Full continuous self-annotation for affective states is time-consuming, cognitively demanding, prone to fatigue and errors, creating a need for more efficient annotation methods.

Method: PREFAB uses preference-learning models based on peak-end rule and ordinal emotion representations to detect relative affective changes, directing annotators to label only selected segments while interpolating the rest, with a preview mechanism for contextual cues.

Result: PREFAB outperforms baselines in modeling affective inflections, mitigates workload (and sometimes temporal burden), improves annotator confidence without degrading annotation quality.

Conclusion: PREFAB offers an effective low-budget alternative to full continuous annotation that reduces cognitive load while maintaining data quality for affective computing research.

Abstract: Self-annotation is the gold standard for collecting affective state labels in affective computing. Existing methods typically rely on full annotation, requiring users to continuously label affective states across entire sessions. While this process yields fine-grained data, it is time-consuming, cognitively demanding, and prone to fatigue and errors. To address these issues, we present PREFAB, a low-budget retrospective self-annotation method that targets affective inflection regions rather than full annotation. Grounded in the peak-end rule and ordinal representations of emotion, PREFAB employs a preference-learning model to detect relative affective changes, directing annotators to label only selected segments while interpolating the remainder of the stimulus. We further introduce a preview mechanism that provides brief contextual cues to assist annotation. We evaluate PREFAB through a technical performance study and a 25-participant user study. Results show that PREFAB outperforms baselines in modeling affective inflections while mitigating workload (and conditionally mitigating temporal burden). Importantly PREFAB improves annotator confidence without degrading annotation quality.

[269] Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF

Wang Zixian

Main category: cs.AI

TL;DR: Failed to fetch summary for arXiv paper 2601.12415 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as the abstract could not be retrieved due to API rate limiting

Method: N/A - No abstract content available for analysis

Result: HTTP 429 error when attempting to fetch the paper summary from arXiv API

Conclusion: The analysis cannot be completed due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2601.12415: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12415&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[270] Patch-Level Tokenization with CNN Encoders and Attention for Improved Transformer Time-Series Forecasting

Saurish Nagrath, Saroj Kumar Panigrahy

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting

Method: N/A - Paper content not accessible due to HTTP 429 error from arXiv API

Result: Failed to retrieve paper information; encountered HTTP 429 (Too Many Requests) error indicating rate limiting on arXiv API

Conclusion: Cannot analyze paper content due to technical limitations in accessing the arXiv API; need to retry later or use alternative methods to obtain the paper

Abstract: Failed to fetch summary for 2601.12467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[271] Diffusion In Diffusion: Reclaiming Global Coherence in Semi-Autoregressive Diffusion

Linrui Ma, Yufei Cui, Kai Han, Yunhe Wang

Main category: cs.AI

TL;DR: The paper with ID 2601.13599 could not be analyzed because the arXiv API request failed with HTTP 429 (Too Many Requests), indicating rate limiting or server overload.

Details

Motivation: Unable to determine the paper's motivation due to API request failure preventing access to the abstract content.

Method: No method information available as the paper content could not be retrieved from the arXiv API.

Result: The only result obtained was an HTTP 429 error when attempting to fetch the paper summary from arXiv.

Conclusion: Technical limitations prevented analysis of this paper; the arXiv API rate limiting or server issues blocked access to the content.

Abstract: Failed to fetch summary for 2601.13599: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13599&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[272] RL-BioAug: Label-Efficient Reinforcement Learning for Self-Supervised EEG Representation Learning

Cheol-Hui Lee, Hwa-Yeon Lee, Dong-Joo Kim

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API

Details

Motivation: The user requested analysis of paper 2601.13964, but the arXiv API returned a rate limiting error preventing access to the abstract content

Method: Attempted to query arXiv API with paper ID 2601.13964 using standard API parameters for fetching paper details

Result: HTTP 429 error (Too Many Requests) received from arXiv API, indicating rate limiting has been triggered. No paper content could be retrieved for analysis

Conclusion: Cannot analyze the paper abstract due to technical limitations with the arXiv API. The system should wait and retry later or use alternative methods to access the paper content

Abstract: Failed to fetch summary for 2601.13964: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13964&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[273] DroneVLA: VLA based Aerial Manipulation

Fawad Mehboob, Monijesu James, Amir Habel, Jeffrin Sam, Miguel Altamirano Cabrera, Dzmitry Tsetserukou

Main category: cs.AI

TL;DR: Unable to analyze paper 2601.13809 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2601.13809: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13809&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[274] Single-step Controllable Music Bandwidth Extension With Flow Matching

Carlos Hernandez-Olivan, Hendrik Vincent Koops, Hao Hao Tan, Elio Quinton

Main category: cs.SD

TL;DR: Proposes Dynamic Spectral Contour (DSC) as a control signal for bandwidth extension in audio restoration using classifier-free guidance, enabling finer control over generative models.

Details

Motivation: Audio restoration is crucial for preserving historical recordings, but existing generative models lack fine controllability despite showing promising results.

Method: Extends FLowHigh framework and introduces Dynamic Spectral Contour (DSC) as a control signal for bandwidth extension via classifier-free guidance.

Result: Experiments show competitive model performance and indicate DSC is a promising feature for fine-grained conditioning in audio restoration.

Conclusion: DSC enables better control over generative audio restoration models, addressing the controllability challenge while maintaining competitive performance.

Abstract: Audio restoration consists in inverting degradations of a digital audio signal to recover what would have been the pristine quality signal before the degradation occurred. This is valuable in contexts such as archives of music recordings, particularly those of precious historical value, for which a clean version may have been lost or simply does not exist. Recent work applied generative models to audio restoration, showing promising improvement over previous methods, and opening the door to the ability to perform restoration operations that were not possible before. However, making these models finely controllable remains a challenge. In this paper, we propose an extension of FLowHigh and introduce the Dynamic Spectral Contour (DSC) as a control signal for bandwidth extension via classifier-free guidance. Our experiments show competitive model performance, and indicate that DSC is a promising feature to support fine-grained conditioning.

[275] Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum

Mohammed Salah Al-Radhi, Riad Larbi, Mátyás Bartalis, Géza Németh

Main category: cs.SD

TL;DR: A neural vocoder with prosody-guided harmonic attention and direct complex spectrum modeling improves pitch accuracy and phase coherence for more natural speech synthesis.

Details

Motivation: Existing neural vocoders have limitations in prosody modeling and phase reconstruction, leading to unnatural synthetic speech with poor pitch fidelity.

Method: Introduces prosody-guided harmonic attention for better voiced segment encoding, directly predicts complex spectral components via inverse STFT, and uses multi-objective training with adversarial, spectral, and phase-aware losses.

Result: Outperforms HiFi-GAN and AutoVocoder: 22% reduction in F0 RMSE, 18% lower voiced/unvoiced error, and 0.15 MOS improvement on benchmark datasets.

Conclusion: Prosody-guided attention combined with direct complex spectrum modeling produces more natural, pitch-accurate, and robust synthetic speech, advancing expressive neural vocoding.

Abstract: Neural vocoders are central to speech synthesis; despite their success, most still suffer from limited prosody modeling and inaccurate phase reconstruction. We propose a vocoder that introduces prosody-guided harmonic attention to enhance voiced segment encoding and directly predicts complex spectral components for waveform synthesis via inverse STFT. Unlike mel-spectrogram-based approaches, our design jointly models magnitude and phase, ensuring phase coherence and improved pitch fidelity. To further align with perceptual quality, we adopt a multi-objective training strategy that integrates adversarial, spectral, and phase-aware losses. Experiments on benchmark datasets demonstrate consistent gains over HiFi-GAN and AutoVocoder: F0 RMSE reduced by 22 percent, voiced/unvoiced error lowered by 18 percent, and MOS scores improved by 0.15. These results show that prosody-guided attention combined with direct complex spectrum modeling yields more natural, pitch-accurate, and robust synthetic speech, setting a strong foundation for expressive neural vocoding.

[276] Dissecting Performance Degradation in Audio Source Separation under Sampling Frequency Mismatch

Kanami Imamura, Tomohiko Nakamura, Kohei Yatabe, Hiroshi Saruwatari

Main category: cs.SD

TL;DR: Deep neural networks for audio processing degrade when handling untrained sampling frequencies via conventional resampling, especially when input SF is lower than trained SF. The paper investigates this degradation and proposes alternative resampling methods that alleviate the problem.

Details

Motivation: Audio DNNs are typically trained at a single sampling frequency, requiring resampling for untrained frequencies. Conventional resampling degrades performance, especially when input SF is lower than trained SF. The paper aims to understand and address this degradation.

Method: The paper tests two hypotheses about degradation causes and compares conventional resampling with three alternatives: (1) post-resampling noise addition (adds Gaussian noise), (2) noisy-kernel resampling (perturbs kernel with Gaussian noise to enrich high frequencies), and (3) trainable-kernel resampling (adapts interpolation kernel through training). Experiments conducted on music source separation task.

Result: Noisy-kernel and trainable-kernel resampling alleviate the degradation observed with conventional resampling. Noisy-kernel resampling is particularly effective across diverse models, making it a simple yet practical solution.

Conclusion: The degradation in audio DNN performance when handling untrained sampling frequencies can be mitigated by alternative resampling methods. Noisy-kernel resampling emerges as an effective and practical approach that works well across different models.

Abstract: Audio processing methods based on deep neural networks are typically trained at a single sampling frequency (SF). To handle untrained SFs, signal resampling is commonly employed, but it can degrade performance, particularly when the input SF is lower than the trained SF. This paper investigates the causes of this degradation through two hypotheses: (i) the lack of high-frequency components introduced by up-sampling, and (ii) the greater importance of their presence than their precise representation. To examine these hypotheses, we compare conventional resampling with three alternatives: post-resampling noise addition, which adds Gaussian noise to the resampled signal; noisy-kernel resampling, which perturbs the kernel with Gaussian noise to enrich high-frequency components; and trainable-kernel resampling, which adapts the interpolation kernel through training. Experiments on music source separation show that noisy-kernel and trainable-kernel resampling alleviate the degradation observed with conventional resampling. We further demonstrate that noisy-kernel resampling is effective across diverse models, highlighting it as a simple yet practical option.

[277] Unlocking Large Audio-Language Models for Interactive Language Learning

Hongfu Liu, Zhouying Cui, Xiangming Gu, Ye Wang

Main category: cs.SD

TL;DR: Instruction-tuned audio-language models outperform existing methods for pronunciation error detection and feedback generation in second language learning.

Details

Motivation: Traditional Computer-Assisted Pronunciation Training (CAPT) systems provide unintuitive feedback lacking actionable guidance, limiting their effectiveness for second language learners. Recent audio-language models offer potential for more user-friendly feedback.

Method: Introduce L2-Arctic-plus dataset with detailed error explanations and actionable suggestions. Benchmark cascaded ASR+LLMs and existing ALMs, then propose instruction-tuning ALMs on L2-Arctic-plus to improve performance.

Result: Instruction-tuned models significantly outperform existing baselines on both mispronunciation detection and suggestion generation, as shown by objective and human evaluations.

Conclusion: The proposed instruction-tuned audio-language models and L2-Arctic-plus dataset effectively enhance pronunciation training by providing actionable feedback, demonstrating the value of specialized datasets for improving second language pronunciation assistance.

Abstract: Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.

[278] Training-Efficient Text-to-Music Generation with State-Space Modeling

Wei-Jaw Lee, Fang-Chih Hsieh, Xuanjun Chen, Fang-Duo Tsai, Yi-Hsuan Yang

Main category: cs.SD

TL;DR: SSM-based text-to-music models achieve competitive performance with MusicGen-small using only 9% FLOPs and 2% training data, while being fully open-source.

Details

Motivation: Current text-to-music generation models require extensive compute and proprietary data, limiting affordability and openness. The paper aims to create more efficient and accessible TTM models using state-space models trained on public data.

Method: Replace Transformer backbone with state-space models (SSMs), explore different SSM variants, compare single-stage SSM design with two-stage SSM/diffusion hybrid, train from scratch on 457 hours of CC-licensed music with ~300M parameters matching MusicGen-small.

Result: SSMs show superior training efficiency vs Transformers; achieve competitive performance with only 9% FLOPs and 2% training data compared to MusicGen-small; maintain competitive performance even with 4x smaller model size at same training budget.

Conclusion: SSMs offer a promising alternative to Transformers for text-to-music generation, enabling more affordable and open research with competitive performance using significantly fewer resources.

Abstract: Recent advances in text-to-music generation (TTM) have yielded high-quality results, but often at the cost of extensive compute and the use of large proprietary internal data. To improve the affordability and openness of TTM training, an open-source generative model backbone that is more training- and data-efficient is needed. In this paper, we constrain the number of trainable parameters in the generative model to match that of the MusicGen-small benchmark (with about 300M parameters), and replace its Transformer backbone with the emerging class of state-space models (SSMs). Specifically, we explore different SSM variants for sequence modeling, and compare a single-stage SSM-based design with a decomposable two-stage SSM/diffusion hybrid design. All proposed models are trained from scratch on a purely public dataset comprising 457 hours of CC-licensed music, ensuring full openness. Our experimental findings are three-fold. First, we show that SSMs exhibit superior training efficiency compared to the Transformer counterpart. Second, despite using only 9% of the FLOPs and 2% of the training data size compared to the MusicGen-small benchmark, our model achieves competitive performance in both objective metrics and subjective listening tests based on MusicCaps captions. Finally, our scaling-down experiment demonstrates that SSMs can maintain competitive performance relative to the Transformer baseline even at the same training budget (measured in iterations), when the model size is reduced to four times smaller. To facilitate the democratization of TTM research, the processed captions, model checkpoints, and source code are available on GitHub via the project page: https://lonian6.github.io/ssmttm/.

[279] Multi-Tast Transformer for Explainable Speech Deepfake Detection via Formant Modeling

Viola Negroni, Luca Cuccovillo, Paolo Bestagini, Patrick Aichroth, Stefano Tubaro

Main category: cs.SD

TL;DR: Multi-task transformer for speech deepfake detection that predicts formant trajectories and voicing patterns while classifying speech as real/fake, with built-in explainability.

Details

Motivation: To create a more efficient and interpretable speech deepfake detection system that provides insights into which speech regions (voiced/unvoiced) contribute to classification decisions.

Method: Builds on prior speaker-formant transformer architecture with improved input segmentation, redesigned decoding process, and integrated explainability features. Uses multi-task learning to predict formant trajectories and voicing patterns while performing classification.

Result: Model requires fewer parameters, trains faster, provides better interpretability, and maintains comparable prediction performance to baseline.

Conclusion: The proposed multi-task transformer offers an efficient and explainable solution for speech deepfake detection that balances performance with interpretability and computational efficiency.

Abstract: In this work, we introduce a multi-task transformer for speech deepfake detection, capable of predicting formant trajectories and voicing patterns over time, ultimately classifying speech as real or fake, and highlighting whether its decisions rely more on voiced or unvoiced regions. Building on a prior speaker-formant transformer architecture, we streamline the model with an improved input segmentation strategy, redesign the decoding process, and integrate built-in explainability. Compared to the baseline, our model requires fewer parameters, trains faster, and provides better interpretability, without sacrificing prediction performance.

[280] Generative Artificial Intelligence, Musical Heritage and the Construction of Peace Narratives: A Case Study in Mali

Nouhoum Coulibaly, Ousmane Ly, Michael Leventhal, Ousmane Goro

Main category: cs.SD

TL;DR: Gen AI can help create peace narratives and revitalize musical heritage in Mali by enabling culturally-rooted musical creation, balancing innovation with authenticity, and strengthening social cohesion, though challenges with linguistic data, censorship, and copyright ethics remain.

Details

Motivation: The study addresses inter-community tensions and social fractures in Mali, seeking new symbolic frameworks for reconciliation through cultural revitalization and peace-building.

Method: Empirical exploration of three questions: using Gen AI for musical creation rooted in national languages/traditions; assessing balanced hybridization between tech innovation and cultural authenticity; and examining how AI-assisted musical co-creation strengthens social cohesion and cultural sovereignty.

Result: Gen AI embedded in culturally conscious participatory frameworks can act as a catalyst for symbolic diplomacy, amplifying local voices rather than standardizing them.

Conclusion: While Gen AI shows promise for peace-building and cultural revitalization in Mali, significant challenges persist regarding linguistic corpora availability, algorithmic censorship, and ethics of generating compositions from copyrighted sources.

Abstract: This study explores the capacity of generative artificial intelligence (Gen AI) to contribute to the construction of peace narratives and the revitalization of musical heritage in Mali. The study has been made in a political and social context where inter-community tensions and social fractures motivate a search for new symbolic frameworks for reconciliation. The study empirically explores three questions: (1) how Gen AI can be used as a tool for musical creation rooted in national languages and traditions; (2) to what extent Gen AI systems enable a balanced hybridization between technological innovation and cultural authenticity; and (3) how AI-assisted musical co-creation can strengthen social cohesion and cultural sovereignty. The experimental results suggest that Gen AI, embedded in a culturally conscious participatory framework, can act as a catalyst for symbolic diplomacy, amplifying local voices instead of standardizing them. However, challenges persist regarding the availability of linguistic corpora, algorithmic censorship, and the ethics of generating compositions derived from copyrighted sources.

[281] VCNAC: A Variable-Channel Neural Audio Codec for Mono, Stereo, and Surround Sound

Florian Grötschla, Arunasish Sen, Alessandro Lombardi, Guillermo Cámbara, Andreas Schwarz

Main category: cs.SD

TL;DR: VCNAC is a variable channel neural audio codec with a single encoder/decoder that supports mono to 5.1 surround audio, maintaining quality across channel configurations while enabling generative language model training on unified codebooks.

Details

Motivation: Current audio codecs often require separate models for different channel configurations (mono, stereo, surround), which is inefficient and doesn't support flexible channel compatibility. There's a need for a unified approach that can handle various channel setups while maintaining quality when downmixing.

Method: VCNAC uses a single encoder-decoder parametrization with channel compatibility objectives that ensure multi-channel content maintains perceptual quality when decoded to fewer channels. The shared representation enables training generative language models on a single set of codebooks while supporting inference-time scalability across modalities and channel configurations.

Result: Evaluation using objective spatial audio metrics and subjective listening tests demonstrates that the unified approach maintains high reconstruction quality across mono, stereo, and surround audio configurations.

Conclusion: VCNAC provides an efficient, unified neural audio codec solution that supports variable channel configurations from mono to 5.1 surround while maintaining quality and enabling generative model training on shared codebooks.

Abstract: We present VCNAC, a variable channel neural audio codec. Our approach features a single encoder and decoder parametrization that enables native inference for different channel setups, from mono speech to cinematic 5.1 channel surround audio. Channel compatibility objectives ensure that multi-channel content maintains perceptual quality when decoded to fewer channels. The shared representation enables training of generative language models on a single set of codebooks while supporting inference-time scalability across modalities and channel configurations. Evaluation using objective spatial audio metrics and subjective listening tests demonstrates that our unified approach maintains high reconstruction quality across mono, stereo, and surround audio configurations.

[282] Bangla Music Genre Classification Using Bidirectional LSTMS

Muntakimur Rahaman, Md Mahmudul Hoque, Md Mehedi Hassain

Main category: cs.SD

TL;DR: A novel Bangla music dataset with 10 genres is created, and an LSTM-based RNN model using MFCC features achieves 78% accuracy for genre classification.

Details

Motivation: Bangla music has rich cultural heritage, and with exponential growth of digital music, automatic genre classification is essential for efficient indexing and retrieval of Bangla music from large libraries.

Method: Created a new Bangla music dataset with 10 genres, used MFCC features for audio representation, and implemented an LSTM-based recurrent neural network architecture for genre classification.

Result: Achieved 78% classification accuracy on the Bangla music genre classification task, demonstrating the system’s effectiveness for organizing Bangla music.

Conclusion: The proposed LSTM-based approach with MFCC features shows strong potential for enhancing Bangla music organization and retrieval, though there’s room for improvement beyond 78% accuracy.

Abstract: Bangla music is enrich in its own music cultures. Now a days music genre classification is very significant because of the exponential increase in available music, both in digital and physical formats. It is necessary to index them accordingly to facilitate improved retrieval. Automatically classifying Bangla music by genre is essential for efficiently locating specific pieces within a vast and diverse music library. Prevailing methods for genre classification predominantly employ conventional machine learning or deep learning approaches. This work introduces a novel music dataset comprising ten distinct genres of Bangla music. For the task of audio classification, we utilize a recurrent neural network (RNN) architecture. Specifically, a Long Short-Term Memory (LSTM) network is implemented to train the model and perform the classification. Feature extraction represents a foundational stage in audio data processing. This study utilizes Mel-Frequency Cepstral Coefficients (MFCCs) to transform raw audio waveforms into a compact and representative set of features. The proposed framework facilitates music genre classification by leveraging these extracted features. Experimental results demonstrate a classification accuracy of 78%, indicating the system’s strong potential to enhance and streamline the organization of Bangla music genres.

[283] WeDefense: A Toolkit to Defend Against Fake Audio

Lin Zhang, Johan Rohdin, Xin Wang, Junyi Peng, Tianchi Liu, You Zhang, Hieu-Thi Luong, Shuai Wang, Chengdong Liang, Anna Silnova, Nicholas Evans

Main category: cs.SD

TL;DR: WeDefense is an open-source toolkit for fake audio detection and localization that provides standardized benchmarking, evaluation metrics, and analysis tools to address the lack of unified solutions in the field.

Details

Motivation: The paper addresses the risks of synthetic audio misuse (impersonation, disinformation, fraud) and the lack of standardized, unified toolkits for fake audio detection. Existing solutions are fragmented across competitions, datasets, and models without fair benchmarking capabilities.

Method: Developed WeDefense as an open-source toolkit with comprehensive features including flexible input/augmentation, calibration, score fusion, standardized evaluation metrics, and analysis tools for deeper interpretation of detection results.

Result: Created the first open-source toolkit supporting both fake audio detection and localization, publicly available on GitHub with interactive demos, providing a unified framework for fair benchmarking and comparison of solutions.

Conclusion: WeDefense fills a critical gap in the field by offering a standardized toolkit that enables researchers to fairly benchmark and compare fake audio detection methods while providing essential analysis tools for better understanding detection performance.

Abstract: The advances in generative AI have enabled the creation of synthetic audio which is perceptually indistinguishable from real, genuine audio. Although this stellar progress enables many positive applications, it also raises risks of misuse, such as for impersonation, disinformation and fraud. Despite a growing number of open-source fake audio detection codes released through numerous challenges and initiatives, most are tailored to specific competitions, datasets or models. A standardized and unified toolkit that supports the fair benchmarking and comparison of competing solutions with not just common databases, protocols, metrics, but also a shared codebase, is missing. To address this, we propose WeDefense, the first open-source toolkit to support both fake audio detection and localization. Beyond model training, WeDefense emphasizes critical yet often overlooked components: flexible input and augmentation, calibration, score fusion, standardized evaluation metrics, and analysis tools for deeper understanding and interpretation. The toolkit is publicly available at https://github.com/zlin0/wedefense with interactive demos for fake audio detection and localization.

[284] WavLink: Compact Audio–Text Embeddings with a Global Whisper Token

Gokul Karthik Kumar, Ludovick Lepauloux, Hakim Hacid

Main category: cs.SD

TL;DR: WavLink is a compact audio-text embedding model that enhances Whisper encoder with a learnable global token, achieving state-of-the-art retrieval performance through systematic design optimization and two-stage training.

Details

Motivation: Whisper has become standard for audio features in large audio-language models, but audio-text embedding models like CLAP haven't effectively leveraged Whisper. The authors aim to create a compact model that bridges this gap and improves retrieval performance.

Method: WavLink augments Whisper encoder with a learnable global token, trained jointly with text encoder. Uses systematic study of design choices (pretrained text encoders, loss functions, training modes, data mixtures). Implements two-stage training across three model sizes with Matryoshka-style supervision for scalable embeddings.

Result: Achieves state-of-the-art retrieval performance, enables 8x smaller embeddings with minimal performance drop, and demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification.

Conclusion: WavLink successfully bridges the gap between Whisper-based audio features and audio-text embedding models, offering a compact, scalable solution with strong retrieval capabilities and competitive performance on benchmark tasks.

Abstract: Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token, trained jointly with a text encoder. Through a systematic study of design choices, including pretrained text encoders, loss functions, training modes, and data mixtures, we identify configurations that yield state-of-the-art retrieval performance. Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x smaller embeddings with minimal performance drop. WavLink also demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification.

[285] A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments

Md Jahangir Alam Khondkar, Ajan Ahmed, Stephanie Schuckers, Masudul Haider Imtiaz

Main category: cs.SD

TL;DR: Benchmark study comparing Wave-U-Net, CMGAN, and U-Net for speech denoising, showing each excels in different aspects: U-Net for noise suppression, CMGAN for perceptual quality, and Wave-U-Net for speaker feature retention.

Details

Motivation: Existing deep learning models for speech enhancement struggle to balance noise suppression, perceptual quality, and speaker feature preservation, creating a need for comparative performance evaluation to understand trade-offs.

Method: Benchmarked three state-of-the-art models (Wave-U-Net, CMGAN, and U-Net) on diverse datasets (SpEAR, VPQAD, Clarkson) chosen for literature relevance and code accessibility, evaluating noise suppression (SNR), perceptual quality (PESQ), and speaker feature retention (VeriSpeak).

Result: U-Net achieved best noise suppression with SNR improvements up to +364.2%; CMGAN excelled in perceptual quality with PESQ scores up to 4.04; Wave-U-Net balanced attributes with best speaker feature retention (+27.38% VeriSpeak gains).

Conclusion: Different models optimize different trade-offs: U-Net for noise suppression, CMGAN for perceptual quality, Wave-U-Net for speaker recognition. Findings can advance voice biometrics, forensic audio, telecommunications, and speaker verification in noisy environments.

Abstract: Speech enhancement, particularly denoising, is vital in improving the intelligibility and quality of speech signals for real-world applications, especially in noisy environments. While prior research has introduced various deep learning models for this purpose, many struggle to balance noise suppression, perceptual quality, and speaker-specific feature preservation, leaving a critical research gap in their comparative performance evaluation. This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net, on diverse datasets such as SpEAR, VPQAD, and Clarkson datasets. These models were chosen due to their relevance in the literature and code accessibility. The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on the Clarkson dataset. CMGAN outperforms in perceptual quality, attaining the highest PESQ scores of 4.04 on SpEAR and 1.46 on VPQAD, making it well-suited for applications prioritizing natural and intelligible speech. Wave-U-Net balances these attributes with improvements in speaker-specific feature retention, evidenced by VeriSpeak score gains of +10.84% on SpEAR and +27.38% on VPQAD. This research indicates how advanced methods can optimize trade-offs between noise suppression, perceptual quality, and speaker recognition. The findings may contribute to advancing voice biometrics, forensic audio analysis, telecommunication, and speaker verification in challenging acoustic conditions.

[286] MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

Main category: cs.SD

TL;DR: MambAttention: A hybrid Mamba + time-frequency attention model for speech enhancement that outperforms state-of-the-art models on out-of-domain generalization.

Details

Motivation: Sequence models like Mamba and xLSTM show promise for speech enhancement but tend to overfit. While adding self-attention to LSTMs improves generalization, hybrid Mamba+attention models haven't been explored for speech enhancement.

Method: Propose MambAttention - hybrid architecture combining Mamba with shared time- and frequency-multi-head attention modules. Train on VB-DemandEx dataset with challenging noise types and low SNR.

Result: MambAttention significantly outperforms state-of-the-art LSTM, xLSTM, Mamba, and Conformer models on out-of-domain datasets (DNS 2020 and EARS-WHAM_v2). Matches or beats generative diffusion models in generalization, competitive with language models. Ablation shows weight sharing crucial for generalization.

Conclusion: MambAttention demonstrates superior cross-corpus generalization for speech enhancement. The shared time-frequency attention mechanism is key to performance, and while integrating similar modules with LSTM/xLSTM helps, MambAttention remains best overall.

Abstract: With new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform the state-of-the-art in single-channel speech enhancement and audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VB-DemandEx, a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, MambAttention significantly outperforms existing state-of-the-art discriminative LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 without reverberation and EARS-WHAM_v2. MambAttention also matches or outperforms generative diffusion models in generalization performance while being competitive with language model baselines. Ablation studies highlight the importance of weight sharing between time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. Yet, MambAttention remains superior for cross-corpus generalization across all reported evaluation metrics.

[287] Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data

Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid

Main category: cs.SD

TL;DR: Falcon3-Audio is a family of efficient audio-language models that achieve state-of-the-art performance on audio understanding benchmarks using remarkably small training data (under 30K hours) and simple single-stage training.

Details

Motivation: While LLMs have transformed NLP, their integration with audio remains underexplored despite audio's importance in human communication. There's a need for more efficient and transparent audio-language models that don't require massive datasets or complex training procedures.

Method: Built on instruction-tuned LLMs and Whisper encoders, using less than 30K hours of public audio data (5K unique). Employs single-stage training without complex components like curriculum learning, multiple audio encoders, or intricate cross-attention connectors.

Result: Falcon3-Audio-7B matches best open-weight models on MMAU benchmark (score 64.14, matching R1-AQA) with superior data/parameter efficiency. The 1B model remains competitive with larger open models (2B-13B). Achieves strong performance compared to models trained on 500K+ hours.

Conclusion: Complex training procedures and massive datasets aren’t necessary for strong audio-language model performance. Simple, efficient architectures with modest data can achieve state-of-the-art results, enabling more accessible and transparent audio AI development.

Abstract: Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored despite audio’s centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data, less than 30K hours (5K unique), Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors are not required for strong performance, even compared to models trained on over 500K hours of data.

[288] Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech Data

Youngwon Choi, Jaeyoon Jung, Hyeonyu Kim, Huu-Kim Nguyen, Hwayeon Kim

Main category: cs.SD

TL;DR: LALMs achieve strong SLU performance with text-only fine-tuning; adding small speech data (2-5%) yields substantial gains; curriculum learning helps with scarce data; cross-lingual adaptation works with source-language speech + target-language text + minimal target speech.

Details

Motivation: Large Audio Language Models (LALMs) are powerful for speech tasks but underexplored for fine-tuning, especially with limited speech data. Need to understand how different fine-tuning schemes work under realistic data constraints where text-label pairs are abundant but paired speech-label data are limited.

Method: Systematically examine different fine-tuning schemes: text-only, direct mixing, and curriculum learning for spoken language understanding (SLU). Focus on scenarios with abundant text-label pairs but limited speech-label data. Also explore cross-lingual SLU adaptation combining source-language speech data with target-language text and minimal target-language speech.

Result: LALMs achieve competitive performance with text-only fine-tuning, showing strong generalization. Adding small amounts of speech data (2-5%) yields substantial further gains. Curriculum learning is particularly effective under scarce data conditions. Cross-lingual SLU adaptation works effectively by combining source-language speech data with target-language text and minimal target-language speech data.

Conclusion: This study provides practical insights into LALM fine-tuning under realistic data constraints. LALMs demonstrate strong generalization from text, benefit significantly from even small amounts of speech data, and can be effectively adapted cross-lingually with strategic data combination.

Abstract: Large Audio Language Models (LALMs) have emerged as powerful tools for speech-related tasks but remain underexplored for fine-tuning, especially with limited speech data. To bridge this gap, we systematically examine how different fine-tuning schemes including text-only, direct mixing, and curriculum learning affect spoken language understanding (SLU), focusing on scenarios where text-label pairs are abundant while paired speech-label data are limited. Results show that LALMs already achieve competitive performance with text-only fine-tuning, highlighting their strong generalization ability. Adding even small amounts of speech data (2-5%) yields substantial further gains, with curriculum learning particularly effective under scarce data. In cross-lingual SLU, combining source-language speech data with target-language text and minimal target-language speech data enables effective adaptation. Overall, this study provides practical insights into the LALM fine-tuning under realistic data constraints.

[289] Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement

Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

Main category: cs.SD

TL;DR: RWSA-MambaUNet combines Mamba and multi-head attention in U-Net for speech enhancement, achieving SOTA cross-corpus generalization with significantly reduced parameters and FLOPs.

Details

Motivation: Recent advances show that Mamba+attention models improve cross-corpus generalization, while Mamba in U-Net structures reduces model size and computational complexity. The paper aims to create an efficient hybrid model for better cross-corpus performance.

Method: Proposes RWSA-MambaUNet, a hybrid model combining Mamba and multi-head attention in U-Net structure with resolution-wise shared attention (RWSA), where attention is shared layerwise across corresponding time-frequency resolutions.

Result: Achieves SOTA generalization on two out-of-domain test sets. Smallest model surpasses all baselines on DNS 2020 (PESQ, SSNR, ESTOI) and EARS-WHAM_v2 (SSNR, ESTOI, SI-SDR) with less than half the parameters and fraction of FLOPs.

Conclusion: RWSA-MambaUNet demonstrates superior cross-corpus generalization for speech enhancement while being highly efficient in terms of model size and computational requirements.

Abstract: Recent advances in speech enhancement have shown that models combining Mamba and attention mechanisms yield superior cross-corpus generalization performance. At the same time, integrating Mamba in a U-Net structure has yielded state-of-the-art enhancement performance, while reducing both model size and computational complexity. Inspired by these insights, we propose RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and multi-head attention in a U-Net structure for improved cross-corpus performance. Resolution-wise shared attention (RWSA) refers to layerwise attention-sharing across corresponding time- and frequency resolutions. Our best-performing RWSA-MambaUNet model achieves state-of-the-art generalization performance on two out-of-domain test sets. Notably, our smallest model surpasses all baselines on the out-of-domain DNS 2020 test set in terms of PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and a fraction of the FLOPs.

[290] End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering

Jiliang Hu, Zuchao Li, Baoyuan Qi, Liu Guoming, Ping Wang

Main category: cs.SD

TL;DR: CLSR is a contrastive language-speech retriever that extracts question-relevant segments from long audio for spoken question answering, outperforming existing methods by converting acoustic features to text-like representations before alignment.

Details

Motivation: Existing spoken question answering methods struggle with long audio, and current speech-related retrievers have poor performance despite the success of retrieval augmented generation approaches.

Method: CLSR is an end-to-end contrastive language-speech retriever that incorporates an intermediate step converting acoustic features into text-like representations before aligning with text, bridging the modality gap more effectively than conventional approaches.

Result: CLSR surpasses both end-to-end speech retrievers and pipeline approaches combining speech recognition with text retrieval across four cross-modal retrieval datasets.

Conclusion: CLSR provides a robust foundation for advancing practical long-form spoken question answering applications by efficiently extracting relevant segments from long audio recordings.

Abstract: Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches combining speech recognition with text retrieval, providing a robust foundation for advancing practical long-form SQA applications.

[291] ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan

Xueping Zhang, Han Yin, Yang Xiao, Lin Zhang, Ting Dang, Rohan Kumar Das, Ming Li

Main category: cs.SD

TL;DR: Proposes CompSpoofV2 dataset and separation-enhanced joint learning framework for component-level audio deepfake detection, where speech and environmental sounds can be independently manipulated.

Details

Motivation: Real-world audio contains both foreground speech and background sounds, and with advances in generation models, either component can be independently manipulated. Component-level manipulations are harder to detect because unaltered components can mislead existing whole-audio detection systems and sound more natural to humans.

Method: Created CompSpoofV2 dataset (250k+ audio samples, ~283 hours) for component-level audio anti-spoofing, and developed a separation-enhanced joint learning framework. Also launched the ESDD2 challenge focusing on component-level spoofing detection.

Result: CompSpoofV2 is a large-scale curated dataset for component-level audio anti-spoofing. The separation-enhanced joint learning framework and ESDD2 challenge address the gap in detecting component-level manipulations.

Conclusion: Component-level audio spoofing presents a more challenging detection scenario where both speech and environmental sounds may be manipulated. The proposed dataset, framework, and challenge aim to advance research in this area for more realistic deepfake audio detection.

Abstract: Audio recorded in real-world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation-enhanced joint learning framework. CompSpoofV2 is a large-scale curated dataset designed for component-level audio anti-spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation-enhanced joint learning framework, we launch the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component-level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).

[292] Performance and Complexity Trade-off Optimization of Speech Models During Training

Esteban Gómez, Tom Bäckström

Main category: cs.SD

TL;DR: A reparameterization method using feature noise injection enables joint optimization of neural network performance and computational complexity via SGD, allowing dynamic model size optimization without heuristic pruning.

Details

Motivation: Traditional neural network design uses fixed architectures with heuristic layer sizing, requiring post-hoc methods like pruning/quantization to reduce computational cost. SGD can't optimize non-differentiable complexity factors like layer sizes and FLOPs.

Method: Proposes a reparameterization technique based on feature noise injection that makes computational complexity differentiable, enabling joint optimization of performance and complexity using SGD-based methods during training.

Result: Demonstrated effectiveness through three case studies: a synthetic example and two real-world speech applications (voice activity detection and audio anti-spoofing). The method allows dynamic model size optimization for target performance-complexity trade-offs.

Conclusion: The proposed approach enables joint optimization of performance and computational complexity during training without relying on heuristic pruning criteria, offering a more principled alternative to traditional post-hoc model compression methods.

Abstract: In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task’s objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD-based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance-complexity trade-off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real-world applications: voice activity detection and audio anti-spoofing. The code related to our work is publicly available to encourage further research.

cs.LG

[293] Call2Instruct: Automated Pipeline for Generating Q&A Datasets from Call Center Recordings for LLM Fine-Tuning

Alex Echeverria, Sávio Salvarino Teles de Oliveira, Fernando Marques Federson

Main category: cs.LG

TL;DR: Automated pipeline converts noisy call center audio recordings into Q&A instructional datasets for LLM fine-tuning, successfully demonstrated with Llama 2 7B.

Details

Motivation: High-quality instructional datasets are crucial for domain-specific LLM adaptation, but generating them from unstructured call center audio is challenging due to noise and disorganization.

Method: End-to-end pipeline with sequential steps: audio processing (diarization, noise removal, transcription), textual processing (cleaning, normalization, anonymization), semantic extraction using vector embeddings, and semantic search matching to form Q&A pairs.

Result: Successfully implemented pipeline generates specifically formatted datasets for Instruct Fine Tuning, validated by fine-tuning Llama 2 7B model. Codes made publicly available for reproducibility.

Conclusion: The approach is viable for converting unstructured conversational data into valuable training resources for LLMs, potentially enabling more effective AI systems for customer service Q&A tasks.

Abstract: The adaptation of Large-Scale Language Models (LLMs) to specific domains depends on high-quality fine-tuning datasets, particularly in instructional format (e.g., Question-Answer - Q&A). However, generating these datasets, particularly from unstructured sources such as call center audio recordings, poses a significant challenge due to the noisy and disorganized nature of the data. This paper presents a solution to this challenge by offering an end-to-end automated pipeline for generating Q&A instructional datasets from such recordings. The methodology developed comprises sequential steps of audio processing (including diarization, noise removal and automatic transcription), textual processing (cleaning, normalization, and anonymization), semantic extraction of customer demands and attendant responses using vector embeddings, and matching via semantic search to form the final Q&A pairs. As a result, the complete pipeline was successfully implemented, generating a dataset specifically formatted for Instruct Fine Tuning. The practical value and feasibility of the generated dataset were substantiated and functionally demonstrated through the successful fine-tuning of an LLM model (based on Llama 2 7B). The conclusion of the paper states that the proposed approach is viable for converting unstructured conversational data from call centers into valuable resources for training LLMs. This development has the potential to open up avenues for creating more effective AI systems for Q&A tasks in the customer service domain. The developed codes have been made publicly available to promote reproducibility and future research.

[294] GCG Attack On A Diffusion LLM

Ruben Neyroud, Sam Corley

Main category: cs.LG

TL;DR: GCG-style adversarial attacks on diffusion-based LLMs (LLaDA) show these models have different vulnerabilities than autoregressive LLMs, requiring new optimization strategies for adversarial analysis.

Details

Motivation: While GCG attacks work well on autoregressive LLMs, their effectiveness on emerging diffusion-based language models like LLaDA is unknown. The paper aims to explore the attack surface and robustness of diffusion LLMs.

Method: Conducted exploratory study using GCG-style adversarial prompt attacks on LLaDA, testing multiple variants including prefix perturbations and suffix-based adversarial generation on harmful prompts from AdvBench dataset.

Result: Initial insights reveal diffusion language models have different robustness characteristics and attack surfaces compared to autoregressive models, showing GCG attacks can be adapted but require modifications.

Conclusion: Diffusion LLMs present distinct security challenges, motivating development of alternative optimization and evaluation strategies for adversarial analysis in non-autoregressive language models.

Abstract: While most LLMs are autoregressive, diffusion-based LLMs have recently emerged as an alternative method for generation. Greedy Coordinate Gradient (GCG) attacks have proven effective against autoregressive models, but their applicability to diffusion language models remains largely unexplored. In this work, we present an exploratory study of GCG-style adversarial prompt attacks on LLaDA (Large Language Diffusion with mAsking), an open-source diffusion LLM. We evaluate multiple attack variants, including prefix perturbations and suffix-based adversarial generation, on harmful prompts drawn from the AdvBench dataset. Our study provides initial insights into the robustness and attack surface of diffusion language models and motivates the development of alternative optimization and evaluation strategies for adversarial analysis in this setting.

[295] Divide and Refine: Enhancing Multimodal Representation and Explainability for Emotion Recognition in Conversation

Anh-Tuan Mai, Cam-Van Thi Nguyen, Duc-Trong Le

Main category: cs.LG

TL;DR: DnR framework explicitly decomposes multimodal signals into unique, redundant, and synergistic components, then refines them with tailored objectives for improved emotion recognition in conversations.

Details

Motivation: Current multimodal emotion recognition methods often fail to properly balance unique, redundant, and synergistic information across modalities. Augmentation-based approaches can blur boundaries between modality-specific and cross-modal signals, limiting representation effectiveness.

Method: Two-phase Divide and Refine (DnR) framework: 1) Divide phase explicitly decomposes each modality into uniqueness, pairwise redundancy, and synergy components; 2) Refine phase uses tailored objectives to enhance informativeness while maintaining distinct roles of these components.

Result: Extensive experiments on IEMOCAP and MELD datasets show consistent improvements across multiple MERC backbones, demonstrating the effectiveness of explicitly dividing, refining, and recombining multimodal representations.

Conclusion: Explicit decomposition and refinement of multimodal signals into unique, redundant, and synergistic components provides a principled strategy for advancing emotion recognition in conversations, with plug-and-play compatibility for diverse multimodal pipelines.

Abstract: Multimodal emotion recognition in conversation (MERC) requires representations that effectively integrate signals from multiple modalities. These signals include modality-specific cues, information shared across modalities, and interactions that emerge only when modalities are combined. In information-theoretic terms, these correspond to \emph{unique}, \emph{redundant}, and \emph{synergistic} contributions. An ideal representation should leverage all three, yet achieving such balance remains challenging. Recent advances in contrastive learning and augmentation-based methods have made progress, but they often overlook the role of data preparation in preserving these components. In particular, applying augmentations directly to raw inputs or fused embeddings can blur the boundaries between modality-unique and cross-modal signals. To address this challenge, we propose a two-phase framework \emph{\textbf{D}ivide and \textbf{R}efine} (\textbf{DnR}). In the \textbf{Divide} phase, each modality is explicitly decomposed into uniqueness, pairwise redundancy, and synergy. In the \textbf{Refine} phase, tailored objectives enhance the informativeness of these components while maintaining their distinct roles. The refined representations are plug-and-play compatible with diverse multimodal pipelines. Extensive experiments on IEMOCAP and MELD demonstrate consistent improvements across multiple MERC backbones. These results highlight the effectiveness of explicitly dividing, refining, and recombining multimodal representations as a principled strategy for advancing emotion recognition. Our implementation is available at https://github.com/mattam301/DnR-WACV2026

[296] Quality or Quantity? Error-Informed Selective Online Learning with Gaussian Processes in Multi-Agent Systems: Extended Version

Zewen Yang, Xiaobing Dai, Jiajun Cheng, Yulong Huang, Peng Shi

Main category: cs.LG

TL;DR: A selective online learning framework for distributed Gaussian process regression that prioritizes quality over quantity by enabling agents to choose higher-quality neighboring models with less prediction errors.

Details

Motivation: The paper addresses the irrationality of indiscriminately including all models in distributed cooperative learning, highlighting the need to prioritize model quality over quantity for effective cooperation in multi-agent systems.

Method: Proposes distributed error-informed GP (EIGP) framework with selection function for agents to assess and choose higher-quality neighboring GP models. Includes algorithmic enhancements: greedy algorithm (gEIGP) for acceleration, adaptive algorithm (aEIGP) for accuracy improvement, and approaches for fast prediction/model update with error-informed quantification term iteration and data deletion strategy.

Result: Numerical simulations demonstrate the framework’s effectiveness and superiority over state-of-the-art distributed GP methods across different benchmarks.

Conclusion: The selective online learning approach enables real-time learning operations and improves distributed GP regression by prioritizing quality models over quantity, enhancing both prediction speed and accuracy.

Abstract: Effective cooperation is pivotal in distributed learning for multi-agent systems, where the interplay between the quantity and quality of the machine learning models is crucial. This paper reveals the irrationality of indiscriminate inclusion of all models on agents for joint prediction, highlighting the imperative to prioritize quality over quantity in cooperative learning. Specifically, we present the first selective online learning framework for distributed Gaussian process (GP) regression, namely distributed error-informed GP (EIGP), that enables each agent to assess its neighboring collaborators, using the proposed selection function to choose the higher quality GP models with less prediction errors. Moreover, algorithmic enhancements are embedded within the EIGP, including a greedy algorithm (gEIGP) for accelerating prediction and an adaptive algorithm (aEIGP) for improving prediction accuracy. In addition, approaches for fast prediction and model update are introduced in conjunction with the error-informed quantification term iteration and a data deletion strategy to achieve real-time learning operations. Numerical simulations are performed to demonstrate the effectiveness of the developed methodology, showcasing its superiority over the state-of-the-art distributed GP methods with different benchmarks.

[297] Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

Uygar Kurt

Main category: cs.LG

TL;DR: A unified empirical study of llama.cpp quantization formats for Llama-3.1-8B-Instruct, evaluating performance across reasoning, knowledge, instruction-following, and truthfulness benchmarks, plus perplexity, throughput, and compression metrics to guide practical quantization scheme selection.

Details

Motivation: Quantization enables LLM deployment on constrained hardware by reducing precision, but available formats in llama.cpp are evaluated inconsistently, making it difficult for users to choose appropriate schemes for their specific needs and resource constraints.

Method: Comprehensive empirical evaluation of llama.cpp quantization on Llama-3.1-8B-Instruct, covering 3-8 bit K-quant and legacy formats. Assessment includes downstream task performance (reasoning, knowledge, instruction-following, truthfulness), perplexity, CPU throughput (prefill/decoding), model size, compression ratios, and quantization time.

Result: The study provides systematic performance comparisons across quantization formats, revealing trade-offs between model quality, inference speed, and memory usage. Results enable users to make informed decisions based on their specific use cases and hardware constraints.

Conclusion: This work serves as a practical guide for selecting llama.cpp quantization schemes, helping users make context-aware decisions that balance performance, resource requirements, and deployment feasibility for local LLM execution on commodity hardware.

Abstract: Quantization is a practical technique for making large language models easier to deploy by reducing the precision used to store and operate on model weights. This can lower memory use and improve runtime feasibility on constrained hardware, which is especially relevant for users running models locally. Quantization in llama.cpp enables large language models to run on commodity hardware, but available formats are often evaluated inconsistently, making it hard to choose among schemes. We present a unified empirical study of the llama.cpp quantization on a single modern model, Llama-3.1-8B-Instruct (FP16, GGUF), covering 3-8 bit K-quant and legacy formats. We evaluate downstream task performance across standard reasoning, knowledge, instruction-following, and truthfulness benchmarks, and also measure perplexity and CPU throughput (prefill/decoding) alongside model size, compression, and quantization time. Ultimately, this work is a practical guide for choosing a llama.cpp quantization scheme, helping readers make informed, context-aware decisions for their intended use and resource budget.

[298] On the Limits of Learned Importance Scoring for KV Cache Compression

Brady Steele

Main category: cs.LG

TL;DR: Learned KV cache compression via Speculative Importance Prediction (SIP) fails to outperform simple position-based heuristics, suggesting limited utility of complex learned approaches for token importance prediction.

Details

Motivation: To investigate whether learned approaches can effectively compress KV caches by predicting token importance, potentially improving inference efficiency in transformer models.

Method: Proposed Speculative Importance Prediction (SIP), a 1.7M parameter non-query-aware scorer that predicts token importance from KV representations alone, using multi-horizon lookahead and cross-attention mechanisms.

Result: SIP does not outperform simple baselines (including random selection) across multiple seeds, retention levels, and tasks. Position-based heuristics (keep first 4 + last N tokens) match or exceed learned approaches.

Conclusion: Complex learned scorers for KV cache compression offer limited benefits over simple heuristics; position information and prefill attention provide sufficient signal, while KV representations contain marginal additional information for importance prediction.

Abstract: We investigate learned KV cache compression through Speculative Importance Prediction (SIP), a 1.7M parameter non-query-aware scorer that predicts token importance from KV representations alone. Despite architectural sophistication (multi-horizon lookahead, cross-attention), SIP does not outperform simple baselines, including random selection, across 5 seeds, 4 retention levels, and 3 tasks. Key findings: (1) position-based heuristics (keep first 4 + last N tokens) match or exceed learned approaches; (2) prefill attention provides equivalent signal to complex learned scorers; (3) marginal information in KV representations beyond position and prefill attention appears limited for importance prediction. We hypothesize that circular dependence between future queries and generation trajectories contributes to this difficulty.

[299] Beyond Affinity: A Benchmark of 1D, 2D, and 3D Methods Reveals Critical Trade-offs in Structure-Based Drug Design

Kangyu Zheng, Kai Zhang, Jiale Tan, Xuehan Chen, Yingzhou Lu, Zaixi Zhang, Lichao Sun, Marinka Zitnik, Tianfan Fu, Zhiding Liang

Main category: cs.LG

TL;DR: Benchmark comparing 15 structure-based drug design models across search-based, deep generative, and reinforcement learning approaches, evaluating pharmaceutical properties and docking performance.

Details

Motivation: Current SBDD research lacks cross-algorithm comparisons, with most studies only comparing models within the same algorithmic category. The paper aims to fill this gap by establishing a comprehensive benchmark.

Method: Established a benchmark evaluating 15 models across three algorithmic categories (search-based, deep generative models, RL) by assessing pharmaceutical properties, docking affinities, and binding poses with target proteins. Included 1D/2D ligand-centric methods using docking as black-box oracle.

Result: 3D models excel in binding affinities but have chemical validity/pose issues. 1D models perform well on standard molecular metrics but rarely achieve optimal binding. 2D models offer balanced performance with high chemical validity and moderate binding scores.

Conclusion: Each algorithmic approach has unique strengths: 3D for binding affinity, 1D for molecular metrics, 2D for balanced performance. Future SBDD models should combine strengths of different approaches while addressing their limitations.

Abstract: Currently, the field of structure-based drug design is dominated by three main types of algorithms: search-based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross-algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of fifteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities and poses with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand-centric drug design methods can be used in SBDD by treating the docking function as a black-box oracle, which is typically neglected. Our evaluation reveals distinct patterns across model categories. 3D structure-based models excel in binding affinities but show inconsistencies in chemical validity and pose quality. 1D models demonstrate reliable performance in standard molecular metrics but rarely achieve optimal binding affinities. 2D models offer balanced performance, maintaining high chemical validity while achieving moderate binding scores. Through detailed analysis across multiple protein targets, we identify key improvement areas for each model category, providing insights for researchers to combine strengths of different approaches while addressing their limitations. All the code that are used for benchmarking is available in https://github.com/zkysfls/2025-sbdd-benchmark

[300] A Comparison of Polynomial-Based Tree Clustering Methods

Pengyu Liu, Mariel Vázquez, Nataša Jonoska

Main category: cs.LG

TL;DR: Tree polynomials provide efficient encoding of tree structures for data analytics, with Canberra distance-based methods showing highest clustering accuracy.

Details

Motivation: Tree structures are prevalent in life sciences (phylogenetics, RNA structures) but require novel analytics methods due to increasing biological data from sequencing and AI.

Method: Compare different distance metrics in tree clustering using tree distinguishing polynomials, and implement two basic autoencoder models for clustering trees.

Result: Distance-based methods with entry-level normalized distances achieve the highest clustering accuracy among compared methods.

Conclusion: Tree polynomials combined with appropriate distance metrics provide effective clustering methods for tree structure data analytics in biological applications.

Abstract: Tree structures appear in many fields of the life sciences, including phylogenetics, developmental biology and nucleic acid structures. Trees can be used to represent RNA secondary structures, which directly relate to the function of non-coding RNAs. Recent developments in sequencing technology and artificial intelligence have yielded numerous biological data that can be represented with tree structures. This requires novel methods for tree structure data analytics. Tree polynomials provide a computationally efficient, interpretable and comprehensive way to encode tree structures as matrices, which are compatible with most data analytics tools. Machine learning methods based on the Canberra distance between tree polynomials have been introduced to analyze phylogenies and nucleic acid structures. In this paper, we compare the performance of different distances in tree clustering methods based on a tree distinguishing polynomial. We also implement two basic autoencoder models for clustering trees using the polynomial. We find that the distance based methods with entry-level normalized distances have the highest clustering accuracy among the compared methods.

[301] Field-Space Autoencoder for Scalable Climate Emulators

Johannes Meuer, Maximilian Witte, Étiénne Plésiat, Thomas Ludwig, Christopher Kadow

Main category: cs.LG

TL;DR: Field-Space Autoencoder: A spherical compression model for kilometer-scale climate emulation that preserves physical structures better than convolutional methods and enables zero-shot super-resolution.

Details

Motivation: Kilometer-scale Earth system models are computationally expensive and produce petabyte-scale outputs, limiting their utility for applications like probabilistic risk assessment. There's a need to bridge the gap between abundant low-resolution ensemble statistics and scarce high-resolution physical detail.

Method: Field-Space Autoencoder framework with Field-Space Attention that operates on native climate model output, avoiding geometric distortions from spherical-to-Euclidean mapping. Uses a generative diffusion model trained on compressed fields to simultaneously learn internal variability from low-resolution data and fine-scale physics from high-resolution data.

Result: The model preserves physical structures significantly better than convolutional baselines, produces structured compressed fields for downstream generative emulation, and enables zero-shot super-resolution that maps low-resolution ensembles and scarce high-resolution data into a shared representation.

Conclusion: The Field-Space Autoencoder provides a scalable climate emulation framework that overcomes computational limitations of kilometer-scale models, bridging the gap between low-resolution ensemble statistics and high-resolution physical detail for improved climate risk assessment.

Abstract: Kilometer-scale Earth system models are essential for capturing local climate change. However, these models are computationally expensive and produce petabyte-scale outputs, which limits their utility for applications such as probabilistic risk assessment. Here, we present the Field-Space Autoencoder, a scalable climate emulation framework based on a spherical compression model that overcomes these challenges. By utilizing Field-Space Attention, the model efficiently operates on native climate model output and therefore avoids geometric distortions caused by forcing spherical data onto Euclidean grids. This approach preserves physical structures significantly better than convolutional baselines. By producing a structured compressed field, it serves as a good baseline for downstream generative emulation. In addition, the model can perform zero-shot super-resolution that maps low-resolution large ensembles and scarce high-resolution data into a shared representation. We train a generative diffusion model on these compressed fields. The model can simultaneously learn internal variability from abundant low-resolution data and fine-scale physics from sparse high-resolution data. Our work bridges the gap between the high volume of low-resolution ensemble statistics and the scarcity of high-resolution physical detail.

[302] Chain-of-Memory: Lightweight Memory Construction with Dynamic Evolution for LLM Agents

Xiucheng Xu, Bingbing Xu, Xueyun Tian, Zihe Huang, Rongxin Chen, Yunfan Li, Huawei Shen

Main category: cs.LG

TL;DR: CoM proposes lightweight memory construction with sophisticated utilization via Chain-of-Memory mechanism, achieving better accuracy with 97% less computational cost.

Details

Motivation: Existing memory systems for LLM agents have two problems: 1) complex memory construction is computationally expensive with minimal performance gains, and 2) simple context concatenation fails to translate retrieval recall into reasoning accuracy.

Method: CoM framework with Chain-of-Memory mechanism that organizes retrieved fragments into coherent inference paths through dynamic evolution, using adaptive truncation to prune irrelevant noise.

Result: Outperforms baselines by 7.5%-10.4% accuracy on LongMemEval and LoCoMo benchmarks, while reducing computational overhead to ~2.7% of token consumption and 6.0% of latency compared to complex memory architectures.

Conclusion: CoM demonstrates that lightweight memory construction paired with sophisticated utilization (Chain-of-Memory) is more effective than complex construction with naive retrieval, achieving superior performance with dramatically reduced computational costs.

Abstract: External memory systems are pivotal for enabling Large Language Model (LLM) agents to maintain persistent knowledge and perform long-horizon decision-making. Existing paradigms typically follow a two-stage process: computationally expensive memory construction (e.g., structuring data into graphs) followed by naive retrieval-augmented generation. However, our empirical analysis reveals two fundamental limitations: complex construction incurs high costs with marginal performance gains, and simple context concatenation fails to bridge the gap between retrieval recall and reasoning accuracy. To address these challenges, we propose CoM (Chain-of-Memory), a novel framework that advocates for a paradigm shift toward lightweight construction paired with sophisticated utilization. CoM introduces a Chain-of-Memory mechanism that organizes retrieved fragments into coherent inference paths through dynamic evolution, utilizing adaptive truncation to prune irrelevant noise. Extensive experiments on the LongMemEval and LoCoMo benchmarks demonstrate that CoM outperforms strong baselines with accuracy gains of 7.5%-10.4%, while drastically reducing computational overhead to approximately 2.7% of token consumption and 6.0% of latency compared to complex memory architectures.

[303] Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity

Jun Liu, Leo Yu Zhang, Fengpeng Li, Isao Echizen, Jiantao Zhou

Main category: cs.LG

TL;DR: The paper proposes a new hard-label black-box attack framework that reframes existing attacks as gradient sign recovery and introduces a zero-query frequency-domain initialization with Pattern-Driven Optimization for improved query efficiency.

Details

Motivation: Hard-label black-box settings (only top-1 labels observable) are practically important but fundamentally constrained. The central challenge is whether meaningful gradient information can be recovered from discrete responses, and existing attacks are heuristic rather than principled.

Method: 1) Provides unified theoretical perspective showing existing sign-flipping attacks approximate true loss gradient sign. 2) Proposes new attack with zero-query frequency-domain initialization for better gradient sign estimation. 3) Introduces Pattern-Driven Optimization (PDO) strategy for lower query complexity than structured search approaches.

Result: The method surpasses SOTA hard-label attacks in attack success rate and query efficiency on CIFAR-10, ImageNet, ObjectNet across standard/adversarially trained models, commercial APIs, and CLIP models. Generalizes to corrupted data, biomedical datasets, dense prediction tasks. Successfully circumvents Blacklight defense with 0% detection rate.

Conclusion: Hard-label attacks can be understood as gradient sign recovery problems. The proposed framework provides theoretical guarantees and practical improvements, achieving superior performance across diverse settings while being robust against stateful defenses.

Abstract: Hard-label black-box settings, where only top-1 predicted labels are observable, pose a fundamentally constrained yet practically important feedback model for understanding model behavior. A central challenge in this regime is whether meaningful gradient information can be recovered from such discrete responses. In this work, we develop a unified theoretical perspective showing that a wide range of existing sign-flipping hard-label attacks can be interpreted as implicitly approximating the sign of the true loss gradient. This observation reframes hard-label attacks from heuristic search procedures into instances of gradient sign recovery under extremely limited feedback. Motivated by this first-principles understanding, we propose a new attack framework that combines a zero-query frequency-domain initialization with a Pattern-Driven Optimization (PDO) strategy. We establish theoretical guarantees demonstrating that, under mild assumptions, our initialization achieves higher expected cosine similarity to the true gradient sign compared to random baselines, while the proposed PDO procedure attains substantially lower query complexity than existing structured search approaches. We empirically validate our framework through extensive experiments on CIFAR-10, ImageNet, and ObjectNet, covering standard and adversarially trained models, commercial APIs, and CLIP-based models. The results show that our method consistently surpasses SOTA hard-label attacks in both attack success rate and query efficiency, particularly in low-query regimes. Beyond image classification, our approach generalizes effectively to corrupted data, biomedical datasets, and dense prediction tasks. Notably, it also successfully circumvents Blacklight, a SOTA stateful defense, resulting in a $0%$ detection rate. Our code will be released publicly soon at https://github.com/csjunjun/DPAttack.git.

[304] Layer-adaptive Expert Pruning for Pre-Training of Mixture-of-Experts Large Language Models

YuanLab. ai, Shawn Wu, Jiangang Luo, Tong Yu, Darcy Chen, Sean Wang, Xudong Zhao, Louie Li, Claire Wang, Hunter He, Carol Wang, Allen Wang

Main category: cs.LG

TL;DR: LAEP algorithm prunes underutilized experts during MoE LLM pre-training to improve efficiency and reduce parameters while maintaining performance.

Details

Motivation: MoE LLMs have superior accuracy but suffer from computational bottlenecks during pre-training due to underutilized experts and limited training efficiency.

Method: Layer-Adaptive Expert Pruning (LAEP) algorithm that selectively prunes underutilized experts and reorganizes experts across computing devices based on token distribution statistics during pre-training.

Result: 48.3% improvement in training efficiency and 33.3% parameter reduction when pre-training 1010B Base model, while maintaining excellent performance across multiple domains.

Conclusion: LAEP effectively addresses pre-training bottlenecks in MoE LLMs by pruning experts during training rather than post-training, achieving significant efficiency gains without compromising model quality.

Abstract: Although Mixture-of-Experts (MoE) Large Language Models (LLMs) deliver superior accuracy with a reduced number of active parameters, their pre-training represents a significant computationally bottleneck due to underutilized experts and limited training efficiency. This work introduces a Layer-Adaptive Expert Pruning (LAEP) algorithm designed for the pre-training stage of MoE LLMs. In contrast to previous expert pruning approaches that operate primarily in the post-training phase, the proposed algorithm enhances training efficiency by selectively pruning underutilized experts and reorganizing experts across computing devices according to token distribution statistics. Comprehensive experiments demonstrate that LAEP effectively reduces model size and substantially improves pre-training efficiency. In particular, when pre-training the 1010B Base model from scratch, LAEP achieves a 48.3% improvement in training efficiency alongside a 33.3% parameter reduction, while still delivering excellent performance across multiple domains.

[305] Hierarchical Contextual Uplift Bandits for Catalog Personalization

Anupam Agrawal, Rajesh Mohanty, Shamik Bhattacharjee, Abhimanyu Mittal

Main category: cs.LG

TL;DR: Hierarchical Contextual Uplift Bandit framework for fantasy sports recommendations that dynamically adjusts contextual granularity and integrates uplift modeling to handle dynamic environments and cold-start issues.

Details

Motivation: Standard Contextual Bandit algorithms struggle in dynamic fantasy sports environments with rapid user behavior changes and dramatic reward distribution shifts, requiring frequent retraining.

Method: Hierarchical framework that dynamically adjusts contextual granularity from broad system-wide to detailed user-specific contexts, using contextual similarity for policy transfer, and integrates uplift modeling principles.

Result: Large-scale A/B testing on Dream11 platform showed 0.4% revenue improvement and better user satisfaction metrics. Production deployment in May 2025 achieved additional 0.5% revenue improvement.

Conclusion: The proposed hierarchical contextual uplift bandit framework effectively addresses dynamic environment challenges in fantasy sports recommendations, delivering significant revenue gains and improved user satisfaction.

Abstract: Contextual Bandit (CB) algorithms are widely adopted for personalized recommendations but often struggle in dynamic environments typical of fantasy sports, where rapid changes in user behavior and dramatic shifts in reward distributions due to external influences necessitate frequent retraining. To address these challenges, we propose a Hierarchical Contextual Uplift Bandit framework. Our framework dynamically adjusts contextual granularity from broad, system-wide insights to detailed, user-specific contexts, using contextual similarity to facilitate effective policy transfer and mitigate cold-start issues. Additionally, we integrate uplift modeling principles into our approach. Results from large-scale A/B testing on the Dream11 fantasy sports platform show that our method significantly enhances recommendation quality, achieving a 0.4% revenue improvement while also improving user satisfaction metrics compared to the current production system. We subsequently deployed this system to production as the default catalog personalization system in May 2025 and observed a further 0.5% revenue improvement.

[306] Log anomaly detection via Meta Learning and Prototypical Networks for Cross domain generalization

Krishna Sharma, Vivek Yelleti

Main category: cs.LG

TL;DR: A meta-learning framework for cross-domain log anomaly detection that handles class imbalance and data drift using semantic embeddings, feature selection, MAML, Prototypical Networks, and SMOTE oversampling.

Details

Motivation: Log anomaly detection faces challenges with class imbalance and poor cross-domain generalization due to data drift and lack of labeled anomalies in new domains. Existing models trained on one domain (like HDFS) don't work well on others (like Linux).

Method: 1) Data preparation: Drain3 log parsing + dynamic drift-based labeling with semantic/fuzzy matching to transfer anomaly knowledge across domains. 2) BERT-based semantic embeddings + feature selection for dimensionality reduction. 3) Meta-learning with Model Agnostic Meta-Learning (MAML) and Prototypical Networks for fast adaptation. 4) SMOTE oversampling to handle class imbalance. 5) Evaluation using leave-one-out source method.

Result: The proposed meta-learning approach achieved the highest mean F1 score in cross-domain settings, demonstrating effectiveness for log anomaly detection across different domains.

Conclusion: The meta-learning-driven framework successfully addresses cross-domain log anomaly detection challenges by combining semantic knowledge transfer, meta-learning adaptation, and imbalance handling, proving effective for real-world deployment scenarios.

Abstract: Log anomaly detection is essential for system reliability, but it is extremely challenging to do considering it involves class imbalance. Additionally, the models trained in one domain are not applicable to other domains, necessitating the need for cross-domain adaptation (such as HDFS and Linux). Traditional detection models often fail to generalize due to significant data drift and the inherent absence of labeled anomalies in new target domains. To handle the above challenges, we proposed a new end-to-end framework based on a meta-learning approach. Our methodology first gets the data ready by combining a Drain3 log parsing mechanism with a dynamic drift-based labeling technique that uses semantic and fuzzy matching to move existing anomaly knowledge from one source to another. BERT-based semantic embeddings are obtained, and the feature selection is invoked to reduce the dimensionality. Later, Model Agnostic Meta-Learning (MAML) and Prototypical Networks models are trained to adapt quickly and effectively. The SMOTE oversampling method is employed to handle imbalances in the data. All the results are obtained by employing the leave-one-out source method, and the corresponding mean F1 scores are reported. Our empirical findings validate that the proposed meta-learning-driven approach yielded the highest mean F1 score and proved to be effective for cross-domain settings.

[307] DiSPA: Differential Substructure-Pathway Attention for Drug Response Prediction

Yewon Han, Sunghyun Kim, Eunyi Jeong, Sungkyung Lee, Seokwoo Yun, Sangsoo Lim

Main category: cs.LG

TL;DR: DiSPA is a representation learning framework that disentangles structure-driven and context-driven drug response mechanisms through bidirectional conditioning between chemical substructures and pathway-level gene expression, achieving state-of-the-art performance with improved interpretability.

Details

Motivation: Existing deep learning approaches treat chemical and transcriptomic modalities independently or combine them only at late stages, limiting their ability to model fine-grained, context-dependent drug action mechanisms. Standard attention mechanisms are sensitive to noise and sparsity in high-dimensional biological networks, hindering generalization and interpretability.

Method: DiSPA uses a differential cross-attention module that suppresses spurious pathway-substructure associations while amplifying contextually relevant interactions. It explicitly disentangles structure-driven and context-driven mechanisms through bidirectional conditioning between chemical substructures and pathway-level gene expression.

Result: DiSPA achieves state-of-the-art performance on GDSC benchmark, with strong improvements in disjoint-set setting (generalization to unseen drug-cell combinations). Learned attention patterns recover known pharmacophores, distinguish structure-driven from context-dependent compounds, and exhibit coherent organization across biological pathways. Enables zero-shot transfer to spatial transcriptomics.

Conclusion: DiSPA establishes a robust and interpretable framework for integrative pharmacogenomic modeling, enabling principled analysis of drug response mechanisms beyond post hoc interpretation, with demonstrated transferability to spatial transcriptomics.

Abstract: Accurate prediction of drug response in precision medicine requires models that capture how specific chemical substructures interact with cellular pathway states. However, most existing deep learning approaches treat chemical and transcriptomic modalities independently or combine them only at late stages, limiting their ability to model fine-grained, context-dependent mechanisms of drug action. In addition, standard attention mechanisms are often sensitive to noise and sparsity in high-dimensional biological networks, hindering both generalization and interpretability. We present DiSPA, a representation learning framework that explicitly disentangles structure-driven and context-driven mechanisms of drug response through bidirectional conditioning between chemical substructures and pathway-level gene expression. DiSPA introduces a differential cross-attention module that suppresses spurious pathway-substructure associations while amplifying contextually relevant interactions. Across multiple evaluation settings on the GDSC benchmark, DiSPA achieves state-of-the-art performance, with particularly strong improvements in the disjoint-set setting, which assesses generalization to unseen drug-cell combinations. Beyond predictive accuracy, DiSPA yields mechanistically informative representations: learned attention patterns recover known pharmacophores, distinguish structure-driven from context-dependent compounds, and exhibit coherent organization across biological pathways. Furthermore, we demonstrate that DiSPA trained solely on bulk RNA-seq data enables zero-shot transfer to spatial transcriptomics, revealing region-specific drug sensitivity patterns without retraining. Together, these results establish DiSPA as a robust and interpretable framework for integrative pharmacogenomic modeling, enabling principled analysis of drug response mechanisms beyond post hoc interpretation.

[308] Constrained Black-Box Attacks Against Cooperative Multi-Agent Reinforcement Learning

Amine Andam, Jamal Bentahar, Mustapha Hedabou

Main category: cs.LG

TL;DR: The paper investigates vulnerabilities in collaborative multi-agent reinforcement learning to adversarial attacks under constrained conditions, proposing a sample-efficient method to generate observation perturbations that misalign agents’ environmental perceptions.

Details

Motivation: Despite rapid evolution of collaborative multi-agent RL for real-world applications, there's insufficient investigation of vulnerabilities to adversarial attacks. Existing work focuses on unrealistic scenarios with access to policy weights or training surrogate policies, lacking examination of more challenging constrained conditions.

Method: The approach generates perturbations that intentionally misalign how victim agents perceive their environment, assuming adversaries can only collect and perturb observations of deployed agents (or have no access at all). The method is sample-efficient, requiring only 1,000 samples.

Result: Empirical validation on three benchmarks and 22 environments demonstrates effectiveness across diverse algorithms and environments. The algorithm shows significant sample efficiency compared to previous methods that required millions of samples.

Conclusion: The paper reveals new vulnerabilities in collaborative multi-agent RL under constrained adversarial conditions, providing a practical and efficient attack method that highlights security concerns for real-world deployment in sensitive domains.

Abstract: Collaborative multi-agent reinforcement learning has rapidly evolved, offering state-of-the-art algorithms for real-world applications, including sensitive domains. However, a key challenge to its widespread adoption is the lack of a thorough investigation into its vulnerabilities to adversarial attacks. Existing work predominantly focuses on training-time attacks or unrealistic scenarios, such as access to policy weights or the ability to train surrogate policies. In this paper, we investigate new vulnerabilities under more challenging and constrained conditions, assuming an adversary can only collect and perturb the observations of deployed agents. We also consider scenarios where the adversary has no access at all (no observations, actions, or weights). Our main approach is to generate perturbations that intentionally misalign how victim agents see their environment. Our approach is empirically validated on three benchmarks and 22 environments, demonstrating its effectiveness across diverse algorithms and environments. Furthermore, we show that our algorithm is sample-efficient, requiring only 1,000 samples compared to the millions needed by previous methods.

[309] VJEPA: Variational Joint Embedding Predictive Architectures as Probabilistic World Models

Yongchao Huang

Main category: cs.LG

TL;DR: VJEPA introduces a probabilistic version of JEPA that learns predictive distributions over latent states via variational objectives, unifying representation learning with PSRs and Bayesian filtering while avoiding representation collapse.

Details

Motivation: Existing JEPA formulations use deterministic regression objectives that mask probabilistic semantics and limit applicability in stochastic control. There's a need for probabilistic generalization that can handle uncertainty and avoid representation collapse in noisy environments.

Method: Introduces Variational JEPA (VJEPA) that learns predictive distributions over future latent states via variational objectives. Extends to Bayesian JEPA (BJEPA) which factorizes predictive belief into learned dynamics expert and modular prior expert using Product of Experts.

Result: Theoretically proves VJEPA representations serve as sufficient information states for optimal control without pixel reconstruction, with formal collapse avoidance guarantees. Empirically shows VJEPA/BJEPA successfully filter high-variance nuisance distractors that cause collapse in generative baselines.

Conclusion: VJEPA provides a foundational framework for scalable, robust, uncertainty-aware planning in high-dimensional noisy environments by enabling principled uncertainty estimation while remaining likelihood-free regarding observations.

Abstract: Joint Embedding Predictive Architectures (JEPA) offer a scalable paradigm for self-supervised learning by predicting latent representations rather than reconstructing high-entropy observations. However, existing formulations rely on \textit{deterministic} regression objectives, which mask probabilistic semantics and limit its applicability in stochastic control. In this work, we introduce \emph{Variational JEPA (VJEPA)}, a \textit{probabilistic} generalization that learns a predictive distribution over future latent states via a variational objective. We show that VJEPA unifies representation learning with Predictive State Representations (PSRs) and Bayesian filtering, establishing that sequential modeling does not require autoregressive observation likelihoods. Theoretically, we prove that VJEPA representations can serve as sufficient information states for optimal control without pixel reconstruction, while providing formal guarantees for collapse avoidance. We further propose \emph{Bayesian JEPA (BJEPA)}, an extension that factorizes the predictive belief into a learned dynamics expert and a modular prior expert, enabling zero-shot task transfer and constraint (e.g. goal, physics) satisfaction via a Product of Experts. Empirically, through a noisy environment experiment, we demonstrate that VJEPA and BJEPA successfully filter out high-variance nuisance distractors that cause representation collapse in generative baselines. By enabling principled uncertainty estimation (e.g. constructing credible intervals via sampling) while remaining likelihood-free regarding observations, VJEPA provides a foundational framework for scalable, robust, uncertainty-aware planning in high-dimensional, noisy environments.

[310] Adaptive KDE for Real-Time Thresholding: Prioritized Queues for Financial Crime Investigation

Danny Butvinik, Nana Boateng, Achi Hackmon

Main category: cs.LG

TL;DR: Proposes a label-free method to convert risk scores into review queues under capacity constraints using adaptive kernel density estimation and tail-mass curves.

Details

Motivation: Need to convert risk scores into review queues under explicit intake constraints without relying on top-K or manually tuned cutoffs, supporting real-time operation with multi-queue routing.

Method: Fits online adaptive kernel density to score stream, transforms density into tail-mass curve to meet capacity, and snaps resulting cut to persistent density valley detected across bandwidths. Operates with sliding windows or exponential forgetting.

Result: Achieves competitive capacity adherence while reducing threshold jitter on synthetic, drifting, multimodal streams. Updates cost O(G) per event with constant memory per activity.

Conclusion: The method provides an effective, label-free approach for real-time risk score queueing with explicit capacity constraints, offering stable thresholds and efficient computation.

Abstract: We study the problem of converting a stream of risk scores into one or more review queues under explicit intake constraints[cite: 6]. Instead of top-$K$ or manually tuned cutoffs, we fit an online adaptive kernel density to the score stream, transform the density into a tail-mass curve to meet capacity, and ``snap’’ the resulting cut to a persistent density valley detected across bandwidths[cite: 7]. The procedure is label-free, supports multi-queue routing, and operates in real time with sliding windows or exponential forgetting[cite: 8]. On synthetic, drifting, multimodal streams, the method achieves competitive capacity adherence while reducing threshold jitter[cite: 9]. Updates cost $O(G)$ per event with constant memory per activity

[311] GPU-accelerated simulated annealing based on p-bits with real-world device-variability modeling

Naoya Onizawa, Takahiro Hanyu

Main category: cs.LG

TL;DR: Device variability in p-bit implementations can enhance algorithm performance, not just degrade it. A GPU-accelerated simulated annealing framework with CUDA achieves 100x speedup on MAX-CUT problems.

Details

Motivation: Probabilistic computing using p-bits offers efficient alternatives to CMOS for complex problems, but device variability in emerging technologies like MTJs was expected to harm performance. This study investigates whether variability might actually have beneficial effects.

Method: Developed a GPU-accelerated, open-source simulated annealing framework that models three key device variability factors: timing, intensity, and offset. Uses CUDA-based simulations to reflect real-world device behavior in p-bit implementations.

Result: Contrary to expectations, device variability can enhance algorithm performance, especially timing variability. The framework achieves two-order magnitude (100x) speedup over CPU implementations on MAX-CUT benchmarks with problem sizes from 800 to 20,000 nodes.

Conclusion: Device variability in p-bit implementations can be leveraged for performance enhancement rather than just being a limitation. The scalable GPU-accelerated framework enables optimization applications across diverse fields and advances probabilistic computing research.

Abstract: Probabilistic computing using probabilistic bits (p-bits) presents an efficient alternative to traditional CMOS logic for complex problem-solving, including simulated annealing and machine learning. Realizing p-bits with emerging devices such as magnetic tunnel junctions (MTJs) introduces device variability, which was expected to negatively impact computational performance. However, this study reveals an unexpected finding: device variability can not only degrade but also enhance algorithm performance, particularly by leveraging timing variability. This paper introduces a GPU-accelerated, open-source simulated annealing framework based on p-bits that models key device variability factors – timing, intensity, and offset – to reflect real-world device behavior. Through CUDA-based simulations, our approach achieves a two-order magnitude speedup over CPU implementations on the MAX-CUT benchmark with problem sizes ranging from 800 to 20,000 nodes. By providing a scalable and accessible tool, this framework aims to advance research in probabilistic computing, enabling optimization applications in diverse fields.

[312] Enabling Agents to Communicate Entirely in Latent Space

Zhuoyun Du, Runze Wang, Huiyu Bai, Zouying Cao, Xiaoyong Zhu, Yu Cheng, Bo Zheng, Wei Chen, Haochao Ying

Main category: cs.LG

TL;DR: Interlat enables LLM agents to communicate via continuous hidden states instead of discrete tokens, improving collaborative problem-solving through latent space communication.

Details

Motivation: Natural language communication between LLM agents is limited because downsampling rich internal states to discrete tokens restricts information depth and nuance, hindering effective collaboration.

Method: Proposes Interlat (Inter-agent Latent Space Communication) using continuous last hidden states of LLMs as thought representations for direct communication, with additional learned compression for latent space reasoning.

Result: Interlat outperforms fine-tuned chain-of-thought prompting and single-agent baselines, works across heterogeneous models, promotes exploratory behavior, and enables genuine latent information utilization. Compression accelerates inference up to 24x while maintaining competitive performance.

Conclusion: The work demonstrates feasibility of entirely latent space inter-agent communication, showing significant potential for future research in more efficient and nuanced agent collaboration.

Abstract: While natural language is the de facto communication medium for LLM-based agents, it presents a fundamental constraint. The process of downsampling rich, internal latent states into discrete tokens inherently limits the depth and nuance of information that can be transmitted, thereby hindering collaborative problem-solving. Inspired by telepathy, which bypasses symbolic language in communication, we propose Interlat (Inter-agent Latent Space Communication), a paradigm that leverages the continuous last hidden states of an LLM as a representation of its thought for direct communication (termed latent communication). An additional learned compression process further compresses latent communication via latent space reasoning. Experiments demonstrate that Interlat outperforms both fine-tuned chain-of-thought (CoT) prompting and single-agent baselines, even across heterogeneous models, promoting more exploratory behavior and enabling genuine utilization of latent information. Further compression not only substantially accelerates inference by up to 24 times but also maintains competitive performance through an efficient information-preserving mechanism. We position this work as a feasibility study of entirely latent space inter-agent communication, and our results highlight its potential, offering valuable insights for future research.

[313] Stabilizing autoregressive forecasts in chaotic systems via multi-rate latent recurrence

Mrigank Dhingra, Omer San

Main category: cs.LG

TL;DR: MSR-HINE is a hierarchical implicit neural forecaster that uses multiscale latent priors and multi-rate recurrent modules to improve long-horizon forecasting of chaotic dynamical systems by mitigating error accumulation.

Details

Motivation: Long-horizon forecasting of chaotic systems is challenging due to rapid error amplification and distribution shift, where small one-step inaccuracies compound into physically inconsistent rollouts and collapse of large-scale statistics.

Method: MSR-HINE uses hierarchical implicit forecasting with multiscale latent priors and multi-rate recurrent modules operating at distinct temporal scales. It employs coarse-to-fine recurrent states to generate latent priors, an implicit one-step predictor with multiscale latent injections, gated fusion with posterior latents for scale-consistent updates, and hidden-state correction to align recurrent memories.

Result: On Kuramoto-Sivashinsky: 62.8% RMSE reduction at H=400, ACC improved from -0.155 to 0.828, predictability horizon (ACC≥0.5) extended from 241 to 400 steps. On Lorenz-96: 27.0% RMSE reduction at H=100, ACC improved from 0.144 to 0.545, predictability horizon extended from 58 to 100 steps.

Conclusion: MSR-HINE effectively maintains long-term context on slow manifolds while preserving fast-scale variability, significantly mitigating error accumulation in chaotic rollouts and substantially improving long-horizon forecasting performance across canonical chaotic systems.

Abstract: Long-horizon autoregressive forecasting of chaotic dynamical systems remains challenging due to rapid error amplification and distribution shift: small one-step inaccuracies compound into physically inconsistent rollouts and collapse of large-scale statistics. We introduce MSR-HINE, a hierarchical implicit forecaster that augments multiscale latent priors with multi-rate recurrent modules operating at distinct temporal scales. At each step, coarse-to-fine recurrent states generate latent priors, an implicit one-step predictor refines the state with multiscale latent injections, and a gated fusion with posterior latents enforces scale-consistent updates; a lightweight hidden-state correction further aligns recurrent memories with fused latents. The resulting architecture maintains long-term context on slow manifolds while preserving fast-scale variability, mitigating error accumulation in chaotic rollouts. Across two canonical benchmarks, MSR-HINE yields substantial gains over a U-Net autoregressive baseline: on Kuramoto-Sivashinsky it reduces end-horizon RMSE by 62.8% at H=400 and improves end-horizon ACC by +0.983 (from -0.155 to 0.828), extending the ACC >= 0.5 predictability horizon from 241 to 400 steps; on Lorenz-96 it reduces RMSE by 27.0% at H=100 and improves end horizon ACC by +0.402 (from 0.144 to 0.545), extending the ACC >= 0.5 horizon from 58 to 100 steps.

[314] Learning PDE Solvers with Physics and Data: A Unifying View of Physics-Informed Neural Networks and Neural Operators

Yilong Dai, Shengyu Chen, Ziyi Wang, Xiaowei Jia, Yiqun Xie, Vipin Kumar, Runlong Yu

Main category: cs.LG

TL;DR: The paper proposes a unifying framework to analyze Physics-Informed Neural Networks (PINNs) and Neural Operators (NOs) within a shared design space, organizing methods along three dimensions: what is learned, how physics is integrated, and computational amortization.

Details

Motivation: Despite the emergence of various physics-aware data-driven approaches for PDEs, the field lacks a unified perspective to understand their relationships, limitations, and appropriate roles in scientific workflows. This gap hinders systematic development and integration of learning-based components in scientific modeling.

Method: The authors propose a unifying perspective that organizes existing methods along three fundamental dimensions: (1) what is learned (e.g., solution functions, operators, or parameters), (2) how physical structures are integrated into the learning process, and (3) how computational load is amortized across problem instances.

Result: The framework provides a shared design space that reveals relationships between PINNs and NOs, showing how many challenges in learning-based PDE solvers can be understood as consequences of these structural properties. This enables systematic analysis of existing methods.

Conclusion: The proposed unifying view facilitates the development of reliable learning-based PDE solvers and catalyzes a synthesis of physics and data by providing a structured framework to analyze, compare, and advance physics-aware machine learning methods for scientific modeling.

Abstract: Partial differential equations (PDEs) are central to scientific modeling. Modern workflows increasingly rely on learning-based components to support model reuse, inference, and integration across large computational processes. Despite the emergence of various physics-aware data-driven approaches, the field still lacks a unified perspective to uncover their relationships, limitations, and appropriate roles in scientific workflows. To this end, we propose a unifying perspective to place two dominant paradigms: Physics-Informed Neural Networks (PINNs) and Neural Operators (NOs), within a shared design space. We organize existing methods from three fundamental dimensions: what is learned, how physical structures are integrated into the learning process, and how the computational load is amortized across problem instances. In this way, many challenges can be best understood as consequences of these structural properties of learning PDEs. By analyzing advances through this unifying view, our survey aims to facilitate the development of reliable learning-based PDE solvers and catalyze a synthesis of physics and data.

[315] E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models

Jiaheng Dong, Hong Jia, Soumyajit Chatterjee, Abhirup Ghosh, James Bailey, Ting Dang

Main category: cs.LG

TL;DR: E-BATS is an efficient backpropagation-free test-time adaptation framework for speech foundation models that balances adaptation effectiveness and memory efficiency through lightweight prompt adaptation, multi-scale loss, and test-time EMA.

Details

Motivation: Speech foundation models degrade in real-world acoustic domain shifts (noise, accents). Existing TTA methods are either memory-intensive (backpropagation-based) or inaccurate (backpropagation-free vision methods not suitable for speech). Need efficient yet effective adaptation for speech tasks.

Method: Three key components: 1) Lightweight prompt adaptation for forward-pass-based feature alignment, 2) Multi-scale loss capturing both global (utterance-level) and local (token-level) distribution shifts, 3) Test-time exponential moving average mechanism for stable adaptation across utterances.

Result: Experiments on four noisy speech datasets across sixteen acoustic conditions show 4.1%-13.5% accuracy gains over backpropagation-free baselines and 2.0-6.4 times GPU memory savings compared to backpropagation-based methods.

Conclusion: E-BATS enables scalable and robust adaptation under acoustic variability, paving the way for more efficient adaptation approaches for practical speech processing systems in real-world environments.

Abstract: Speech Foundation Models encounter significant performance degradation when deployed in real-world scenarios involving acoustic domain shifts, such as background noise and speaker accents. Test-time adaptation (TTA) has recently emerged as a viable strategy to address such domain shifts at inference time without requiring access to source data or labels. However, existing TTA approaches, particularly those relying on backpropagation, are memory-intensive, limiting their applicability in speech tasks and resource-constrained settings. Although backpropagation-free methods offer improved efficiency, existing ones exhibit poor accuracy. This is because they are predominantly developed for vision tasks, which fundamentally differ from speech task formulations, noise characteristics, and model architecture, posing unique transferability challenges. In this paper, we introduce E-BATS, the first Efficient BAckpropagation-free TTA framework designed explicitly for speech foundation models. E-BATS achieves a balance between adaptation effectiveness and memory efficiency through three key components: (i) lightweight prompt adaptation for a forward-pass-based feature alignment, (ii) a multi-scale loss to capture both global (utterance-level) and local distribution shifts (token-level) and (iii) a test-time exponential moving average mechanism for stable adaptation across utterances. Experiments conducted on four noisy speech datasets spanning sixteen acoustic conditions demonstrate consistent improvements, with 4.1%-13.5% accuracy gains over backpropagation-free baselines and 2.0-6.4 times GPU memory savings compared to backpropagation-based methods. By enabling scalable and robust adaptation under acoustic variability, this work paves the way for developing more efficient adaptation approaches for practical speech processing systems in real-world environments.

[316] How Worst-Case Are Adversarial Attacks? Linking Adversarial and Statistical Robustness

Giulio Rossolini

Main category: cs.LG

TL;DR: Adversarial attacks may not reliably estimate robustness to random noise; the paper introduces a probabilistic framework to measure when adversarial success meaningfully reflects noisy risk vs. when it fails.

Details

Motivation: There's ongoing debate about whether adversarial perturbations are valid proxies for robustness to random noise. The authors want to determine if adversarial attacks represent typical robustness or just worst-case scenarios, which is important for safety-oriented model evaluation.

Method: Introduces a probabilistic metric that quantifies noisy risk using directionally biased perturbation distributions parameterized by κ, which interpolates between isotropic noise and adversarial directions. Proposes an attack strategy designed to operate in regimes statistically closer to uniform noise.

Result: Systematic experiments on ImageNet and CIFAR-10 benchmark widely used attacks, showing when adversarial success meaningfully reflects noisy risk and when it fails.

Conclusion: Adversarial perturbations have limits as estimators of noisy risk; the framework helps inform when adversarial attacks are appropriate for safety-oriented evaluation versus when they overestimate vulnerability.

Abstract: Adversarial attacks are widely used to evaluate model robustness, yet their validity as proxies for robustness to random perturbations remains debated. We ask whether an adversarial perturbation provides a representative estimate of robustness under random noise of the same magnitude, or instead reflects an atypical worst-case event. To this end, we introduce a probabilistic metric that quantifies noisy risk with respect to directionally biased perturbation distributions, parameterized by a concentration factor $κ$ that interpolates between isotropic noise and adversarial direction. Using this framework, we study the limits of adversarial perturbations as estimators of noisy risk by proposing an attack strategy designed to operate in regimes statistically closer to uniform noise. Experiments on ImageNet and CIFAR-10 systematically benchmark widely used attacks, highlighting when adversarial success meaningfully reflects noisy risk and when it fails, thereby informing their use in safety-oriented evaluation.

[317] On the Runway Cascade of Transformers for Language Modeling

Hunjae Lee, Corey Clark

Main category: cs.LG

TL;DR: The paper proposes “runway-aware rewiring” to address misalignment between direct and indirect information paths in causal transformers, improving language modeling, retrieval, and extrapolation without adding parameters.

Details

Motivation: Causal transformers suffer from failure modes where indirect information paths (runways) create misalignment with direct attention paths, causing redundancies and irrelevant information to cascade through token representations despite properly learned attention patterns.

Method: Runway-aware rewiring modifies attention patterns based on runway context summaries, explicitly incorporating runway landscape information into direct-path attention. This rewires attention for each token based on its runway influences, enabling awareness of accumulating representational effects while maintaining parameter-free integration with standard attention mechanisms.

Result: The rewired transformer shows steady improvements in general language modeling, noticeably stronger information retrieval capabilities, and better extrapolation abilities compared to standard transformers.

Conclusion: Explicitly incorporating runway context into attention mechanisms via runway-aware rewiring addresses information propagation misalignments in causal transformers, leading to better performance across multiple tasks without additional parameters.

Abstract: In decoder-only (causal) transformers, the computation graph created by causal masking routes information through both direct-path attention and indirect paths formed by intermediate tokens. We denote these indirect paths between token pairs as their runways. We argue that certain failure modes of causal transformers as observed by a growing body of recent works are likely exacerbated by a misalignment between these two information propagation modes. We formalize runway cascade as a phenomenon whereby this misalignment results in redundancies and irrelevant information cascading to token representations despite adequately learned attention patterns. As a solution, we propose runway-aware rewiring as a more explicit way of incorporating runway context directly into each token’s direct-path attention. This mechanism re-wires the attention pattern for each token based on a summary of its runway landscape, enabling awareness of accumulating representational influences and allowing for more balanced information propagation. Our proposed methodology introduces no additional parameters and can seamlessly be integrated into standard attention mechanism. Empirically, our rewired transformer results in steady improvements in general language modeling as well as noticeably stronger information retrieval and extrapolation abilities compared to standard transformers.

[318] Search over Self-Edit Strategies for LLM Adaptation

Alistair Cheong, Haolin Cong, Tyler Yang, Dustin Miao

Main category: cs.LG

TL;DR: LLMs can use task feedback to decide how to update their own weights through self-supervised next token prediction, with archive-based template generation showing better performance than no-archive approaches.

Details

Motivation: Existing LLM-based search systems freeze foundation models, bottlenecking long-run progress. While recent work explores updating proposal models at test time, update strategies remain hand-specified. This study investigates whether LLMs can autonomously decide how to update their weights using task feedback.

Method: Used Self-Adapting Language Models (SEAL) framework, relaxing fixed human template constraint to allow models to generate their own self-edit templates. Studied two variants: with and without conditioning template generation on a lightweight archive of past templates. Focused on single round of self-improvement with self-supervised next token prediction as the update operator, giving models freedom in choosing training data and hyperparameters.

Result: In SEAL’s Single-Passage Knowledge Incorporation setting with Qwen3-8B on SQuAD: no-archive variant performed comparably to weaker “Implications” baseline; archive variant outperformed “Implications” and approached strongest human-designed “Rewrite” baseline without surpassing it. Analysis revealed naive archives provide short-term robustness but can accelerate homogenization, suggesting explicit novelty pressure may be needed to consistently advance beyond human-optimized strategies.

Conclusion: LLMs can autonomously decide weight updates using task feedback, with archive-based approaches showing promise. However, naive archives may lead to homogenization, indicating need for explicit novelty mechanisms to consistently surpass human-designed strategies.

Abstract: Many LLM-based open-ended search systems freeze the foundation model that proposes improvements to existing solutions, which may bottleneck long-run progress. Recent work has explored updating the proposal model at test time [arXiv:2511.23473], but the update strategy is still typically hand-specified. Therefore, this study investigated whether an LLM can use task feedback to decide how it should update its weights. For tractability, we focused on the simpler case where there is only one round of self-improvement, and restricted the update operator to self-supervised next token prediction (NTP), leaving the model freedom in choosing its training data and key NTP hyperparameters. Using the Self-Adapting Language Models (SEAL) [arXiv:2506.10943] framework as a testbed, we relaxed its fixed human template constraint and allowed the model to generate its own self-edit templates, thereby giving it more control over its training data and hyperparameters. Two variants were studied, differing in whether template generation was conditioned on a lightweight archive of past templates. In SEAL’s Single-Passage Knowledge Incorporation setting with Qwen3-8B on SQuAD [arXiv:1606.05250], the no-archive variant performed comparably to the weaker “Implications” baseline, while the archive variant outperformed “Implications” and approached the strongest human-designed “Rewrite” baseline without surpassing it. Further analysis of collapse in the model’s exploration revealed that a naive archive can confer some short-term robustness but can also accelerate homogenization, suggesting that explicit novelty pressure may be required to consistently advance beyond carefully optimized human strategies. Our code is available at https://github.com/cheongalc/search-self-edit-strategies .

[319] engGNN: A Dual-Graph Neural Network for Omics-Based Disease Classification and Feature Selection

Tiantian Yang, Yuxuan Wang, Zhenwei Zhou, Ching-Ti Liu

Main category: cs.LG

TL;DR: engGNN is a dual-graph neural network framework that combines external biological networks with data-driven generated graphs to improve disease classification and biomarker discovery in high-dimensional omics data.

Details

Motivation: Omics data (transcriptomics, proteomics, metabolomics) are high-dimensional with small sample sizes and complex biological networks, making reliable prediction and interpretation challenging. Existing GNN methods use either external curated graphs or data-driven graphs alone, missing complementary information.

Method: engGNN uses a dual-graph framework: (1) biologically informed undirected feature graph from established network databases, and (2) directed feature graph derived from tree-ensemble models. This combines prior biological knowledge with data-driven relationships to create comprehensive embeddings.

Result: engGNN consistently outperforms state-of-the-art baselines in simulations and real-world gene expression applications. It provides interpretable feature importance scores that enable biologically meaningful discoveries like pathway enrichment analysis.

Conclusion: engGNN is a robust, flexible, and interpretable framework for disease classification and biomarker discovery in high-dimensional omics contexts, effectively leveraging both external biological knowledge and data-driven relationships.

Abstract: Omics data, such as transcriptomics, proteomics, and metabolomics, provide critical insights into disease mechanisms and clinical outcomes. However, their high dimensionality, small sample sizes, and intricate biological networks pose major challenges for reliable prediction and meaningful interpretation. Graph Neural Networks (GNNs) offer a promising way to integrate prior knowledge by encoding feature relationships as graphs. Yet, existing methods typically rely solely on either an externally curated feature graph or a data-driven generated one, which limits their ability to capture complementary information. To address this, we propose the external and generated Graph Neural Network (engGNN), a dual-graph framework that jointly leverages both external known biological networks and data-driven generated graphs. Specifically, engGNN constructs a biologically informed undirected feature graph from established network databases and complements it with a directed feature graph derived from tree-ensemble models. This dual-graph design produces more comprehensive embeddings, thereby improving predictive performance and interpretability. Through extensive simulations and real-world applications to gene expression data, engGNN consistently outperforms state-of-the-art baselines. Beyond classification, engGNN provides interpretable feature importance scores that facilitate biologically meaningful discoveries, such as pathway enrichment analysis. Taken together, these results highlight engGNN as a robust, flexible, and interpretable framework for disease classification and biomarker discovery in high-dimensional omics contexts.

[320] Report for NSF Workshop on AI for Electronic Design Automation

Deming Chen, Vijay Ganesh, Weikai Li, Yingyan, Lin, Yong Liu, Subhasish Mitra, David Z. Pan, Ruchir Puri, Jason Cong, Yizhou Sun

Main category: cs.LG

TL;DR: NSF workshop report on AI applications in Electronic Design Automation (EDA) covering four key themes: physical synthesis/DFM, high-level/logic synthesis, AI optimization tools, and test/verification, with recommendations for NSF investment and collaboration.

Details

Motivation: To explore how AI technologies (LLMs, GNNs, RL, neurosymbolic methods) can accelerate Electronic Design Automation and shorten hardware design turnaround times by addressing current challenges in the field.

Method: Workshop-based discussion and distillation of expert insights across machine learning and EDA domains, organized around four thematic areas: physical synthesis/DFM, high-level/logic synthesis, AI optimization tools, and test/verification.

Result: Identified key AI application areas in EDA and produced recommendations for NSF to foster AI/EDA collaboration, invest in foundational AI for EDA, develop data infrastructures, promote scalable compute, and invest in workforce development.

Conclusion: AI has significant potential to transform EDA and democratize hardware design, requiring strategic NSF investments in collaboration, infrastructure, and workforce development to enable next-generation hardware systems.

Abstract: This report distills the discussions and recommendations from the NSF Workshop on AI for Electronic Design Automation (EDA), held on December 10, 2024 in Vancouver alongside NeurIPS 2024. Bringing together experts across machine learning and EDA, the workshop examined how AI-spanning large language models (LLMs), graph neural networks (GNNs), reinforcement learning (RL), neurosymbolic methods, etc.-can facilitate EDA and shorten design turnaround. The workshop includes four themes: (1) AI for physical synthesis and design for manufacturing (DFM), discussing challenges in physical manufacturing process and potential AI applications; (2) AI for high-level and logic-level synthesis (HLS/LLS), covering pragma insertion, program transformation, RTL code generation, etc.; (3) AI toolbox for optimization and design, discussing frontier AI developments that could potentially be applied to EDA tasks; and (4) AI for test and verification, including LLM-assisted verification tools, ML-augmented SAT solving, security/reliability challenges, etc. The report recommends NSF to foster AI/EDA collaboration, invest in foundational AI for EDA, develop robust data infrastructures, promote scalable compute infrastructure, and invest in workforce development to democratize hardware design and enable next-generation hardware systems. The workshop information can be found on the website https://ai4eda-workshop.github.io/.

[321] QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design

Nilesh Prasad Pandey, Jangseon Park, Onat Gungor, Flavio Ponzina, Tajana Rosing

Main category: cs.LG

TL;DR: QMC is a retraining-free quantization method with heterogeneous memory architecture that stores inlier weights in ReRAM and outlier weights in MRAM, achieving significant improvements in memory, energy, and latency for edge AI deployment.

Details

Motivation: Deploying Small Language Models on edge platforms faces constraints from memory, latency, and energy budgets. Existing memory technologies (SRAM, DRAM, Flash) have limitations: SRAM has low density, DRAM suffers bandwidth contention from KV caches, and Flash is inactive during inference. Quantization helps but suffers from device noise in emerging memories.

Method: QMC (Outlier-aware Quantization with Memory Co-design) identifies inlier and outlier weights in SLMs through retraining-free quantization. It stores inlier weights in compact multi-level Resistive-RAM (ReRAM) while preserving critical outliers in high-precision on-chip Magnetoresistive-RAM (MRAM), mitigating noise-induced degradation.

Result: On language modeling and reasoning benchmarks, QMC outperforms/matches state-of-the-art quantization methods. Compared to FP16 on latest edge AI platform: reduces memory usage by 6.3x-7.3x, external data transfers by 7.6x, energy by 11.7x, and latency by 12.5x.

Conclusion: QMC establishes a scalable, deployment-ready co-design for efficient on-device inference, addressing memory hierarchy limitations through heterogeneous memory architecture and outlier-aware quantization without retraining.

Abstract: Deploying Small Language Models (SLMs) on edge platforms is critical for real-time, privacy-sensitive generative AI, yet constrained by memory, latency, and energy budgets. Quantization reduces model size and cost but suffers from device noise in emerging non-volatile memories, while conventional memory hierarchies further limit efficiency. SRAM provides fast access but has low density, DRAM must simultaneously accommodate static weights and dynamic KV caches, which creates bandwidth contention, and Flash, although dense, is primarily used for initialization and remains inactive during inference. These limitations highlight the need for hybrid memory organizations tailored to LLM inference. We propose Outlier-aware Quantization with Memory Co-design (QMC), a retraining-free quantization with a novel heterogeneous memory architecture. QMC identifies inlier and outlier weights in SLMs, storing inlier weights in compact multi-level Resistive-RAM (ReRAM) while preserving critical outliers in high-precision on-chip Magnetoresistive-RAM (MRAM), mitigating noise-induced degradation. On language modeling and reasoning benchmarks, QMC outperforms and matches state-of-the-art quantization methods using advanced algorithms and hybrid data formats, while achieving greater compression under both algorithm-only evaluation and realistic deployment settings. Specifically, compared against SoTA quantization methods on the latest edge AI platform, QMC reduces memory usage by 6.3x-7.3x, external data transfers by 7.6x, energy by 11.7x, and latency by 12.5x when compared to FP16, establishing QMC as a scalable, deployment-ready co-design for efficient on-device inference.

[322] Constructing Multi-label Hierarchical Classification Models for MITRE ATT&CK Text Tagging

Andrew Crossman, Jonah Dodd, Viralam Ramamurthy Chaithanya Kumar, Riyaz Mohammed, Andrew R. Plummer, Chandra Sekharudu, Deepak Warrier, Mohammad Yekrangian

Main category: cs.LG

TL;DR: Proposes a stratified task space framework for automating MITRE ATT&CK text tagging using classical ML, achieving 94% tactic-level and 82% technique-level accuracy without LLMs.

Details

Motivation: Manual tagging of cybersecurity texts with MITRE ATT&CK tactics and techniques is time-consuming and inefficient; automation is needed but current approaches lack systematic organization.

Method: Develops a stratified task space characterization for organizing automation efforts, then builds multi-label hierarchical classification models using classical ML methods (not LLMs) on cyber-threat intelligence text.

Result: Models achieve ~94% accuracy at tactic level and ~82% at technique level, outperforming GPT-4o (~60% tactic accuracy) and matching/exceeding state-of-the-art without complex hierarchical approaches.

Conclusion: Classical ML approaches can effectively automate MITRE ATT&CK tagging, providing practical, shareable tools for the security community while avoiding LLM dependencies and complexity.

Abstract: MITRE ATT&CK is a cybersecurity knowledge base that organizes threat actor and cyber-attack information into a set of tactics describing the reasons and goals threat actors have for carrying out attacks, with each tactic having a set of techniques that describe the potential methods used in these attacks. One major application of ATT&CK is the use of its tactic and technique hierarchy by security specialists as a framework for annotating cyber-threat intelligence reports, vulnerability descriptions, threat scenarios, inter alia, to facilitate downstream analyses. To date, the tagging process is still largely done manually. In this technical note, we provide a stratified “task space” characterization of the MITRE ATT&CK text tagging task for organizing previous efforts toward automation using AIML methods, while also clarifying pathways for constructing new methods. To illustrate one of the pathways, we use the task space strata to stage-wise construct our own multi-label hierarchical classification models for the text tagging task via experimentation over general cyber-threat intelligence text – using shareable computational tools and publicly releasing the models to the security community (via https://github.com/jpmorganchase/MITRE_models). Our multi-label hierarchical approach yields accuracy scores of roughly 94% at the tactic level, as well as accuracy scores of roughly 82% at the technique level. The models also meet or surpass state-of-the-art performance while relying only on classical machine learning methods – removing any dependence on LLMs, RAG, agents, or more complex hierarchical approaches. Moreover, we show that GPT-4o model performance at the tactic level is significantly lower (roughly 60% accuracy) than our own approach. We also extend our baseline model to a corpus of threat scenarios for financial applications produced by subject matter experts.

[323] Place with Intention: An Empirical Attendance Predictive Study of Expo 2025 Osaka, Kansai, Japan

Xiaojie Yang, Dizhi Huang, Hangli Ge, Masahiro Sano, Takeaki Ohdake, Kazuma Hatano, Noboru Koshizuka

Main category: cs.LG

TL;DR: Transformer-based framework uses reservation dynamics (ticket bookings and updates) as proxy for attendance intentions to forecast daily attendance at large events, avoiding complex multi-source data integration.

Details

Motivation: Accurate daily attendance forecasting is crucial for managing transportation and services at large international events like Expo 2025 Osaka. Existing methods rely on multi-source external data (weather, traffic, social media) which can be unreliable when historical data is insufficient.

Method: Proposes a Transformer-based framework that leverages reservation dynamics (ticket bookings and subsequent updates within a time window) as a proxy for visitors’ attendance intentions. This avoids multi-source integration complexity while capturing external influences implicitly embedded in reservation patterns. Uses encoder-decoder structure with inverse-style embedding and adaptive fusion module.

Result: Separately modeling East and West gates consistently improves accuracy, especially for short- and medium-term horizons. Ablation studies confirm the importance of encoder-decoder structure, inverse-style embedding, and adaptive fusion module.

Conclusion: Reservation dynamics provide a practical and informative foundation for attendance forecasting in large-scale international events, offering a reliable alternative to complex multi-source data integration approaches.

Abstract: Accurate forecasting of daily attendance is vital for managing transportation, crowd flows, and services at large-scale international events such as Expo 2025 Osaka, Kansai, Japan. However, existing approaches often rely on multi-source external data (such as weather, traffic, and social media) to improve accuracy, which can lead to unreliable results when historical data are insufficient. To address these challenges, we propose a Transformer-based framework that leverages reservation dynamics, i.e., ticket bookings and subsequent updates within a time window, as a proxy for visitors’ attendance intentions, under the assumption that such intentions are eventually reflected in reservation patterns. This design avoids the complexity of multi-source integration while still capturing external influences like weather and promotions implicitly embedded in reservation dynamics. We construct a dataset combining entrance records and reservation dynamics and evaluate the model under both single-channel (total attendance) and two-channel (separated by East and West gates) settings. Results show that separately modeling East and West gates consistently improves accuracy, particularly for short- and medium-term horizons. Ablation studies further confirm the importance of the encoder-decoder structure, inverse-style embedding, and adaptive fusion module. Overall, our findings indicate that reservation dynamics offer a practical and informative foundation for attendance forecasting in large-scale international events.

[324] Counterfactual Modeling with Fine-Tuned LLMs for Health Intervention Design and Sensor Data Augmentation

Shovito Barua Soumma, Asiful Arefeen, Stephanie M. Carpenter, Melanie Hingle, Hassan Ghasemzadeh

Main category: cs.LG

TL;DR: LLM-generated counterfactual explanations (CFEs) outperform traditional optimization methods in clinical settings, offering high plausibility, validity, and actionable interventions while effectively augmenting imbalanced datasets to restore classifier performance.

Details

Motivation: Counterfactual explanations provide human-centric interpretability for ML models and can serve dual purposes: as interventions for abnormality prevention and as augmented data for training robust models. The paper aims to evaluate LLMs' capability to generate high-quality CFEs compared to traditional optimization-based methods.

Method: Comprehensive evaluation of CF generation using LLMs (GPT-4 zero-shot/few-shot, BioMistral-7B, LLaMA-3.1-8B) in both pretrained and fine-tuned configurations. Using multimodal AI-READI clinical dataset, assessed CFs across intervention quality, feature diversity, and augmentation effectiveness. Compared with optimization baselines (DiCE, CFNOW, NICE).

Result: Fine-tuned LLMs (especially LLaMA-3.1-8B) produce CFs with high plausibility (up to 99%), strong validity (up to 0.99), and realistic feature adjustments. LLM-generated CFs substantially restore classifier performance under label-scarcity, yielding average 20% F1 recovery across three scarcity scenarios. LLMs outperform optimization baselines in generating clinically actionable and semantically coherent counterfactuals.

Conclusion: LLM-driven counterfactuals show promise for interpretable intervention design and data-efficient model training in digital health. The SenseCF approach demonstrates that fine-tuned LLMs can generate valid, representative CFEs and effectively supplement minority classes in imbalanced datasets to improve model robustness and predictive performance.

Abstract: Counterfactual explanations (CFEs) provide human-centric interpretability by identifying the minimal, actionable changes required to alter a machine learning model’s prediction. Therefore, CFs can be used as (i) interventions for abnormality prevention and (ii) augmented data for training robust models. We conduct a comprehensive evaluation of CF generation using large language models (LLMs), including GPT-4 (zero-shot and few-shot) and two open-source models-BioMistral-7B and LLaMA-3.1-8B, in both pretrained and fine-tuned configurations. Using the multimodal AI-READI clinical dataset, we assess CFs across three dimensions: intervention quality, feature diversity, and augmentation effectiveness. Fine-tuned LLMs, particularly LLaMA-3.1-8B, produce CFs with high plausibility (up to 99%), strong validity (up to 0.99), and realistic, behaviorally modifiable feature adjustments. When used for data augmentation under controlled label-scarcity settings, LLM-generated CFs substantially restore classifier performance, yielding an average 20% F1 recovery across three scarcity scenarios. Compared with optimization-based baselines such as DiCE, CFNOW, and NICE, LLMs offer a flexible, model-agnostic approach that generates more clinically actionable and semantically coherent counterfactuals. Overall, this work demonstrates the promise of LLM-driven counterfactuals for both interpretable intervention design and data-efficient model training in sensor-based digital health. Impact: SenseCF fine-tunes an LLM to generate valid, representative counterfactual explanations and supplement minority class in an imbalanced dataset for improving model training and boosting model robustness and predictive performance

[325] Rethinking Reinforcement fine-tuning of LLMs: A Multi-armed Bandit Learning Perspective

Xiao Hu, Hong Xie, Tao Tan, Defu Lian, Jianyu Han

Main category: cs.LG

TL;DR: The paper investigates the fundamental questions about reinforcement fine-tuning of LLMs by proposing a bottom-up experiment pipeline to disentangle confounding factors and understand the role of each design choice.

Details

Motivation: The field of reinforcement fine-tuning for LLMs has many inconsistent claims and lacks clear understanding of what each optimization choice does and which ones are the bottlenecks. There are entangled confounding factors in the fine-tuning process that need to be systematically examined.

Method: Proposes a bottom-up experiment pipeline starting with a minimalist configuration (one training data, one rollout per round, reward directly as learning signal without advantage function). This connects to multi-armed bandit learning theory. Then expands layer by layer to examine each design choice’s role.

Result: Experimental results on three LLMs and two reasoning datasets reveal new understanding of design choices and yield essential insights to shape the research area.

Conclusion: The bottom-up approach provides systematic understanding of reinforcement fine-tuning optimization choices, addressing fundamental questions about their roles and bottlenecks in LLM fine-tuning.

Abstract: A large number of heuristics have been proposed to optimize the reinforcement fine-tuning of LLMs. However, inconsistent claims are made from time to time, making this area elusive. Reflecting on this situation, two fundamental questions still lack a clear understanding: 1) what is the role of each optimizing choice? 2) which ones are the bottlenecks? This paper aims to shed light on them, and it faces the challenge of several entangled confounding factors in the fine-tuning process. To tackle this challenge, we propose a bottom-up experiment pipeline. The bottom layer is composed of a minimalist configuration: one training data, one rollout per round and the reward directly serve as the learning signal without advantage function design. This minimalist configuration connects to multi-armed bandit learning with extremely large discrete action space, which offers theories to corroborate the experiment findings. The up procedure of the experiment pipeline expanding the minimalist configuration layer by layer, examining the role of each design choice. Experimental results on three LLMs and two reasoning datasets not only reveal new understanding of the design choice but also yield essential insights to shape the area.

[326] Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum

Jingru Li, Yibo Fan, Huan Li

Main category: cs.LG

TL;DR: Muon accelerates LLM pretraining via orthogonal momentum updates, with two variants (Muon-NSR and Muon-VS) that apply variance-adaptive normalization to momentum before orthogonalization, achieving faster convergence than AdamW and Muon baselines.

Details

Motivation: LLM pretraining is computationally demanding, making optimizer efficiency crucial. While Adam is effective, there's room for improvement through momentum-based approaches that can accelerate convergence and reduce training time.

Method: Muon uses orthogonal momentum updates as a matrix analogue of element-wise sign operator. Two variants are proposed: Muon-NSR applies noise-to-signal ratio modulation, and Muon-VS performs variance-based scaling without additional hyperparameters. Both apply variance-adaptive normalization to momentum before orthogonalization.

Result: Experiments on GPT-2 and LLaMA pretraining show Muon-NSR and Muon-VS accelerate convergence and achieve lower validation loss than well-tuned AdamW and Muon baselines. On LLaMA-1.2B, they reduce iterations required to reach target validation loss by 1.36× relative to well-tuned Muon.

Conclusion: The proposed Muon variants with variance-adaptive normalization effectively accelerate LLM pretraining, offering practical improvements in optimizer efficiency for large-scale language model training.

Abstract: Large Language Models (LLMs) achieve competitive performance across diverse natural language processing (NLP) tasks, yet pretraining is computationally demanding, making optimizer efficiency an important practical consideration. Muon accelerates LLM pretraining via orthogonal momentum updates that serve as a matrix analogue of the element-wise sign operator. Motivated by the recent perspective that Adam is a variance-adaptive sign update algorithm, we propose two variants of Muon, Muon-NSR and Muon-VS, which apply variance-adaptive normalization to momentum before orthogonalization. Muon-NSR applies noise-to-signal ratio (NSR) modulation, while Muon-VS performs variance-based scaling without introducing additional hyperparameters. Experiments on GPT-2 and LLaMA pretraining demonstrate that our proposed methods accelerate convergence and consistently achieve lower validation loss than both competitive, well-tuned AdamW and Muon baselines. For example, on the LLaMA-1.2B model, Muon-NSR and Muon-VS reduce the iterations required to reach the target validation loss by $1.36\times$ relative to the well-tuned Muon following the recent benchmark.

[327] Relational Graph Modeling for Credit Default Prediction: Heterogeneous GNNs and Hybrid Ensemble Learning

Yvonne Yang, Eranki Vasistha

Main category: cs.LG

TL;DR: Heterogeneous GNNs combined with tabular models improve credit default prediction over standalone methods, with hybrid ensembles achieving best performance on a massive financial graph.

Details

Motivation: Traditional tabular models for credit scoring fail to capture cross-entity dependencies in multi-table financial histories, while graph neural networks can potentially model these complex relationships between borrowers, institutions, and transaction behaviors.

Method: Built massive heterogeneous graph (31M+ nodes, 50M+ edges) integrating borrower attributes with transaction entities; evaluated heterogeneous GNNs (GraphSAGE, relation-aware attentive GNN) vs tabular baselines; tested hybrid ensembles combining tabular features with GNN embeddings; used contrastive pretraining; conducted explainability and fairness analyses.

Result: Standalone GNNs provided limited improvement over gradient-boosted trees; hybrid ensemble (tabular + GNN embeddings) achieved best overall performance (improved ROC-AUC and PR-AUC); contrastive pretraining improved optimization stability but limited downstream gains; explainability analyses revealed how relational signals affect subgroup behavior.

Conclusion: Hybrid approaches combining GNN-derived relational embeddings with traditional tabular features outperform either method alone for credit default prediction, demonstrating the value of capturing cross-entity dependencies while maintaining the strength of established tabular models.

Abstract: Credit default risk arises from complex interactions among borrowers, financial institutions, and transaction-level behaviors. While strong tabular models remain highly competitive in credit scoring, they may fail to explicitly capture cross-entity dependencies embedded in multi-table financial histories. In this work, we construct a massive-scale heterogeneous graph containing over 31 million nodes and more than 50 million edges, integrating borrower attributes with granular transaction-level entities such as installment payments, POS cash balances, and credit card histories. We evaluate heterogeneous graph neural networks (GNNs), including heterogeneous GraphSAGE and a relation-aware attentive heterogeneous GNN, against strong tabular baselines. We find that standalone GNNs provide limited lift over a competitive gradient-boosted tree baseline, while a hybrid ensemble that augments tabular features with GNN-derived customer embeddings achieves the best overall performance, improving both ROC-AUC and PR-AUC. We further observe that contrastive pretraining can improve optimization stability but yields limited downstream gains under generic graph augmentations. Finally, we conduct structured explainability and fairness analyses to characterize how relational signals affect subgroup behavior and screening-oriented outcomes.

[328] Efficient Imputation for Patch-based Missing Single-cell Data via Cluster-regularized Optimal Transport

Yuyu Liu, Jiannan Yang, Ziyang Yu, Weishen Pan, Fei Wang, Tengfei Ma

Main category: cs.LG

TL;DR: CROT is an optimal transport-based imputation algorithm for handling patch-based missing data in tabular single-cell sequencing datasets, achieving superior accuracy and runtime efficiency.

Details

Motivation: Existing imputation methods struggle with large patches of missing data in single-cell sequencing datasets, limiting biological insights from incomplete data.

Method: CROT uses optimal transport-based approach specifically designed for patch-based missing data in tabular formats, capturing underlying data structure despite significant missingness.

Result: Achieves superior imputation accuracy while significantly reducing runtime, demonstrating scalability and efficiency for large-scale datasets.

Conclusion: Provides a robust solution for imputation in heterogeneous, high-dimensional datasets with structured data absence, addressing critical challenges in biological and clinical data analysis.

Abstract: Missing data in single-cell sequencing datasets poses significant challenges for extracting meaningful biological insights. However, existing imputation approaches, which often assume uniformity and data completeness, struggle to address cases with large patches of missing data. In this paper, we present CROT, an optimal transport-based imputation algorithm designed to handle patch-based missing data in tabular formats. Our approach effectively captures the underlying data structure in the presence of significant missingness. Notably, it achieves superior imputation accuracy while significantly reducing runtime, demonstrating its scalability and efficiency for large-scale datasets. This work introduces a robust solution for imputation in heterogeneous, high-dimensional datasets with structured data absence, addressing critical challenges in both biological and clinical data analysis. Our code is available at Anomalous Github.

[329] Beyond Denial-of-Service: The Puppeteer’s Attack for Fine-Grained Control in Ranking-Based Federated Learning

Zhihao Chen, Zirui Gong, Jianting Ning, Yanjun Zhang, Leo Yu Zhang

Main category: cs.LG

TL;DR: Federated Rank Learning (FRL) is vulnerable to Edge Control Attack (ECA), a fine-grained model poisoning attack that can precisely degrade competitor accuracy while evading detection.

Details

Motivation: FRL was designed to be resilient against model poisoning attacks due to its discrete ranking-based mechanism, but the authors discovered it remains vulnerable to sophisticated attacks that can precisely control accuracy degradation while maintaining normal-looking convergence.

Method: Proposes Edge Control Attack (ECA) with two stages: (1) identifying and manipulating Ascending/Descending Edges to align global model with target model, and (2) widening selection boundary gap to stabilize global model at target accuracy.

Result: ECA achieves fine-grained accuracy control with average error of only 0.224%, outperforming baseline by up to 17x across seven benchmark datasets and nine Byzantine-robust aggregation rules.

Conclusion: FRL remains vulnerable to advanced poisoning attacks despite its security design, highlighting the need for stronger defenses against such fine-grained control attacks in ranking-based FL frameworks.

Abstract: Federated Rank Learning (FRL) is a promising Federated Learning (FL) paradigm designed to be resilient against model poisoning attacks due to its discrete, ranking-based update mechanism. Unlike traditional FL methods that rely on model updates, FRL leverages discrete rankings as a communication parameter between clients and the server. This approach significantly reduces communication costs and limits an adversary’s ability to scale or optimize malicious updates in the continuous space, thereby enhancing its robustness. This makes FRL particularly appealing for applications where system security and data privacy are crucial, such as web-based auction and bidding platforms. While FRL substantially reduces the attack surface, we demonstrate that it remains vulnerable to a new class of local model poisoning attack, i.e., fine-grained control attacks. We introduce the Edge Control Attack (ECA), the first fine-grained control attack tailored to ranking-based FL frameworks. Unlike conventional denial-of-service (DoS) attacks that cause conspicuous disruptions, ECA enables an adversary to precisely degrade a competitor’s accuracy to any target level while maintaining a normal-looking convergence trajectory, thereby avoiding detection. ECA operates in two stages: (i) identifying and manipulating Ascending and Descending Edges to align the global model with the target model, and (ii) widening the selection boundary gap to stabilize the global model at the target accuracy. Extensive experiments across seven benchmark datasets and nine Byzantine-robust aggregation rules (AGRs) show that ECA achieves fine-grained accuracy control with an average error of only 0.224%, outperforming the baseline by up to 17x. Our findings highlight the need for stronger defenses against advanced poisoning attacks. Our code is available at: https://github.com/Chenzh0205/ECA

[330] Beyond Error-Based Optimization: Experience-Driven Symbolic Regression with Goal-Conditioned Reinforcement Learning

Jianwen Sun, Xinrui Li, Fuqing Li, Xiaoxuan Shen

Main category: cs.LG

TL;DR: EGRL-SR uses goal-conditioned reinforcement learning with historical trajectories to guide symbolic regression search, focusing on structural patterns rather than just fitting error.

Details

Motivation: Traditional error-driven symbolic regression methods face ambiguity because many expressions have similar errors but different structures, leading to poor convergence to the true underlying function.

Method: Formulates symbolic regression as goal-conditioned RL with hindsight experience replay, uses all-point satisfaction binary reward function to focus on structural patterns, and implements structure-guided heuristic exploration for diversity.

Result: Outperforms state-of-the-art methods in recovery rate and robustness on public benchmarks, recovers more complex expressions under same search budget.

Conclusion: EGRL-SR provides a novel RL-based approach that effectively guides symbolic regression search by learning from historical trajectories and focusing on structural patterns rather than just error minimization.

Abstract: Symbolic Regression aims to automatically identify compact and interpretable mathematical expressions that model the functional relationship between input and output variables. Most existing search-based symbolic regression methods typically rely on the fitting error to inform the search process. However, in the vast expression space, numerous candidate expressions may exhibit similar error values while differing substantially in structure, leading to ambiguous search directions and hindering convergence to the underlying true function. To address this challenge, we propose a novel framework named EGRL-SR (Experience-driven Goal-conditioned Reinforcement Learning for Symbolic Regression). In contrast to traditional error-driven approaches, EGRL-SR introduces a new perspective: leveraging precise historical trajectories and optimizing the action-value network to proactively guide the search process, thereby achieving a more robust expression search. Specifically, we formulate symbolic regression as a goal-conditioned reinforcement learning problem and incorporate hindsight experience replay, allowing the action-value network to generalize common mapping patterns from diverse input-output pairs. Moreover, we design an all-point satisfaction binary reward function that encourages the action-value network to focus on structural patterns rather than low-error expressions, and concurrently propose a structure-guided heuristic exploration strategy to enhance search diversity and space coverage. Experiments on public benchmarks show that EGRL-SR consistently outperforms state-of-the-art methods in recovery rate and robustness, and can recover more complex expressions under the same search budget. Ablation results validate that the action-value network effectively guides the search, with both the reward function and the exploration strategy playing critical roles.

[331] Re-understanding Graph Unlearning through Memorization

Pengfei Ding, Yan Wang, Guanfeng Liu

Main category: cs.LG

TL;DR: MGU is a memorization-guided graph unlearning framework that addresses limitations in existing methods by providing accurate difficulty assessment, adaptive unlearning strategies, and comprehensive evaluation protocols.

Details

Motivation: Existing graph unlearning methods lack understanding of key factors determining unlearning effectiveness, leading to impractical difficulty assessment, ineffectiveness on hard tasks, and misaligned evaluation protocols that don't capture true forgetting capability.

Method: Proposes MGU framework with three key advances: 1) accurate and practical difficulty assessment across different GU tasks, 2) adaptive strategy that dynamically adjusts unlearning objectives based on difficulty levels, and 3) comprehensive evaluation protocol aligned with practical requirements.

Result: Extensive experiments on ten real-world graphs demonstrate MGU consistently outperforms state-of-the-art baselines in forgetting quality, computational efficiency, and utility preservation.

Conclusion: MGU establishes GNN memorization as a new perspective for understanding graph unlearning and provides a superior framework that addresses fundamental limitations of existing methods.

Abstract: Graph unlearning (GU), which removes nodes, edges, or features from trained graph neural networks (GNNs), is crucial in Web applications where graph data may contain sensitive, mislabeled, or malicious information. However, existing GU methods lack a clear understanding of the key factors that determine unlearning effectiveness, leading to three fundamental limitations: (1) impractical and inaccurate GU difficulty assessment due to test-access requirements and invalid assumptions, (2) ineffectiveness on hard-to-unlearn tasks, and (3) misaligned evaluation protocols that overemphasize easy tasks and fail to capture true forgetting capability. To address these issues, we establish GNN memorization as a new perspective for understanding graph unlearning and propose MGU, a Memorization-guided Graph Unlearning framework. MGU achieves three key advances: it provides accurate and practical difficulty assessment across different GU tasks, develops an adaptive strategy that dynamically adjusts unlearning objectives based on difficulty levels, and establishes a comprehensive evaluation protocol that aligns with practical requirements. Extensive experiments on ten real-world graphs demonstrate that MGU consistently outperforms state-of-the-art baselines in forgetting quality, computational efficiency, and utility preservation.

[332] CoScale-RL: Efficient Post-Training by Co-Scaling Data and Computation

Yutong Chen, Jiandong Gao, Ji Wu

Main category: cs.LG

TL;DR: CoScale-RL is a novel scaling strategy that improves data and computational efficiency for training Large Reasoning Models by scaling up solutions per problem and rollout computation, achieving 3.76× accuracy improvement.

Details

Motivation: Training Large Reasoning Models is often unstable and unpredictable, especially on hard problems or with weak foundation models. Current post-training scaling strategies are inefficient for these challenging cases.

Method: CoScale-RL uses a two-step approach: 1) Scale up solutions by collecting multiple solutions per problem instead of enlarging dataset size, 2) Scale up rollout computation to stabilize Reinforcement Learning, and 3) Use Re-distillation model merge technique to maintain computational efficiency during scaling.

Result: The method achieves an average 3.76× accuracy improvement on four benchmarks, significantly improving data and computational efficiency while enhancing LRM’s ability boundary without requiring extensive supervised fine-tuning datasets.

Conclusion: CoScale-RL provides a new scaling direction to further improve Large Reasoning Models’ reasoning ability through better data and computational efficiency, enabling improvement of LRM’s ability boundaries without extensive SFT datasets.

Abstract: Training Large Reasoning Model (LRM) is usually unstable and unpredictable, especially on hard problems or weak foundation models. We found that the current post-training scaling strategy can still improve on these cases. We propose CoScale-RL, a novel scaling strategy with better data and computational efficiency. We first scale up solutions to make problems solvable. The core idea is to collect multiple solutions for each problem, rather than simply enlarging the dataset. Then, we scale up rollout computation to stabilize Reinforcement Learning. We further leverage a model merge technique called Re-distillation to sustain or even improve computational efficiency when scaling up. Our method significantly improves data and computational efficiency, with an average 3.76$\times$ accuracy improvement on four benchmarks. CoScale-RL is able to improve an LRM’s ability boundary without an extensive SFT dataset. Our method provides a new scaling direction to further improve LRM’s reasoning ability.

[333] Case-Guided Sequential Assay Planning in Drug Discovery

Tianchi Chen, Jan Bima, Sean L. Wu, Otto Ritter, Bingjia Yang, Xiang Yu

Main category: cs.LG

TL;DR: IBMDP is a model-based RL framework for drug discovery sequencing that uses historical data to create implicit transition models and ensemble MCTS planning, achieving 92% resource reduction vs heuristics.

Details

Motivation: Drug discovery experimental sequencing is a high-stakes planning problem with severe uncertainty and resource constraints, but standard RL fails due to lack of environment simulators or transition data - only static historical databases are available.

Method: IBMDP constructs case-guided implicit transition models using nonparametric belief distributions from similar historical outcomes, enables Bayesian belief updating as evidence accumulates, and uses ensemble MCTS planning to balance information gain with resource efficiency.

Result: On real-world CNS drug discovery, IBMDP reduced resource consumption by up to 92% compared to established heuristics while maintaining decision confidence. In synthetic benchmarks with computable optimal policies, IBMDP achieved significantly higher alignment with optimal policies than deterministic value iteration alternatives.

Conclusion: IBMDP offers a practical solution for sequential experimental design in data-rich but simulator-poor domains, demonstrating superiority of ensemble planning over deterministic alternatives using the same similarity-based models.

Abstract: Optimally sequencing experimental assays in drug discovery is a high-stakes planning problem under severe uncertainty and resource constraints. A primary obstacle for standard reinforcement learning (RL) is the absence of an explicit environment simulator or transition data $(s, a, s’)$; planning must rely solely on a static database of historical outcomes. We introduce the Implicit Bayesian Markov Decision Process (IBMDP), a model-based RL framework designed for such simulator-free settings. IBMDP constructs a case-guided implicit model of transition dynamics by forming a nonparametric belief distribution using similar historical outcomes. This mechanism enables Bayesian belief updating as evidence accumulates and employs ensemble MCTS planning to generate stable policies that balance information gain toward desired outcomes with resource efficiency. We validate IBMDP through comprehensive experiments. On a real-world central nervous system (CNS) drug discovery task, IBMDP reduced resource consumption by up to 92% compared to established heuristics while maintaining decision confidence. To rigorously assess decision quality, we also benchmarked IBMDP in a synthetic environment with a computable optimal policy. Our framework achieves significantly higher alignment with this optimal policy than a deterministic value iteration alternative that uses the same similarity-based model, demonstrating the superiority of our ensemble planner. IBMDP offers a practical solution for sequential experimental design in data-rich but simulator-poor domains.

[334] PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning

Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, Yonghong Tian

Main category: cs.LG

TL;DR: PCL-Reasoner-V1.5 is a 32B parameter LLM for math reasoning built on Qwen2.5-32B, using SFT + novel offline RL for stability, achieving SOTA 90.9% on AIME 2024 and 85.6% on AIME 2025.

Details

Motivation: To develop a stable and efficient method for advancing mathematical reasoning in large language models, addressing the instability issues of standard online RL approaches like GRPO.

Method: Built upon Qwen2.5-32B with supervised fine-tuning followed by reinforcement learning, featuring a novel offline RL method that provides superior training stability and efficiency compared to online RL methods.

Result: Achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, with average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025.

Conclusion: Offline RL is demonstrated as a stable and efficient paradigm for advancing reasoning capabilities in large language models, with experiments conducted on Huawei Ascend 910C NPUs.

Abstract: We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs.

[335] FSX: Message Flow Sensitivity Enhanced Structural Explainer for Graph Neural Networks

Bizu Feng, Zhimu Yang, Shaode Yu, Zixin Hu

Main category: cs.LG

TL;DR: FSX is a hybrid GNN explainer combining internal message flow analysis with cooperative game theory to provide efficient, high-fidelity explanations that capture structural interactions.

Details

Motivation: Existing GNN explainability methods face a trade-off: gradient-based approaches are efficient but ignore structural interactions, while game-theoretic methods capture interactions but are computationally expensive and may deviate from the model's true reasoning.

Method: FSX first performs flow-sensitivity analysis during a single forward pass to identify critical message flows by simulating localized node perturbations. These flows are projected onto input graphs to define meaningful subgraphs. Within each subgraph, a flow-aware cooperative game computes Shapley-like values that incorporate both node-feature importance and their roles in sustaining critical flows.

Result: Extensive evaluation across multiple datasets and GNN architectures shows FSX achieves superior explanation fidelity with significantly reduced runtime, providing insights into how important sub-structures influence predictions by governing key internal computational pathways.

Conclusion: FSX successfully bridges the gap between efficient gradient-based methods and accurate game-theoretic approaches by synergistically combining internal message flow analysis with external graph data analysis, offering both computational efficiency and structural understanding.

Abstract: Despite the widespread success of Graph Neural Networks (GNNs), understanding the reasons behind their specific predictions remains challenging. Existing explainability methods face a trade-off that gradient-based approaches are computationally efficient but often ignore structural interactions, while game-theoretic techniques capture interactions at the cost of high computational overhead and potential deviation from the model’s true reasoning path. To address this gap, we propose FSX (Message Flow Sensitivity Enhanced Structural Explainer), a novel hybrid framework that synergistically combines the internal message flows of the model with a cooperative game approach applied to the external graph data. FSX first identifies critical message flows via a novel flow-sensitivity analysis: during a single forward pass, it simulates localized node perturbations and measures the resulting changes in message flow intensities. These sensitivity-ranked flows are then projected onto the input graph to define compact, semantically meaningful subgraphs. Within each subgraph, a flow-aware cooperative game is conducted, where node contributions are evaluated fairly through a Shapley-like value that incorporates both node-feature importance and their roles in sustaining or destabilizing the identified critical flows. Extensive evaluation across multiple datasets and GNN architectures demonstrates that FSX achieves superior explanation fidelity with significantly reduced runtime, while providing unprecedented insights into the structural logic underlying model predictions–specifically, how important sub-structures exert influence by governing the stability of key internal computational pathways.

[336] RefProtoFL: Communication-Efficient Federated Learning via External-Referenced Prototype Alignment

Hongyue Wu, Hangyu Li, Guodong Fan, Haoran Zhu, Shizhan Chen, Zhiyong Feng

Main category: cs.LG

TL;DR: RefProtoFL is a communication-efficient federated learning framework that uses external reference prototypes for representation consistency and adaptive probabilistic update dropping for communication efficiency.

Details

Motivation: Federated learning faces challenges with limited communication bandwidth and heterogeneous client data distributions. Existing prototype-based FL methods still suffer from suboptimal generalization under severe communication constraints.

Method: Decomposes model into private backbone and lightweight shared adapter, restricts communication to adapter only. Uses Adaptive Probabilistic Update Dropping (APUD) with magnitude-aware Top-K sparsification for uplink efficiency. Employs External-Referenced Prototype Alignment (ERPA) using server-held public dataset to create shared semantic anchors for representation consistency.

Result: Extensive experiments on standard benchmarks demonstrate that RefProtoFL attains higher classification accuracy than state-of-the-art prototype-based FL baselines.

Conclusion: RefProtoFL effectively addresses both communication efficiency and representation consistency challenges in federated learning through its dual approach of APUD for communication reduction and ERPA for semantic alignment.

Abstract: Federated learning (FL) enables collaborative model training without sharing raw data in edge environments, but is constrained by limited communication bandwidth and heterogeneous client data distributions. Prototype-based FL mitigates this issue by exchanging class-wise feature prototypes instead of full model parameters; however, existing methods still suffer from suboptimal generalization under severe communication constraints. In this paper, we propose RefProtoFL, a communication-efficient FL framework that integrates External-Referenced Prototype Alignment (ERPA) for representation consistency with Adaptive Probabilistic Update Dropping (APUD) for communication efficiency. Specifically, we decompose the model into a private backbone and a lightweight shared adapter, and restrict federated communication to the adapter parameters only. To further reduce uplink cost, APUD performs magnitude-aware Top-K sparsification, transmitting only the most significant adapter updates for server-side aggregation. To address representation inconsistency across heterogeneous clients, ERPA leverages a small server-held public dataset to construct external reference prototypes that serve as shared semantic anchors. For classes covered by public data, clients directly align local representations to public-induced prototypes, whereas for uncovered classes, alignment relies on server-aggregated global reference prototypes via weighted averaging. Extensive experiments on standard benchmarks demonstrate that RefProtoFL attains higher classification accuracy than state-of-the-art prototype-based FL baselines.

[337] Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

Injin Kong, Hyoungjoon Lee, Yohan Jo

Main category: cs.LG

TL;DR: Post-training ARMs into MDMs causes fundamental computational reorganization: MDMs retain autoregressive circuits for local tasks but develop new global planning pathways with distributed semantic integration.

Details

Motivation: To understand whether post-trained Masked Diffusion Models (MDMs) genuinely acquire bidirectional reasoning capabilities or just repackage autoregressive heuristics from their Autoregressive Model (ARM) origins.

Method: Comparative circuit analysis of ARMs and their MDM counterparts, examining both structural and semantic transformations.

Result: MDMs show systematic “mechanism shift”: retain autoregressive circuitry for local causal tasks but abandon initialized pathways for global planning, with increased early-layer processing and transition from localized specialization to distributed integration.

Conclusion: Diffusion post-training fundamentally reorganizes internal computation to support non-sequential global planning, not just parameter adaptation.

Abstract: Post-training pretrained Autoregressive models (ARMs) into Masked Diffusion models (MDMs) has emerged as a cost-effective strategy to overcome the limitations of sequential generation. However, the internal algorithmic transformations induced by this paradigm shift remain unexplored, leaving it unclear whether post-trained MDMs acquire genuine bidirectional reasoning capabilities or merely repackage autoregressive heuristics. In this work, we address this question by conducting a comparative circuit analysis of ARMs and their MDM counterparts. Our analysis reveals a systematic “mechanism shift” dependent on the structural nature of the task. Structurally, we observe a distinct divergence: while MDMs largely retain autoregressive circuitry for tasks dominated by local causal dependencies, they abandon initialized pathways for global planning tasks, exhibiting distinct rewiring characterized by increased early-layer processing. Semantically, we identify a transition from sharp, localized specialization in ARMs to distributed integration in MDMs. Through these findings, we conclude that diffusion post-training does not merely adapt model parameters but fundamentally reorganizes internal computation to support non-sequential global planning.

[338] Anytime Optimal Decision Tree Learning with Continuous Features

Harold Kiossou, Pierre Schaus, Siegfried Nijssen

Main category: cs.LG

TL;DR: Proposes an anytime complete algorithm for learning optimal decision trees with continuous features using limited discrepancy search to improve anytime performance over existing depth-first methods.

Details

Motivation: Existing exact algorithms for optimal decision trees with continuous features suffer from poor anytime behavior - when interrupted early, they produce highly unbalanced, suboptimal trees, sometimes worse than greedy methods like C4.5.

Method: Uses limited discrepancy search to distribute computational effort more evenly across the entire tree structure, ensuring high-quality trees are available at any interruption point while maintaining completeness.

Result: Experimental results show the proposed approach outperforms existing methods in terms of anytime performance, providing better quality trees when computation is interrupted early.

Conclusion: The limited discrepancy search approach addresses the anytime performance limitation of depth-first methods for optimal decision tree learning with continuous features, offering practical improvements for real-world applications.

Abstract: In recent years, significant progress has been made on algorithms for learning optimal decision trees, primarily in the context of binary features. Extending these methods to continuous features remains substantially more challenging due to the large number of potential splits for each feature. Recently, an elegant exact algorithm was proposed for learning optimal decision trees with continuous features; however, the rapidly increasing computational time limits its practical applicability to shallow depths (typically 3 or 4). It relies on a depth-first search optimization strategy that fully optimizes the left subtree of each split before exploring the corresponding right subtree. While effective in finding optimal solutions given sufficient time, this strategy can lead to poor anytime behavior: when interrupted early, the best-found tree is often highly unbalanced and suboptimal. In such cases, purely greedy methods such as C4.5 may, paradoxically, yield better solutions. To address this limitation, we propose an anytime, yet complete approach leveraging limited discrepancy search, distributing the computational effort more evenly across the entire tree structure, and thus ensuring that a high-quality decision tree is available at any interruption point. Experimental results show that our approach outperforms the existing one in terms of anytime performance.

[339] Robustness of Mixtures of Experts to Feature Noise

Dong Sun, Rahul Nittala, Rebekka Burkholz

Main category: cs.LG

TL;DR: MoE models outperform dense networks not just due to parameter scaling but because sparse expert activation acts as a noise filter, improving generalization, robustness, and convergence speed.

Details

Motivation: Despite practical success, it's unclear why Mixture of Experts (MoE) models outperform dense networks beyond just having more parameters. The paper aims to understand the fundamental advantages of MoE architectures in handling noisy data with latent modular structure.

Method: The study examines an iso-parameter regime where inputs have latent modular structure but are corrupted by feature noise (proxy for noisy internal activations). Theoretical analysis compares MoE vs dense estimators, with empirical validation on synthetic data and real-world language tasks.

Result: MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed compared to dense networks. Sparse expert activation acts as an effective noise filter. Empirical results on both synthetic and real-world language tasks confirm these theoretical insights.

Conclusion: The performance advantage of MoE models stems from their ability to filter noise through sparse expert activation, not just parameter scaling. This provides consistent robustness and efficiency gains from sparse modular computation in noisy environments with latent structure.

Abstract: Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.

[340] Reflecting in the Reflection: Integrating a Socratic Questioning Framework into Automated AI-Based Question Generation

Ondřej Holub, Essi Ryymin, Rodrigo Alves

Main category: cs.LG

TL;DR: A two-agent LLM framework (Student-Teacher & Teacher-Educator) generates reflection questions via Socratic dialogue, outperforming one-shot baselines in relevance, depth, and overall quality.

Details

Motivation: Designing effective reflection questions is pedagogically valuable but time-consuming and inconsistently implemented across teachers, creating a need for automated, high-quality question generation.

Method: Two specialized agents engage in multi-turn Socratic dialogue: Student-Teacher proposes questions with rationales, Teacher-Educator evaluates them on clarity, depth, relevance, engagement, and conceptual connections, providing coaching questions or stop signals.

Result: Dynamic stopping with contextual information outperforms fixed iterations; two-agent protocol produces significantly more relevant, deeper, and better overall questions than one-shot baseline using same backbone model.

Conclusion: The reflection-in-reflection framework effectively automates high-quality reflection question generation through structured agent dialogue, offering a scalable solution to support teachers in designing pedagogically valuable questions.

Abstract: Designing good reflection questions is pedagogically important but time-consuming and unevenly supported across teachers. This paper introduces a reflection-in-reflection framework for automated generation of reflection questions with large language models (LLMs). Our approach coordinates two role-specialized agents, a Student-Teacher and a Teacher-Educator, that engage in a Socratic multi-turn dialogue to iteratively refine a single question given a teacher-specified topic, key concepts, student level, and optional instructional materials. The Student-Teacher proposes candidate questions with brief rationales, while the Teacher-Educator evaluates them along clarity, depth, relevance, engagement, and conceptual interconnections, responding only with targeted coaching questions or a fixed signal to stop the dialogue. We evaluate the framework in an authentic lower-secondary ICT setting on the topic, using GPT-4o-mini as the backbone model and a stronger GPT- 4-class LLM as an external evaluator in pairwise comparisons of clarity, relevance, depth, and overall quality. First, we study how interaction design and context (dynamic vs.fixed iteration counts; presence or absence of student level and materials) affect question quality. Dynamic stopping combined with contextual information consistently outperforms fixed 5- or 10-step refinement, with very long dialogues prone to drift or over-complication. Second, we show that our two-agent protocol produces questions that are judged substantially more relevant and deeper, and better overall, than a one-shot baseline using the same backbone model.

[341] Statistical Learning Theory for Distributional Classification

Christian Fiedler

Main category: cs.LG

TL;DR: Theoretical analysis of kernel-based learning with distributional inputs using SVMs, establishing oracle inequalities, consistency results, and learning rates under noise assumptions.

Details

Motivation: In supervised learning with distributional inputs (like medical screening or causal learning), the actual probability distributions are not accessible during learning - only samples from them. Kernel methods using kernel mean embeddings provide a natural approach, but theoretical analysis of this approach needs development.

Method: Uses kernel-based learning with kernel mean embeddings (KMEs) to embed distributions/samples into Hilbert space, then applies standard SVM classification. Theoretical analysis includes establishing oracle inequalities, consistency results, and learning rates. Introduces a novel variant of noise assumption for Gaussian kernels with hinge loss.

Result: Establishes new oracle inequality, derives consistency and learning rate results for SVMs with distributional inputs. For Gaussian kernels with hinge loss, formulates new noise assumption enabling learning rate derivation. Technical contributions include new feature space for Gaussian kernels on Hilbert spaces.

Conclusion: Provides theoretical foundation for kernel-based learning with distributional inputs using SVMs, with practical applications in medical screening and causal learning. Technical tools developed have independent value for broader kernel method research.

Abstract: In supervised learning with distributional inputs in the two-stage sampling setup, relevant to applications like learning-based medical screening or causal learning, the inputs (which are probability distributions) are not accessible in the learning phase, but only samples thereof. This problem is particularly amenable to kernel-based learning methods, where the distributions or samples are first embedded into a Hilbert space, often using kernel mean embeddings (KMEs), and then a standard kernel method like Support Vector Machines (SVMs) is applied, using a kernel defined on the embedding Hilbert space. In this work, we contribute to the theoretical analysis of this latter approach, with a particular focus on classification with distributional inputs using SVMs. We establish a new oracle inequality and derive consistency and learning rate results. Furthermore, for SVMs using the hinge loss and Gaussian kernels, we formulate a novel variant of an established noise assumption from the binary classification literature, under which we can establish learning rates. Finally, some of our technical tools like a new feature space for Gaussian kernels on Hilbert spaces are of independent interest.

[342] From Observation to Prediction: LSTM for Vehicle Lane Change Forecasting on Highway On/Off-Ramps

Mohamed Abouras, Catherine M. Elias

Main category: cs.LG

TL;DR: This paper studies vehicle behavior prediction on highway on/off-ramps using LSTM models, showing better accuracy for straight highway sections (94%) than ramp areas (76%) for 4-second predictions.

Details

Motivation: On and off-ramps are understudied road sections that introduce higher variation in highway interactions. Predicting vehicle behavior in these areas can reduce uncertainty and increase road safety.

Method: Used multi-layered LSTM architecture trained on the ExiD drone dataset. Tested different prediction horizons and model workflows to compare ramp areas (Area of Interest) with straight highway sections.

Result: Results show promising performance up to 4-second horizons: 76% prediction accuracy for ramp areas (AoI) and 94% accuracy for general highway scenarios at maximum horizon.

Conclusion: The study demonstrates the feasibility of predicting vehicle behavior on highway ramps using LSTM models, though straight highway sections achieve higher accuracy, highlighting the complexity of ramp interactions.

Abstract: On and off-ramps are understudied road sections even though they introduce a higher level of variation in highway interactions. Predicting vehicles’ behavior in these areas can decrease the impact of uncertainty and increase road safety. In this paper, the difference between this Area of Interest (AoI) and a straight highway section is studied. Multi-layered LSTM architecture to train the AoI model with ExiD drone dataset is utilized. In the process, different prediction horizons and different models’ workflow are tested. The results show great promise on horizons up to 4 seconds with prediction accuracy starting from about 76% for the AoI and 94% for the general highway scenarios on the maximum horizon.

[343] Adaptive Exponential Integration for Stable Gaussian Mixture Black-Box Variational Inference

Baojun Che, Yifan Chen, Daniel Zhengyu Huang, Xinying Mao, Weijie Wang

Main category: cs.LG

TL;DR: A stable and efficient black-box variational inference framework using Gaussian mixture families with natural gradient preconditioning, exponential integrators for positive definiteness, and adaptive time stepping for convergence.

Details

Motivation: Standard numerical optimization methods for black-box variational inference with Gaussian mixture families often suffer from instability and inefficiency when approximating complex posterior distributions without requiring gradients of the target density.

Method: Combines three key components: (1) affine-invariant preconditioning via natural gradient formulations, (2) an exponential integrator that unconditionally preserves positive definiteness of covariance matrices, and (3) adaptive time stepping to ensure stability and accommodate distinct warm-up and convergence phases.

Result: Proves exponential convergence for Gaussian posteriors in noise-free settings and almost-sure convergence under Monte Carlo estimation. Numerical experiments demonstrate effectiveness on multimodal distributions, Neal’s multiscale funnel, and a PDE-based Bayesian inverse problem for Darcy flow.

Conclusion: The proposed framework provides a stable and efficient approach for black-box variational inference with Gaussian mixture families, with natural connections to manifold optimization and mirror descent, rigorously justifying the necessity of adaptive time stepping.

Abstract: Black-box variational inference (BBVI) with Gaussian mixture families offers a flexible approach for approximating complex posterior distributions without requiring gradients of the target density. However, standard numerical optimization methods often suffer from instability and inefficiency. We develop a stable and efficient framework that combines three key components: (1) affine-invariant preconditioning via natural gradient formulations, (2) an exponential integrator that unconditionally preserves the positive definiteness of covariance matrices, and (3) adaptive time stepping to ensure stability and to accommodate distinct warm-up and convergence phases. The proposed approach has natural connections to manifold optimization and mirror descent. For Gaussian posteriors, we prove exponential convergence in the noise-free setting and almost-sure convergence under Monte Carlo estimation, rigorously justifying the necessity of adaptive time stepping. Numerical experiments on multimodal distributions, Neal’s multiscale funnel, and a PDE-based Bayesian inverse problem for Darcy flow demonstrate the effectiveness of the proposed method.

[344] Strategic Doctrine Language Models (sdLM): A Learning-System Framework for Doctrinal Consistency and Geopolitical Forecasting

Olaf Yunus Laitinen Imanov, Taner Yilmaz, Derya Umut Kulali

Main category: cs.LG

TL;DR: sdLM is a framework for multi-document strategic reasoning with doctrinal consistency constraints and calibrated uncertainty, showing improved strategic quality and calibration over LLM baselines while remaining competitive with human experts on long-horizon judgments.

Details

Motivation: The paper addresses the need for AI systems that can perform strategic reasoning with doctrinal consistency across multiple documents while maintaining calibrated uncertainty, particularly for long-horizon forecasting and plan plausibility assessment.

Method: The sdLM framework combines multi-document attention, temporal encoding, and a doctrine-consistency layer to enforce doctrinal constraints while improving forecasting and reducing severe doctrinal violations.

Result: sdLM achieves higher strategic quality and better calibration than strong general-purpose LLM baselines across three benchmarks: expert-panel scoring (N=47), doctrine consistency on 336 publications (12,847 statements), and geopolitical forecasting on 127 historical counterfactuals (1945-2020).

Conclusion: The sdLM framework effectively improves strategic reasoning with doctrinal consistency, remains competitive with human experts on long-horizon judgments, and shows promising scaling trends and deployment characteristics for operational settings.

Abstract: We introduce Strategic Doctrine Language Models (sdLM), a learning-system framework for multi-document strategic reasoning with doctrinal consistency constraints and calibrated uncertainty. The approach combines multi-document attention, temporal encoding, and a doctrine-consistency layer to improve long-horizon forecasting and plan plausibility while reducing severe doctrinal violations. We evaluate sdLM using (i) expert-panel scoring of strategic scenarios (N=47), (ii) doctrine consistency on 336 doctrine publications (12,847 statements), and (iii) geopolitical forecasting on 127 historical counterfactuals (1945-2020) across 12-60 month horizons. Across these benchmarks, sdLM achieves higher strategic quality and better calibration than strong general-purpose LLM baselines, and remains competitive with human experts on long-horizon judgments. We further report ablations, scaling trends, and deployment-oriented performance/latency characteristics to clarify which components drive improvements and how they translate to operational settings.

[345] What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study

Keyu Lv, Manyi Zhang, Xiaobo Xia, Jingchen Ni, Shannan Yan, Xianzhi Yu, Lu Hou, Chun Yuan, Haoli Bai

Main category: cs.LG

TL;DR: QAT (Quantization-Aware Training) for reasoning models outperforms PTQ, with key findings on knowledge distillation, PTQ initialization, RL feasibility, and domain alignment, consolidated into Reasoning-QAT workflow.

Details

Motivation: Reasoning models are slow and token-inefficient during inference, and post-training quantization (PTQ) causes large accuracy drops for reasoning tasks, especially at low-bit settings.

Method: Systematic empirical study of quantization-aware training (QAT) for reasoning models, developing Reasoning-QAT workflow with knowledge distillation, PTQ initialization, reinforcement learning for quantized models, and domain alignment.

Result: Reasoning-QAT consistently outperforms state-of-the-art PTQ methods across multiple LLM backbones and reasoning datasets. On Qwen3-0.6B, surpasses GPTQ by 44.53% on MATH-500 and recovers performance in 2-bit regime.

Conclusion: QAT is effective for reasoning models with proper techniques, and the consolidated Reasoning-QAT workflow provides superior quantization performance compared to PTQ methods.

Abstract: Reasoning models excel at complex tasks such as coding and mathematics, yet their inference is often slow and token-inefficient. To improve the inference efficiency, post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings. In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models. Our key findings include: (1) Knowledge distillation is a robust objective for reasoning models trained via either supervised fine-tuning or reinforcement learning; (2) PTQ provides a strong initialization for QAT, improving accuracy while reducing training cost; (3) Reinforcement learning remains feasible for quantized models given a viable cold start and yields additional gains; and (4) Aligning the PTQ calibration domain with the QAT training domain accelerates convergence and often improves the final accuracy. Finally, we consolidate these findings into an optimized workflow (Reasoning-QAT), and show that it consistently outperforms state-of-the-art PTQ methods across multiple LLM backbones and reasoning datasets. For instance, on Qwen3-0.6B, it surpasses GPTQ by 44.53% on MATH-500 and consistently recovers performance in the 2-bit regime.

[346] Tailoring Adverse Event Prediction in Type 1 Diabetes with Patient-Specific Deep Learning Models

Giorgia Rigamonti, Mirko Paolo Barbato, Davide Marelli, Paolo Napoletano

Main category: cs.LG

TL;DR: Deep learning approach for personalized blood glucose prediction using patient-specific data, showing significant improvements over traditional generalized models for Type 1 Diabetes management.

Details

Motivation: With growing adoption of wearable glucose monitors and mobile health apps, accurate blood glucose prediction is essential for enhancing automated insulin delivery and decision-support systems in Type 1 Diabetes management.

Method: Deep learning-based personalized approach using patient-specific data, comparing Leave-One-Subject-Out Cross-Validation with fine-tuning strategies, multimodal patient-specific approach vs traditional CGM-only methods, and ablation studies with progressively smaller training sets.

Result: Personalized models significantly improve prediction of adverse events, enabling more precise and timely interventions. The study identifies minimum data required for effective personalization, addressing challenges of extensive data collection in real-world applications.

Conclusion: Adaptive, personalized glucose prediction models have strong potential for advancing next-generation diabetes management, particularly in wearable and mobile health platforms, enhancing consumer-oriented diabetes care solutions.

Abstract: Effective management of Type 1 Diabetes requires continuous glucose monitoring and precise insulin adjustments to prevent hyperglycemia and hypoglycemia. With the growing adoption of wearable glucose monitors and mobile health applications, accurate blood glucose prediction is essential for enhancing automated insulin delivery and decision-support systems. This paper presents a deep learning-based approach for personalized blood glucose prediction, leveraging patient-specific data to improve prediction accuracy and responsiveness in real-world scenarios. Unlike traditional generalized models, our method accounts for individual variability, enabling more effective subject-specific predictions. We compare Leave-One-Subject-Out Cross-Validation with a fine-tuning strategy to evaluate their ability to model patient-specific dynamics. Results show that personalized models significantly improve the prediction of adverse events, enabling more precise and timely interventions in real-world scenarios. To assess the impact of patient-specific data, we conduct experiments comparing a multimodal, patient-specific approach against traditional CGM-only methods. Additionally, we perform an ablation study to investigate model performance with progressively smaller training sets, identifying the minimum data required for effective personalization-an essential consideration for real-world applications where extensive data collection is often challenging. Our findings underscore the potential of adaptive, personalized glucose prediction models for advancing next-generation diabetes management, particularly in wearable and mobile health platforms, enhancing consumer-oriented diabetes care solutions.

Hang Zhao, Hongru Li, Dongfang Xu, Shenghui Song, Khaled B. Letaief

Main category: cs.LG

TL;DR: Three-stage distributed learning framework for multi-modal edge inference that reduces communication overhead while maintaining robustness to channel variations and noisy inputs.

Details

Motivation: Semantic communication is crucial for distributed edge intelligence but faces challenges: 1) prohibitive communication overhead for multi-modal systems over bandwidth-limited wireless links, and 2) limited robustness under varying channels and noisy multi-modal inputs.

Method: Three-stage framework: Stage I - local multi-modal self-supervised learning without device-server exchange; Stage II - distributed fine-tuning with centralized evidential fusion to calibrate uncertainty and aggregate noisy features; Stage III - uncertainty-guided feedback mechanism that selectively requests additional features for uncertain samples.

Result: Experiments on RGB-depth indoor scene classification show higher accuracy with far fewer training communication rounds, robustness to modality degradation or channel variation, outperforming existing self-supervised and fully supervised baselines.

Conclusion: The proposed communication-aware distributed learning framework effectively addresses communication efficiency and robustness challenges in multi-modal edge inference systems over wireless channels.

Abstract: Semantic communication is emerging as a key enabler for distributed edge intelligence due to its capability to convey task-relevant meaning. However, achieving communication-efficient training and robust inference over wireless links remains challenging. This challenge is further exacerbated for multi-modal edge inference (MMEI) by two factors: 1) prohibitive communication overhead for distributed learning over bandwidth-limited wireless links, due to the \emph{multi-modal} nature of the system; and 2) limited robustness under varying channels and noisy multi-modal inputs. In this paper, we propose a three-stage communication-aware distributed learning framework to improve training and inference efficiency while maintaining robustness over wireless channels. In Stage~~I, devices perform local multi-modal self-supervised learning to obtain shared and modality-specific encoders without device–server exchange, thereby reducing the communication cost. In Stage~~II, distributed fine-tuning with centralized evidential fusion calibrates per-modality uncertainty and reliably aggregates features distorted by noise or channel fading. In Stage~III, an uncertainty-guided feedback mechanism selectively requests additional features for uncertain samples, optimizing the communication–accuracy tradeoff in the distributed setting. Experiments on RGB–depth indoor scene classification show that the proposed framework attains higher accuracy with far fewer training communication rounds and remains robust to modality degradation or channel variation, outperforming existing self-supervised and fully supervised baselines.

[348] Multimodal Rumor Detection Enhanced by External Evidence and Forgery Features

Han Li, Hua Sun

Main category: cs.LG

TL;DR: Proposes a multimodal rumor detection model using external evidence and forgery features to detect deep semantic mismatch rumors in social media image-text posts.

Details

Motivation: Social media rumors exploit subtle inconsistencies between images and text, with deep semantic mismatch rumors being particularly challenging. Existing methods have limited feature extraction, noisy alignment, inflexible fusion, and ignore external factual evidence needed for complex rumor verification.

Method: Uses ResNet34 visual encoder, BERT text encoder, and forgery feature module (frequency-domain traces + compression artifacts via Fourier transformation). BLIP generates image descriptions to bridge semantic spaces. Dual contrastive learning between text-image and text-description pairs detects inconsistencies. Gated adaptive feature-scaling fusion dynamically adjusts multimodal fusion.

Result: Outperforms mainstream baselines on Weibo and Twitter datasets in macro accuracy, recall, and F1 score.

Conclusion: The proposed model effectively addresses limitations of existing multimodal rumor detection by incorporating external evidence, forgery features, and improved fusion strategies for better detection of complex rumors.

Abstract: Social media increasingly disseminates information through mixed image text posts, but rumors often exploit subtle inconsistencies and forged content, making detection based solely on post content difficult. Deep semantic mismatch rumors, which superficially align images and texts, pose particular challenges and threaten online public opinion. Existing multimodal rumor detection methods improve cross modal modeling but suffer from limited feature extraction, noisy alignment, and inflexible fusion strategies, while ignoring external factual evidence necessary for verifying complex rumors. To address these limitations, we propose a multimodal rumor detection model enhanced with external evidence and forgery features. The model uses a ResNet34 visual encoder, a BERT text encoder, and a forgery feature module extracting frequency-domain traces and compression artifacts via Fourier transformation. BLIP-generated image descriptions bridge image and text semantic spaces. A dual contrastive learning module computes contrastive losses between text image and text description pairs, improving detection of semantic inconsistencies. A gated adaptive feature-scaling fusion mechanism dynamically adjusts multimodal fusion and reduces redundancy. Experiments on Weibo and Twitter datasets demonstrate that our model outperforms mainstream baselines in macro accuracy, recall, and F1 score.

[349] Improving Regret Approximation for Unsupervised Dynamic Environment Generation

Harry Mead, Bruno Lacerda, Jakob Foerster, Nick Hawes

Main category: cs.LG

TL;DR: DEGen improves UED scaling by providing denser reward signals for level generators, while MNA offers better regret approximation to identify challenging levels, outperforming existing methods especially in larger environments.

Details

Motivation: Current UED methods struggle with credit assignment problems and poor regret approximations that fail to identify challenging levels, particularly as environment size grows. These limitations hinder effective curriculum generation for RL agents.

Method: Proposes DEGen (Dynamic Environment Generation) to provide denser reward signals for level generators, reducing credit assignment difficulty. Also introduces MNA (Maximised Negative Advantage) as an improved regret approximation metric that better identifies challenging levels.

Result: Empirical results show MNA outperforms current regret approximations, and DEGen+MNA consistently outperforms existing UED methods, with particularly strong performance gains as environment size increases.

Conclusion: The combination of DEGen and MNA addresses key limitations in UED, enabling better scaling to larger environments and more effective identification of challenging training levels for improved RL generalization.

Abstract: Unsupervised Environment Design (UED) seeks to automatically generate training curricula for reinforcement learning (RL) agents, with the goal of improving generalisation and zero-shot performance. However, designing effective curricula remains a difficult problem, particularly in settings where small subsets of environment parameterisations result in significant increases in the complexity of the required policy. Current methods struggle with a difficult credit assignment problem and rely on regret approximations that fail to identify challenging levels, both of which are compounded as the size of the environment grows. We propose Dynamic Environment Generation for UED (DEGen) to enable a denser level generator reward signal, reducing the difficulty of credit assignment and allowing for UED to scale to larger environment sizes. We also introduce a new regret approximation, Maximised Negative Advantage (MNA), as a significantly improved metric to optimise for, that better identifies more challenging levels. We show empirically that MNA outperforms current regret approximations and when combined with DEGen, consistently outperforms existing methods, especially as the size of the environment grows. We have made all our code available here: https://github.com/HarryMJMead/Dynamic-Environment-Generation-for-UED.

[350] InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

Mingyue Cheng, Xiaoyu Tao, Huajian Zhang, Qi Liu, Enhong Chen

Main category: cs.LG

TL;DR: InstructTime++ reformulates time series classification as a multimodal generative task using language models, converting continuous sequences to discrete tokens and incorporating implicit feature modeling for better performance.

Details

Motivation: Existing discriminative time series classification methods struggle to incorporate contextual features and capture semantic relationships among classes, limiting their effectiveness.

Method: Proposes InstructTime framework: converts continuous sequences to discrete temporal tokens, uses alignment projection and generative self-supervised pre-training for cross-modal alignment. InstructTime++ adds implicit feature modeling using statistical feature extraction and vision-language-based image captioning to mine patterns and translate them into textual descriptions.

Result: Extensive experiments on multiple benchmark datasets demonstrate superior performance of InstructTime++ compared to existing methods.

Conclusion: Reformulating time series classification as a multimodal generative task with language models and incorporating implicit feature modeling significantly improves performance by better capturing contextual information and semantic relationships.

Abstract: Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.

[351] Fine-Grained Traceability for Transparent ML Pipelines

Liping Chen, Mujie Liu, Haytham Fayek

Main category: cs.LG

TL;DR: FG-Trac is a framework that provides verifiable, sample-level traceability in ML pipelines without modifying models, using cryptographic commitments to track data usage.

Details

Motivation: Current ML transparency mechanisms operate at model level but lack sample-level traceability, leaving practitioners unable to verify when specific samples were used or whether records remain intact over time.

Method: FG-Trac captures and verifies sample lifecycle events across preprocessing and training, computes contribution scores grounded in training checkpoints, and anchors traces to tamper-evident cryptographic commitments without modifying model architectures.

Result: Experiments on CNN and multimodal graph learning pipelines show FG-Trac preserves predictive performance while enabling verifiable evidence of how individual samples were used during model execution.

Conclusion: FG-Trac provides practical, model-agnostic sample-level traceability for ML pipelines, addressing the gap in verifiable data usage tracking with minimal computational overhead.

Abstract: Modern machine learning systems are increasingly realised as multistage pipelines, yet existing transparency mechanisms typically operate at a model level: they describe what a system is and why it behaves as it does, but not how individual data samples are operationally recorded, tracked, and verified as they traverse the pipeline. This absence of verifiable, sample-level traceability leaves practitioners and users unable to determine whether a specific sample was used, when it was processed, or whether the corresponding records remain intact over time. We introduce FG-Trac, a model-agnostic framework that establishes verifiable, fine-grained sample-level traceability throughout machine learning pipelines. FG-Trac defines an explicit mechanism for capturing and verifying sample lifecycle events across preprocessing and training, computes contribution scores explicitly grounded in training checkpoints, and anchors these traces to tamper-evident cryptographic commitments. The framework integrates without modifying model architectures or training objectives, reconstructing complete and auditable data-usage histories with practical computational overhead. Experiments on a canonical convolutional neural network and a multimodal graph learning pipeline demonstrate that FG-Trac preserves predictive performance while enabling machine learning systems to furnish verifiable evidence of how individual samples were used and propagated during model execution.

[352] Lineup Regularized Adjusted Plus-Minus (L-RAPM): Basketball Lineup Ratings with Informed Priors

Christos Petridis, Konstantinos Pelechrinis

Main category: cs.LG

TL;DR: The paper introduces L-RAPM, a regression-based method for evaluating basketball lineups that accounts for opposition quality and player information to address data sparsity from frequent substitutions.

Details

Motivation: Current lineup evaluation in basketball suffers from highly sparse data due to frequent substitutions - NBA teams use over 600 lineups per season, with each lineup averaging only 25-30 possessions. This results in noisy statistics with low predictive value, and there's no existing public work addressing this problem.

Method: Proposes L-RAPM, a regression-based approach that controls for the opposition faced by each lineup while also utilizing information about the players making up the lineups.

Result: L-RAPM provides improved predictive power compared to currently used baselines, with the improvement increasing as the sample size for lineups gets smaller.

Conclusion: The proposed regression-based method effectively addresses the data sparsity problem in lineup evaluation by incorporating opposition quality and player information, offering better predictive performance especially for lineups with limited playing time.

Abstract: Identifying combinations of players (that is, lineups) in basketball - and other sports - that perform well when they play together is one of the most important tasks in sports analytics. One of the main challenges associated with this task is the frequent substitutions that occur during a game, which results in highly sparse data. In particular, a National Basketball Association (NBA) team will use more than 600 lineups during a season, which translates to an average lineup having seen the court in approximately 25-30 possessions. Inevitably, any statistics that one collects for these lineups are going to be noisy, with low predictive value. Yet, there is no existing work (in the public at least) that addresses this problem. In this work, we propose a regression-based approach that controls for the opposition faced by each lineup, while it also utilizes information about the players making up the lineups. Our experiments show that L-RAPM provides improved predictive power than the currently used baseline, and this improvement increases as the sample size for the lineups gets smaller.

[353] RadixMLP – Intra-batch Deduplication for Causal Transformers

Michael Feil, Julius Lipp

Main category: cs.LG

TL;DR: RadixMLP eliminates redundant MLP computations for shared prefixes in batch inference by compressing identical segments into a single computation, achieving 1.44-1.59× speedups in realistic workloads.

Details

Motivation: Batch inference workloads for causal transformers often process sequences with common prefixes (system prompts, few-shot examples, shared queries), but standard engines redundantly recompute identical MLP activations for each copy of shared prefixes.

Method: RadixMLP exploits position-wise nature of MLPs, LayerNorms, linear projections, and embeddings to eliminate redundancy. It dynamically maps batches to a prefix trie, gathering shared segments into compressed representation for position-wise computation and scattering results back at attention boundaries.

Result: In MS~MARCO v1.1 reranking benchmarks with Qwen3 models (0.6B to 8B parameters), RadixMLP achieves 1.44-1.59× speedups in realistic workloads, with up to 5× speedups on synthetic benchmarks with longer shared prefixes.

Conclusion: RadixMLP provides an effective stateless technique to eliminate redundant computations in batch inference for transformer models with shared prefixes, offering significant speedups without requiring multiple forward passes.

Abstract: Batch inference workloads for causal transformer models frequently process sequences that share common prefixes, such as system prompts, few-shot examples, or shared queries. Standard inference engines treat each sequence independently, redundantly recomputing identical MLP activations for every copy of the shared prefix. We introduce RadixMLP, a technique that exploits the position-wise nature of MLPs, LayerNorms, linear projections, and embeddings to eliminate this redundancy. RadixMLP dynamically maps batches to a prefix trie, gathering shared segments into a compressed representation for position-wise computation and scattering results back only at attention boundaries. RadixMLP is stateless and operates within a single forward pass. In end-to-end serving benchmarks on MS~MARCO v1.1 with Qwen3 models (0.6B to 8B parameters), RadixMLP achieves 1.44-1.59$\times$ speedups in realistic reranking workloads, with up to $5\times$ speedups on synthetic benchmarks with longer shared prefixes. Our code is available at https://github.com/michaelfeil/radix-mlp.

[354] Plug-and-Play Benchmarking of Reinforcement Learning Algorithms for Large-Scale Flow Control

Jannis Becktepe, Aleksandra Franz, Nils Thuerey, Sebastian Peitz

Main category: cs.LG

TL;DR: FluidGym is a standalone, fully differentiable benchmark suite for RL in active flow control that eliminates dependency on external CFD solvers and provides standardized evaluation protocols.

Details

Motivation: Current RL research in active flow control suffers from heterogeneous setups, reliance on external CFD solvers, lack of differentiability, and limited 3D/multi-agent support, making progress difficult to assess.

Method: Built entirely in PyTorch on top of GPU-accelerated PICT solver, FluidGym runs in a single Python stack with no external CFD software, providing standardized evaluation protocols and baseline implementations.

Result: The authors present baseline results with PPO and SAC algorithms and release all environments, datasets, and trained models as public resources for the research community.

Conclusion: FluidGym enables systematic comparison of control methods, establishes a scalable foundation for future learning-based flow control research, and is available as an open-source benchmark suite.

Abstract: Reinforcement learning (RL) has shown promising results in active flow control (AFC), yet progress in the field remains difficult to assess as existing studies rely on heterogeneous observation and actuation schemes, numerical setups, and evaluation protocols. Current AFC benchmarks attempt to address these issues but heavily rely on external computational fluid dynamics (CFD) solvers, are not fully differentiable, and provide limited 3D and multi-agent support. To overcome these limitations, we introduce FluidGym, the first standalone, fully differentiable benchmark suite for RL in AFC. Built entirely in PyTorch on top of the GPU-accelerated PICT solver, FluidGym runs in a single Python stack, requires no external CFD software, and provides standardized evaluation protocols. We present baseline results with PPO and SAC and release all environments, datasets, and trained models as public resources. FluidGym enables systematic comparison of control methods, establishes a scalable foundation for future research in learning-based flow control, and is available at https://github.com/safe-autonomous-systems/fluidgym.

[355] Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization

Adam Rokah, Daniel Veress, Caleb Caulk, Sourav Sharan

Main category: cs.LG

TL;DR: MoE architectures in image classification achieve slightly better accuracy than dense models with balanced expert utilization, but show different sharpness characteristics without clear generalization benefits, and naive conditional routing doesn’t provide inference speedups at small scale.

Details

Motivation: To study MoE behavior in image classification (not just language models), focusing on predictive performance, expert utilization, and generalization, and to analyze the gap between theoretical and realized efficiency in sparse MoE models.

Method: Compared dense, SoftMoE, and SparseMoE classifier heads on CIFAR10 under comparable model capacity. Used regularization to maintain balanced expert utilization. Analyzed generalization using Hessian-based sharpness metrics (largest eigenvalue and trace) and loss surface perturbation analyses. Evaluated empirical inference efficiency.

Result: Both MoE variants achieved slightly higher validation accuracy than dense baseline while avoiding expert collapse. SoftMoE exhibited higher sharpness metrics, while Dense and SparseMoE had similar curvature. All models achieved comparable generalization performance. Naive conditional routing didn’t yield inference speedups on modern hardware at this scale.

Conclusion: MoE architectures can achieve competitive performance in image classification with balanced expert utilization, but show different loss landscape characteristics. The gap between theoretical and realized efficiency highlights practical challenges in implementing sparse MoE models at small scales.

Abstract: Mixture-of-Experts (MoE) architectures enable conditional computation by routing inputs to multiple expert subnetworks and are often motivated as a mechanism for scaling large language models. In this project, we instead study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization. We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity. Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization, avoiding expert collapse. To analyze generalization, we compute Hessian-based sharpness metrics at convergence, including the largest eigenvalue and trace of the loss Hessian, evaluated on both training and test data. We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance. Complementary loss surface perturbation analyses reveal qualitative differences in non-local behavior under finite parameter perturbations between dense and MoE models, which help contextualize curvature-based measurements without directly explaining validation accuracy. We further evaluate empirical inference efficiency and show that naively implemented conditional routing does not yield inference speedups on modern hardware at this scale, highlighting the gap between theoretical and realized efficiency in sparse MoE models.

[356] Factorizable joint shift revisited

Dirk Tasche

Main category: cs.LG

TL;DR: The paper proposes a framework for analyzing distribution shift in general label spaces (covering both classification and regression), generalizes existing factorizable joint shift results to these spaces, extends EM algorithms for class prior probabilities, and re-examines generalized label shift.

Details

Motivation: Previous research on factorizable joint shift (FJS) has been limited to categorical label spaces, leaving a gap for analyzing distribution shifts in general label spaces that include both classification and regression problems.

Method: Develops a framework for analyzing distribution shift in general label spaces, generalizes existing FJS results to these spaces, proposes an extension of the EM algorithm for class prior probabilities, and re-examines generalized label shift in this broader context.

Result: The framework enables analysis of distribution shift beyond categorical labels, extends FJS theory to general label spaces, provides algorithmic extensions for prior probability estimation, and offers new insights into generalized label shift.

Conclusion: The proposed framework successfully addresses the limitation of previous FJS research by extending distribution shift analysis to general label spaces, providing theoretical generalizations and practical algorithmic extensions for both classification and regression problems.

Abstract: Factorizable joint shift (FJS) was proposed as a type of distribution shift (or dataset shift) that comprises both covariate and label shift. Recently, it has been observed that FJS actually arises from consecutive label and covariate (or vice versa) shifts. Research into FJS so far has been confined to the case of categorical label spaces. We propose a framework for analysing distribution shift in the case of general label spaces, thus covering both classification and regression models. Based on the framework, we generalise existing results on FJS to general label spaces and propose a related extension of the expectation maximisation (EM) algorithm for class prior probabilities. We also take a fresh look at generalized label shift (GLS) in the case of general label spaces.

[357] A Curriculum-Based Deep Reinforcement Learning Framework for the Electric Vehicle Routing Problem

Mertcan Daysalilar, Fuat Uyguroglu, Gabriel Nicolosi, Adam Meyers

Main category: cs.LG

TL;DR: Curriculum-based DRL framework for EVRPTW improves training stability and generalization by gradually increasing problem complexity through three learning phases.

Details

Motivation: Existing DRL models for EVRPTW struggle with training instability and poor generalization when constraints are dense, failing to converge or maintain feasibility.

Method: Three-phase curriculum learning: Phase A (distance/fleet optimization), Phase B (battery management), Phase C (full EVRPTW). Uses modified PPO with phase-specific hyperparameters, value/advantage clipping, adaptive learning rates, and heterogeneous graph attention encoder with global-local attention and feature-wise linear modulation.

Result: Trained only on small instances (N=10), model generalizes robustly to unseen instances (N=5 to N=100), outperforming baselines on medium-scale problems with high feasibility rates and competitive solution quality.

Conclusion: Curriculum-guided DRL effectively bridges neural speed and operational reliability, achieving stable learning and strong generalization where standard DRL baselines fail on out-of-distribution instances.

Abstract: The electric vehicle routing problem with time windows (EVRPTW) is a complex optimization problem in sustainable logistics, where routing decisions must minimize total travel distance, fleet size, and battery usage while satisfying strict customer time constraints. Although deep reinforcement learning (DRL) has shown great potential as an alternative to classical heuristics and exact solvers, existing DRL models often struggle to maintain training stability-failing to converge or generalize when constraints are dense. In this study, we propose a curriculum-based deep reinforcement learning (CB-DRL) framework designed to resolve this instability. The framework utilizes a structured three-phase curriculum that gradually increases problem complexity: the agent first learns distance and fleet optimization (Phase A), then battery management (Phase B), and finally the full EVRPTW (Phase C). To ensure stable learning across phases, the framework employs a modified proximal policy optimization algorithm with phase-specific hyperparameters, value and advantage clipping, and adaptive learning-rate scheduling. The policy network is built upon a heterogeneous graph attention encoder enhanced by global-local attention and feature-wise linear modulation. This specialized architecture explicitly captures the distinct properties of depots, customers, and charging stations. Trained exclusively on small instances with N=10 customers, the model demonstrates robust generalization to unseen instances ranging from N=5 to N=100, significantly outperforming standard baselines on medium-scale problems. Experimental results confirm that this curriculum-guided approach achieves high feasibility rates and competitive solution quality on out-of-distribution instances where standard DRL baselines fail, effectively bridging the gap between neural speed and operational reliability.

[358] HyperNet-Adaptation for Diffusion-Based Test Case Generation

Oliver Weißl, Vincenzo Riccio, Severin Kacianka, Andrea Stocco

Main category: cs.LG

TL;DR: HyNeA is a generative testing method using hypernetworks to control diffusion models for efficient, dataset-free generation of realistic failure cases in deep learning systems.

Details

Motivation: Traditional adversarial attacks create unrealistic perturbations and only assess robustness, while existing generative methods are limited to simple datasets or constrained domains. Diffusion models offer high-fidelity synthesis but are computationally expensive and lack controllability for large-scale testing.

Method: HyNeA uses hypernetworks to provide direct, dataset-free controllability over diffusion-based generation without architecture-specific conditioning or fine-tuning. It employs a distinct training strategy supporting instance-level tuning to identify failure-inducing test cases without needing failure-labeled datasets.

Result: HyNeA improves controllability and test diversity compared to existing generative test generators, generalizes to domains without failure-labeled training data, and generates realistic failure cases at substantially lower computational cost than search-based methods.

Conclusion: HyNeA enables efficient, targeted generation of realistic failure cases for systematic evaluation of deep learning reliability, addressing limitations of traditional adversarial attacks and existing generative testing methods.

Abstract: The increasing deployment of deep learning systems requires systematic evaluation of their reliability in real-world scenarios. Traditional gradient-based adversarial attacks introduce small perturbations that rarely correspond to realistic failures and mainly assess robustness rather than functional behavior. Generative test generation methods offer an alternative but are often limited to simple datasets or constrained input domains. Although diffusion models enable high-fidelity image synthesis, their computational cost and limited controllability restrict their applicability to large-scale testing. We present HyNeA, a generative testing method that enables direct and efficient control over diffusion-based generation. HyNeA provides dataset-free controllability through hypernetworks, allowing targeted manipulation of the generative process without relying on architecture-specific conditioning mechanisms or dataset-driven adaptations such as fine-tuning. HyNeA employs a distinct training strategy that supports instance-level tuning to identify failure-inducing test cases without requiring datasets that explicitly contain examples of similar failures. This approach enables the targeted generation of realistic failure cases at substantially lower computational cost than search-based methods. Experimental results show that HyNeA improves controllability and test diversity compared to existing generative test generators and generalizes to domains where failure-labeled training data is unavailable.

[359] LoRAP: Low-Rank Aggregation Prompting for Quantized Graph Neural Networks Training

Chenyu Liu, Haige Li, Luca Rossi

Main category: cs.LG

TL;DR: LoRAP improves GNN quantization performance by injecting low-rank prompts into aggregated features during quantization-aware training.

Details

Motivation: GNN quantization reduces model size for resource-constrained environments, but quantizing graph features is challenging. Existing approaches that only prompt node features don't fully optimize quantized aggregation results.

Method: Proposes Low-Rank Aggregation Prompting (LoRAP) - injects lightweight, input-dependent prompts into each aggregated feature to optimize quantized aggregation results during QAT.

Result: Extensive evaluations on 4 QAT frameworks over 9 graph datasets show LoRAP consistently enhances low-bit quantized GNN performance with minimal computational overhead.

Conclusion: LoRAP effectively improves GNN quantization performance by addressing limitations of node-only prompting through aggregation-level prompt injection.

Abstract: Graph Neural Networks (GNNs) are neural networks that aim to process graph data, capturing the relationships and interactions between nodes using the message-passing mechanism. GNN quantization has emerged as a promising approach for reducing model size and accelerating inference in resource-constrained environments. Compared to quantization in LLMs, quantizing graph features is more emphasized in GNNs. Inspired by the above, we propose to leverage prompt learning, which manipulates the input data, to improve the performance of quantization-aware training (QAT) for GNNs. To mitigate the issue that prompting the node features alone can only make part of the quantized aggregation result optimal, we introduce Low-Rank Aggregation Prompting (LoRAP), which injects lightweight, input-dependent prompts into each aggregated feature to optimize the results of quantized aggregations. Extensive evaluations on 4 leading QAT frameworks over 9 graph datasets demonstrate that LoRAP consistently enhances the performance of low-bit quantized GNNs while introducing a minimal computational overhead.

[360] Memory Retention Is Not Enough to Master Memory Tasks in Reinforcement Learning

Oleg Shchendrigin, Egor Cherepanov, Alexey K. Kovalev, Aleksandr I. Panov

Main category: cs.LG

TL;DR: Existing RL benchmarks focus on memory retention, but real-world decision-making requires both stable memory and adaptive updating. This paper introduces a benchmark for testing memory rewriting under partial observability and finds classic recurrent models outperform modern structured memories and transformers in this critical capability.

Details

Motivation: Real-world decision-making requires memory that is both stable (retaining information over long horizons) and adaptive (updating outdated content when circumstances change). Current RL benchmarks and memory-augmented agents focus primarily on retention, leaving memory rewriting largely unexplored despite its equal importance.

Method: The authors introduce a benchmark that explicitly tests continual memory updating under partial observability, where agents must rely on memory rather than current observations. They compare three types of memory architectures: recurrent models, transformer-based architectures, and structured memory architectures.

Result: Classic recurrent models demonstrate greater flexibility and robustness in memory rewriting tasks compared to modern structured memories (which succeed only under narrow conditions) and transformer-based agents (which often fail beyond trivial retention cases).

Conclusion: Current approaches have fundamental limitations in balancing stable retention with adaptive updating. The work highlights this overlooked challenge, introduces benchmarks to evaluate memory rewriting, and suggests future RL agents need explicit and trainable forgetting mechanisms.

Abstract: Effective decision-making in the real world depends on memory that is both stable and adaptive: environments change over time, and agents must retain relevant information over long horizons while also updating or overwriting outdated content when circumstances shift. Existing Reinforcement Learning (RL) benchmarks and memory-augmented agents focus primarily on retention, leaving the equally critical ability of memory rewriting largely unexplored. To address this gap, we introduce a benchmark that explicitly tests continual memory updating under partial observability, i.e. the natural setting where an agent must rely on memory rather than current observations, and use it to compare recurrent, transformer-based, and structured memory architectures. Our experiments reveal that classic recurrent models, despite their simplicity, demonstrate greater flexibility and robustness in memory rewriting tasks than modern structured memories, which succeed only under narrow conditions, and transformer-based agents, which often fail beyond trivial retention cases. These findings expose a fundamental limitation of current approaches and emphasize the necessity of memory mechanisms that balance stable retention with adaptive updating. Our work highlights this overlooked challenge, introduces benchmarks to evaluate it, and offers insights for designing future RL agents with explicit and trainable forgetting mechanisms. Code: https://quartz-admirer.github.io/Memory-Rewriting/

[361] Auditing Language Model Unlearning via Information Decomposition

Anmol Goel, Alan Ritter, Iryna Gurevych

Main category: cs.LG

TL;DR: Current machine unlearning methods in language models fail to truly erase forgotten data - information remains linearly decodable from internal representations, creating privacy risks.

Details

Motivation: To address the critical limitation that current unlearning approaches don't actually remove information about forgotten data from language models, despite apparent algorithmic success.

Method: Introduce an interpretable, information-theoretic framework using Partial Information Decomposition (PID) to audit unlearning by comparing model representations before and after unlearning, decomposing mutual information into distinct components.

Result: Reveals that redundant information shared across both models persists as residual knowledge post-unlearning and correlates with susceptibility to adversarial reconstruction attacks.

Conclusion: Proposes a representation-based risk score for abstention on sensitive inputs and introduces a principled, representation-level audit framework for safer deployment of language models with theoretical insights and practical tools.

Abstract: We expose a critical limitation in current approaches to machine unlearning in language models: despite the apparent success of unlearning algorithms, information about the forgotten data remains linearly decodable from internal representations. To systematically assess this discrepancy, we introduce an interpretable, information-theoretic framework for auditing unlearning using Partial Information Decomposition (PID). By comparing model representations before and after unlearning, we decompose the mutual information with the forgotten data into distinct components, formalizing the notions of unlearned and residual knowledge. Our analysis reveals that redundant information, shared across both models, constitutes residual knowledge that persists post-unlearning and correlates with susceptibility to known adversarial reconstruction attacks. Leveraging these insights, we propose a representation-based risk score that can guide abstention on sensitive inputs at inference time, providing a practical mechanism to mitigate privacy leakage. Our work introduces a principled, representation-level audit for unlearning, offering theoretical insight and actionable tools for safer deployment of language models.

[362] Overcoming In-Memory Bottlenecks in Graph Foundation Models via Retrieval-Augmented Generation

Haonan Yuan, Qingyun Sun, Jiacheng Tao, Xingcheng Fu, Jianxin Li

Main category: cs.LG

TL;DR: RAG-GFM is a retrieval-augmented graph foundation model that externalizes knowledge from parameters using dual-modal retrieval (semantic text + structural motifs) to overcome in-memory bottlenecks and improve scalability.

Details

Motivation: Current Graph Foundation Models (GFMs) suffer from in-memory bottlenecks where they encode knowledge into model parameters, leading to limited semantic capacity, heavy lossy compression with conflicts, and entangled representations that hinder efficient adaptation, scalability, and interpretability.

Method: Proposes RAG-GFM with: 1) Dual-modal unified retrieval module (semantic store from prefix-structured text + structural store from centrality-based motifs), 2) Dual-view alignment objective to preserve heterogeneous information by contrasting both modalities, 3) In-context augmentation for downstream adaptation using retrieved texts and motifs as contextual evidence.

Result: Extensive experiments on five benchmark graph datasets show RAG-GFM consistently outperforms 13 state-of-the-art baselines in both cross-domain node and graph classification, achieving superior effectiveness and efficiency.

Conclusion: RAG-GFM successfully addresses in-memory bottlenecks in GFMs by externalizing knowledge through retrieval-augmented generation, enabling better scalability, interpretability, and adaptation while maintaining strong performance across diverse graph tasks.

Abstract: Graph Foundation Models (GFMs) have emerged as a frontier in graph learning, which are expected to deliver transferable representations across diverse tasks. However, GFMs remain constrained by in-memory bottlenecks: they attempt to encode knowledge into model parameters, which limits semantic capacity, introduces heavy lossy compression with conflicts, and entangles graph representation with the knowledge in ways that hinder efficient adaptation, undermining scalability and interpretability. In this work,we propose RAG-GFM, a Retrieval-Augmented Generation aided Graph Foundation Model that offloads knowledge from parameters and complements parameterized learning. To externalize graph knowledge, we build a dual-modal unified retrieval module, where a semantic store from prefix-structured text and a structural store from centrality-based motif. To preserve heterogeneous information, we design a dual-view alignment objective that contrasts both modalities to capture both content and relational patterns. To enable efficient downstream adaptation, we perform in-context augmentation to enrich supporting instances with retrieved texts and motifs as contextual evidence. Extensive experiments on five benchmark graph datasets demonstrate that RAG-GFM consistently outperforms 13 state-of-the-art baselines in both cross-domain node and graph classification, achieving superior effectiveness and efficiency.

[363] DeepFedNAS: A Unified Framework for Principled, Hardware-Aware, and Predictor-Free Federated Neural Architecture Search

Bostan Khan, Masoud Daneshtalab

Main category: cs.LG

TL;DR: DeepFedNAS is a novel two-phase framework for Federated Neural Architecture Search that addresses bottlenecks in supernet training and subnet discovery through Pareto-optimal curriculum learning and predictor-free search, achieving SOTA accuracy with 61x speedup.

Details

Motivation: Current FedNAS approaches suffer from unguided supernet training leading to suboptimal models, and costly multi-hour pipelines for post-training subnet discovery, making hardware-aware FL deployments impractical.

Method: Two-phase framework: 1) Federated Pareto Optimal Supernet Training using pre-computed Pareto-optimal architectures as intelligent curriculum, 2) Predictor-Free Search Method using multi-objective fitness function as direct accuracy proxy for zero-cost subnet discovery.

Result: Achieves state-of-the-art accuracy (up to 1.21% improvement on CIFAR-100), superior parameter/communication efficiency, 61x speedup in total pipeline time, reducing from 20+ hours to ~20 minutes, with 20-second individual subnet searches.

Conclusion: DeepFedNAS makes hardware-aware FL deployments instantaneous and practical by addressing critical bottlenecks in FedNAS through principled multi-objective optimization and efficient search mechanisms.

Abstract: Federated Neural Architecture Search (FedNAS) aims to automate model design for privacy-preserving Federated Learning (FL) but currently faces two critical bottlenecks: unguided supernet training that yields suboptimal models, and costly multi-hour pipelines for post-training subnet discovery. We introduce DeepFedNAS, a novel, two-phase framework underpinned by a principled, multi-objective fitness function that synthesizes mathematical network design with architectural heuristics. Enabled by a re-engineered supernet, DeepFedNAS introduces Federated Pareto Optimal Supernet Training, which leverages a pre-computed Pareto-optimal cache of high-fitness architectures as an intelligent curriculum to optimize shared supernet weights. Subsequently, its Predictor-Free Search Method eliminates the need for costly accuracy surrogates by utilizing this fitness function as a direct, zero-cost proxy for accuracy, enabling on-demand subnet discovery in mere seconds. DeepFedNAS achieves state-of-the-art accuracy (e.g., up to 1.21% absolute improvement on CIFAR-100), superior parameter and communication efficiency, and a substantial ~61x speedup in total post-training search pipeline time. By reducing the pipeline from over 20 hours to approximately 20 minutes (including initial cache generation) and enabling 20-second individual subnet searches, DeepFedNAS makes hardware-aware FL deployments instantaneous and practical. The complete source code and experimental scripts are available at: https://github.com/bostankhan6/DeepFedNAS

[364] CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning

Tianshi Xu, Yuteng Chen, Meng Li

Main category: cs.LG

TL;DR: CLEANER is a method that uses LLMs’ self-correction capabilities to purify noisy trajectories in agentic RL by replacing failures with successful self-corrections, improving policy optimization for parameter-constrained models.

Details

Motivation: Parameter-constrained LLMs (4B-7B) in agentic RL suffer from frequent execution failures during exploration, creating noisy trajectories that cause credit assignment problems. Existing solutions face trade-offs: dense rewards lead to reward hacking, while supersampling is computationally expensive.

Method: CLEANER uses Similarity-Aware Adaptive Rollback (SAAR) to autonomously construct clean trajectories by retrospectively replacing failures with successful self-corrections. SAAR adaptively regulates replacement granularity based on semantic similarity, from shallow execution repairs to deep reasoning substitutions.

Result: Achieves average accuracy gains of 6% on AIME24/25, 3% on GPQA, and 5% on LiveCodeBench over baselines. Notably matches state-of-the-art performance using only one-third of the training steps.

Conclusion: Trajectory purification via self-correction is a scalable solution for efficient agentic RL, allowing models to internalize correct reasoning patterns rather than error-recovery loops, significantly improving training efficiency and performance.

Abstract: Agentic Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to utilize tools like Python interpreters for complex problem-solving. However, for parameter-constrained models (e.g., 4B–7B), the exploration phase is often plagued by frequent execution failures, creating noisy trajectories that hinder policy optimization. Under standard outcome-based reward settings, this noise leads to a critical credit assignment issue, where erroneous actions are inadvertently reinforced alongside successful outcomes. Existing mitigations face a dilemma: dense rewards often trigger reward hacking, while supersampling incurs prohibitive computational costs. To address these challenges, we propose CLEANER. Distinct from external filtering methods, CLEANER exploits the model’s intrinsic self-correction capabilities to eliminate error-contaminated context directly during data collection. At its core, the Similarity-Aware Adaptive Rollback (SAAR) mechanism autonomously constructs clean, purified trajectories by retrospectively replacing failures with successful self-corrections. Based on semantic similarity, SAAR adaptively regulates replacement granularity from shallow execution repairs to deep reasoning substitutions. By training on these self-purified paths, the model internalizes correct reasoning patterns rather than error-recovery loops. Empirical results on AIME24/25, GPQA, and LiveCodeBench show average accuracy gains of 6%, 3%, and 5% over baselines. Notably, CLEANER matches state-of-the-art performance using only one-third of the training steps, highlighting trajectory purification as a scalable solution for efficient agentic RL. Our models and code are available at GitHub

[365] Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen

Main category: cs.LG

TL;DR: Transformers trained with RL on sparse rewards can spontaneously develop Chain-of-Thought reasoning, and this paper analyzes the gradient dynamics behind this emergence using a synthetic graph traversal task.

Details

Motivation: The mechanism by which sparse rewards drive gradient descent to discover systematic Chain-of-Thought reasoning in Transformers remains poorly understood, despite empirical observations of this phenomenon.

Method: Analyze gradient flow dynamics of single-layer Transformers on a synthetic graph traversal task that requires Chain-of-Thought but has a simple iterative solution. Prove theoretical convergence properties and characterize distributional requirements.

Result: Despite training only on final-answer correctness, gradient flow converges to a structured, interpretable algorithm that iteratively traverses graphs vertex-by-vertex. The emergence depends on having sufficient “simple examples” (instances requiring fewer reasoning steps) in the training distribution.

Conclusion: The theoretical findings explain how sparse rewards can drive the emergence of systematic reasoning in Transformers, with practical validation on synthetic data and real-world language models on mathematical reasoning tasks.

Abstract: Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive gradient descent to discover such systematic reasoning remains poorly understood. We address this by analyzing the gradient flow dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought (CoT) but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, gradient flow drives the model to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of “simple examples”: instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler instances, the model learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, gradient-based learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.

[366] ZENITH: Automated Gradient Norm Informed Stochastic Optimization

Dhrubo Saha

Main category: cs.LG

TL;DR: ZENITH optimizer adapts learning rate using gradient norm evolution, achieving better accuracy faster than baselines across vision tasks with regularization compatibility.

Details

Motivation: Existing adaptive optimizers have issues: computational/memory overhead, incompatibility with regularization, and suboptimal learning rate choices. Manual learning rate scheduling requires oversight or hyperparameter tuning.

Method: ZENITH (Zero-overhead Evolution using Norm-Informed Training History) adapts learning rate using temporal evolution of gradient norm, avoiding overhead of existing adaptive methods.

Result: Achieves higher test accuracy in lower wall-clock time across 6 CNN architectures and 6 benchmarks. Superior mAP in object detection, keypoint detection, and instance segmentation on MS COCO with R-CNN models. Compatible with regularization for better generalization.

Conclusion: ZENITH provides an effective, efficient alternative to existing adaptive optimizers with better performance, faster training, and regularization compatibility.

Abstract: Training deep computer vision models requires manual oversight or hyperparameter tuning of the learning rate (LR) schedule. While existing adaptive optimizers schedule the LR automatically, they suffer from computational and memory overhead, incompatibility with regularization, and suboptimal LR choices. In this work, we introduce the ZENITH (Zero-overhead Evolution using Norm-Informed Training History) optimizer, which adapts the LR using the temporal evolution of the gradient norm. Image classification experiments spanning 6 CNN architectures and 6 benchmarks demonstrate that ZENITH achieves higher test accuracy in lower wall-clock time than baselines. It also yielded superior mAP in object detection, keypoint detection, and instance segmentation on MS COCO using the R-CNN family of models. Furthermore, its compatibility with regularization enables even better generalization.

[367] Recommending Best Paper Awards for ML/AI Conferences via the Isotonic Mechanism

Garrett G. Wen, Buxin Su, Natalie Collina, Zhun Deng, Weijie Su

Main category: cs.LG

TL;DR: The paper proposes an author-assisted mechanism using the Isotonic Mechanism to improve best paper award selection by eliciting truthful author rankings of their own submissions to adjust review scores.

Details

Motivation: Large ML/AI conferences receive tens of thousands of submissions, making peer review quality and consistency challenging. Best paper award selection has become increasingly debated, requiring better mechanisms to identify truly excellent papers.

Method: Uses the Isotonic Mechanism to elicit authors’ truthful rankings of their own submissions. These rankings adjust raw review scores to better estimate ground-truth paper quality. The mechanism is extended to handle overlapping authorship scenarios.

Result: Authors are incentivized to report truthfully when utility is convex additive. For single-quota cases, truthfulness holds even with nondecreasing additive utility functions. Validation using ICLR (2019-2023) and NeurIPS (2021-2023) data supports convexity assumption. Simulations show significant improvement in award paper quality.

Conclusion: The author-assisted mechanism provides a practical solution to improve best paper award selection by leveraging authors’ private information about their own work while ensuring truthful reporting incentives, with relaxed assumptions compared to prior work.

Abstract: Machine learning and artificial intelligence conferences such as NeurIPS and ICML now regularly receive tens of thousands of submissions, posing significant challenges to maintaining the quality and consistency of the peer review process. This challenge is particularly acute for best paper awards, which are an important part of the peer review process, yet whose selection has increasingly become a subject of debate in recent years. In this paper, we introduce an author-assisted mechanism to facilitate the selection of best paper awards. Our method employs the Isotonic Mechanism for eliciting authors’ assessments of their own submissions in the form of a ranking, which is subsequently utilized to adjust the raw review scores for optimal estimation of the submissions’ ground-truth quality. We demonstrate that authors are incentivized to report truthfully when their utility is a convex additive function of the adjusted scores, and we validate this convexity assumption for best paper awards using publicly accessible review data of ICLR from 2019 to 2023 and NeurIPS from 2021 to 2023. Crucially, in the special case where an author has a single quota – that is, may nominate only one paper – we prove that truthfulness holds even when the utility function is merely nondecreasing and additive. This finding represents a substantial relaxation of the assumptions required in prior work. For practical implementation, we extend our mechanism to accommodate the common scenario of overlapping authorship. Finally, simulation results demonstrate that our mechanism significantly improves the quality of papers selected for awards.

[368] MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs

Christoph Bartmann, Johannes Schimunek, Mykyta Ielanskyi, Philipp Seidl, Günter Klambauer, Sohvi Luukkonen

Main category: cs.LG

TL;DR: MolecularIQ is a new benchmark for evaluating LLMs’ ability to reason about molecular structure through symbolically verifiable tasks, revealing specific capability patterns and failure modes.

Details

Motivation: Existing chemistry benchmarks have limitations: they focus on general chemical knowledge, rely on potentially biased literature/surrogate labels, or reduce evaluation to multiple-choice questions. There's a need for benchmarks that specifically test molecular structure reasoning with verifiable tasks.

Method: Introduces MolecularIQ benchmark with symbolically verifiable tasks focused exclusively on molecular graph reasoning. Enables fine-grained evaluation of reasoning capabilities over molecular structures.

Result: The benchmark reveals capability patterns that localize model failures to specific tasks and molecular structures, providing actionable insights into strengths and limitations of current chemistry LLMs.

Conclusion: MolecularIQ provides a focused evaluation framework for molecular structure reasoning that guides development of models capable of faithful reasoning over molecular graphs, addressing gaps in existing chemistry benchmarks.

Abstract: A molecule’s properties are fundamentally determined by its composition and structure encoded in its molecular graph. Thus, reasoning about molecular properties requires the ability to parse and understand the molecular graph. Large Language Models (LLMs) are increasingly applied to chemistry, tackling tasks such as molecular name conversion, captioning, text-guided generation, and property or reaction prediction. Most existing benchmarks emphasize general chemical knowledge, rely on literature or surrogate labels that risk leakage or bias, or reduce evaluation to multiple-choice questions. We introduce MolecularIQ, a molecular structure reasoning benchmark focused exclusively on symbolically verifiable tasks. MolecularIQ enables fine-grained evaluation of reasoning over molecular graphs and reveals capability patterns that localize model failures to specific tasks and molecular structures. This provides actionable insights into the strengths and limitations of current chemistry LLMs and guides the development of models that reason faithfully over molecular structure.

[369] Learning from Discriminatory Training Data

Przemyslaw A. Grabowicz, Nicholas Perello, Kenta Takatsu

Main category: cs.LG

TL;DR: A fair learning method that minimizes model error on fair test datasets while training on discriminatory data, using probabilistic interventions to handle direct and indirect discrimination while maintaining business necessity.

Details

Motivation: Supervised learning systems trained on historical discriminatory data may perpetuate discrimination against protected groups. There's a need for methods that can perform well on fair datasets despite being trained on potentially biased data, addressing both direct discrimination and intersectionality issues.

Method: The method uses probabilistic interventions with causal and counterfactual formulations to represent dataset shifts. It can be applied to any supervised learning model to prevent direct and indirect discrimination via proxies while maximizing accuracy for business necessity. The approach is computationally lightweight.

Result: The method provably minimizes model error on fair datasets while training on datasets poisoned with direct additive discrimination. It provides a solution to intersectionality issues by balancing protected groups and is compatible with existing legal systems.

Conclusion: The proposed fair learning method effectively addresses discrimination in machine learning by handling dataset shifts, supporting intersectionality, and maintaining business necessity while being legally compatible and computationally efficient.

Abstract: Supervised learning systems are trained using historical data and, if the data was tainted by discrimination, they may unintentionally learn to discriminate against protected groups. We propose that fair learning methods, despite training on potentially discriminatory datasets, shall perform well on fair test datasets. Such dataset shifts crystallize application scenarios for specific fair learning methods. For instance, the removal of direct discrimination can be represented as a particular dataset shift problem. For this scenario, we propose a learning method that provably minimizes model error on fair datasets, while blindly training on datasets poisoned with direct additive discrimination. The method is compatible with existing legal systems and provides a solution to the widely discussed issue of protected groups’ intersectionality by striking a balance between the protected groups. Technically, the method applies probabilistic interventions, has causal and counterfactual formulations, and is computationally lightweight - it can be used with any supervised learning model to prevent direct and indirect discrimination via proxies while maximizing model accuracy for business necessity.

[370] BayesAgent: Bayesian Agentic Reasoning Under Uncertainty via Verbalized Probabilistic Graphical Modeling

Hengguan Huang, Xing Shen, Songtao Wang, Lingfa Meng, Dianbo Liu, David Alejandro Duchene, Hao Wang, Samir Bhatt

Main category: cs.LG

TL;DR: vPGM bridges LLM agents with probabilistic graphical models using natural language guidance and Bayesian inference to improve reasoning under uncertainty without requiring domain expertise.

Details

Motivation: LLM agents lack principled frameworks for capturing latent structures and modeling uncertainty, unlike human cognition which excels at forming latent representations. There's a need to bridge LLM agents with probabilistic methods for better agentic reasoning under uncertainty.

Method: Introduces Verbalized Probabilistic Graphical Modeling (vPGM), a Bayesian agentic framework that: (1) guides LLM agents in following PGM principles through natural language, and (2) refines posterior distributions via numerical Bayesian inference. Bypasses expert-driven model design.

Result: Evaluated on several agentic reasoning tasks (both close-ended and open-ended). The model effectively enhances confidence calibration and text generation quality.

Conclusion: vPGM successfully bridges LLM agents with probabilistic graphical models, providing a principled framework for agentic reasoning under uncertainty without requiring substantial domain expertise or assumptions.

Abstract: Human cognition excels at transcending sensory input and forming latent representations that structure our understanding of the world. While Large Language Model (LLM) agents demonstrate emergent reasoning and decision-making abilities, they lack a principled framework for capturing latent structures and modeling uncertainty. In this work, we explore for the first time how to bridge LLM agents with probabilistic graphical models (PGMs) to address agentic reasoning under uncertainty. To this end, we introduce Verbalized Probabilistic Graphical Modeling (vPGM), a Bayesian agentic framework that (i) guides LLM agents in following key principles of PGMs through natural language and (ii) refines the resulting posterior distributions via numerical Bayesian inference. Unlike many traditional probabilistic methods requiring substantial domain expertise, vPGM bypasses expert-driven model design, making it well-suited for scenarios with limited assumptions. We evaluated our model on several agentic reasoning tasks, both close-ended and open-ended. Our results indicate that the model effectively enhances confidence calibration and text generation quality.

[371] Reinforced Inverse Scattering

Hanyang Jiang, Yuehaw Khoo, Haizhao Yang

Main category: cs.LG

TL;DR: Using reinforcement learning to optimize sensor positions and wave frequencies for improved inverse wave scattering reconstruction quality with limited resources.

Details

Motivation: Traditional inverse wave scattering methods have fixed sensor positions and frequencies, limiting reconstruction quality. There's a need for adaptive, intelligent resource allocation to improve imaging with limited sensors and frequencies.

Method: Proposes a reinforcement learning framework that adaptively selects optimal sensor positions and wave frequencies based on different scatterer properties, enabling intelligent resource allocation for precision imaging.

Result: The method achieves significant improvement in reconstruction quality compared to existing methods, as demonstrated through extensive numerical simulations.

Conclusion: Reinforcement learning enables adaptive, intelligent optimization of imaging resources (sensor positions and frequencies) for superior inverse wave scattering reconstruction, offering a promising approach for precision imaging with limited resources.

Abstract: Inverse wave scattering aims at determining the properties of an object using data on how the object scatters incoming waves. In order to collect information, sensors are put in different locations to send and receive waves from each other. The choice of sensor positions and incident wave frequencies determines the reconstruction quality of scatterer properties. This paper introduces reinforcement learning to develop precision imaging that decides sensor positions and wave frequencies adaptive to different scatterers in an intelligent way, thus obtaining a significant improvement in reconstruction quality with limited imaging resources. Extensive numerical results will be provided to demonstrate the superiority of the proposed method over existing methods.

[372] GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations

Rick Wilming, Artur Dox, Hjalmar Schulz, Marta Oliveira, Benedict Clark, Stefan Haufe

Main category: cs.LG

TL;DR: The paper introduces GECO dataset and GECOBench framework to evaluate gender bias in XAI feature attributions for BERT models, showing fine-tuning embedding layers improves explanation performance.

Details

Motivation: Pre-trained language models inherit gender biases from training data, but it's unclear how these biases affect feature attributions generated by XAI techniques. There's a need to systematically evaluate whether XAI methods produce biased explanations and to what extent fine-tuning can mitigate such biases.

Method: Created GECO (gender-controlled text dataset) with grammatical gender alterations that provide ground truth feature attributions for gender classification tasks. Applied this to pre-trained BERT model fine-tuned to different degrees. Developed GECOBench framework for quantitative evaluation of XAI methods.

Result: Found clear dependency between explanation performance and number of fine-tuned layers. XAI methods benefit particularly from fine-tuning or complete retraining of embedding layers. Pre-training induces undesirable bias in feature attributions that can be mitigated through fine-tuning.

Conclusion: The study demonstrates that gender biases in pre-trained models affect XAI feature attributions, but fine-tuning (especially embedding layers) can mitigate these biases. GECO dataset and GECOBench provide objective evaluation framework for XAI methods regarding bias.

Abstract: Large pre-trained language models have become a crucial backbone for many downstream tasks in natural language processing (NLP), and while they are trained on a plethora of data containing a variety of biases, such as gender biases, it has been shown that they can also inherit such biases in their weights, potentially affecting their prediction behavior. However, it is unclear to what extent these biases also affect feature attributions generated by applying “explainable artificial intelligence” (XAI) techniques, possibly in unfavorable ways. To systematically study this question, we create a gender-controlled text dataset, GECO, in which the alteration of grammatical gender forms induces class-specific words and provides ground truth feature attributions for gender classification tasks. This enables an objective evaluation of the correctness of XAI methods. We apply this dataset to the pre-trained BERT model, which we fine-tune to different degrees, to quantitatively measure how pre-training induces undesirable bias in feature attributions and to what extent fine-tuning can mitigate such explanation bias. To this extent, we provide GECOBench, a rigorous quantitative evaluation framework for benchmarking popular XAI methods. We show a clear dependency between explanation performance and the number of fine-tuned layers, where XAI methods are observed to benefit particularly from fine-tuning or complete retraining of embedding layers.

[373] Finite Expression Methods for Discovering Physical Laws from Data

Zhongyi Jiang, Chunmei Wang, Haizhao Yang

Main category: cs.LG

TL;DR: FEX (Finite Expression Method) is a deep symbolic learning approach that discovers governing equations from limited dynamic data by generating analytical expressions within a finite function space.

Details

Motivation: Deriving analytical expressions for nonlinear dynamics from limited data is challenging, and existing methods have limitations in handling complex dynamical systems with time-varying coefficients.

Method: FEX generates analytical expressions of governing equations by learning derivatives of PDE solutions through convolutions, operating within a function space containing finite analytic expressions.

Result: FEX surpasses existing methods (PDE-Net, SINDy, GP, SPL) in numerical performance across time-dependent PDE problems and nonlinear dynamical systems with time-varying coefficients.

Conclusion: FEX demonstrates superior flexibility and expressive power in accurately approximating symbolic governing equations from limited dynamic data.

Abstract: Nonlinear dynamics is a pervasive phenomenon observed in scientific and engineering disciplines. However, the task of deriving analytical expressions to describe nonlinear dynamics from limited data remains challenging. In this paper, we shall present a novel deep symbolic learning method called the “finite expression method” (FEX) to discover governing equations within a function space containing a finite set of analytic expressions, based on observed dynamic data. The key concept is to employ FEX to generate analytical expressions of the governing equations by learning the derivatives of partial differential equation (PDE) solutions through convolutions. Our numerical results demonstrate that our FEX surpasses other existing methods (such as PDE-Net, SINDy, GP, and SPL) in terms of numerical performance across a range of problems, including time-dependent PDE problems and nonlinear dynamical systems with time-varying coefficients. Moreover, the results highlight FEX’s flexibility and expressive power in accurately approximating symbolic governing equations.

[374] A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models

Qika Lin, Zhen Peng, Kaize Shi, Kai He, Yiming Xu, Jian Zhang, Erik Cambria, Mengling Feng

Main category: cs.LG

TL;DR: Survey paper on Quantized Graph Representation (QGR) learning - a new paradigm using discrete codes instead of continuous embeddings for graphs, enabling better integration with LLMs.

Details

Motivation: Continuous graph embeddings face issues with parameter efficiency, interpretability, and robustness. QGR offers a promising alternative using discrete codes that can better integrate with large language models due to their similar representation form to natural language.

Method: Comprehensive survey methodology covering: 1) Background of general quantization methods, 2) Current QGR studies from multiple perspectives (quantized strategies, training objectives, distinctive designs, knowledge graph quantization, applications), 3) Code dependence learning strategies, 4) Integration approaches with LLMs.

Result: Provides a thorough systematic review of the emerging QGR field, organizing existing research into coherent categories and identifying key approaches, but does not present new experimental results as it’s a survey paper.

Conclusion: QGR is a promising emerging paradigm that addresses limitations of continuous embeddings and enables better LLM integration. The survey aims to provide comprehensive understanding and inspire future research directions in this rapidly developing field.

Abstract: Recent years have witnessed rapid advances in graph representation learning, with the continuous embedding approach emerging as the dominant paradigm. However, such methods encounter issues regarding parameter efficiency, interpretability, and robustness. Thus, Quantized Graph Representation (QGR) learning has recently gained increasing interest, which represents the graph structure with discrete codes instead of conventional continuous embeddings. Given its analogous representation form to natural language, QGR also possesses the capability to seamlessly integrate graph structures with large language models (LLMs). As this emerging paradigm is still in its infancy yet holds significant promise, we undertake this thorough survey to promote its rapid future prosperity. We first present the background of the general quantization methods and their merits. Moreover, we provide an in-depth demonstration of current QGR studies from the perspectives of quantized strategies, training objectives, distinctive designs, knowledge graph quantization, and applications. We further explore the strategies for code dependence learning and integration with LLMs. At last, we give discussions and conclude future directions, aiming to provide a comprehensive picture of QGR and inspire future research.

[375] GSINA: Improving Subgraph Extraction for Graph Invariant Learning via Graph Sinkhorn Attention

Junchi Yan, Fangyu Ding, Jiawei Sun, Zhaoping Hu, Yunyi Zhou, Lei Zhu

Main category: cs.LG

TL;DR: GSINA is a fully differentiable, cardinality-constrained attention mechanism for invariant subgraph extraction that uses Sinkhorn iterations and Gumbel reparameterization to improve OOD generalization.

Details

Motivation: Existing graph invariant learning methods for OOD generalization either lack explicit control over compactness or use hard top-k selection that shrinks solution space and is only partially differentiable.

Method: Propose Graph Sinkhorn Attention (GSINA) - a fully differentiable attention mechanism using optimal transport and Sinkhorn iterations to assign sparse-yet-soft edge weights, with Gumbel reparameterization for training stability.

Result: Theoretical study of convergence behavior and extensive empirical results on synthetic and real-world datasets (though the abstract cuts off before specific results).

Conclusion: GSINA provides explicit controls for separability and softness while maintaining full differentiability, addressing limitations of previous invariant subgraph extraction methods.

Abstract: Graph invariant learning (GIL) seeks invariant relations between graphs and labels under distribution shifts. Recent works try to extract an invariant subgraph to improve out-of-distribution (OOD) generalization, yet existing approaches either lack explicit control over compactness or rely on hard top-$k$ selection that shrinks the solution space and is only partially differentiable. In this paper, we provide an in-depth analysis of the drawbacks of some existing works and propose a few general principles for invariant subgraph extraction: 1) separability, as encouraged by our sparsity-driven mechanism, to filter out the irrelevant common features; 2) softness, for a broader solution space; and 3) differentiability, for a soundly end-to-end optimization pipeline. Specifically, building on optimal transport, we propose Graph Sinkhorn Attention (GSINA), a fully differentiable, cardinality-constrained attention mechanism that assigns sparse-yet-soft edge weights via Sinkhorn iterations and induces node attention. GSINA provides explicit controls for separability and softness, and uses a Gumbel reparameterization to stabilize training. It convergence behavior is also theoretically studied. Extensive empirical experimental results on both synthetic and real-world

[376] Reward Shaping to Mitigate Reward Hacking in RLHF

Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao

Main category: cs.LG

TL;DR: PAR (Preference As Reward) is a novel reward shaping method for RLHF that uses latent preferences from reward models to prevent reward hacking and improve alignment stability.

Details

Motivation: RLHF is vulnerable to reward hacking where models exploit flaws in reward functions instead of learning intended behaviors. Existing reward shaping methods lack systematic investigation and design principles.

Method: Proposes PAR which leverages latent preferences embedded within reward models as RL signals. Based on two design principles: (1) RL reward should be bounded, (2) RL reward benefits from rapid initial growth followed by gradual convergence.

Result: PAR outperforms other reward shaping methods on Gemma2-2B with Ultrafeedback-Binarized and HH-RLHF datasets. Achieves at least 5 percentage points higher win rate on AlpacaEval 2.0, shows remarkable data efficiency (single reference reward needed), and maintains robustness against reward hacking even after two full training epochs.

Conclusion: PAR effectively addresses reward hacking in RLHF through principled reward shaping, offering superior performance, data efficiency, and training stability compared to existing methods.

Abstract: Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning. Moreover, PAR exhibits two critical variance-reduction properties that contribute to stabilizing the RLHF training process and effectively extending the tolerance window for early stopping. We evaluated PAR on the base model Gemma2-2B using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR’s superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. The code is available at https://github.com/PorUna-byte/PAR.

[377] The Good, the Bad and the Ugly: Meta-Analysis of Watermarks, Transferable Attacks and Adversarial Defenses

Grzegorz Głuch, Berkant Turan, Sai Ganesh Nagarajan, Sebastian Pokutta

Main category: cs.LG

TL;DR: The paper analyzes the fundamental trade-off between watermarks, adversarial defenses, and transferable attacks in machine learning, showing that for any learning task, at least one of these three must exist.

Details

Motivation: To formalize and extend the analysis of the trade-off between backdoor-based watermarks and adversarial defenses, identifying transferable attacks as a necessary third component in this fundamental relationship.

Method: Frames the problem as an interactive protocol between verifier and prover, uses cryptographic techniques (fully homomorphic encryption) to construct transferable attacks, and analyzes learning tasks through computational complexity and VC-dimension theory.

Result: Proves that for all learning tasks, at least one of three exists: watermark, adversarial defense, or transferable attack. Shows bounded VC-dimension tasks allow adversarial defenses against all attackers, while a subclass allows watermarks secure against fast adversaries.

Conclusion: The paper establishes a fundamental trilemma in machine learning security: watermarks, adversarial defenses, and transferable attacks form an interdependent trade-off where at least one must exist for any learning task, with implications for task complexity and security.

Abstract: We formalize and analyze the trade-off between backdoor-based watermarks and adversarial defenses, framing it as an interactive protocol between a verifier and a prover. While previous works have primarily focused on this trade-off, our analysis extends it by identifying transferable attacks as a third, counterintuitive, but necessary option. Our main result shows that for all learning tasks, at least one of the three exists: a watermark, an adversarial defense, or a transferable attack. By transferable attack, we refer to an efficient algorithm that generates queries indistinguishable from the data distribution and capable of fooling all efficient defenders. Using cryptographic techniques, specifically fully homomorphic encryption, we construct a transferable attack and prove its necessity in this trade-off. Finally, we show that tasks of bounded VC-dimension allow adversarial defenses against all attackers, while a subclass allows watermarks secure against fast adversaries.

[378] Marvel: Accelerating Safe Online Reinforcement Learning with Finetuned Offline Policy

Keru Chen, Honghao Wei, Zhigang Deng, Sen Lin

Main category: cs.LG

TL;DR: Marvel is a novel offline-to-online safe RL framework that addresses challenges in transitioning from offline to online learning while maintaining safety constraints, outperforming existing baselines in both reward and safety.

Details

Motivation: Current online safe RL methods are impractical due to high costs and risks of environment interactions, while offline safe RL suffers from data quality limitations and OOD action challenges. There's an unexplored opportunity to leverage offline safe RL to enable faster and safer online policy learning.

Method: Marvel framework with two key components: 1) Value Pre-Alignment to correct erroneous Q-estimations from offline-online objective mismatch and cost sparsity, and 2) Adaptive PID Control to align Lagrange multipliers between offline and online policies.

Result: Extensive experiments show Marvel significantly outperforms existing baselines in both reward maximization and safety constraint satisfaction.

Conclusion: Marvel introduces the first policy-finetuning based framework for O2O safe RL that is compatible with many offline and online safe RL methods, advancing the field toward more efficient and practical safe RL solutions.

Abstract: The high costs and risks involved in extensive environment interactions hinder the practical application of current online safe reinforcement learning (RL) methods. While offline safe RL addresses this by learning policies from static datasets, the performance therein is usually limited due to reliance on data quality and challenges with out-of-distribution (OOD) actions. Inspired by recent successes in offline-to-online (O2O) RL, it is crucial to explore whether offline safe RL can be leveraged to facilitate faster and safer online policy learning, a direction that has yet to be fully investigated. To fill this gap, we first demonstrate that naively applying existing O2O algorithms from standard RL would not work well in the safe RL setting due to two unique challenges: \emph{erroneous Q-estimations}, resulted from offline-online objective mismatch and offline cost sparsity, and \emph{Lagrangian mismatch}, resulted from difficulties in aligning Lagrange multipliers between offline and online policies. To address these challenges, we introduce \textbf{Marvel}, a novel framework for O2O safe RL, comprising two key components that work in concert: \emph{Value Pre-Alignment} to align the Q-functions with the underlying truth before online learning, and \emph{Adaptive PID Control} to effectively adjust the Lagrange multipliers during online finetuning. Extensive experiments demonstrate that Marvel significantly outperforms existing baselines in both reward maximization and safety constraint satisfaction. By introducing the first policy-finetuning based framework for O2O safe RL, which is compatible with many offline and online safe RL methods, our work has the great potential to advance the field towards more efficient and practical safe RL solutions.

[379] CT-PatchTST: Channel-Time Patch Time-Series Transformer for Long-Term Renewable Energy Forecasting

Kuan Lu, Menghao Huo, Yuxiao Li, Qiang Zhu, Zhenrui Chen

Main category: cs.LG

TL;DR: CT-PatchTST is a novel deep learning model that provides accurate long-term forecasts of wind and solar power by capturing both temporal dependencies and inter-channel correlations, enabling better energy storage planning and grid stability.

Details

Motivation: Accurate renewable energy forecasting is crucial for modern power grids with high renewable penetration, as it enables proactive energy storage deployment, reduces uncertainties, and optimizes grid operations for stability and cost-efficiency.

Method: The paper proposes CT-PatchTST (Channel-Time Patch Time-Series Transformer), a deep learning model that captures both temporal dependencies and inter-channel correlations in renewable energy data, unlike conventional time-series models.

Result: CT-PatchTST outperforms existing methods in both accuracy and robustness when evaluated on real-world datasets from Denmark’s offshore wind, onshore wind, and solar generation.

Conclusion: The model enables predictive, data-driven coordination of energy storage systems across integrated source-grid-load-storage systems, contributing to more stable, responsive, and cost-efficient power networks.

Abstract: Accurate forecasting of renewable energy generation is fundamental to enhancing the dynamic performance of modern power grids, especially under high renewable penetration. This paper presents Channel-Time Patch Time-Series Transformer (CT-PatchTST), a novel deep learning model designed to provide long-term, high-fidelity forecasts of wind and solar power. Unlike conventional time-series models, CT-PatchTST captures both temporal dependencies and inter-channel correlations-features that are critical for effective energy storage planning, control, and dispatch. Reliable forecasting enables proactive deployment of energy storage systems (ESSs), helping to mitigate uncertainties in renewable output, reduce system response time, and optimize storage operation based on location-specific flow and voltage conditions. Evaluated on real-world datasets from Denmark’s offshore wind, onshore wind, and solar generation, CT-PatchTST outperforms existing methods in both accuracy and robustness. By enabling predictive, data-driven coordination of ESSs across integrated source-grid-load-storage systems, this work contributes to the design of more stable, responsive, and cost-efficient power networks.

[380] Complexity-aware fine-tuning

Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev

Main category: cs.LG

TL;DR: Proposes entropy-based data complexity categorization for efficient LLM fine-tuning, using reasoning only for complex data to achieve better performance with 81% less data.

Details

Motivation: Standard SFT often underperforms, while distillation with chain-of-thought reasoning requires expensive calls and large data volumes. Need efficient fine-tuning that balances performance and resource usage.

Method: Split training data into complexity categories using single token answer entropy (ROC AUC 0.73). Fine-tune LLMs via SFT and distillation, applying reasoning only to complex data identified by entropy.

Result: Proposed pipeline significantly outperforms standard SFT (0.58 vs 0.45 average accuracy) and outperforms distillation approach (0.58 vs 0.56) while using 81% less data across three 3B parameter models.

Conclusion: Entropy-based complexity categorization enables efficient fine-tuning that achieves superior performance with substantially reduced data requirements, offering a practical blueprint for resource-efficient LLM adaptation.

Abstract: General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across three small open models ($\approx 3B$) we split the training data into complexity categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ($0.58$ vs $0.45$ average accuracy) and outperforms the distillation approach ($0.58$ vs $0.56$ average accuracy) while using $81%$ less data.

[381] IFRA: a machine learning-based Instrumented Fall Risk Assessment Scale derived from Instrumented Timed Up and Go test in stroke patients

Simone Macciò, Alessandro Carfì, Alessio Capitanelli, Peppino Tropea, Massimo Corbo, Fulvio Mastrogiovanni, Michela Picardi

Main category: cs.LG

TL;DR: Researchers developed IFRA, a machine learning-based fall risk assessment tool using Instrumented TUG test data, showing better identification of high-risk stroke patients than traditional scales.

Details

Motivation: Falls are a major health concern for stroke survivors, but traditional fall risk assessment scales often miss important mobility measures. There's a need for more effective, automated screening tools that can capture subtle movement patterns.

Method: Two-step machine learning approach: 1) Identify predictive mobility features from Instrumented Timed Up and Go (ITUG) test data, 2) Create stratification strategy to classify patients into low-, medium-, or high-fall-risk categories. Study included 142 participants divided into training (with synthetic cases), validation, and testing sets (22 non-fallers, 10 fallers). Performance compared against traditional clinical scales using Fisher’s Exact test.

Result: Machine learning identified vertical/medio-lateral acceleration and angular velocity during walking and sit-to-walk transitions as key predictors. IFRA showed statistically significant association with fall status (p=0.004) and was the only scale to assign >50% of actual fallers to high-risk category, outperforming traditional scales (standard TUG and Mini-BESTest).

Conclusion: IFRA demonstrates potential as an automated, complementary approach for fall risk stratification in post-stroke patients. While showing promising discriminative capability (especially for high-risk identification), these preliminary findings require validation in larger cohorts before clinical implementation.

Abstract: Background/Objectives: Falls represent a major health concern for stroke survivors, necessitating effective risk assessment tools. This study proposes the Instrumented Fall Risk Assessment (IFRA) scale, a novel screening tool derived from Instrumented Timed Up and Go (ITUG) test data, designed to capture mobility measures often missed by traditional scales. Methods: We employed a two-step machine learning approach to develop the IFRA scale: first, identifying predictive mobility features from ITUG data and, second, creating a stratification strategy to classify patients into low-, medium-, or high-fall-risk categories. This study included 142 participants, who were divided into training (including synthetic cases), validation, and testing sets (comprising 22 non-fallers and 10 fallers). IFRA’s performance was compared against traditional clinical scales (e.g., standard TUG and Mini-BESTest) using Fisher’s Exact test. Results: Machine learning analysis identified specific features as key predictors, namely vertical and medio-lateral acceleration, and angular velocity during walking and sit-to-walk transitions. IFRA demonstrated a statistically significant association with fall status (Fisher’s Exact test p = 0.004) and was the only scale to assign more than half of the actual fallers to the high-risk category, outperforming the comparative clinical scales in this dataset. Conclusions: This proof-of-concept study demonstrates IFRA’s potential as an automated, complementary approach for fall risk stratification in post-stroke patients. While IFRA shows promising discriminative capability, particularly for identifying high-risk individuals, these preliminary findings require validation in larger cohorts before clinical implementation.

[382] Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu

Main category: cs.LG

TL;DR: RFT outperforms SFT for continual post-training by better preserving prior knowledge and maintaining general capabilities, with an implicit regularization mechanism from reward variance scaling.

Details

Motivation: Existing CPT research focuses on techniques like data replay and model expansion, but overlooks the fundamental role of learning paradigms. This paper investigates how different post-training paradigms affect knowledge retention during continual learning.

Method: Comparative analysis of SFT vs RFT on seven diverse multimodal tasks using Qwen2.5-VL-7B-Instruct. Includes theoretical analysis of RFT’s implicit regularization via reward variance scaling, and proposes rollout-based instance filtering algorithm.

Result: 1) SFT causes catastrophic forgetting while RFT preserves prior knowledge comparable to multi-task training. 2) RFT protects/enhances general knowledge (MMMU, MMLU-Pro) while SFT degrades it. 3) Stability comes from implicit regularization, not explicit mechanisms like KL penalty.

Conclusion: RFT is superior to SFT as a robust paradigm for continual post-training due to its inherent knowledge preservation capabilities and implicit regularization mechanism.

Abstract: Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model’s general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. Instead, we identify an implicit regularization mechanism inherent to RFT as a key contributing factor. Our theoretical analysis suggests that RFT’s gradient updates are naturally scaled by the reward variance, acting as a data-dependent regularizer that inherently protects previously acquired knowledge. Finally, we propose a rollout-based instance filtering algorithm to enhance the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.

[383] Deceptive Sequential Decision-Making via Regularized Policy Optimization

Yerin Kim, Alexander Benvenuti, Bo Chen, Mustafa Karabag, Abhishek Kulkarni, Nathaniel D. Bastian, Ufuk Topcu, Matthew Hale

Main category: cs.LG

TL;DR: A deceptive sequential decision-making framework that conceals sensitive information and actively misleads adversaries about a system’s reward function using three deception strategies: diversionary, targeted, and equivocal deception.

Details

Motivation: Autonomous systems operating in adversarial environments risk exposing sensitive information through observable behavior. Adversaries can use inverse reinforcement learning to infer reward functions, necessitating deception mechanisms to protect sensitive information.

Method: Model systems as Markov decision processes, with adversaries using inverse reinforcement learning. Introduce three regularization strategies for policy synthesis: diversionary deception (leads to any false conclusion), targeted deception (leads to specific false conclusion), and equivocal deception (makes both real and false rewards plausible).

Result: Analytical bounds on reward loss from deception. In multi-agent evaluation, all three deception strategies successfully steer adversaries to false beliefs while maintaining at least 98% of optimal non-deceptive reward performance.

Conclusion: The framework provides effective deception strategies that protect sensitive information while maintaining near-optimal performance, offering practical solutions for secure autonomous systems in adversarial environments.

Abstract: Autonomous systems are increasingly expected to operate in the presence of adversaries, though adversaries may infer sensitive information simply by observing a system. Therefore, present a deceptive sequential decision-making framework that not only conceals sensitive information, but actively misleads adversaries about it. We model autonomous systems as Markov decision processes, with adversaries using inverse reinforcement learning to recover reward functions. To counter them, we present three regularization strategies for policy synthesis problems that actively deceive an adversary about a system’s reward. Diversionary deception'' leads an adversary to draw any false conclusion about the system's reward function. Targeted deception’’ leads an adversary to draw a specific false conclusion about the system’s reward function. ``Equivocal deception’’ leads an adversary to infer that the real reward and a false reward both explain the system’s behavior. We show how each form of deception can be implemented in policy optimization problems and analytically bound the loss in total accumulated reward induced by deception. Next, we evaluate these developments in a multi-agent setting. We show that diversionary, targeted, and equivocal deception all steer the adversary to false beliefs while still attaining a total accumulated reward that is at least 98% of its optimal, non-deceptive value.

[384] Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach

Wenyun Li, Wenjie Huang, Chen Sun

Main category: cs.LG

TL;DR: A semi-supervised learning approach for reward shaping that leverages both non-zero and zero-reward transitions with data augmentation to improve reward inference in sparse-reward environments.

Details

Motivation: Real-world scenarios often have extremely sparse reward signals, making it difficult to learn effective reward functions for reward shaping. Traditional approaches struggle with the scarcity of non-zero reward transitions.

Method: Uses Semi-Supervised Learning (SSL) combined with novel data augmentation to learn trajectory space representations from both non-zero-reward transitions and the majority of zero-reward transitions. Introduces double entropy data augmentation to enhance performance.

Result: Outperforms supervised-based approaches in reward inference, achieving higher agent scores. In more sparse-reward environments, achieves up to twice the peak scores compared to supervised baselines. Double entropy data augmentation shows 15.8% increase in best score over other augmentation methods.

Conclusion: The proposed SSL approach with data augmentation effectively addresses sparse reward challenges by leveraging both reward and non-reward transitions, significantly improving reward shaping performance in sparse-reward environments like Atari and robotic manipulation.

Abstract: In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping. To address this issue, the proposed approach in this paper performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the \emph{Semi-Supervised Learning} (SSL) technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, {i.e}., zero-reward transitions, thereby improving the efficacy of reward shaping. Experimental results in Atari and robotic manipulation demonstrate that our method outperforms supervised-based approaches in reward inference, leading to higher agent scores. Notably, in more sparse-reward environments, our method achieves up to twice the peak scores compared to supervised baselines. The proposed double entropy data augmentation enhances performance, showcasing a 15.8% increase in best score over other augmentation methods

[385] Stackelberg Self-Annotation: A Robust Approach to Data-Efficient LLM Alignment

Xu Chu, Zhixin Zhang, Tianyu Jia, Yujie Jin

Main category: cs.LG

TL;DR: SGPO is a robust LLM alignment framework using Stackelberg game theory that reduces human annotation needs by 30x while maintaining strong performance against GPT-4.

Details

Motivation: Traditional LLM alignment requires expensive, noise-prone human preference curation. There's a need for more data-efficient and robust alignment methods that can handle annotation noise.

Method: Models alignment as a two-player Stackelberg game between policy (leader) and worst-case preference distribution (follower). Uses distributionally robust reweighting of synthetic annotations with minimal human “seed” preferences, iteratively self-annotating new prompts.

Result: Achieves strong win rates against GPT-4 across multiple benchmarks using only 2K seed preferences (1/30 of standard human labels) within three iterations, with formal O(ε)-bounded regret guarantees.

Conclusion: Principled Stackelberg formulation enables data-efficient LLM alignment, significantly reducing reliance on costly human annotations while maintaining robustness to annotation noise.

Abstract: Aligning large language models (LLMs) with human preferences typically demands vast amounts of meticulously curated data, which is both expensive and prone to labeling noise. We propose Stackelberg Game Preference Optimization (SGPO), a robust alignment framework that models alignment as a two-player Stackelberg game between a policy (leader) and a worst-case preference distribution (follower). The proposed SGPO guarantees $\mathcal{O}(ε)$-bounded regret within an $ε$-Wasserstein ball, offering formal robustness to (self-)annotation noise. We instantiate SGPO with Stackelberg Self-Annotated Preference Optimization (SSAPO), which uses minimal human-labeled “seed” preferences and iteratively self-annotates new prompts. In each iteration, SSAPO applies a distributionally robust reweighting of synthetic annotations, ensuring that noisy or biased self-labels do not derail training. Remarkably, using only 2K seed preferences – about 1/30 of standard human labels – SSAPO achieves strong win rates against GPT-4 across multiple benchmarks within three iterations. These results highlight that a principled Stackelberg formulation yields data-efficient alignment for LLMs, significantly reducing reliance on costly human annotations.

[386] Onboard Optimization and Learning: A Survey

Monirul Islam Pavel, Siyi Hu, Mahardhika Pratama, Ryszard Kowalczyk

Main category: cs.LG

TL;DR: Survey paper on onboard learning for edge AI, covering methodologies to address computational constraints, inference costs, and security vulnerabilities while enabling real-time processing on resource-constrained devices.

Details

Motivation: Edge AI applications require low latency, enhanced privacy, and energy efficiency, but face challenges with limited computational resources, high inference costs, and security vulnerabilities on resource-constrained devices.

Method: Comprehensive survey methodology exploring techniques for optimizing model efficiency, accelerating inference, and supporting collaborative learning across distributed devices, including model complexity reduction, inference speed improvement, and privacy-preserving computation.

Result: Analysis of various approaches including hardware-software co-design, model compression, and decentralized learning strategies that enhance scalability and adaptability in dynamic edge environments.

Conclusion: The survey provides insights into current onboard learning advancements to enable robust, efficient, and secure AI deployment at the edge by bridging hardware-software co-design, model compression, and decentralized learning techniques.

Abstract: Onboard learning is a transformative approach in edge AI, enabling real-time data processing, decision-making, and adaptive model training directly on resource-constrained devices without relying on centralized servers. This paradigm is crucial for applications demanding low latency, enhanced privacy, and energy efficiency. However, onboard learning faces challenges such as limited computational resources, high inference costs, and security vulnerabilities. This survey explores a comprehensive range of methodologies that address these challenges, focusing on techniques that optimize model efficiency, accelerate inference, and support collaborative learning across distributed devices. Approaches for reducing model complexity, improving inference speed, and ensuring privacy-preserving computation are examined alongside emerging strategies that enhance scalability and adaptability in dynamic environments. By bridging advancements in hardware-software co-design, model compression, and decentralized learning, this survey provides insights into the current state of onboard learning to enable robust, efficient, and secure AI deployment at the edge.

[387] Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data

Lingkai Kong, Haichuan Wang, Tonghan Wang, Guojun Xiong, Milind Tambe

Main category: cs.LG

TL;DR: CompFlow: A flow matching framework for offline-to-online RL with dynamics mismatch, using Wasserstein distance for stable dynamics gap estimation and optimistic exploration.

Details

Motivation: Existing offline RL methods struggle when offline and online transition dynamics differ, especially when their estimators (KL divergence, mutual information) fail due to support mismatch between dynamics distributions.

Method: Models online dynamics as conditional flow built on pretrained offline flow (composite structure), uses Wasserstein distance for stable dynamics gap estimation, and implements optimistic active data collection prioritizing high-gap regions.

Result: Theoretical analysis shows reduced performance gap to optimal policy; empirical evaluation demonstrates consistent outperformance across RL benchmarks with shifted-dynamics data.

Conclusion: CompFlow provides principled solution to dynamics mismatch in offline-to-online RL through flow matching framework with stable Wasserstein-based gap estimation and optimistic exploration strategy.

Abstract: Incorporating pre-collected offline data can substantially improve the sample efficiency of reinforcement learning (RL), but its benefits can break down when the transition dynamics in the offline dataset differ from those encountered online. Existing approaches typically mitigate this issue by penalizing or filtering offline transitions in regions with large dynamics gap. However, their dynamics-gap estimators often rely on KL divergence or mutual information, which can be ill-defined when offline and online dynamics have mismatched support. To address this challenge, we propose CompFlow, a principled framework built on the theoretical connection between flow matching and optimal transport. Specifically, we model the online dynamics as a conditional flow built upon the output distribution of a pretrained offline flow, rather than learning it directly from a Gaussian prior. This composite structure provides two advantages: (1) improved generalization when learning online dynamics under limited interaction data, and (2) a well-defined and stable estimate of the dynamics gap via the Wasserstein distance between offline and online transitions. Building on this dynamics-gap estimator, we further develop an optimistic active data collection strategy that prioritizes exploration in high-gap regions, and show theoretically that it reduces the performance gap to the optimal policy. Empirically, CompFlow consistently outperforms strong baselines across a range of RL benchmarks with shifted-dynamics data.

[388] Recalibrating binary probabilistic classifiers

Dirk Tasche

Main category: cs.LG

TL;DR: This paper addresses the problem of recalibrating binary probabilistic classifiers to target prior probabilities, focusing on distribution shift assumptions and proposing two new methods (CSPD and QMM) that provide conservative results for concave evaluation functions like credit risk weights.

Details

Motivation: Recalibration of binary probabilistic classifiers to target prior probabilities is important in applications like credit risk management, but the problem is not well-defined because multiple transformations can match the target. The paper aims to provide meaningful recalibration methods by analyzing distribution shift assumptions.

Method: The paper analyzes recalibration methods from a distribution shift perspective, linking assumptions to AUC properties. It proposes two new methods: parametric covariate shift with posterior drift (CSPD) and ROC-based quasi moment matching (QMM). These methods are tested alongside other approaches in example settings.

Result: The test outcomes suggest that the QMM methods discussed in the paper can provide appropriately conservative results in evaluations with concave functions, such as risk weight functions for credit risk applications.

Conclusion: Distribution shift assumptions linked to AUC properties are useful for designing meaningful recalibration methods. The proposed QMM methods offer conservative results suitable for concave evaluation functions in applications like credit risk management.

Abstract: Recalibration of binary probabilistic classifiers to a target prior probability is an important task in areas like credit risk management. However, recalibration of a classifier learned on a training dataset to a target on a test dataset in general is not a well-defined problem because there might be more than one way to transform the original posterior probabilities such that the target is matched. In this paper, methods for recalibration are analysed from a distribution shift perspective. Distribution shift assumptions linked to the area under the curve (AUC) of a probabilistic classifier are found to be useful for the design of meaningful recalibration methods. Two new methods called parametric covariate shift with posterior drift (CSPD) and ROC-based quasi moment matching (QMM) are proposed and tested together with some other methods in an example setting. The outcomes of the test suggest that the QMM methods discussed in the paper can provide appropriately conservative results in evaluations with concave functions like for instance risk weights functions for credit risk.

[389] Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits

Tianyi Xu, Jiaxin Liu, Nicholas Mattei, Zizhan Zheng

Main category: cs.LG

TL;DR: A multi-agent multi-armed bandit framework with strategic probing for fair resource allocation, achieving good fairness-efficiency tradeoffs with provable guarantees.

Details

Motivation: To address the challenge of ensuring fair outcomes across multiple agents while maximizing overall system performance in multi-agent bandit settings, particularly when agents have limited information about arm rewards.

Method: Introduces a novel probing framework that strategically gathers information about selected arms before allocation. For offline setting (known reward distributions), uses submodular properties to design greedy probing algorithm. For online setting, develops algorithm achieving sublinear regret while maintaining fairness.

Result: Extensive experiments on synthetic and real-world datasets show the approach outperforms baseline methods, achieving better fairness and efficiency. The offline algorithm has provable performance bounds, while the online algorithm achieves sublinear regret.

Conclusion: The proposed MA-MAB framework with strategic probing effectively balances fairness and efficiency in multi-agent resource allocation problems, with strong theoretical guarantees and empirical performance.

Abstract: We propose a multi-agent multi-armed bandit (MA-MAB) framework aimed at ensuring fair outcomes across agents while maximizing overall system performance. A key challenge in this setting is decision-making under limited information about arm rewards. To address this, we introduce a novel probing framework that strategically gathers information about selected arms before allocation. In the offline setting, where reward distributions are known, we leverage submodular properties to design a greedy probing algorithm with a provable performance bound. For the more complex online setting, we develop an algorithm that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real-world datasets show that our approach outperforms baseline methods, achieving better fairness and efficiency.

[390] PDE-aware Optimizer for Physics-informed Neural Networks

Vismay Churiwala, Hardik Shukla, Manurag Khullar

Main category: cs.LG

TL;DR: Proposed a PDE-aware optimizer for PINNs that adapts parameter updates based on PDE residual gradient variance, achieving smoother convergence and lower errors than Adam and SOAP on stiff PDEs.

Details

Motivation: Standard optimizers like Adam struggle to balance competing loss terms in PINNs, especially for stiff or ill-conditioned PDE systems, leading to convergence issues.

Method: Developed a PDE-aware optimizer that adapts parameter updates based on the variance of per-sample PDE residual gradients, addressing gradient misalignment without the computational cost of second-order methods like SOAP.

Result: Benchmarked on 1D Burgers’, Allen-Cahn, and KdV equations, showing smoother convergence and lower absolute errors compared to Adam and SOAP, particularly in regions with sharp gradients.

Conclusion: PDE residual-aware adaptivity effectively enhances PINNs training stability, though scaling to larger architectures and hardware accelerators requires future research.

Abstract: Physics-Informed Neural Networks (PINNs) have emerged as a powerful framework for solving partial differential equations (PDEs) by embedding physical constraints into the loss function. However, standard optimizers such as Adam often struggle to balance competing loss terms, particularly in stiff or ill-conditioned systems. In this work, we propose a PDE-aware optimizer that adapts parameter updates based on the variance of per-sample PDE residual gradients. This method addresses gradient misalignment without incurring the heavy computational costs of second-order optimizers such as SOAP. We benchmark the PDE-aware optimizer against Adam and SOAP on 1D Burgers’, Allen-Cahn and Korteweg-de Vries(KdV) equations. Across both PDEs, the PDE-aware optimizer achieves smoother convergence and lower absolute errors, particularly in regions with sharp gradients. Our results demonstrate the effectiveness of PDE residual-aware adaptivity in enhancing stability in PINNs training. While promising, further scaling on larger architectures and hardware accelerators remains an important direction for future research.

[391] U-PINet: Physics-Informed Hierarchical Learning for Radar Cross Section Prediction via 3D Electromagnetic Scattering Reconstruction

Rui Zhu, Yuexing Peng, George C. Alexandropoulos, Peng Wang, Wenbo Wang, Wei Xiang

Main category: cs.LG

TL;DR: U-PINet: A physics-informed hierarchical neural network for efficient 3D electromagnetic scattering reconstruction and RCS prediction with orders-of-magnitude speedup while maintaining EM-solver-level accuracy.

Details

Motivation: Conventional CEM solvers are computationally expensive for repeated queries and large-scale 3D scenarios, while purely data-driven networks bypass scattering mechanisms, compromising physical consistency and generalization.

Method: U-PINet uses a physics-informed hierarchical network with physics-guided graph neural network to capture electromagnetic coupling, modeling local coupling and long-range radiation effects through hierarchical operator design inspired by near-far field decomposition. It embeds governing equations as residual constraints.

Result: Achieves EM-solver-level RCS accuracy and 3D object reconstruction with orders-of-magnitude speedups, generalizes well to unseen geometries under limited training data.

Conclusion: U-PINet bridges the gap between conventional solvers and data-driven methods by providing physically consistent, efficient RCS prediction through explicit learning of scattering representations.

Abstract: Conventional computational electromagnetics (CEM) solvers can deliver high fidelity radar cross section (RCS) signatures by first solving the induced surface currents on 3-dimensional (3D) targets and then evaluating the scattered fields via radiation integrals. However, their computational cost becomes prohibitive for repeated queries and large-scale 3D scenarios. Recent purely data-driven networks improve efficiency, yet they often bypass this scattering mechanism, which may compromise physical consistency and generalization. To bridge this gap, in this paper, we propose U-PINet, a fully end-to-end, physics-informed hierarchical network for efficient RCS prediction via 3D electromagnetic scattering reconstruction. Once the scattering quantities are reconstructed, scattered fields and RCS can be evaluated for arbitrary observation directions via the radiation integral. U-PINet explicitly learns physics-consistent intermediate scattering representations by modeling local electromagnetic coupling and long-range radiation effects through a hierarchical operator design inspired by near-far field decomposition in fast solvers. A physics-guided graph neural network is incorporated to capture self- and mutual-coupling among mesh elements of complex targets, enabling physically interpretable intermediate representations. By embedding governing equations as residual constraints, U-PINet enables accurate object reconstruction of scattering quantities and consequently reliable RCS prediction across observation directions, while significantly reducing runtime. Extensive numerical experiments demonstrate that U-PINet achieves EM-solver-level RCS accuracy and 3D object reconstruction with orders-of-magnitude speedups, and generalizes well to unseen geometries under limited training data.

[392] Intrinsic Dimension Estimating Autoencoder (IDEA) Using CancelOut Layer and a Projected Loss

Antoine Oriou, Philipp Krah, Julian Koellermeier

Main category: cs.LG

TL;DR: IDEA is an autoencoder that estimates intrinsic dimension of datasets on linear/nonlinear manifolds while also reconstructing data using re-weighted double CancelOut layers and projected reconstruction loss.

Details

Motivation: Need for method that can both estimate intrinsic dimension of datasets (especially those on nonlinear manifolds) AND reconstruct original data from the identified latent space, going beyond existing intrinsic dimension estimators.

Method: Intrinsic Dimension Estimating Autoencoder with re-weighted double CancelOut layers and projected reconstruction loss that continuously assesses reconstruction quality under removal of additional latent dimensions.

Result: Good accuracy and high versatility on theoretical benchmarks, successful application to complex fluid dynamics data from numerical solution of free-surface flow, able to estimate intrinsic dimension and reconstruct original solution.

Conclusion: IDEA provides robust intrinsic dimension estimation with reconstruction capability, validated on benchmarks and applied to complex scientific data, demonstrating practical utility beyond theoretical estimation.

Abstract: This paper introduces the Intrinsic Dimension Estimating Autoencoder (IDEA), which identifies the underlying intrinsic dimension of a wide range of datasets whose samples lie on either linear or nonlinear manifolds. Beyond estimating the intrinsic dimension, IDEA is also able to reconstruct the original dataset after projecting it onto the corresponding latent space, which is structured using re-weighted double CancelOut layers. Our key contribution is the introduction of the projected reconstruction loss term, guiding the training of the model by continuously assessing the reconstruction quality under the removal of an additional latent dimension. We first assess the performance of IDEA on a series of theoretical benchmarks to validate its robustness. These experiments allow us to test its reconstruction ability and compare its performance with state-of-the-art intrinsic dimension estimators. The benchmarks show good accuracy and high versatility of our approach. Subsequently, we apply our model to data generated from the numerical solution of a vertically resolved one-dimensional free-surface flow, following a pointwise discretization of the vertical velocity profile in the horizontal direction, vertical direction, and time. IDEA succeeds in estimating the dataset’s intrinsic dimension and then reconstructs the original solution by working directly within the projection space identified by the network.

[393] Cost-Free Personalization via Information-Geometric Projection in Bayesian Federated Learning

Nour Jamoussi, Giuseppe Serra, Photios A. Stavrou, Marios Kountouris

Main category: cs.LG

TL;DR: Proposes an information-geometric projection framework for personalized Bayesian Federated Learning that enables tunable trade-off between global generalization and local specialization with minimal computational overhead.

Details

Motivation: Bayesian Federated Learning (BFL) needs better personalization mechanisms to handle data heterogeneity and privacy constraints. Existing approaches using MCMC or variational inference often lack efficient personalization methods that balance global and local performance effectively.

Method: Information-geometric projection framework that projects the global model onto a neighborhood of the user’s local model. This projection is equivalent to computing a barycenter on the statistical manifold, enabling closed-form solutions and cost-free personalization. Applied to variational learning using IVON optimizer and extended to general BFL aggregation schemes.

Result: Empirical evaluations under heterogeneous data distributions show the method effectively balances global and local performance with minimal computational overhead. The framework enables tunable trade-off between generalization and specialization.

Conclusion: The proposed information-geometric projection framework provides an efficient, mathematically grounded approach for personalization in Bayesian Federated Learning, achieving optimal balance between global and local model performance without significant computational cost.

Abstract: Bayesian Federated Learning (BFL) combines uncertainty modeling with decentralized training, enabling the development of personalized and reliable models under data heterogeneity and privacy constraints. Existing approaches typically rely on Markov Chain Monte Carlo (MCMC) sampling or variational inference, often incorporating personalization mechanisms to better adapt to local data distributions. In this work, we propose an information-geometric projection framework for personalization in parametric BFL. By projecting the global model onto a neighborhood of the user’s local model, our method enables a tunable trade-off between global generalization and local specialization. Under mild assumptions, we show that this projection step is equivalent to computing a barycenter on the statistical manifold, allowing us to derive closed-form solutions and achieve cost-free personalization. We apply the proposed approach to a variational learning setup using the Improved Variational Online Newton (IVON) optimizer and extend its application to general aggregation schemes in BFL. Empirical evaluations under heterogeneous data distributions confirm that our method effectively balances global and local performance with minimal computational overhead.

Xiaocheng Fang, Jiarui Jin, Haoyu Wang, Che Liu, Jieyi Cai, Yujie Xiao, Guangkun Nie, Bo Liu, Shun Huang, Hongyan Li, Shenda Hong

Main category: cs.LG

TL;DR: PPGFlowECG: A two-stage framework that translates PPG signals into clinically useful ECG signals using latent space alignment and rectified flow synthesis for wearable CVD screening.

Details

Motivation: ECG is the gold standard for CVD assessment but requires dedicated hardware and trained personnel for continuous monitoring. PPG is ubiquitous in wearables but lacks diagnostic reliability. Existing generative methods for PPG-to-ECG translation are limited by physiological semantic misalignment and high-dimensional signal complexity.

Method: Two-stage framework: 1) CardioAlign Encoder aligns PPG and ECG in a shared latent space, 2) latent rectified flow synthesizes ECGs from aligned representations. Formal analysis shows CardioAlign Encoder is necessary for stable and semantically consistent ECG synthesis.

Result: Extensive experiments on four datasets demonstrate improved synthesis fidelity and downstream diagnostic utility compared to existing approaches.

Conclusion: PPGFlowECG enables scalable, wearable-first CVD screening when standard ECG acquisition is unavailable, bridging the gap between ubiquitous PPG monitoring and clinical-grade ECG diagnostics.

Abstract: Electrocardiography (ECG) is the clinical gold standard for cardiovascular disease (CVD) assessment, yet continuous monitoring is constrained by the need for dedicated hardware and trained personnel. Photoplethysmography (PPG) is ubiquitous in wearable devices and readily scalable, but it lacks electrophysiological specificity, limiting diagnostic reliability. While generative methods aim to translate PPG into clinically useful ECG signals, existing approaches are limited by the misalignment of physiological semantics in generative models and the complexity of modeling in high-dimensional signals. To address these limitations, we propose PPGFlowECG, a two-stage framework that aligns PPG and ECG in a shared latent space using the CardioAlign Encoder and then synthesizes ECGs with latent rectified flow. We further provide a formal analysis of this coupling, showing that the CardioAlign Encoder is necessary to guarantee stable and semantically consistent ECG synthesis under our formulation. Extensive experiments on four datasets demonstrate improved synthesis fidelity and downstream diagnostic utility. These results indicate that PPGFlowECG supports scalable, wearable-first CVD screening when standard ECG acquisition is unavailable.

[395] Manifold-based Sampling for In-Context Hallucination Detection in Large Language Models

Bodla Krishna Vamshi, Rohan Bhatnagar, Haizhao Yang

Main category: cs.LG

TL;DR: MB-ICL: A manifold-based demonstration sampling framework for hallucination detection that selects in-context examples using latent representations and class-aware prototypes, outperforming similarity-based methods.

Details

Motivation: LLMs often generate factually incorrect content (hallucinations). Existing ICL demonstration selection methods rely on surface-level similarity heuristics and lack robustness across tasks and models.

Method: MB-ICL leverages latent representations from frozen LLMs to model local manifold structure and class-aware prototype geometry. It selects demonstrations based on proximity to learned prototypes rather than lexical or embedding similarity alone.

Result: Outperforms standard ICL selection baselines on factual verification (FEVER) and hallucination detection (HaluEval) benchmarks, with strong gains on dialogue and summarization tasks. Shows robustness under temperature perturbations and model variation.

Conclusion: Manifold-based prototype selection provides a reliable, training-light approach for hallucination detection without modifying LLM parameters, offering a principled direction for improved ICL demonstration selection.

Abstract: Large language models (LLMs) frequently generate factually incorrect or unsupported content, commonly referred to as hallucinations. Prior work has explored decoding strategies, retrieval augmentation, and supervised fine-tuning for hallucination detection, while recent studies show that in-context learning (ICL) can substantially influence factual reliability. However, existing ICL demonstration selection methods often rely on surface-level similarity heuristics and exhibit limited robustness across tasks and models. We propose MB-ICL, a manifold-based demonstration sampling framework for selecting in-context demonstrations that leverages latent representations extracted from frozen LLMs. By jointly modeling local manifold structure and class-aware prototype geometry, MB-ICL selects demonstrations based on their proximity to learned prototypes rather than lexical or embedding similarity alone. Across factual verification (FEVER) and hallucination detection (HaluEval) benchmarks, MB-ICL outperforms standard ICL selection baselines in the majority of evaluated settings, with particularly strong gains on dialogue and summarization tasks. The method remains robust under temperature perturbations and model variation, indicating improved stability compared to heuristic retrieval strategies. While lexical retrieval can remain competitive in certain question-answering regimes, our results demonstrate that manifold-based prototype selection provides a reliable and training light approach for hallucination detection without modifying LLM parameters, offering a principled direction for improved ICL demonstration selection.

[396] Robust Barycenters of Persistence Diagrams

Keanu Sisouk, Eloi Tanguy, Julie Delon, Julien Tierny

Main category: cs.LG

TL;DR: General approach for computing robust Wasserstein barycenters of persistence diagrams that works for transportation costs beyond q=2, particularly robust q ∈ (1,2) distances.

Details

Motivation: Classical methods for computing Wasserstein barycenters of persistence diagrams only work for q=2 Wasserstein distances, limiting robustness to outliers. Need methods for generic transportation costs (q>1) that are more robust.

Method: Adapt an alternative fixed-point method to compute barycenter diagrams for generic transportation costs (q>1), particularly robust q ∈ (1,2) distances that are less sensitive to outliers.

Result: Successfully applied to two scenarios: (i) clustering persistence diagrams and (ii) dictionary encoding of persistence diagrams, demonstrating added robustness to outliers compared to classical q=2 methods.

Conclusion: The proposed framework generalizes Wasserstein barycenter computation to robust transportation costs, providing practical benefits for topological data analysis applications where outlier robustness is important.

Abstract: This short paper presents a general approach for computing robust Wasserstein barycenters of persistence diagrams. The classical method consists in computing assignment arithmetic means after finding the optimal transport plans between the barycenter and the persistence diagrams. However, this procedure only works for the transportation cost related to the $q$-Wasserstein distance $W_q$ when $q=2$. We adapt an alternative fixed-point method to compute a barycenter diagram for generic transportation costs ($q > 1$), in particular those robust to outliers, $q \in (1,2)$. We show the utility of our work in two applications: \emph{(i)} the clustering of persistence diagrams on their metric space and \emph{(ii)} the dictionary encoding of persistence diagrams. In both scenarios, we demonstrate the added robustness to outliers provided by our generalized framework. Our Python implementation is available at this address: https://github.com/Keanu-Sisouk/RobustBarycenter .

[397] Targeted Fine-Tuning of DNN-Based Receivers via Influence Functions

Marko Tuononen, Heikki Penttinen, Ville Hautamäki

Main category: cs.LG

TL;DR: First application of influence functions to deep learning-based wireless receivers (DeepRx) for interpretability and targeted fine-tuning, showing improved bit error rate performance through loss-relative influence analysis.

Details

Motivation: To enhance interpretability and enable efficient adaptation of deep learning-based wireless receivers by identifying which training samples influence specific bit predictions, allowing for targeted fine-tuning of poorly performing cases.

Method: Applied influence functions to DeepRx (fully convolutional receiver) using loss-relative influence analysis with capacity-like binary cross-entropy loss. Used first-order updates on beneficial training samples for targeted fine-tuning, and proposed second-order influence-aligned update strategy.

Result: Loss-relative influence with first-order updates on beneficial samples consistently improved bit error rate toward genie-aided performance, outperforming random fine-tuning in single-target scenarios. Multi-target adaptation was less effective, revealing open challenges.

Conclusion: Influence functions serve as both an interpretability tool for deep learning-based wireless receivers and a basis for efficient receiver adaptation, establishing their practical utility in wireless communications.

Abstract: We present the first use of influence functions for deep learning-based wireless receivers. Applied to DeepRx, a fully convolutional receiver, influence analysis reveals which training samples drive bit predictions, enabling targeted fine-tuning of poorly performing cases. We show that loss-relative influence with capacity-like binary cross-entropy loss and first-order updates on beneficial samples most consistently improves bit error rate toward genie-aided performance, outperforming random fine-tuning in single-target scenarios. Multi-target adaptation proved less effective, underscoring open challenges. Beyond experiments, we connect influence to self-influence corrections and propose a second-order, influence-aligned update strategy. Our results establish influence functions as both an interpretability tool and a basis for efficient receiver adaptation.

[398] Achilles’ Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data

Tianyi Chen, Pengxiao Lin, Zhiwei Wang, Zhi-Qin John Xu

Main category: cs.LG

TL;DR: Mamba’s nonlinear convolution creates asymmetry bias, impairing symmetrical pattern recognition and reversed sequence matching, despite SSMs’ linear complexity advantages.

Details

Motivation: While Mamba shows promise as an alternative to Transformers with linear complexity, fundamental differences between architectures remain poorly understood. The paper aims to systematically identify Mamba's inherent limitations through controlled experiments.

Method: Used carefully designed synthetic tasks including composite function and inverse sequence matching tasks to isolate Mamba’s limitations. Analyzed performance on symmetrical patterns and relationships to identify architectural biases.

Result: Mamba exhibits strong asymmetry bias due to its nonlinear convolution, impairing symmetrical pattern recognition and reversed sequence matching. The bias favors compositional solutions over symmetrical ones, and the limitation stems specifically from the nonlinear convolution rather than the SSM module itself.

Conclusion: Mamba’s architectural constraints are rooted in its nonlinear convolution’s asymmetric token fusion. These insights provide understanding of Mamba’s limitations and suggest concrete improvements for future sequence models.

Abstract: State Space Models (SSMs) have emerged as promising alternatives to attention mechanisms, with the Mamba architecture demonstrating impressive performance and linear complexity for processing long sequences. However, the fundamental differences between Mamba and Transformer architectures remain incompletely understood. In this work, we use carefully designed synthetic tasks to reveal Mamba’s inherent limitations. Through experiments, we identify that Mamba’s nonlinear convolution introduces an asymmetry bias that significantly impairs its ability to recognize symmetrical patterns and relationships. Using composite function and inverse sequence matching tasks, we demonstrate that Mamba strongly favors compositional solutions over symmetrical ones and struggles with tasks requiring the matching of reversed sequences. We show these limitations stem not from the SSM module itself but from the nonlinear convolution preceding it, which fuses token information asymmetrically. These insights provide a new understanding of Mamba’s constraints and suggest concrete architectural improvements for future sequence models.

[399] Scaffold-Aware Generative Augmentation and Reranking for Enhanced Virtual Screening

Xin Wang, Yu Wang, Yunchao Liu, Jens Meiler, Tyler Derr

Main category: cs.LG

TL;DR: ScaffAug is a scaffold-aware virtual screening framework that addresses class imbalance, structural imbalance, and diversity needs in drug discovery through generative augmentation, self-training, and reranking modules.

Details

Motivation: Virtual screening faces three major challenges: class imbalance (low active rate), structural imbalance (certain scaffolds dominate), and the need to identify structurally diverse active compounds for novel drug development.

Method: Three-module framework: 1) Augmentation module using graph diffusion models to generate synthetic data conditioned on scaffolds with scaffold-aware sampling to address underrepresented scaffolds; 2) Model-agnostic self-training to integrate synthetic and original data; 3) Reranking module to enhance scaffold diversity in top recommendations while maintaining performance.

Result: Comprehensive computational experiments across five target classes show ScaffAug outperforms existing baseline methods on multiple evaluation metrics, with ablation studies confirming the effectiveness of its components.

Conclusion: ScaffAug introduces novel perspectives on enhancing virtual screening by leveraging generative augmentations, reranking, and scaffold-awareness to address key challenges in drug discovery.

Abstract: Ligand-based virtual screening (VS) is an essential step in drug discovery that evaluates large chemical libraries to identify compounds that potentially bind to a therapeutic target. However, VS faces three major challenges: class imbalance due to the low active rate, structural imbalance among active molecules where certain scaffolds dominate, and the need to identify structurally diverse active compounds for novel drug development. We introduce ScaffAug, a scaffold-aware VS framework that addresses these challenges through three modules. The augmentation module first generates synthetic data conditioned on scaffolds of actual hits using generative models, specifically a graph diffusion model. This helps mitigate the class imbalance and furthermore the structural imbalance, due to our proposed scaffold-aware sampling algorithm, designed to produce more samples for active molecules with underrepresented scaffolds. A model-agnostic self-training module is then used to safely integrate the generated synthetic data from our augmentation module with the original labeled data. Lastly, we introduce a reranking module that improves VS by enhancing scaffold diversity in the top recommended set of molecules, while still maintaining and even enhancing the overall general performance of identifying novel, active compounds. We conduct comprehensive computational experiments across five target classes, comparing ScaffAug against existing baseline methods by reporting the performance of multiple evaluation metrics and performing ablation studies on ScaffAug. Overall, this work introduces novel perspectives on effectively enhancing VS by leveraging generative augmentations, reranking, and general scaffold-awareness.

[400] NeuroClean: A Generalized Machine-Learning Approach to Neural Time-Series Conditioning

Manuel A. Hernandez Alonso, Michael Depass, Stephan Quessy, Ali Falaki, Soraya Rahimi, Numa Dancause, Ignasi Cos

Main category: cs.LG

TL;DR: NeuroClean is an unsupervised EEG/LFP preprocessing pipeline that automatically removes artifacts while preserving task-relevant information, achieving 97% accuracy in motor task classification compared to 74% with raw data.

Details

Motivation: EEG and LFP recordings suffer from various artifacts and noise, requiring proper conditioning. Current preprocessing often involves human intervention, causing reproducibility issues and biases. There's a need for fully automated, unsupervised preprocessing methods.

Method: Five-step pipeline including bandpass and line noise filtering, bad channel rejection, and efficient independent component analysis with automatic component rejection using a clustering algorithm. Uses machine learning classifier to ensure task-relevant information preservation.

Result: NeuroClean successfully removed common artifacts and achieved 97% accuracy in motor task classification using Multinomial Logistic Regression (vs. 33.3% chance level), compared to 74% accuracy with raw data.

Conclusion: NeuroClean is a promising unsupervised preprocessing pipeline that improves machine learning performance and generalization for EEG/LFP studies while addressing reproducibility issues from human intervention.

Abstract: Electroencephalography (EEG) and local field potentials (LFP) are two widely used techniques to record electrical activity from the brain. These signals are used in both the clinical and research domains for multiple applications. However, most brain data recordings suffer from a myriad of artifacts and noise sources other than the brain itself. Thus, a major requirement for their use is proper and, given current volumes of data, a fully automatized conditioning. As a means to this end, here we introduce an unsupervised, multipurpose EEG/LFP preprocessing method, the NeuroClean pipeline. In addition to its completeness and reliability, NeuroClean is an unsupervised series of algorithms intended to mitigate reproducibility issues and biases caused by human intervention. The pipeline is designed as a five-step process, including the common bandpass and line noise filtering, and bad channel rejection. However, it incorporates an efficient independent component analysis with an automatic component rejection based on a clustering algorithm. This machine learning classifier is used to ensure that task-relevant information is preserved after each step of the cleaning process. We used several data sets to validate the pipeline. NeuroClean removed several common types of artifacts from the signal. Moreover, in the context of motor tasks of varying complexity, it yielded more than 97% accuracy (vs. a chance-level of 33.3%) in an optimized Multinomial Logistic Regression model after cleaning the data, compared to the raw data, which performed at 74% accuracy. These results show that NeuroClean is a promising pipeline and workflow that can be applied to future work and studies to achieve better generalization and performance on machine learning pipelines.

[401] Finding Kissing Numbers with Game-theoretic Reinforcement Learning

Chengdong Ma, Théo Tao Zhaowei, Pengyu Li, Minghao Liu, Haojun Chen, Zihao Mao, Yuan Cheng, Yuan Qi, Yaodong Yang

Main category: cs.LG

TL;DR: AI system PackingStar uses game-theoretic reinforcement learning to solve high-dimensional kissing number problems, breaking 50-year-old records and discovering thousands of new configurations.

Details

Motivation: The kissing number problem has remained unsolved for centuries, with existing methods limited by high-dimensional geometric irregularities and combinatorial complexity beyond 8 dimensions. There's a need for scalable approaches to explore high-dimensional spaces beyond human intuition.

Method: Model the kissing number problem as a two-player matrix completion game. One player fills matrix entries (pairwise cosines of sphere center vectors) while another corrects suboptimal ones, jointly maximizing matrix size. This cooperative game-theoretic reinforcement learning system (PackingStar) is fully parallelizable for large-scale exploration.

Result: PackingStar reproduces known configurations and surpasses all human-known records from dimensions 25 to 31. It achieves the first breakthrough beyond rational structures from 1971 in dimension 13, discovers over 6000 new structures in dimension 14 and other dimensions, and establishes new records for generalized kissing configurations under various angular constraints.

Conclusion: The results demonstrate AI’s power to explore high-dimensional spaces beyond human intuition, opening new pathways for solving the kissing number problem and broader geometry problems through game-theoretic reinforcement learning approaches.

Abstract: Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a fundamental challenge. This problem represents the local analogue of Hilbert’s 18th problem on sphere packing, bridging geometry, number theory, and information theory. Although significant progress has been made through lattices and codes, the irregularities of high-dimensional geometry and exponentially growing combinatorial complexity beyond 8 dimensions, which exceeds the complexity of Go game, limit the scalability of existing methods. Here we model this problem as a two-player matrix completion game that can be fully parallelized at large scale, and train the game-theoretic reinforcement learning system, PackingStar, to efficiently explore high-dimensional spaces. The matrix entries represent pairwise cosines of sphere center vectors; one player fills entries while another corrects suboptimal ones, jointly maximizing the matrix size, corresponding to the kissing number. This cooperative dynamics substantially improves sample quality, making the extremely large spaces tractable. PackingStar reproduces previous configurations and surpasses all human-known records from dimensions 25 to 31, with the configuration in 25 dimensions geometrically corresponding to the Leech lattice and suggesting possible optimality. It achieves the first breakthrough beyond rational structures from 1971 in 13 dimensions, discovers over 6000 new structures in 14 and other dimensions, and establishes new records for generalized kissing configurations under various angular constraints. These results demonstrate AI’s power to explore high-dimensional spaces beyond human intuition and open new pathways for the Kissing Number Problem and broader geometry problems.

[402] Towards Causal Market Simulators

Dennis Thumm, Luis Ontaneda Mijares

Main category: cs.LG

TL;DR: TNCM-VAE combines VAE with structural causal models to generate counterfactual financial time series with causal reasoning for stress testing and scenario analysis.

Details

Motivation: Existing market generators using deep generative models lack causal reasoning capabilities essential for counterfactual analysis and risk assessment in finance.

Method: Combines variational autoencoders with structural causal models, enforces causal constraints through DAGs in decoder architecture, and uses causal Wasserstein distance for training.

Result: Validated on synthetic autoregressive models inspired by Ornstein-Uhlenbeck process, achieving superior counterfactual probability estimation with L1 distances as low as 0.03-0.10 compared to ground truth.

Conclusion: The model enables financial stress testing, scenario analysis, and enhanced backtesting by generating plausible counterfactual market trajectories that respect underlying causal mechanisms.

Abstract: Market generators using deep generative models have shown promise for synthetic financial data generation, but existing approaches lack causal reasoning capabilities essential for counterfactual analysis and risk assessment. We propose a Time-series Neural Causal Model VAE (TNCM-VAE) that combines variational autoencoders with structural causal models to generate counterfactual financial time series while preserving both temporal dependencies and causal relationships. Our approach enforces causal constraints through directed acyclic graphs in the decoder architecture and employs the causal Wasserstein distance for training. We validate our method on synthetic autoregressive models inspired by the Ornstein-Uhlenbeck process, demonstrating superior performance in counterfactual probability estimation with L1 distances as low as 0.03-0.10 compared to ground truth. The model enables financial stress testing, scenario analysis, and enhanced backtesting by generating plausible counterfactual market trajectories that respect underlying causal mechanisms.

[403] Intrinsic Dimensionality as a Model-Free Measure of Class Imbalance

Çağrı Eser, Zeynep Sonat Baltacı, Emre Akbaş, Sinan Kalkan

Main category: cs.LG

TL;DR: The paper proposes using Intrinsic Dimensionality (ID) as a model-free measure of class imbalance that outperforms traditional cardinality-based methods and can be combined with them for better performance.

Details

Motivation: Traditional cardinality-based imbalance measures ignore redundant examples and inherent class learning difficulties, while complex measures like training loss require model training. There's a need for an easy-to-compute, model-free imbalance measure.

Method: Proposes using data Intrinsic Dimensionality (ID) as an imbalance measure that can be incorporated into various imbalance mitigation methods. ID captures class complexity and learning difficulty without requiring model training.

Result: Across five datasets with diverse imbalance ratios, ID consistently outperforms cardinality-based re-weighting and re-sampling techniques. Combining ID with cardinality further improves performance.

Conclusion: Intrinsic Dimensionality is an effective, model-free measure for class imbalance that addresses limitations of cardinality-based approaches and can enhance existing imbalance mitigation methods.

Abstract: Imbalance in classification tasks is commonly quantified by the cardinalities of examples across classes. This, however, disregards the presence of redundant examples and inherent differences in the learning difficulties of classes. Alternatively, one can use complex measures such as training loss and uncertainty, which, however, depend on training a machine learning model. Our paper proposes using data Intrinsic Dimensionality (ID) as an easy-to-compute, model-free measure of imbalance that can be seamlessly incorporated into various imbalance mitigation methods. Our results across five different datasets with a diverse range of imbalance ratios show that ID consistently outperforms cardinality-based re-weighting and re-sampling techniques used in the literature. Moreover, we show that combining ID with cardinality can further improve performance. Our code and models are available at https://github.com/cagries/IDIM.

[404] GraphPerf-RT: A Graph-Driven Performance Model for Hardware-Aware Scheduling of OpenMP Codes

Mohammad Pivezhandi, Mahdi Banisharif, Saeed Bakhshan, Abusayeed Saifullah, Ali Jannesari

Main category: cs.LG

TL;DR: GraphPerf-RT: A graph neural network surrogate for real-time risk-aware scheduling of autonomous AI agents on embedded platforms, achieving deep learning accuracy at heuristic speeds with calibrated uncertainty.

Details

Motivation: Autonomous AI agents on embedded platforms need real-time, risk-aware scheduling under resource and thermal constraints. Existing approaches have limitations: classical heuristics struggle with workload irregularity, tabular regressors discard structural information, and model-free RL risks overheating.

Method: GraphPerf-RT uses a graph neural network surrogate that unifies task DAG topology, CFG-derived code semantics, and runtime context (per-core DVFS, thermal state, utilization) in a heterogeneous graph with typed edges. It employs evidential regression with Normal-Inverse-Gamma priors for calibrated uncertainty.

Result: Achieves R^2 = 0.81 on log-transformed makespan with Spearman rho = 0.95 and conservative uncertainty calibration (PICP = 99.9% at 95% confidence). Integration with RL methods shows 66% makespan reduction and 82% energy reduction versus model-free baselines with zero thermal violations.

Conclusion: GraphPerf-RT enables efficient, risk-aware scheduling for autonomous AI agents on embedded platforms by combining graph neural networks with evidential uncertainty quantification, outperforming existing approaches while maintaining thermal safety.

Abstract: Autonomous AI agents on embedded platforms require real-time, risk-aware scheduling under resource and thermal constraints. Classical heuristics struggle with workload irregularity, tabular regressors discard structural information, and model-free reinforcement learning (RL) risks overheating. We introduce GraphPerf-RT, a graph neural network surrogate achieving deep learning accuracy at heuristic speeds (2-7ms). GraphPerf-RT is, to our knowledge, the first to unify task DAG topology, CFG-derived code semantics, and runtime context (per-core DVFS, thermal state, utilization) in a heterogeneous graph with typed edges encoding precedence, placement, and contention. Evidential regression with Normal-Inverse-Gamma priors provides calibrated uncertainty; we validate on makespan prediction for risk-aware scheduling. Experiments on three ARM platforms (Jetson TX2, Orin NX, RUBIK Pi) achieve R^2 = 0.81 on log-transformed makespan with Spearman rho = 0.95 and conservative uncertainty calibration (PICP = 99.9% at 95% confidence). Integration with four RL methods demonstrates that multi-agent model-based RL with GraphPerf-RT as the world model achieves 66% makespan reduction and 82% energy reduction versus model-free baselines, with zero thermal violations.

[405] DiEC: Diffusion Embedded Clustering

Haidong Hu, Xiaoyu Zheng, Jin Zhou, Yingxu Wang, Rui Wang, Pei Dong, Shiyuan Han, Lin Wang, C. L. Philip Chen, Tong Zhang, Yuehui Chen

Main category: cs.LG

TL;DR: DiEC is an unsupervised clustering framework that leverages optimal intermediate representations from pretrained diffusion models by systematically searching for clustering-friendly representations across network layers and noise timesteps.

Details

Motivation: Traditional deep clustering methods use single representations, while diffusion models offer abundant multi-scale representations across layers and timesteps. The challenge is efficiently identifying the most clustering-friendly representation in the layer*timestep space.

Method: DiEC systematically evaluates clusterability of representations along network depth and noise timesteps, uses unsupervised search to identify Clustering-optimal Layer (COL) and Clustering-optimal Timestep (COT), fine-tunes with structure-preserving KL-divergence objective at fixed COL+COT, and maintains generative capability with random-timestep diffusion denoising objective.

Result: DiEC achieves excellent clustering performance across multiple benchmark datasets without relying on augmentation-based consistency constraints or contrastive learning.

Conclusion: Pretrained diffusion models provide rich multi-scale representations that can be effectively leveraged for clustering by systematically identifying optimal intermediate representations, offering a novel approach to unsupervised clustering with strong performance.

Abstract: Deep clustering methods typically rely on a single, well-defined representation for clustering. In contrast, pretrained diffusion models provide abundant and diverse multi-scale representations across network layers and noise timesteps. However, a key challenge is how to efficiently identify the most clustering-friendly representation in the layertimestep space. To address this issue, we propose Diffusion Embedded Clustering (DiEC), an unsupervised framework that performs clustering by leveraging optimal intermediate representations from pretrained diffusion models. DiEC systematically evaluates the clusterability of representations along the trajectory of network depth and noise timesteps. Meanwhile, an unsupervised search strategy is designed for recognizing the Clustering-optimal Layer (COL) and Clustering-optimal Timestep (COT) in the layertimestep space of pretrained diffusion models, aiming to promote clustering performance and reduce computational overhead. DiEC is fine-tuned primarily with a structure-preserving DEC-style KL-divergence objective at the fixed COL + COT, together with a random-timestep diffusion denoising objective to maintain the generative capability of the pretrained model. Without relying on augmentation-based consistency constraints or contrastive learning, DiEC achieves excellent clustering performance across multiple benchmark datasets.

[406] Guardrailed Elasticity Pricing: A Churn-Aware Forecasting Playbook for Subscription Strategy

Deepit Sapru

Main category: cs.LG

TL;DR: A dynamic pricing framework for subscription services that combines demand forecasting, price elasticity, and churn prediction to optimize revenue while enforcing business guardrails.

Details

Motivation: Traditional static pricing tiers and uniform price uplifts fail to capture customer heterogeneity in willingness-to-pay, often leaving revenue on the table or eroding customer trust through inappropriate pricing. There's a need for a systematic approach that dynamically adjusts prices while maintaining ethical boundaries and business constraints.

Method: Blends seasonal time-series models with tree-based learners for multivariate demand forecasting, segment-level price elasticity, and churn propensity. Uses Monte Carlo scenario testing to map risk envelopes and solves constrained optimization with business guardrails (customer experience, margin floors, allowable churn). Designed as modular APIs for real-time recalibration with model explainability features.

Result: Validated across heterogeneous SaaS portfolios, consistently outperforms static tiers and uniform uplifts by reallocating price moves toward segments with higher willingness-to-pay while protecting price-sensitive cohorts.

Conclusion: The framework serves as a strategy playbook enabling companies to shift from flat to dynamic pricing, align pricing with CLV and MRR targets, and embed ethical guardrails for durable growth without eroding customer trust.

Abstract: This paper presents a marketing analytics framework that operationalizes subscription pricing as a dynamic, guardrailed decision system, uniting multivariate demand forecasting, segment-level price elasticity, and churn propensity to optimize revenue, margin, and retention. The approach blends seasonal time-series models with tree-based learners, runs Monte Carlo scenario tests to map risk envelopes, and solves a constrained optimization that enforces business guardrails on customer experience, margin floors, and allowable churn. Validated across heterogeneous SaaS portfolios, the method consistently outperforms static tiers and uniform uplifts by reallocating price moves toward segments with higher willingness-to-pay while protecting price-sensitive cohorts. The system is designed for real-time recalibration via modular APIs and includes model explainability for governance and compliance. Managerially, the framework functions as a strategy playbook that clarifies when to shift from flat to dynamic pricing, how to align pricing with CLV and MRR targets, and how to embed ethical guardrails, enabling durable growth without eroding customer trust.

[407] PersonaLedger: Generating Realistic Financial Transactions with Persona Conditioned LLMs and Rule Grounded Feedback

Dehao Yuan, Tyler Farnan, Stefan Tesliuc, Doron L Bergman, Yulun Wu, Xiaoyu Liu, Minghui Liu, James Montgomery, Nam H Nguyen, C. Bayan Bruss, Furong Huang

Main category: cs.LG

TL;DR: PersonaLedger: An LLM-driven synthetic transaction generator that combines behavioral diversity from personas with programmatic financial rule enforcement to create realistic, privacy-preserving financial datasets.

Details

Motivation: Strict privacy regulations limit access to real transaction data, hindering financial AI research. Existing synthetic data generators fail to achieve both behavioral diversity and logical groundedness - rule-based simulators lack human richness, while learning-based approaches violate financial constraints and require private training data.

Method: Uses a large language model conditioned on rich user personas to generate diverse transaction streams, coupled with an expert-configurable programmatic engine that maintains financial correctness. The LLM and engine interact in a closed loop: after each event, the engine updates user state, enforces financial rules, and returns a context-aware “nextprompt” to guide the LLM toward feasible next actions.

Result: Created a public dataset of 30 million transactions from 23,000 users and a benchmark suite with two tasks: illiquidity classification and identity theft segmentation. Provides a realistic, privacy-preserving resource with code, rules, and generation logs.

Conclusion: PersonaLedger offers a realistic, privacy-preserving resource that supports rigorous evaluation of forecasting and anomaly detection models, accelerating innovation in financial AI and enabling reproducible research without privacy concerns.

Abstract: Strict privacy regulations limit access to real transaction data, slowing open research in financial AI. Synthetic data can bridge this gap, but existing generators do not jointly achieve behavioral diversity and logical groundedness. Rule-driven simulators rely on hand-crafted workflows and shallow stochasticity, which miss the richness of human behavior. Learning-based generators such as GANs capture correlations yet often violate hard financial constraints and still require training on private data. We introduce PersonaLedger, a generation engine that uses a large language model conditioned on rich user personas to produce diverse transaction streams, coupled with an expert configurable programmatic engine that maintains correctness. The LLM and engine interact in a closed loop: after each event, the engine updates the user state, enforces financial rules, and returns a context aware “nextprompt” that guides the LLM toward feasible next actions. With this engine, we create a public dataset of 30 million transactions from 23,000 users and a benchmark suite with two tasks, illiquidity classification and identity theft segmentation. PersonaLedger offers a realistic, privacy preserving resource that supports rigorous evaluation of forecasting and anomaly detection models. PersonaLedger offers the community a rich, realistic, and privacy preserving resource – complete with code, rules, and generation logs – to accelerate innovation in financial AI and enable rigorous, reproducible evaluation.

[408] Causal Data Augmentation for Robust Fine-Tuning of Tabular Foundation Models

Magnus Bühler, Lennart Purucker, Frank Hutter

Main category: cs.LG

TL;DR: CausalMixFT improves fine-tuning of tabular foundation models under data scarcity by generating causally-informed synthetic samples, boosting performance and enabling more reliable early stopping.

Details

Motivation: Fine-tuning tabular foundation models (TFMs) with limited data is challenging because early stopping on scarce validation data often fails to accurately assess generalization performance, leading to unreliable model selection.

Method: CausalMixFT generates structurally consistent synthetic samples using Structural Causal Models (SCMs) fitted on target datasets. It augments limited real data with causally-informed synthetic examples that preserve feature dependencies while expanding training diversity.

Result: Evaluated across 33 classification datasets and 2300+ fine-tuning runs, CausalMixFT improved median normalized ROC-AUC from 0.10 (standard fine-tuning) to 0.12, outperforming statistical generators like CTGAN, TabEBM, and TableAugment. It also reduced the validation-test performance correlation gap from 0.67 to 0.30, enabling more reliable early stopping.

Conclusion: Incorporating causal structure into data augmentation provides an effective and principled approach for fine-tuning tabular foundation models in low-data regimes, improving both performance and fine-tuning stability.

Abstract: Fine-tuning tabular foundation models (TFMs) under data scarcity is challenging, as early stopping on even scarcer validation data often fails to capture true generalization performance. We propose CausalMixFT, a method that enhances fine-tuning robustness and downstream performance by generating structurally consistent synthetic samples using Structural Causal Models (SCMs) fitted on the target dataset. This approach augments limited real data with causally informed synthetic examples, preserving feature dependencies while expanding training diversity. Evaluated across 33 classification datasets from TabArena and over 2300 fine-tuning runs, our CausalMixFT method consistently improves median normalized ROC-AUC from 0.10 (standard fine-tuning) to 0.12, outperforming purely statistical generators such as CTGAN (-0.01), TabEBM (-0.04), and TableAugment (-0.09). Moreover, it narrows the median validation-test performance correlation gap from 0.67 to 0.30, enabling more reliable validation-based early stopping, a key step toward improving fine-tuning stability under data scarcity. These results demonstrate that incorporating causal structure into data augmentation provides an effective and principled route to fine-tuning tabular foundation models in low-data regimes.

[409] Active Learning Strategies for Efficient Machine-Learned Interatomic Potentials Across Diverse Material Systems

Mohammed Azeez Khan, Aaron D’Souza, Vijay Choyal

Main category: cs.LG

TL;DR: Active learning framework for efficient MLIP training using uncertainty and diversity sampling strategies from materials databases, achieving comparable accuracy with 5-13% fewer labeled samples.

Details

Motivation: To reduce costly first-principles calculations for training machine-learned interatomic potentials by developing an efficient active learning framework that minimizes the need for expensive labeled data.

Method: Active learning framework that iteratively selects informative structures from Materials Project and OQMD using compositional/property descriptors with neural network ensemble. Compares four strategies: random sampling (baseline), uncertainty-based sampling, diversity-based sampling (k-means clustering with farthest-point refinement), and hybrid approach.

Result: Diversity sampling achieves competitive or superior performance with 10.9% improvement on TiO2. Approach achieves equivalent accuracy with 5-13% fewer labeled samples than random baselines. Complete pipeline runs on Google Colab in under 4 hours per system using <8 GB RAM.

Conclusion: The framework democratizes MLIP development for resource-limited researchers and provides practical guidelines for data-efficient MLIP training. Integration with symmetry-aware architectures is identified as a promising future direction.

Abstract: Efficient materials discovery requires reducing costly first-principles calculations for training machine-learned interatomic potentials (MLIPs). We develop an active learning (AL) framework that iteratively selects informative structures from the Materials Project and Open Quantum Materials Database (OQMD) using compositional and property-based descriptors with a neural network ensemble model. Query-by-Committee enables real-time uncertainty quantification. We compare four strategies: random sampling (baseline), uncertainty-based sampling, diversity-based sampling (k-means clustering with farthest-point refinement), and a hybrid approach. Experiments across four material systems (C, Si, Fe, and TiO2) with 5 random seeds demonstrate that diversity sampling achieves competitive or superior performance, with 10.9% improvement on TiO2. Our approach achieves equivalent accuracy with 5-13% fewer labeled samples than random baselines. The complete pipeline executes on Google Colab in under 4 hours per system using less than 8 GB RAM, democratizing MLIP development for resource-limited researchers. Open-source code and configurations are available on GitHub. This multi-system evaluation provides practical guidelines for data-efficient MLIP training and highlights integration with symmetry-aware architectures as a promising future direction.

[410] When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models

Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Damien Garreau, Pierre-Alexandre Mattei

Main category: cs.LG

TL;DR: Ensembling diffusion models improves score-matching loss but doesn’t consistently enhance perceptual quality metrics like FID on image datasets, with theoretical insights provided on model composition techniques.

Details

Motivation: To investigate whether ensembling, a well-known technique for improving supervised models, provides tangible benefits for unconditional score-based diffusion models in generative modeling.

Method: Evaluated ensembling across various aggregation rules using Deep Ensembles and Monte Carlo Dropout on CIFAR-10 and FFHQ datasets, investigated the link between score estimation and image quality, examined tabular data through random forests, and provided theoretical analysis of score model composition.

Result: While ensembling scores improves score-matching loss and model likelihood, it fails to consistently enhance perceptual quality metrics like FID on image datasets. One aggregation strategy outperformed others on tabular data.

Conclusion: Ensembling diffusion models doesn’t reliably improve perceptual quality despite improving theoretical metrics, with theoretical insights shedding light on both ensembling and other model composition techniques like guidance.

Abstract: Diffusion models now generate high-quality, diverse samples, with an increasing focus on more powerful models. Although ensembling is a well-known way to improve supervised models, its application to unconditional score-based diffusion models remains largely unexplored. In this work we investigate whether it provides tangible benefits for generative modelling. We find that while ensembling the scores generally improves the score-matching loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as FID on image datasets. We confirm this observation across a breadth of aggregation rules using Deep Ensembles, Monte Carlo Dropout, on CIFAR-10 and FFHQ. We attempt to explain this discrepancy by investigating possible explanations, such as the link between score estimation and image quality. We also look into tabular data through random forests, and find that one aggregation strategy outperforms the others. Finally, we provide theoretical insights into the summing of score models, which shed light not only on ensembling but also on several model composition techniques (e.g. guidance).

[411] Learning to Trust Experience: A Monitor-Trust-Regulator Framework for Learning under Unobservable Feedback Reliability

Zhipeng Zhang, Zhenjie Yao, Kai Li, Lei Yang

Main category: cs.LG

TL;DR: The paper addresses learning under unobservable feedback reliability, proposing metacognitive regulation with a Monitor-Trust-Regulator framework and self-diagnosis to infer experience credibility from internal dynamics.

Details

Motivation: Standard robust learning can converge stably yet form high-confidence, systematically wrong beliefs when feedback reliability is unobservable. This creates a fundamental challenge where systems must decide whether to learn from experiences, not just how to learn stably.

Method: Proposes metacognitive regulation with a Monitor-Trust-Regulator (MTR) decomposition, instantiated with self-diagnosis that maintains slowly varying experience-trust variables to softly modulate learning updates without needing exogenous reliability labels or explicit corruption models.

Result: Self-diagnosis improves epistemic identifiability in EIUR regimes. In RL, it enables calibrated skepticism and recovery under systematically corrupted rewards. In supervised learning, it reveals that performance recovery doesn’t imply epistemic recovery - accuracy can rebound while beliefs remain locked-in by early misleading data.

Conclusion: MTR and self-diagnosis provide an organizing abstraction and concrete design template for intrinsic reliability assessment in autonomous learning under unobservable reliability, addressing the critical challenge of epistemic identifiability.

Abstract: Learning under unobservable feedback reliability poses a distinct challenge beyond optimization robustness: a system must decide whether to learn from an experience, not only how to learn stably. We study this setting as Epistemic Identifiability under Unobservable Reliability (EIUR), where each experience has a latent credibility, reliable and unreliable feedback can be locally indistinguishable, and data are generated in a closed loop by the learner’s own evolving beliefs and actions. In EIUR, standard robust learning can converge stably yet form high-confidence, systematically wrong beliefs. We propose metacognitive regulation as a practical response: a second, introspective control loop that infers experience credibility from endogenous evidence in the learner’s internal dynamics. We formalize this as a modular Monitor-Trust-Regulator (MTR) decomposition and instantiate it with self-diagnosis, which maintains a slowly varying experience-trust variable that softly modulates learning updates, without exogenous reliability labels or an explicit corruption model. Empirically, in the EIUR regimes studied here, self-diagnosis is associated with improved epistemic identifiability. In reinforcement learning, it enables calibrated skepticism and recovery under systematically corrupted rewards. In supervised learning, it exposes a critical dissociation: performance recovery does not imply epistemic recovery. Accuracy can rebound while internal belief dynamics remain locked-in by early misleading data, a failure detectable only through introspective diagnostics. Together, MTR and self-diagnosis provide an organizing abstraction and a concrete design template for intrinsic reliability assessment in autonomous learning under unobservable reliability.

[412] Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text

Piyush Singh Pasi

Main category: cs.LG

TL;DR: M2M is a lightweight alignment method that uses only English text to map multilingual text embeddings into multimodal space, achieving strong zero-shot transfer across 11 languages for text-to-image retrieval.

Details

Motivation: Multimodal models perform well in English but poorly in other languages due to limited multilingual multimodal resources. Existing solutions rely on machine translation, while advances in multilingual text modeling remain underutilized.

Method: M2M learns only a few linear layers using English text alone to map multilingual text embeddings into multimodal space. It’s a lightweight alignment approach that transforms embedding geometry rather than performing trivial rotations.

Result: M2M matches baseline performance in English (94.9% Recall@10) and achieves strong zero-shot transfer (89.5% Recall@10 averaged across 11 languages, 10 unseen) on XTD Text-to-Image retrieval. It also demonstrates robustness across datasets and tasks including Audio-Text retrieval and Text-to-Image generation.

Conclusion: M2M provides an effective lightweight solution for multilingual multimodal alignment that leverages existing multilingual text models without requiring extensive multimodal training data in multiple languages. The method shows promising zero-shot transfer capabilities and the released datasets facilitate further research.

Abstract: Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely on machine translation, while advances in multilingual text modeling remain underutilized. We introduce M2M, a lightweight alignment method that learns only a few linear layers–using English text alone–to map multilingual text embeddings into multimodal space. Despite its simplicity, M2M matches baseline performance in English (94.9% Recall@10) and achieves strong zero-shot transfer (89.5% Recall@10 averaged across 11 languages, 10 unseen) on XTD Text-to-Image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, M2M demonstrates robustness across datasets and tasks, extending to Audio-Text retrieval and Text-to-Image generation. We release code and checkpoints (https://github.com/piyushsinghpasi/M2M) along with multilingual evaluation datasets: MSCOCO Multilingual 30K (https://huggingface.co/datasets/piyushsinghpasi/mscoco-multilingual-30k), AudioCaps Multilingual (https://huggingface.co/datasets/piyushsinghpasi/audiocaps-multilingual), and Clotho Multilingual (https://huggingface.co/datasets/piyushsinghpasi/clotho-multilingual).

[413] jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation

Ho Fung Tsoi, Dylan Rankin

Main category: cs.LG

TL;DR: Unable to analyze paper 2601.11719 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved due to rate limiting error

Method: Unable to analyze method due to failed data retrieval from arXiv API

Result: HTTP 429 error indicates too many requests to arXiv API, preventing access to paper content

Conclusion: Paper analysis impossible without content; need to wait for rate limit reset or use alternative access method

Abstract: Failed to fetch summary for 2601.11719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[414] ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits

Aryan Karmore

Main category: cs.LG

TL;DR: ButterflyMoE reduces memory scaling from O(N·d²) to O(d² + N·d log d) by treating experts as geometric reorientations of a shared quantized substrate instead of independent weight matrices.

Details

Motivation: Current MoE models require O(N·d²) memory for N experts, which exceeds edge device memory budgets. Existing compression methods (quantization, pruning, low-rank) only reduce constant factors but don't solve the fundamental linear scaling bottleneck.

Method: Experts are treated as geometric reorientations of a unified shared quantized substrate. Diversity comes from viewing different angles of shared capacity via learned rotations applied to a shared ternary prototype, rather than storing redundant weight matrices.

Result: Achieves 150× memory reduction at 256 experts with negligible accuracy loss across language modeling benchmarks. Enables multiple experts to fit on edge-constrained devices.

Conclusion: Geometric parameterization breaks linear memory scaling in MoE models, showing that treating experts as rotations of shared capacity rather than independent matrices enables sub-linear memory scaling suitable for edge devices.

Abstract: Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d^2 + N \cdot d \log d)$ memory,sub-linear in the number of experts. The key insight: training these rotations with quantization reduces activation outliers and stabilizes extreme low bit training, where static methods collapse. Across language modeling benchmarks, ButterflyMoE achieves 150$\times$ memory reduction at 256 experts with negligible accuracy loss. ButterflyMoE allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

[415] DRGW: Learning Disentangled Representations for Robust Graph Watermarking

Jiasen Li, Yanwei Liu, Zhuoyi Shang, Xiaoyan Gu, Weiping Wang

Main category: cs.LG

TL;DR: Unable to analyze paper due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: The paper analysis cannot be completed because the arXiv API returned an HTTP 429 error (Too Many Requests), indicating rate limiting or server overload preventing access to the paper's abstract

Method: Attempted to fetch paper abstract using arXiv API query with ID 2601.13569, but encountered HTTP 429 error preventing retrieval of content

Result: Failed to obtain paper abstract due to server rate limiting; no content available for analysis

Conclusion: The arXiv API rate limiting prevents analysis of this specific paper; alternative methods or waiting before retrying would be needed to access the content

Abstract: Failed to fetch summary for 2601.13569: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13569&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[416] Online Statistical Inference for Contextual Bandits via Stochastic Gradient Descent

Xiangyu Chang, Xi Chen, Zehua Lai, He Li, Zhihong Liu, Yichen Zhang

Main category: cs.LG

TL;DR: Unable to analyze paper 2212.14883 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved due to rate limiting error

Method: No method information available - arXiv API returned HTTP 429 (Too Many Requests) error

Result: Failed to fetch paper summary - encountered rate limiting from arXiv API service

Conclusion: Paper analysis impossible due to technical limitations - need to try again later or use alternative access methods

Abstract: Failed to fetch summary for 2212.14883: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2212.14883&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[417] A Finite Expression Method for Solving High-Dimensional Committor Problems

Zezheng Song, Maria K. Cameron, Haizhao Yang

Main category: cs.LG

TL;DR: The paper “2306.12268” appears to be unavailable due to HTTP 429 error (rate limiting), preventing access to the abstract or content for analysis.

Details

Motivation: Unable to determine motivation as the paper content is inaccessible due to rate limiting from arXiv API.

Method: Cannot analyze method due to HTTP 429 error preventing access to the paper’s content.

Result: No results can be reported as the paper summary failed to fetch due to rate limiting (HTTP 429).

Conclusion: The analysis cannot be completed due to technical limitations in accessing the paper content from arXiv.

Abstract: Failed to fetch summary for 2306.12268: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2306.12268&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[418] Learning minimal representations of stochastic processes with variational autoencoders

Gabriel Fernández-Fernández, Carlo Manzo, Maciej Lewenstein, Alexandre Dauphin, Gorka Muñoz-Gil

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved from arXiv API

Method: N/A - Paper content not accessible due to API rate limiting

Result: HTTP 429 error indicates too many requests to arXiv API, preventing access to paper details

Conclusion: Cannot analyze paper 2307.11608 due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2307.11608: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2307.11608&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[419] Dynamic angular synchronization under smoothness constraints

Ernesto Araya, Mihai Cucuringu, Hemant Tyagi

Main category: cs.LG

TL;DR: Unable to analyze paper 2406.04071 due to HTTP 429 error when fetching the abstract from arXiv API.

Details

Motivation: The user requested analysis of paper 2406.04071, but the arXiv API returned an HTTP 429 error (Too Many Requests), preventing access to the abstract content.

Method: Attempted to fetch the paper abstract using arXiv API query with the specific paper ID 2406.04071, but encountered rate limiting.

Result: Failed to retrieve paper content due to HTTP 429 error. This suggests either temporary rate limiting on arXiv’s API or excessive requests in a short time period.

Conclusion: Cannot analyze the paper as the abstract content is unavailable. The user should try again later when rate limits reset, or access the paper directly through arXiv’s website.

Abstract: Failed to fetch summary for 2406.04071: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.04071&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[420] P-MOSS: Scheduling Main-Memory Indexes Over NUMA Servers Using Next Token Prediction

Yeasir Rayhan, Walid G. Aref

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2411.02933

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot analyze method as paper content could not be retrieved from arXiv

Result: HTTP 429 error indicates the arXiv API request was rate-limited, preventing access to the paper’s content

Conclusion: Paper analysis is not possible due to technical limitations in accessing the arXiv API; the user should try again later or access the paper directly

Abstract: Failed to fetch summary for 2411.02933: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.02933&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[421] Last-iterate Convergence for Symmetric, General-sum, $2 \times 2$ Games Under The Exponential Weights Dynamic

Guanghui Wang, Krishna Acharya, Lokranjan Lakshmikanthan, Juba Ziani, Vidya Muthukumar

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: The user requested analysis of a paper with arXiv ID 2502.08063, but the system encountered a rate limiting error when attempting to fetch the abstract

Method: Attempted to retrieve paper metadata via arXiv API query, but received HTTP 429 status indicating too many requests

Result: Failed to access the paper content due to API rate limiting restrictions

Conclusion: Cannot analyze the paper as requested because the content could not be retrieved; the arXiv API is temporarily limiting requests

Abstract: Failed to fetch summary for 2502.08063: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.08063&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[422] Depth-Based Local Center Clustering: A Framework for Handling Different Clustering Scenarios

Siyi Wang, Alexandre Leblanc, Paul D. McNicholas

Main category: cs.LG

TL;DR: The paper analysis cannot be completed because the arXiv API request failed with HTTP 429 error (too many requests).

Details

Motivation: Unable to determine the paper's motivation due to failed API request preventing access to the abstract content.

Method: Cannot analyze the method as the paper content is unavailable due to API rate limiting issues.

Result: No results can be reported since the paper abstract could not be retrieved from arXiv.

Conclusion: Technical limitations prevented analysis of paper 2505.09516 due to arXiv API rate limiting (HTTP 429 error).

Abstract: Failed to fetch summary for 2505.09516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.09516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[423] New Perspectives on the Polyak Stepsize: Surrogate Functions and Negative Results

Francesco Orabona, Ryan D’Orazio

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to analyze paper content due to technical limitations in accessing the arXiv API

Method: Attempted to retrieve paper metadata from arXiv API but encountered rate limiting restrictions

Result: HTTP 429 error prevents analysis of paper 2505.20219; need to wait before retrying or use alternative access methods

Conclusion: Technical limitations (rate limiting) prevent analysis of this specific paper; suggest retrying later or accessing directly through arXiv website

Abstract: Failed to fetch summary for 2505.20219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[424] Whitening Spherical Gaussian Mixtures in the Large-Dimensional Regime

Mohammed Racim Moussa Boudjemaa, Alper Kalle, Xiaoyi Mai, José Henrique de Morais Goulart, Cédric Févotte

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2509.17636

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting error

Method: N/A - Paper content inaccessible due to HTTP 429 error from arXiv API

Result: Failed to retrieve paper information - arXiv API returned HTTP 429 (Too Many Requests) error

Conclusion: Unable to analyze paper due to technical limitations in accessing the content from arXiv

Abstract: Failed to fetch summary for 2509.17636: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.17636&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[425] A Configuration-First Framework for Reproducible, Low-Code Localization

Tim Strnad, Blaž Bertalanič, Carolina Fortuna

Main category: cs.LG

TL;DR: Unable to analyze paper 2510.25692 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: The user requested analysis of a paper with arXiv ID 2510.25692, but the content could not be retrieved due to rate limiting issues with the arXiv API

Method: Attempted to fetch paper summary using arXiv API query with the specified ID, but received HTTP 429 (Too Many Requests) error indicating rate limiting

Result: Failed to retrieve paper content. The arXiv API returned HTTP 429 status, preventing access to the paper’s abstract or summary for analysis

Conclusion: Cannot analyze the requested paper due to technical limitations with the arXiv API. The user may need to try again later or provide the paper content directly for analysis

Abstract: Failed to fetch summary for 2510.25692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[426] Which Similarity-Sensitive Entropy (S-entropy)?

Phuc Nguyen, Josiah Couch, Rahul Bansal, Alexandra Morgan, Chris Tam, Miao Li, Rima Arnaout, Ramy Arnaout

Main category: cs.LG

TL;DR: Unable to analyze paper 2511.03849 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved from arXiv due to rate limiting

Method: No method information available - arXiv API returned HTTP 429 (Too Many Requests) error

Result: Failed to fetch paper summary - HTTP 429 indicates rate limiting or server overload preventing access

Conclusion: Analysis not possible due to technical limitations accessing the paper; suggest trying again later or using alternative access methods

Abstract: Failed to fetch summary for 2511.03849: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03849&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[427] Causal Regime Detection in Energy Markets With Augmented Time Series Structural Causal Models

Dennis Thumm

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved from arXiv API

Method: N/A - Paper content not accessible due to API rate limiting error

Result: HTTP 429 error indicates the arXiv API request was rate limited, preventing access to paper content

Conclusion: Paper analysis not possible due to technical limitations in accessing the arXiv API

Abstract: Failed to fetch summary for 2511.04361: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04361&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[428] Context-aware Learned Mesh-based Simulation via Trajectory-Level Meta-Learning

Philipp Dahlinger, Niklas Freymuth, Tai Hoang, Tobias Würth, Michael Volpp, Luise Kärger, Gerhard Neumann

Main category: cs.LG

TL;DR: The paper analysis cannot be completed as the arXiv API request failed with HTTP 429 (Too Many Requests) error when trying to fetch the summary for paper ID 2511.05234.

Details

Motivation: Unable to determine the paper's motivation due to API request failure preventing access to the abstract content.

Method: Cannot analyze the method since the paper content could not be retrieved from the arXiv API.

Result: No results can be reported as the paper summary failed to load due to rate limiting (HTTP 429 error).

Conclusion: The analysis attempt was unsuccessful because the arXiv API rate limit was exceeded, preventing access to the paper’s abstract for analysis.

Abstract: Failed to fetch summary for 2511.05234: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05234&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[429] Cluster-Based Generalized Additive Models Informed by Random Fourier Features

Xin Huang, Jia Li, Jun Yu

Main category: cs.LG

TL;DR: Unable to analyze paper 2512.19373 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as the abstract content is not available due to HTTP 429 error (rate limiting)

Method: No method information available - arXiv API request failed with HTTP 429 status code

Result: Failed to retrieve paper summary - encountered HTTP 429 error indicating too many requests to arXiv API

Conclusion: Unable to analyze paper 2512.19373 due to technical limitations in accessing the abstract content from arXiv

Abstract: Failed to fetch summary for 2512.19373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[430] Physics-Informed Singular-Value Learning for Cross-Covariances Forecasting in Financial Markets

Efstratios Manolakis, Christian Bongiorno, Rosario Nunzio Mantegna

Main category: cs.LG

TL;DR: Unable to analyze the paper due to HTTP 429 error when fetching the abstract from arXiv API

Details

Motivation: Cannot determine motivation as the abstract content could not be retrieved due to API rate limiting

Method: No method information available - arXiv API returned HTTP 429 (Too Many Requests) error

Result: Failed to fetch paper summary from arXiv with ID 2601.07687

Conclusion: Unable to analyze the paper due to technical limitations in accessing the abstract content

Abstract: Failed to fetch summary for 2601.07687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[431] RMBRec: Robust Multi-Behavior Recommendation towards Target Behaviors

Miaomiao Cai, Zhijie Zhang, Junfeng Fang, Zhiyong Cheng, Xiang Wang, Meng Wang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting

Method: N/A - Paper content not accessible due to HTTP 429 error from arXiv API

Result: N/A - No results available due to failed API request

Conclusion: Unable to analyze paper as the arXiv API returned HTTP 429 (Too Many Requests) error, indicating rate limiting

Abstract: Failed to fetch summary for 2601.08705: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08705&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[432] Adversarial Drift-Aware Predictive Transfer: Toward Durable Clinical AI

Xin Xiong, Zijian Guo, Haobo Zhu, Chuan Hong, Jordan W Smoller, Tianxi Cai, Molei Liu

Main category: cs.LG

TL;DR: Unable to analyze the paper due to HTTP 429 error when fetching the abstract from arXiv API.

Details

Motivation: The user requested analysis of a paper with arXiv ID 2601.11860, but the abstract could not be retrieved due to rate limiting (HTTP 429 error).

Method: Attempted to fetch the paper abstract using arXiv API query with the specified ID, but encountered HTTP 429 Too Many Requests error.

Result: Failed to retrieve the paper abstract. HTTP 429 indicates rate limiting - too many requests were made to the arXiv API in a short period.

Conclusion: Cannot analyze the paper without access to its abstract. Need to wait before retrying the arXiv API or use alternative methods to obtain the paper content.

Abstract: Failed to fetch summary for 2601.11860: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11860&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[433] On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

Sharan Sahu, Cameron J. Hogan, Martin T. Wells

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved from arXiv API

Method: Unable to analyze method due to HTTP 429 error preventing access to paper content

Result: No results available - arXiv API returned HTTP 429 (Too Many Requests) error

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2601.12238: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12238&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[434] Deterministic and probabilistic neural surrogates of global hybrid-Vlasov simulations

Daniel Holmberg, Ivan Zaitsev, Markku Alho, Ioanna Bouri, Fanni Franssila, Haewon Jeong, Minna Palmroth, Teemu Roos

Main category: cs.LG

TL;DR: Unable to analyze paper 2601.12614 due to HTTP 429 error (rate limiting) when fetching from arXiv API

Details

Motivation: The motivation cannot be determined as the paper content could not be retrieved due to API rate limiting

Method: No method information available - paper content inaccessible due to HTTP 429 error from arXiv API

Result: No results available - the analysis failed because the arXiv API returned HTTP 429 (Too Many Requests) error

Conclusion: Unable to provide analysis due to technical limitations - arXiv API rate limiting prevented access to paper content

Abstract: Failed to fetch summary for 2601.12614: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12614&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

Xiao Xue, Deyu Zhou, Ming Zhang, Xiangning Yu, Fei-Yue Wang

Main category: cs.MA

TL;DR: Computational experiments address ABM’s limitations by enabling counterfactual analysis through systematic variable manipulation for causal inference in complex systems.

Details

Motivation: Traditional Agent-Based Modeling (ABM) emphasizes simulation over experimentation, limiting its ability to uncover governing operational principles and provide causal explanations for system complexity.

Method: Computational experiments using counterfactual analysis - creating parallel worlds with alternative evolutionary paths by systematically adjusting input variables and observing output changes.

Result: Provides robust causal inference tools that overcome ABM limitations, offering deeper insights into system dynamics and complexity.

Conclusion: Computational experiments complement ABM by enabling systematic causal analysis of complex systems, laying foundation for understanding system evolution through counterfactual experimentation.

Abstract: The study of system complexity primarily has two objectives: to explore underlying patterns and to develop theoretical explanations. Pattern exploration seeks to clarify the mechanisms behind the emergence of system complexity, while theoretical explanations aim to identify the fundamental causes of this complexity. Laws are generally defined as mappings between variables, whereas theories offer causal explanations of system behavior. Agent Based Modeling(ABM) is an important approach for studying complex systems, but it tends to emphasize simulation over experimentation. As a result, ABM often struggles to deeply uncover the governing operational principles. Unlike conventional scenario analysis that relies on human reasoning, computational experiments emphasize counterfactual experiments-that is, creating parallel worlds that simulate alternative “evolutionary paths” of real-world events. By systematically adjusting input variables and observing the resulting changes in output variables, computational experiments provide a robust tool for causal inference, thereby addressing the limitations of traditional ABM. Together, these methods offer causal insights into the dynamic evolution of systems. This part can help readers gain a preliminary understanding of the entire computational experiment method, laying the foundation for the subsequent study.

[436] Predicting Long-Term Self-Rated Health in Small Areas Using Ordinal Regression and Microsimulation

Seán Caulfield Curley, Karl Mason, Patrick Mannion

Main category: cs.MA

TL;DR: Microsimulation model predicts future self-rated health in Ireland using socio-economic characteristics at Electoral Division level, with ordinal regression and alignment techniques to match real data.

Details

Motivation: To enable local authorities to predict and address future health issues by modeling health status at granular geographical scales, particularly important given Ireland's ageing population and its potential impact on overall health outcomes.

Method: Open-source microsimulation projects Ireland’s population with demographic/socio-economic characteristics at Electoral Division level. Uses ordinal regression to predict self-rated health based on socio-economic factors, with alignment technique to correct for differences between health microdata and national data distributions.

Result: Ordinal regression method matches well to Ireland’s 2022 health status distribution. For one future population scenario, ageing effects may outweigh socio-economic improvements, slightly worsening Ireland’s mean self-rated health.

Conclusion: Granular health modeling at Electoral Division level provides valuable tool for local authorities to predict and combat future health issues, with ageing population identified as a key factor potentially offsetting socio-economic improvements.

Abstract: This paper presents an approach for predicting the self-rated health of individuals in a future population utilising the individuals’ socio-economic characteristics. An open-source microsimulation is used to project Ireland’s population into the future where each individual is defined by a number of demographic and socio-economic characteristics. The model is disaggregated spatially at the Electoral Division level, allowing for analysis of results at that, or any broader geographical scales. Ordinal regression is utilised to predict an individual’s self-rated health based on their socio-economic characteristics and this method is shown to match well to Ireland’s 2022 distribution of health statuses. Due to differences in the health status distributions of the health microdata and the national data, an alignment technique is proposed to bring predictions closer to real values. It is illustrated for one potential future population that the effects of an ageing population may outweigh other improvements in socio-economic outcomes to disimprove Ireland’s mean self-rated health slightly. Health modelling at this kind of granular scale could offer local authorities a chance to predict and combat health issues which may arise in their local populations in the future.

[437] MARBLE: Multi-Agent Reasoning for Bioinformatics Learning and Evolution

Sunghyun Kim, Seokwoo Yun, Youngseo Yun, Youngrak Lee, Sangsoo Lim

Main category: cs.MA

TL;DR: MARBLE is an autonomous framework that uses multi-agent debate and performance-grounded reasoning to iteratively improve bioinformatics models, achieving stable, sustained performance gains across multiple refinement cycles.

Details

Motivation: Traditional bioinformatics model development is slow, labor-intensive, and difficult to reproduce due to manual cycles of hypothesis formulation, architectural redesign, and empirical validation. Existing LLM-based assistants lack performance-grounded reasoning and stability-aware mechanisms needed for reliable, iterative model improvement.

Method: MARBLE combines literature-aware reference selection with structured, debate-driven architectural reasoning among role-specialized agents, followed by autonomous execution, evaluation, and memory updates explicitly grounded in empirical performance metrics.

Result: MARBLE consistently achieves sustained performance improvements over strong baselines across spatial transcriptomics domain segmentation, drug-target interaction prediction, and drug response prediction tasks, while maintaining high execution robustness and low regression rates across multiple refinement cycles.

Conclusion: Structured debate, balanced evidence selection, and performance-grounded memory are critical for stable, repeatable model evolution rather than single-run or brittle gains, demonstrating MARBLE’s effectiveness as an autonomous model refinement framework for bioinformatics.

Abstract: Motivation: Developing high-performing bioinformatics models typically requires repeated cycles of hypothesis formulation, architectural redesign, and empirical validation, making progress slow, labor-intensive, and difficult to reproduce. Although recent LLM-based assistants can automate isolated steps, they lack performance-grounded reasoning and stability-aware mechanisms required for reliable, iterative model improvement in bioinformatics workflows. Results: We introduce MARBLE, an execution-stable autonomous model refinement framework for bioinformatics models. MARBLE couples literature-aware reference selection with structured, debate-driven architectural reasoning among role-specialized agents, followed by autonomous execution, evaluation, and memory updates explicitly grounded in empirical performance. Across spatial transcriptomics domain segmentation, drug-target interaction prediction, and drug response prediction, MARBLE consistently achieves sustained performance improvements over strong baselines across multiple refinement cycles, while maintaining high execution robustness and low regression rates. Framework-level analyses demonstrate that structured debate, balanced evidence selection, and performance-grounded memory are critical for stable, repeatable model evolution, rather than single-run or brittle gains. Availability: Source code, data and Supplementary Information are available at https://github.com/PRISM-DGU/MARBLE.

[438] If You Want Coherence, Orchestrate a Team of Rivals: Multi-Agent Models of Organizational Intelligence

Gopal Vijayaraghavan, Prasanth Jayachandran, Arun Murthy, Sunil Govindan, Vivek Subramanian

Main category: cs.MA

TL;DR: Teams of rival AI agents with strict roles and opposing incentives catch errors before reaching users, achieving 90%+ error interception with acceptable latency tradeoffs.

Details

Motivation: AI agents are fallible like humans - they make mistakes, have biases, and lack transparency. Instead of seeking perfect AI components, the paper proposes creating safe, productive working environments for AI agents by applying corporate organizational principles to AI teams.

Method: Architecture with specialized agent teams (planners, executors, critics, experts) organized into rival teams with clear goals. Uses remote code execution to separate data transformations/tool invocations from reasoning models. Agents write code that executes remotely, with only relevant summaries returning to agent context, preventing raw data contamination of context windows.

Result: Achieves over 90% internal error interception before user exposure while maintaining acceptable latency tradeoffs. Survey shows trading off cost and latency for correctness while incrementally expanding capabilities without impacting existing ones.

Conclusion: Reliability can be achieved through careful orchestration of imperfect AI components rather than seeking perfect ones. The team-of-rivals approach with strict role separation and remote execution architecture provides a practical framework for safe, productive AI systems.

Abstract: AI Agents can perform complex operations at great speed, but just like all the humans we have ever hired, their intelligence remains fallible. Miscommunications aren’t noticed, systemic biases have no counter-action, and inner monologues are rarely written down. We did not come to fire them for their mistakes, but to hire them and provide a safe productive working environment. We posit that we can reuse a common corporate organizational structure: teams of independent AI agents with strict role boundaries can work with common goals, but opposing incentives. Multiple models serving as a team of rivals can catch and minimize errors within the final product at a small cost to the velocity of actions. In this paper we demonstrate that we can achieve reliability without acquiring perfect components, but through careful orchestration of imperfect ones. This paper describes the architecture of such a system in practice: specialized agent teams (planners, executors, critics, experts), organized into an organization with clear goals, coordinated through a remote code executor that keeps data transformations and tool invocations separate from reasoning models. Rather than agents directly calling tools and ingesting full responses, they write code that executes remotely; only relevant summaries return to agent context. By preventing raw data and tool outputs from contaminating context windows, the system maintains clean separation between perception (brains that plan and reason) and execution (hands that perform heavy data transformations and API calls). We demonstrate the approach achieves over 90% internal error interception prior to user exposure while maintaining acceptable latency tradeoffs. A survey from our traces shows that we only trade off cost and latency to achieve correctness and incrementally expand capabilities without impacting existing ones.

[439] Agent Identity URI Scheme: Topology-Independent Naming and Capability-Based Discovery for Multi-Agent Systems

Roland R. Rodriguez

Main category: cs.MA

TL;DR: The paper proposes the agent:// URI scheme that decouples agent identity from network location, enabling stable references, capability-based discovery, and cross-provider migration without breaking references.

Details

Motivation: Current multi-agent systems have a fundamental flaw where agent identity is bound to network location (URI-based identity). This causes problems when agents migrate between providers, scale across instances, or federate across organizations - breaking references, fragmenting audit trails, and requiring centralized coordination.

Method: The agent:// URI scheme with three orthogonal components: 1) trust root establishing organizational authority, 2) hierarchical capability path for semantic discovery, and 3) sortable unique identifier for stable reference. Uses DHT key derivation for capability-based discovery, trust-root scoping to prevent cross-organization pollution, and cryptographic attestation via PASETO tokens to bind capability claims to agent identity.

Result: Evaluation shows: capability expressiveness (100% coverage on 369 production tools with zero collision), discovery precision (F1=1.0 across 10,000 agents), identity stability (formal proofs of migration invariance), and performance (all operations under 5 microseconds).

Conclusion: The agent:// URI scheme provides a formally-specified, practically-evaluated foundation for decentralized agent identity and capability-based discovery that solves the fundamental architectural flaw of binding identity to network location.

Abstract: Multi-agent systems face a fundamental architectural flaw: agent identity is bound to network location. When agents migrate between providers, scale across instances, or federate across organizations, URI-based identity schemes break references, fragment audit trails, and require centralized coordination. We propose the agent:// URI scheme, which decouples identity from topology through three orthogonal components: a trust root establishing organizational authority, a hierarchical capability path enabling semantic discovery, and a sortable unique identifier providing stable reference. The scheme enables capability-based discovery through DHT key derivation, where queries return agents by what they do rather than where they are. Trust-root scoping prevents cross-organization pollution while permitting federation when desired. Cryptographic attestation via PASETO tokens binds capability claims to agent identity, enabling verification without real-time contact with the issuing authority. We evaluate the scheme across four dimensions: capability expressiveness (100% coverage on 369 production tools with zero collision), discovery precision (F1=1.0 across 10,000 agents), identity stability (formal proofs of migration invariance), and performance (all operations under 5 microseconds). The agent:// URI scheme provides a formally-specified, practically-evaluated foundation for decentralized agent identity and capability-based discovery.

[440] INFA-Guard: Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems

Yijin Zhou, Xiaoya Lu, Dongrui Liu, Junchi Yan, Jing Shao

Main category: cs.MA

TL;DR: INFA-Guard is a novel defense framework for LLM-based Multi-Agent Systems that identifies and handles infected agents (benign agents converted by attackers) as a distinct threat category, reducing Attack Success Rate by 33% on average.

Details

Motivation: Current LLM-based Multi-Agent Systems have security vulnerabilities where malicious influence spreads virally through inter-agent communication. Existing defenses use binary paradigms that only distinguish between benign and attack agents, failing to account for infected agents that have been converted by attackers.

Method: INFA-Guard uses infection-aware detection and topological constraints to accurately localize attack sources and infected ranges. During remediation, it replaces attackers and rehabilitates infected agents, preventing malicious propagation while preserving topological integrity.

Result: Extensive experiments show INFA-Guard achieves state-of-the-art performance, reducing Attack Success Rate by an average of 33%. It also demonstrates cross-model robustness, superior topological generalization, and high cost-effectiveness.

Conclusion: INFA-Guard provides an effective defense framework for LLM-based Multi-Agent Systems by explicitly addressing the infected agent threat category, offering improved security through infection-aware detection and remediation while maintaining system integrity.

Abstract: The rapid advancement of Large Language Model (LLM)-based Multi-Agent Systems (MAS) has introduced significant security vulnerabilities, where malicious influence can propagate virally through inter-agent communication. Conventional safeguards often rely on a binary paradigm that strictly distinguishes between benign and attack agents, failing to account for infected agents i.e., benign entities converted by attack agents. In this paper, we propose Infection-Aware Guard, INFA-Guard, a novel defense framework that explicitly identifies and addresses infected agents as a distinct threat category. By leveraging infection-aware detection and topological constraints, INFA-Guard accurately localizes attack sources and infected ranges. During remediation, INFA-Guard replaces attackers and rehabilitates infected ones, avoiding malicious propagation while preserving topological integrity. Extensive experiments demonstrate that INFA-Guard achieves state-of-the-art performance, reducing the Attack Success Rate (ASR) by an average of 33%, while exhibiting cross-model robustness, superior topological generalization, and high cost-effectiveness.

[441] Game-Theoretic Lens on LLM-based Multi-Agent Systems

Jianing Hao, Han Ding, Yuanjian Xu, Tianze Sun, Ran Chen, Wanbo Zhang, Guang Zhang, Siguang Li

Main category: cs.MA

TL;DR: This paper provides a comprehensive survey of LLM-based multi-agent systems through a game-theoretic framework, organizing research around players, strategies, payoffs, and information to establish a systematic foundation for the field.

Details

Motivation: While LLMs show strong capabilities as autonomous agents, multi-agent systems composed of interacting LLMs have emerged as a powerful paradigm for studying social dynamics and strategic behaviors. However, current research is fragmented and lacks a unifying theoretical foundation.

Method: The authors conduct a comprehensive survey of LLM-based multi-agent systems using a game-theoretic lens. They organize existing studies around the four key elements of game theory: players, strategies, payoffs, and information.

Result: The paper establishes a systematic framework for understanding, comparing, and guiding future research on the design and analysis of LLM-based multi-agent systems, providing a unifying theoretical foundation for the fragmented field.

Conclusion: The game-theoretic framework offers a powerful approach to study LLM-based multi-agent systems, enabling better understanding of social dynamics and strategic behaviors while providing a structured foundation for future research in this emerging paradigm.

Abstract: Large language models (LLMs) have demonstrated strong reasoning, planning, and communication abilities, enabling them to operate as autonomous agents in open environments. While single-agent systems remain limited in adaptability and coordination, recent progress has shifted attention toward multi-agent systems (MAS) composed of interacting LLMs that pursue cooperative, competitive, or mixed objectives. This emerging paradigm provides a powerful testbed for studying social dynamics and strategic behaviors among intelligent agents. However, current research remains fragmented and lacks a unifying theoretical foundation. To address this gap, we present a comprehensive survey of LLM-based multi-agent systems through a game-theoretic lens. By organizing existing studies around the four key elements of game theory: players, strategies, payoffs, and information, we establish a systematic framework for understanding, comparing, and guiding future research on the design and analysis of LLM-based MAS.

Valerio La Gatta, Gian Marco Orlando, Marco Perillo, Ferdinando Tammaro, Vincenzo Moscato

Main category: cs.MA

TL;DR: GABM needs behavioral traits beyond demographics/personality to create realistic social media agent engagement patterns and content propagation dynamics.

Details

Motivation: Current GABM frameworks lack mechanisms to encode agents' behavioral dispositions toward platform actions, causing homogeneous engagement patterns rather than the differentiated participation styles observed on real social media platforms.

Method: Introduce behavioral traits as an explicit characterization layer to regulate agents’ propensities across posting, re-sharing, commenting, reacting, and inactivity. Conduct large-scale simulations with 980 agents and validate against real-world social media data.

Result: Behavioral traits are essential to sustain heterogeneous, profile-consistent participation patterns and enable realistic content propagation dynamics through the interplay of amplification- and interaction-oriented profiles.

Conclusion: Modeling how agents act—not only who they are—is necessary for advancing GABM as a tool for studying social media phenomena.

Abstract: Generative Agent-Based Modeling (GABM) leverages Large Language Models to create autonomous agents that simulate human behavior in social media environments, demonstrating potential for modeling information propagation, influence processes, and network phenomena. While existing frameworks characterize agents through demographic attributes, personality traits, and interests, they lack mechanisms to encode behavioral dispositions toward platform actions, causing agents to exhibit homogeneous engagement patterns rather than the differentiated participation styles observed on real platforms. In this paper, we investigate the role of behavioral traits as an explicit characterization layer to regulate agents’ propensities across posting, re-sharing, commenting, reacting, and inactivity. Through large-scale simulations involving 980 agents and validation against real-world social media data, we demonstrate that behavioral traits are essential to sustain heterogeneous, profile-consistent participation patterns and enable realistic content propagation dynamics through the interplay of amplification- and interaction-oriented profiles. Our findings establish that modeling how agents act-not only who they are-is necessary for advancing GABM as a tool for studying social media phenomena.

[443] Harm in AI-Driven Societies: An Audit of Toxicity Adoption on Chirper.ai

Erica Coppolillo, Luca Luceri, Emilio Ferrara

Main category: cs.MA

TL;DR: LLM agents on AI social platforms adopt toxic behavior when exposed to harmful content, with cumulative exposure increasing toxicity likelihood, enabling prediction from exposure patterns.

Details

Motivation: While LLM toxicity generation is known, little is understood about how exposure to harmful content shapes agent behavior over time in AI-only social environments, especially as these agents may interact with humans.

Method: Large-scale empirical analysis of LLM-driven agents on Chirper.ai, modeling interactions as stimuli (posts) and responses (comments), examining toxicity relationships, cumulative exposure effects, and predictive modeling of toxic behavior.

Result: Toxic responses more likely after toxic stimuli; cumulative toxic exposure significantly increases toxic response probability; strong negative correlation between induced and spontaneous toxicity; toxic stimuli count alone accurately predicts eventual toxic content production.

Conclusion: Exposure is a critical risk factor for LLM agent deployment, potentially triggering hate-speech propagation and cyberbullying; monitoring toxic exposure provides lightweight mechanism for auditing and mitigating harmful behavior.

Abstract: Large Language Models (LLMs) are increasingly embedded in autonomous agents that engage, converse, and co-evolve in online social platforms. While prior work has documented the generation of toxic content by LLMs, far less is known about how exposure to harmful content shapes agent behavior over time, particularly in environments composed entirely of interacting AI agents. In this work, we study toxicity adoption of LLM-driven agents on Chirper.ai, a fully AI-driven social platform. Specifically, we model interactions in terms of stimuli (posts) and responses (comments). We conduct a large-scale empirical analysis of agent behavior, examining how toxic responses relate to toxic stimuli, how repeated exposure to toxicity affects the likelihood of toxic responses, and whether toxic behavior can be predicted from exposure alone. Our findings show that toxic responses are more likely following toxic stimuli, and, at the same time, cumulative toxic exposure (repeated over time) significantly increases the probability of toxic responding. We further introduce two influence metrics, revealing a strong negative correlation between induced and spontaneous toxicity. Finally, we show that the number of toxic stimuli alone enables accurate prediction of whether an agent will eventually produce toxic content. These results highlight exposure as a critical risk factor in the deployment of LLM agents, particularly as such agents operate in online environments where they may engage not only with other AI chatbots, but also with human counterparts. This could trigger unwanted and pernicious phenomena, such as hate-speech propagation and cyberbullying. In an effort to reduce such risks, monitoring exposure to toxic content may provide a lightweight yet effective mechanism for auditing and mitigating harmful behavior in the wild.

cs.MM

[444] Structured Image-based Coding for Efficient Gaussian Splatting Compression

Pedro Martin, Antonio Rodrigues, Joao Ascenso, Maria Paula Queluz

Main category: cs.MM

TL;DR: GSICO is a novel compression method for Gaussian Splatting models that maps GS parameters into structured images for efficient encoding using conventional image codecs, achieving 20.2x compression with minimal quality loss.

Details

Motivation: Gaussian Splatting models require storing millions of parameters, leading to large file sizes that limit their practical use in multimedia systems. There's a need for efficient compression while preserving visual fidelity.

Method: GSICO uses a mapping procedure to arrange GS parameters into structured images, guided by a novel algorithm that enhances spatial coherence. These parameter images are then encoded using conventional image codecs.

Result: On Tanks and Temples, Deep Blending, and Mip-NeRF360 datasets, GSICO achieves average compression factors of 20.2x with minimal loss in visual quality (measured by PSNR, SSIM, and LPIPS). It consistently yields superior rate-distortion trade-offs compared to state-of-the-art GS compression methods.

Conclusion: GSICO provides an effective solution for compressing Gaussian Splatting models, enabling practical deployment in multimedia systems by significantly reducing storage requirements while maintaining high visual fidelity.

Abstract: Gaussian Splatting (GS) has recently emerged as a state-of-the-art representation for radiance fields, combining real-time rendering with high visual fidelity. However, GS models require storing millions of parameters, leading to large file sizes that impair their use in practical multimedia systems. To address this limitation, this paper introduces GS Image-based Compression (GSICO), a novel GS codec that efficiently compresses pre-trained GS models while preserving perceptual fidelity. The core contribution lies in a mapping procedure that arranges GS parameters into structured images, guided by a novel algorithm that enhances spatial coherence. These GS parameter images are then encoded using a conventional image codec. Experimental evaluations on Tanks and Temples, Deep Blending, and Mip-NeRF360 datasets show that GSICO achieves average compression factors of 20.2x with minimal loss in visual quality, as measured by PSNR, SSIM, and LPIPS. Compared with state-of-the-art GS compression methods, the proposed codec consistently yields superior rate-distortion (RD) trade-offs.

[445] HCVR Scene Generation: High Compatibility Virtual Reality Environment Generation for Extended Redirected Walking

Yiran Zhang, Xingpeng Sun, Aniket Bera

Main category: cs.MM

TL;DR: HCVR is a framework that generates virtual reality scenes optimized for Redirected Walking by ensuring physical-virtual environment compatibility, reducing collisions by 22.78x compared to LLM-based generation.

Details

Motivation: Current virtual scene generation methods focus on aesthetics and object relationships but neglect physical compatibility with real-world spaces, which is crucial for effective Redirected Walking (RDW) that prevents collisions. When physical and virtual environments diverge geometrically, RDW fails, leading to unavoidable collisions.

Method: HCVR introduces ENI++, a boundary-sensitive metric to evaluate physical-virtual incompatibility using rotation-sensitive visibility polygons. It uses LLMs for context-aware 3D asset retrieval and initial layout generation, then strategically adjusts object selection, scaling, and placement to maximize coverage of incompatible regions, guiding users toward RDW-feasible paths.

Result: HCVR-generated scenes resulted in 22.78 times fewer physical collisions and 35.89% lower ENI++ incompatibility scores compared to LLM-based generation with RDW. Users also gave 12.5% higher scores for layout design quality.

Conclusion: HCVR successfully addresses the physical compatibility problem in VR scene generation by optimizing for RDW controllers, significantly reducing collisions while maintaining good layout quality, making natural walking in large virtual scenes more feasible.

Abstract: Natural walking enhances immersion in virtual environments (VEs), but physical space limitations and obstacles hinder exploration, especially in large virtual scenes. Redirected Walking (RDW) techniques mitigate this by subtly manipulating the virtual camera to guide users away from physical collisions within pre-defined VEs. However, RDW efficacy diminishes significantly when substantial geometric divergence exists between the physical and virtual environments, leading to unavoidable collisions. Existing scene generation methods primarily focus on object relationships or layout aesthetics, often neglecting the crucial aspect of physical compatibility required for effective RDW. To address this, we introduce HCVR (High Compatibility Virtual Reality Environment Generation), a novel framework that generates virtual scenes inherently optimized for alignment-based RDW controllers. HCVR first employs ENI++, a novel, boundary-sensitive metric to evaluate the incompatibility between physical and virtual spaces by comparing rotation-sensitive visibility polygons. Guided by the ENI++ compatibility map and user prompts, HCVR utilizes a Large Language Model (LLM) for context-aware 3D asset retrieval and initial layout generation. The framework then strategically adjusts object selection, scaling, and placement to maximize coverage of virtually incompatible regions, effectively guiding users towards RDW-feasible paths. User studies evaluating physical collisions and layout quality demonstrate HCVR’s effectiveness with HCVR-generated scenes, resulting in 22.78 times fewer physical collisions and received 35.89% less on ENI++ score compared to LLM-based generation with RDW, while also receiving 12.5% higher scores on user feedback to layout design.

[446] Interpreting Multimodal Communication at Scale in Short-Form Video: Visual, Audio, and Textual Mental Health Discourse on TikTok

Mingyue Zha, Ho-Chun Herbert Chang

Main category: cs.MM

TL;DR: Researchers developed a multimodal analysis pipeline combining automated feature extraction with Shapley values to study how text, visuals, and audio jointly influence engagement on TikTok, focusing on social anxiety disorder content.

Details

Motivation: Existing research analyzes text, visuals, and audio modalities in isolation on short-form video platforms, lacking scalable frameworks to interpret their joint contributions to communication and engagement.

Method: A pipeline combining automated multimodal feature extraction with Shapley value-based interpretability was applied to 162,965 TikTok videos and 814,825 images about social anxiety disorder to analyze how different modalities jointly influence engagement.

Result: Facial expressions outperformed textual sentiment in predicting viewership; informational content drove more attention than emotional support; and cross-modal synergies exhibited threshold-dependent effects, revealing interaction patterns invisible to single-modality approaches.

Conclusion: The study provides both methodological contributions (reproducible framework for interpretable multimodal research) and substantive advances (understanding of mental health communication in algorithmically mediated environments through multimodal analysis).

Abstract: Short-form video platforms integrate text, visuals, and audio into complex communicative acts, yet existing research analyzes these modalities in isolation, lacking scalable frameworks to interpret their joint contributions. This study introduces a pipeline combining automated multimodal feature extraction with Shapley value-based interpretability to analyze how text, visuals, and audio jointly influence engagement. Applying this framework to 162,965 TikTok videos and 814,825 images about social anxiety disorder (SAD), we find that facial expressions outperform textual sentiment in predicting viewership, informational content drives more attention than emotional support, and cross-modal synergies exhibit threshold-dependent effects. These findings demonstrate how multimodal analysis reveals interaction patterns invisible to single-modality approaches. Methodologically, we contribute a reproducible framework for interpretable multimodal research applicable across domains; substantively, we advance understanding of mental health communication in algorithmically mediated environments.

[447] Point Cloud Streaming with Latency-Driven Implicit Adaptation using MoQ

Andrew Freeman, Michael Rudolph, Amr Rizk

Main category: cs.MM

TL;DR: Using Media Over QUIC’s delivery timeout feature for implicit server-side adaptation to trade off latency vs quality in point cloud video streaming

Details

Motivation: Point clouds are promising for VR/AR but their high bitrate limits practicality of live streaming systems

Method: Leverage delivery timeout feature within Media Over QUIC protocol to perform implicit server-side adaptation based on application’s latency target

Result: System enables per-client trade-off: lower latency requirements get lower-quality video, relaxed latency requirements get higher-quality video

Conclusion: The approach unlocks unique quality-latency trade-offs for point cloud streaming on a per-client basis

Abstract: Point clouds are a promising video representation for virtual and augmented reality. Their high-bitrate, however, has so far limited the practicality of live streaming systems. In this work, we leverage the delivery timeout feature within the Media Over QUIC protocol to perform implicit server-side adaptation based on an application’s latency target. Through experimentation with several publisher and network configurations, we demonstrate that our system unlocks a unique trade-off on a per-client basis: applications with lower latency requirements will receive lower-quality video, while applications with more relaxed latency requirements will receive higher-quality video.

Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang, Jihua Zhu, Haijun Zhang

Main category: cs.MM

TL;DR: V-Skip is a token pruning method for multimodal LLMs that addresses visual amnesia by using visual-anchored information bottleneck optimization, achieving 2.9× speedup with minimal accuracy loss.

Details

Motivation: Current CoT reasoning in MLLMs suffers from high latency due to autoregressive nature. Existing token compression methods fail in multimodal contexts by blindly applying text-centric metrics, causing visual amnesia where visually important tokens are erroneously pruned.

Method: V-Skip reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem. It uses a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow to preserve visually salient anchors.

Result: Achieves 2.9× speedup with negligible accuracy loss. Preserves fine-grained visual details, outperforming other baselines by over 30% on DocVQA. Tested on Qwen2-VL and Llama-3.2 families.

Conclusion: V-Skip effectively addresses visual amnesia in multimodal token pruning by balancing linguistic and visual information, enabling significant speed improvements while maintaining visual reasoning accuracy.

Abstract: While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts. We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, leading to hallucinations. To address this, we introduce V-Skip that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem. V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow, effectively rescuing visually salient anchors. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a $2.9\times$ speedup with negligible accuracy loss. Specifically, it preserves fine-grained visual details, outperforming other baselines over 30% on the DocVQA.

eess.AS

[449] Towards noise-robust speech inversion through multi-task learning with speech enhancement

Saba Tabatabaee, Carol Espy-Wilson

Main category: eess.AS

TL;DR: A unified framework combining speech enhancement and speech inversion using shared SSL representations to handle noisy speech scenarios.

Details

Motivation: Real-world speech inversion faces challenges due to pervasive background noise, despite SSL representations showing effectiveness for clean speech.

Method: Joint training framework integrating speech enhancement and speech inversion modules through shared SSL-based speech representations, where SSL model supports noise suppression while producing informative representations for articulation estimation.

Result: At -5 dB SNR, achieves 80.95% relative improvement over baseline under babble noise and 38.98% under non-babble noise, measured by average Pearson correlation across estimated parameters.

Conclusion: The unified framework successfully addresses noise challenges in speech inversion by leveraging joint training of enhancement and inversion modules through shared SSL representations.

Abstract: Recent studies demonstrate the effectiveness of Self Supervised Learning (SSL) speech representations for Speech Inversion (SI). However, applying SI in real-world scenarios remains challenging due to the pervasive presence of background noise. We propose a unified framework that integrates Speech Enhancement (SE) and SI models through shared SSL-based speech representations. In this framework, the SSL model is trained not only to support the SE module in suppressing noise but also to produce representations that are more informative for the SI task, allowing both modules to benefit from joint training. At a Signal-to-Noise Ratio of -5 db, our method for the SI task achieves relative improvements over the baseline of 80.95% under babble noise and 38.98% under non-babble noise, as measured by the average Pearson product-moment correlation across all estimated parameters.

[450] Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models

Wenda Zhang, Hongyu Jin, Siyi Wang, Zhiqiang Wei, Ting Dang

Main category: eess.AS

TL;DR: ALMs can generate synthetic emotion annotations to address scarcity in ambiguous emotion recognition, improving distribution reliability for low-ambiguity cases but struggling with highly ambiguous emotions.

Details

Motivation: Traditional speech emotion recognition uses single categorical labels, ignoring emotion ambiguity. Ambiguous emotion recognition needs probability distributions, but ground-truth distributions are unreliable due to sparse human annotations.

Method: Proposes framework using Large Audio-Language Models to create Synthetic Perceptual Proxies that augment human annotations. Introduces DiME-Aug (Distribution-aware Multimodal Emotion Augmentation) to address class imbalance and enable unbiased evaluation.

Result: Synthetic annotations enhance emotion distribution quality, especially in low-ambiguity regions with high annotation agreement. Benefits diminish for highly ambiguous emotions with greater human disagreement.

Conclusion: First evidence that ALMs can address annotation scarcity in ambiguous emotion recognition, but highlights need for more advanced prompting/generation strategies for highly ambiguous cases.

Abstract: Speech Emotion Recognition models typically use single categorical labels, overlooking the inherent ambiguity of human emotions. Ambiguous Emotion Recognition addresses this by representing emotions as probability distributions, but progress is limited by unreliable ground-truth distributions inferred from sparse human annotations. This paper explores whether Large Audio-Language Models (ALMs) can mitigate the annotation bottleneck by generating high-quality synthetic annotations. We introduce a framework leveraging ALMs to create Synthetic Perceptual Proxies, augmenting human annotations to improve ground-truth distribution reliability. We validate these proxies through statistical analysis of their alignment with human distributions and evaluate their impact by fine-tuning ALMs with the augmented emotion distributions. Furthermore, to address class imbalance and enable unbiased evaluation, we propose DiME-Aug, a Distribution-aware Multimodal Emotion Augmentation strategy. Experiments on IEMOCAP and MSP-Podcast show that synthetic annotations enhance emotion distribution, especially in low-ambiguity regions where annotation agreement is high. However, benefits diminish for highly ambiguous emotions with greater human disagreement. This work provides the first evidence that ALMs could address annotation scarcity in ambiguous emotion recognition, but highlights the need for more advanced prompting or generation strategies to handle highly ambiguous cases.

[451] Triage knowledge distillation for speaker verification

Ju-ho Kim, Youngmoon Jung, Joon-Young Yang, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho

Main category: eess.AS

TL;DR: TRKD (Triage KD) improves knowledge distillation for speaker verification by partitioning teacher predictions into target, confusion-set, and background groups, using a curriculum-based approach to focus on most confusable classes.

Details

Motivation: Deploying speaker verification on resource-constrained devices is challenging due to computational costs. Classical KD has limitations in transferring relational information, and decoupled KD treats non-targets uniformly and is vulnerable to low-probability classes in large-class settings.

Method: TRKD introduces a cumulative-probability cutoff τ to assess difficulty and partition teacher posterior into three groups: target class, high-probability non-target confusion-set, and background-set. It distills the confusion-set conditional distribution while discarding background, transfers three-mass distribution, and uses a curriculum on τ that starts large for broad context then progressively decreases to focus on most confusable classes.

Result: In extensive experiments on VoxCeleb1 with both homogeneous and heterogeneous teacher-student pairs, TRKD was consistently superior to recent KD variants and attained the lowest EER across all protocols.

Conclusion: TRKD effectively addresses limitations of existing KD methods for speaker verification by operationalizing assess-prioritize-focus through triage-based partitioning and curriculum learning, achieving state-of-the-art performance.

Abstract: Deploying speaker verification on resource-constrained devices remains challenging due to the computational cost of high-capacity models; knowledge distillation (KD) offers a remedy. Classical KD entangles target confidence with non-target structure in a Kullback-Leibler term, limiting the transfer of relational information. Decoupled KD separates these signals into target and non-target terms, yet treats non-targets uniformly and remains vulnerable to the long tail of low-probability classes in large-class settings. We introduce Triage KD (TRKD), a distillation scheme that operationalizes assess-prioritize-focus. TRKD introduces a cumulative-probability cutoff $τ$ to assess per-example difficulty and partition the teacher posterior into three groups: the target class, a high-probability non-target confusion-set, and a background-set. To prioritize informative signals, TRKD distills the confusion-set conditional distribution and discards the background. Concurrently, it transfers a three-mass (target/confusion/background) that capture sample difficulty and inter-class confusion. Finally, TRKD focuses learning via a curriculum on $τ$: training begins with a larger $τ$ to convey broad non-target context, then $τ$ is progressively decreased to shrink the confusion-set, concentrating supervision on the most confusable classes. In extensive experiments on VoxCeleb1 with both homogeneous and heterogeneous teacher-student pairs, TRKD was consistently superior to recent KD variants and attained the lowest EER across all protocols.

[452] NLP-Based Review for Toxic Comment Detection Tailored to the Chinese Cyberspace

Ruixing Ren, Junhui Zhao, Xiaoke Sun, Qiuping Li

Main category: eess.AS

TL;DR: This paper reviews toxic comment detection in Chinese cyberspace, analyzing challenges like cultural specificity and complex linguistic forms, proposing a new classification framework, and discussing model evolution from traditional to deep learning approaches.

Details

Motivation: The explosive growth of user-generated content in Chinese cyberspace has led to widespread toxic comments that harm mental health, community atmosphere, and social trust. Traditional detection methods struggle with Chinese cyber language's context dependence, cultural specificity, and complex forms like homophones and metaphors.

Method: The review systematically analyzes research progress by: 1) defining Chinese toxic comments and analyzing platform ecology, 2) reviewing existing datasets and proposing a novel fine-grained classification framework with data annotation strategies, 3) summarizing model evolution from traditional to deep learning approaches with emphasis on interpretability.

Result: The paper provides a comprehensive analysis of the field, identifies limitations in current approaches, and proposes a new framework for toxic comment definition and classification that addresses the unique challenges of Chinese cyber language.

Conclusion: The review thoroughly discusses open challenges in Chinese toxic comment detection and provides forward-looking suggestions for future research directions, emphasizing the need for culturally-aware, interpretable models that can handle the complex linguistic features of Chinese cyber language.

Abstract: With the in-depth integration of mobile Internet and widespread adoption of social platforms, user-generated content in the Chinese cyberspace has witnessed explosive growth. Among this content, the proliferation of toxic comments poses severe challenges to individual mental health, community atmosphere and social trust. Owing to the strong context dependence, cultural specificity and rapid evolution of Chinese cyber language, toxic expressions are often conveyed through complex forms such as homophones and metaphors, imposing notable limitations on traditional detection methods. To address this issue, this review focuses on the core topic of natural language processing based toxic comment detection in the Chinese cyberspace, systematically collating and critically analyzing the research progress and key challenges in this field. This review first defines the connotation and characteristics of Chinese toxic comments, and analyzes the platform ecology and transmission mechanisms they rely on. It then comprehensively reviews the construction methods and limitations of existing public datasets, and proposes a novel fine-grained and scalable framework for toxic comment definition and classification, along with corresponding data annotation and quality assessment strategies. We systematically summarize the evolutionary path of detection models from traditional methods to deep learning, with special emphasis on the importance of interpretability in model design. Finally, we thoroughly discuss the open challenges faced by current research and provide forward-looking suggestions for future research directions.

[453] AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering

Chun-Yi Kuan, Kai-Wei Chang, Hung-yi Lee

Main category: eess.AS

TL;DR: AQAScore is a new evaluation framework for text-to-audio generation that uses audio-aware LLMs to assess semantic alignment through probabilistic verification, outperforming existing similarity-based metrics.

Details

Motivation: Current evaluation metrics for text-to-audio generation (like CLAPScore) are limited in fine-grained semantic alignment and compositional reasoning, despite significant progress in generation quality.

Method: AQAScore reformulates assessment as a probabilistic semantic verification task using audio-aware LLMs. Instead of open-ended generation, it computes exact log-probability of “Yes” answers to targeted semantic queries about audio-text alignment.

Result: AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines across multiple benchmarks including relevance, pairwise comparison, and compositional reasoning tasks.

Conclusion: AQAScore effectively captures subtle semantic inconsistencies and scales with the capability of underlying audio-aware LLMs, providing a more accurate evaluation framework for text-to-audio generation.

Abstract: Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a “Yes” answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.

[454] Inverse-Hessian Regularization for Continual Learning in ASR

Steven Vander Eeckt, Hugo Van hamme

Main category: eess.AS

TL;DR: IHR: A memory-free continual learning method for ASR that uses inverse Hessian regularization during model merging to reduce forgetting while maintaining adaptability.

Details

Motivation: Catastrophic forgetting is a major challenge in continual learning for ASR. Existing weight averaging methods are heuristic and ignore loss landscape information, limiting their effectiveness.

Method: Proposes Inverse Hessian Regularization (IHR) - a memory-free approach that incorporates curvature information using Kronecker-factored inverse Hessian approximation. After fine-tuning on new tasks, the adaptation is adjusted to move primarily in directions less harmful to past performance.

Result: IHR significantly outperforms state-of-the-art baselines on two CL benchmarks, reducing forgetting while improving adaptability. Ablation studies confirm its effectiveness.

Conclusion: IHR provides an effective, lightweight solution for continual learning in ASR by incorporating curvature information into the merging process, addressing limitations of heuristic weight averaging methods.

Abstract: Catastrophic forgetting remains a major challenge for continual learning (CL) in automatic speech recognition (ASR), where models must adapt to new domains without losing performance on previously learned conditions. Several CL methods have been proposed for ASR, and, recently, weight averaging - where models are averaged in a merging step after fine-tuning - has proven effective as a simple memory-free strategy. However, it is heuristic in nature and ignores the underlying loss landscapes of the tasks, hindering adaptability. In this work, we propose Inverse Hessian Regularization (IHR), a memory-free approach for CL in ASR that incorporates curvature information into the merging step. After fine-tuning on a new task, the adaptation is adjusted through a Kronecker-factored inverse Hessian approximation of the previous task, ensuring that the model moves primarily in directions less harmful to past performance, while keeping the method lightweight. We evaluate IHR on two CL benchmarks and show that it significantly outperforms state-of-the-art baselines, reducing forgetting while improving adaptability. Ablation studies and analyses further confirm its effectiveness.

[455] Test-Time Adaptation For Speech Enhancement Via Mask Polarization

Tobias Raichle, Erfan Amini, Bin Yang

Main category: eess.AS

TL;DR: MPol is a lightweight test-time adaptation method for speech enhancement that restores mask bimodality using Wasserstein distance to address model degradation under domain shifts.

Details

Motivation: Speech enhancement models degrade in unseen environments, and test-time adaptation for SE is under-explored due to lack of understanding of how models degrade under domain shifts. The authors observed that mask-based SE models lose confidence under domain shifts, with predicted masks becoming flattened and losing decisive speech preservation and noise suppression.

Method: Mask polarization (MPol) - a lightweight TTA method that restores mask bimodality through distribution comparison using the Wasserstein distance. It requires no additional parameters beyond the trained model, making it suitable for resource-constrained edge deployments.

Result: Experimental results across diverse domain shifts and architectures demonstrate that MPol achieves very consistent gains that are competitive with significantly more complex approaches.

Conclusion: MPol provides an effective, lightweight solution for test-time adaptation of speech enhancement models that addresses the observed degradation of mask confidence under domain shifts, making it practical for real-world deployments.

Abstract: Adapting speech enhancement (SE) models to unseen environments is crucial for practical deployments, yet test-time adaptation (TTA) for SE remains largely under-explored due to a lack of understanding of how SE models degrade under domain shifts. We observe that mask-based SE models lose confidence under domain shifts, with predicted masks becoming flattened and losing decisive speech preservation and noise suppression. Based on this insight, we propose mask polarization (MPol), a lightweight TTA method that restores mask bimodality through distribution comparison using the Wasserstein distance. MPol requires no additional parameters beyond the trained model, making it suitable for resource-constrained edge deployments. Experimental results across diverse domain shifts and architectures demonstrate that MPol achieves very consistent gains that are competitive with significantly more complex approaches.

[456] Fast-ULCNet: A fast and ultra low complexity network for single-channel speech enhancement

Nicolás Arrieta Larraza, Niels de Koeijer

Main category: eess.AS

TL;DR: Fast-ULCNet adapts ULCNet by replacing GRUs with FastGRNNs to reduce latency and complexity for embedded speech enhancement, while addressing state drifting with a trainable complementary filter.

Details

Motivation: Need low-latency, low-complexity speech enhancement for resource-constrained embedded devices; existing state-of-the-art ULCNet uses GRUs which are computationally expensive.

Method: Replace GRU layers in ULCNet with FastGRNNs to reduce complexity; address FastGRNN state drifting in long audio with a novel trainable complementary filter approach.

Result: Fast-ULCNet performs on par with original ULCNet, reduces model size by >50%, and decreases latency by 34% on average.

Conclusion: Fast-ULCNet achieves state-of-the-art performance with significantly reduced computational requirements, making it suitable for embedded speech enhancement applications.

Abstract: Single-channel speech enhancement algorithms are often used in resource-constrained embedded devices, where low latency and low complexity designs gain more importance. In recent years, researchers have proposed a wide variety of novel solutions to this problem. In particular, a recent deep learning model named ULCNet is among the state-of-the-art approaches in this domain. This paper proposes an adaptation of ULCNet, by replacing its GRU layers with FastGRNNs, to reduce both computational latency and complexity. Furthermore, this paper shows empirical evidence on the performance decay of FastGRNNs in long audio signals during inference due to internal state drifting, and proposes a novel approach based on a trainable complementary filter to mitigate it. The resulting model, Fast-ULCNet, performs on par with the state-of-the-art original ULCNet architecture on a speech enhancement task, while reducing its model size by more than half and decreasing its latency by 34% on average.

[457] Unsupervised Variational Acoustic Clustering

Luan Vinícius Fiorio, Bruno Defraene, Johan David, Frans Widdershoven, Wim van Houtum, Ronald M. Aarts

Main category: eess.AS

TL;DR: Unsupervised variational acoustic clustering model using convolutional-recurrent VAE with GMM prior for audio time-frequency data, showing improved clustering accuracy on spoken digits.

Details

Motivation: Need for better unsupervised clustering methods for audio data in time-frequency domain that can capture complex audio patterns without labeled data.

Method: Variational inference extended to autoencoder framework with Gaussian mixture model prior, using convolutional-recurrent variational autoencoder specifically designed for time-frequency audio processing.

Result: Significant improvement in accuracy and clustering performance on spoken digits dataset compared to traditional methods.

Conclusion: The proposed variational acoustic clustering model effectively captures complex audio patterns and outperforms traditional clustering approaches for audio data.

Abstract: We propose an unsupervised variational acoustic clustering model for clustering audio data in the time-frequency domain. The model leverages variational inference, extended to an autoencoder framework, with a Gaussian mixture model as a prior for the latent space. Specifically designed for audio applications, we introduce a convolutional-recurrent variational autoencoder optimized for efficient time-frequency processing. Our experimental results considering a spoken digits dataset demonstrate a significant improvement in accuracy and clustering performance compared to traditional methods, showcasing the model’s enhanced ability to capture complex audio patterns.

[458] Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches

Changhao Pan, Dongyu Yao, Yu Zhang, Wenxiang Guo, Jingyu Lu, Zhiyuan Zhu, Zhou Zhao

Main category: eess.AS

TL;DR: A comprehensive survey paper on deep learning-based singing voice synthesis (SVS) systems, categorizing existing approaches, analyzing core technologies, and reviewing datasets/tools.

Details

Motivation: The field lacks a systematic survey of deep-learning-based singing voice synthesis systems despite recent advances with large language models and generative paradigms. There's a need to organize and analyze the current state of SVS technologies.

Method: Categorizes existing SVS systems by task type, organizes architectures into cascaded and end-to-end paradigms, provides in-depth analysis of core technologies (singing modeling and control techniques), and reviews datasets, annotation tools, and evaluation benchmarks.

Result: Provides an up-to-date review of SVS literature, offering a comprehensive reference that systematically analyzes deep-learning-based singing voice synthesis systems and their enabling technologies.

Conclusion: This survey serves as a useful reference for both researchers and engineers in the field of singing voice synthesis, addressing the gap in comprehensive systematic analysis and providing organized insights into current architectures, technologies, and resources.

Abstract: Recent advances in singing voice synthesis (SVS) have attracted substantial attention from both academia and industry. With the advent of large language models and novel generative paradigms, producing controllable, high-fidelity singing voices has become an attainable goal. Yet the field still lacks a comprehensive survey that systematically analyzes deep-learning-based singing voice synthesis systems and their enabling technologies. To address the aforementioned issue, this survey first categorizes existing systems by task type and then organizes current architectures into two major paradigms: cascaded and end-to-end approaches. Moreover, we provide an in-depth analysis of core technologies, covering singing modeling and control techniques. Finally, we review relevant datasets, annotation tools, and evaluation benchmarks that support training and assessment. In appendix, we introduce training strategies and further discussion of SVS. This survey provides an up-to-date review of the literature on SVS models, which would be a useful reference for both researchers and engineers. Related materials are available at https://github.com/David-Pigeon/SyntheticSingers.

[459] Categorical Unsupervised Variational Acoustic Clustering

Luan Vinícius Fiorio, Ivana Nikoloska, Ronald M. Aarts

Main category: eess.AS

TL;DR: Unsupervised variational acoustic clustering using categorical distribution with Gumbel-Softmax approximation for overlapping audio data in time-frequency domain.

Details

Motivation: Urban acoustic scenes often have data points that strongly overlap in time and frequency, making traditional clustering challenging. A categorical approach is needed to enforce sharper clustering despite this overlap.

Method: Proposes categorical distribution for unsupervised variational acoustic clustering, using Gumbel-Softmax distribution as a soft approximation to enable backpropagation training. The softmax temperature parameter tunes clustering performance.

Result: The model achieves impressive clustering performance across all considered datasets, even when data points strongly overlap in time and frequency.

Conclusion: Categorical approach with Gumbel-Softmax approximation effectively handles overlapping acoustic data and enables sharp clustering in challenging urban acoustic scenes.

Abstract: We propose a categorical approach for unsupervised variational acoustic clustering of audio data in the time-frequency domain. The consideration of a categorical distribution enforces sharper clustering even when data points strongly overlap in time and frequency, which is the case for most datasets of urban acoustic scenes. To this end, we use a Gumbel-Softmax distribution as a soft approximation to the categorical distribution, allowing for training via backpropagation. In this settings, the softmax temperature serves as the main mechanism to tune clustering performance. The results show that the proposed model can obtain impressive clustering performance for all considered datasets, even when data points strongly overlap in time and frequency.

[460] Acoustic Non-Stationarity Objective Assessment with Hard Label Criteria for Supervised Learning Models

Guilherme Zucatelli, Ricardo Barioni, Gabriela Dantas

Main category: eess.AS

TL;DR: Proposes Hard Label Criteria (HLC) algorithm to generate non-stationarity labels for acoustic signals, enabling supervised learning of stationarity estimators, and introduces NANSA network that achieves 99% accuracy while solving computational limitations of traditional measures.

Details

Motivation: Existing objective non-stationarity measures are resource-intensive and impose critical limitations for real-time processing solutions, creating a need for more efficient approaches.

Method: Proposes Hard Label Criteria (HLC) algorithm to generate global non-stationarity labels for acoustic signals, enabling supervised learning strategies. Evaluates HLC on state-of-the-art acoustic models, then introduces NANSA (Network for Acoustic Non-Stationarity Assessment) based on HLC.

Result: HLC demonstrates that existing acoustic models capture stationarity information. NANSA models outperform competing approaches, achieving up to 99% classification accuracy while solving computational infeasibility of traditional objective measures.

Conclusion: The proposed HLC algorithm and NANSA network provide an effective solution for acoustic non-stationarity assessment, overcoming computational limitations of traditional methods and enabling real-time processing with high accuracy.

Abstract: Objective non-stationarity measures are resource intensive and impose critical limitations for real-time processing solutions. In this paper, a novel Hard Label Criteria (HLC) algorithm is proposed to generate a global non-stationarity label for acoustic signals, enabling supervised learning strategies to be trained as stationarity estimators. The HLC is first evaluated on state-of-the-art general-purpose acoustic models, demonstrating that these models capture stationarity information. Furthermore, the first-of-its-kind HLC-based Network for Acoustic Non-Stationarity Assessment (NANSA) is proposed. NANSA models outperform competing approaches, achieving up to 99% classification accuracy, while solving the computational infeasibility of traditional objective measures.

[461] Rec-RIR: Monaural Blind Room Impulse Response Identification via DNN-based Reverberant Speech Reconstruction in STFT Domain

Pengyu Wang, Xiaofei Li

Main category: eess.AS

TL;DR: Rec-RIR is a monaural blind room impulse response identification method using deep neural networks with cross-band and narrow-band blocks to estimate convolutive transfer functions, achieving state-of-the-art performance.

Details

Motivation: The paper addresses the challenge of blind room impulse response (RIR) identification from monaural audio, which is important for various audio processing applications like speech enhancement, dereverberation, and acoustic parameter estimation.

Method: The method uses convolutive transfer function (CTF) approximation to model reverberation in narrow-band filter banks. A DNN with cross-band and narrow-band blocks estimates CTF filters by reconstructing noise-free reverberant speech spectra. A pseudo intrusive measurement process then converts CTF estimates into RIRs.

Result: Rec-RIR achieves state-of-the-art performance in both RIR identification and acoustic parameter estimation, with open-source code available for reproducibility.

Conclusion: The proposed Rec-RIR method effectively solves monaural blind RIR identification using CTF approximation and deep learning, providing a stable supervised training approach and demonstrating superior performance over existing methods.

Abstract: This paper presents Rec-RIR for monaural blind room impulse response (RIR) identification. Rec-RIR is developed based on the convolutive transfer function (CTF) approximation, which models reverberation effect within narrow-band filter banks in the short-time Fourier transform domain. Specifically, we propose a deep neural network (DNN) with cross-band and narrow-band blocks to estimate the CTF filter. The DNN is trained through reconstructing the noise-free reverberant speech spectra. This objective enables stable and straightforward supervised training. Subsequently, a pseudo intrusive measurement process is employed to convert the CTF filter estimate into RIR by simulating a common intrusive RIR measurement procedure. Experimental results demonstrate that Rec-RIR achieves state-of-the-art performance in both RIR identification and acoustic parameter estimation. Open-source codes are available online at https://github.com/Audio-WestlakeU/Rec-RIR.

[462] Clustering of Acoustic Environments with Variational Autoencoders for Hearing Devices

Luan Vinícius Fiorio, Ivana Nikoloska, Wim van Houtum, Ronald M. Aarts

Main category: eess.AS

TL;DR: VAE-based unsupervised clustering for acoustic environments outperforms traditional methods, especially for complex urban soundscapes with categorical latent representations.

Details

Motivation: Traditional acoustic classification has limitations: classical signal processing can't handle high-dimensional data well, and supervised learning depends on labeled data which may not reflect true acoustic scene structure. There's a need for unsupervised methods that can discover natural groupings in acoustic environments.

Method: Uses variational autoencoders (VAEs) with categorical latent clustering using Gumbel-Softmax reparameterization. Includes time-context windowing for lower memory requirements (suitable for hearing devices), and proposes general VAE architecture adaptations for audio clustering.

Result: All variational methods succeeded on spoken digits clustering (simpler task with meaningful labels), but only the proposed categorical VAE model achieved effective clustering performance on complex urban soundscapes with overlapping time-frequency characteristics.

Conclusion: Categorical VAE with Gumbel-Softmax is effective for unsupervised clustering of acoustic environments, particularly for complex real-world scenarios like urban soundscapes where traditional methods fail.

Abstract: Traditional acoustic environment classification relies on: i) classical signal processing algorithms, which are unable to extract meaningful representations of high-dimensional data; or on ii) supervised learning, limited by the availability of labels. Knowing that human-imposed labels do not always reflect the true structure of acoustic scenes, we explore the potential of (unsupervised) clustering of acoustic environments using variational autoencoders (VAEs). We employ a VAE model for categorical latent clustering with a Gumbel-Softmax reparameterization which can operate with a time-context windowing scheme for lower memory requirements, tailored for real-world hearing device scenarios. Additionally, general adaptations on VAE architectures for audio clustering are also proposed. The approaches are validated through the clustering of spoken digits, a simpler task where labels are meaningful, and urban soundscapes, where the recordings present strong overlap in time and frequency. While all variational methods succeeded when clustering spoken digits, only the proposed model achieved effective clustering performance on urban acoustic scenes, given its categorical nature.

[463] Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

Umberto Cappellazzo, Xubo Liu, Pingchuan Ma, Stavros Petridis, Maja Pantic

Main category: eess.AS

TL;DR: Omni-AVSR is a unified audio-visual LLM that supports ASR, VSR, and AVSR tasks with elastic inference, using multi-granularity training and parameter-efficient adaptation to reduce resource use while maintaining performance.

Details

Motivation: Current LLM-based speech recognition approaches train separate models for ASR, VSR, and AVSR tasks, increasing computational and deployment costs while missing cross-task synergies. Fixed-rate token compression also limits flexibility in balancing accuracy with efficiency.

Method: Adapts matryoshka representation learning for efficient multi-granularity training across audio and visual modalities. Explores three LoRA-based strategies for parameter-efficient adaptation of the backbone LLM to balance shared and task-specific specialization.

Result: Achieves comparable or superior accuracy to state-of-the-art baselines on LRS2 and LRS3 datasets while training only a single model with substantially lower training and deployment resource use. Model remains robust under acoustic noise and shows insights into performance-efficiency trade-offs with LLM scaling.

Conclusion: Omni-AVSR provides a unified framework for audio-visual speech recognition that enables elastic inference while reducing resource requirements, demonstrating the viability of multi-task learning with parameter-efficient adaptation in speech recognition LLMs.

Abstract: Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.

[464] Principled Coarse-Grained Acceptance for Speculative Decoding in Speech

Moran Yanuka, Paul Dixon, Eyal Finkelshtein, Daniel Rotman, Raja Giryes

Main category: eess.AS

TL;DR: PCG accelerates speech LLM generation using acoustic similarity groups instead of exact token matching for speculative decoding, increasing acceptance rates while maintaining quality.

Details

Motivation: Standard speculative decoding for speech LLMs suffers from low acceptance rates because exact token matching is too restrictive - many discrete tokens are acoustically or semantically interchangeable, limiting speedups.

Method: Introduces Principled Coarse-Graining (PCG) that verifies proposals at Acoustic Similarity Groups (ASGs) derived from target model’s embedding space. Splits token probability mass across overlapping groups and performs rejection sampling on group variable, allowing accepted draft tokens to stand in for any group member.

Result: On LibriTTS, PCG increases acceptance rates and throughput relative to standard speculative decoding and prior speech-specific relaxations while maintaining intelligibility and speaker similarity.

Conclusion: Acoustically aware, group-level acceptance provides a simple and general way to accelerate speech token generation while maintaining speech quality.

Abstract: Speculative decoding accelerates autoregressive speech generation by letting a fast draft model propose tokens that a larger target model verifies. However, for speech LLMs that generate acoustic tokens, exact token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, reducing acceptance rates and limiting speedups. We introduce Principled Coarse-Graining (PCG), which verifies proposals at the level of Acoustic Similarity Groups (ASGs) derived from the target model’s embedding space. By splitting each token’s probability mass across the overlapping groups that contain it, we define an overlap-aware coarse-grained distribution and perform rejection sampling on the resulting group variable. This yields an exactness guarantee at the group level while allowing the accepted draft token to stand in for any member of the group in practice. On LibriTTS, PCG increases acceptance and throughput relative to standard speculative decoding and prior speech-specific relaxations while maintaining intelligibility and speaker similarity. These results suggest acoustically aware, group-level acceptance as a simple and general way to accelerate speech token generation while maintaining speech quality.

[465] Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

Yifan Yang, Bing Han, Hui Wang, Wei Wang, Ziyang Ma, Long Zhou, Zengrui Jin, Guanrou Yang, Tianrui Wang, Xu Tan, Xie Chen

Main category: eess.AS

TL;DR: FCaps dataset provides 47k hours of speech with 19M fine-grained style captions via direct audio grounding, enabling CLSP model to learn multi-granular speech-text representations for various zero-shot tasks.

Details

Motivation: Existing speech-text models lack fine-grained style modeling due to coarse captions, task-specific supervision, and lack of scalable fine-grained annotations. Current annotation pipelines suffer from error propagation through LLM-based rewriting.

Method: Created FCaps dataset using novel end-to-end pipeline that directly grounds detailed captions in audio (avoiding cascaded LLM rewriting). Built CLSP model with contrastive language-speech pre-training integrating global and fine-grained supervision for multi-granular representations.

Result: FCaps annotations surpass existing cascaded annotations in correctness, coverage, and naturalness (evaluated via LLM-as-a-judge). CLSP learns fine-grained multi-granular representations that perform well on global/fine-grained speech-text retrieval, zero-shot paralinguistic classification, and speech style similarity scoring.

Conclusion: The FCaps dataset and CLSP model enable effective fine-grained speaking style modeling through direct audio grounding and multi-granular contrastive learning, achieving strong performance across diverse speech-text tasks with alignment to human judgments.

Abstract: Modeling fine-grained speaking styles remains challenging for language-speech representation pre-training, as existing speech-text models are typically trained with coarse captions or task-specific supervision, and scalable fine-grained style annotations are unavailable. We present FCaps, a large-scale dataset with fine-grained free-text style descriptions, encompassing 47k hours of speech and 19M fine-grained captions annotated via a novel end-to-end pipeline that directly grounds detailed captions in audio, thereby avoiding the error propagation caused by LLM-based rewriting in existing cascaded pipelines. Evaluations using LLM-as-a-judge demonstrate that our annotations surpass existing cascaded annotations in terms of correctness, coverage, and naturalness. Building on FCaps, we propose CLSP, a contrastive language-speech pre-trained model that integrates global and fine-grained supervision, enabling unified representations across multiple granularities. Extensive experiments demonstrate that CLSP learns fine-grained and multi-granular speech-text representations that perform reliably across global and fine-grained speech-text retrieval, zero-shot paralinguistic classification, and speech style similarity scoring, with strong alignment to human judgments. Code and dataset are publicly available at https://github.com/yfyeung/CLSP.

[466] Adaptive Rotary Steering with Joint Autoregression for Robust Extraction of Closely Moving Speakers in Dynamic Scenarios

Jakob Kienegger, Timo Gerkmann

Main category: eess.AS

TL;DR: Proposes joint autoregressive framework combining automated rotary steering with multi-channel enhancement for dynamic multi-speaker scenarios, improving tracking and enhancement of closely spaced speakers.

Details

Motivation: Existing deep spatial filtering methods work well for stationary speakers but struggle with dynamic acoustic conditions, especially when speakers are nearby or crossing, making robust tracking difficult and spatial cues less effective.

Method: Novel joint autoregressive framework that automates rotary steering using interleaved tracking algorithm conditioned on target’s initial direction, and incorporates processed recording as additional guide into both algorithms to leverage temporal-spectral correlations of speech.

Result: Significantly improves tracking and enhancement of closely spaced speakers, consistently outperforming comparable non-autoregressive methods on synthetic dataset. Real-world recordings show effectiveness in complex scenarios with multiple speaker crossings and varying speaker-to-array distances.

Conclusion: The proposed joint autoregressive framework successfully addresses challenges in dynamic multi-speaker scenarios by leveraging temporal-spectral correlations to resolve spatially challenging speaker constellations, demonstrating superior performance over existing methods.

Abstract: Latest advances in deep spatial filtering for Ambisonics demonstrate strong performance in stationary multi-speaker scenarios by rotating the sound field toward a target speaker prior to multi-channel enhancement. For applicability in dynamic acoustic conditions with moving speakers, we propose to automate this rotary steering using an interleaved tracking algorithm conditioned on the target’s initial direction. However, for nearby or crossing speakers, robust tracking becomes difficult and spatial cues less effective for enhancement. By incorporating the processed recording as additional guide into both algorithms, our novel joint autoregressive framework leverages temporal-spectral correlations of speech to resolve spatially challenging speaker constellations. Consequently, our proposed method significantly improves tracking and enhancement of closely spaced speakers, consistently outperforming comparable non-autoregressive methods on a synthetic dataset. Real-world recordings complement these findings in complex scenarios with multiple speaker crossings and varying speaker-to-array distances.

eess.IV

[467] Self-Supervised Score-Based Despeckling for SAR Imagery via Log-Domain Transformation

Junhyuk Heo

Main category: eess.IV

TL;DR: Novel self-supervised framework for SAR image despeckling using score-based generative models in log domain transformation.

Details

Motivation: SAR speckle noise is multiplicative and Gamma-distributed, degrading image quality and complicating analysis. Existing methods face challenges in effectively despeckling SAR imagery due to the complex noise characteristics.

Method: Transform SAR data into log-domain to convert multiplicative speckle into approximately additive Gaussian noise. Apply score-based generative models trained in transformed domain using self-supervised objective that learns from further corrupted versions of input data.

Result: Method achieves significantly shorter inference times compared to existing self-supervised techniques while providing robust SAR image restoration.

Conclusion: The proposed self-supervised score-based framework offers a practical and efficient solution for SAR image despeckling by effectively handling multiplicative speckle noise through domain transformation and generative modeling.

Abstract: The speckle noise inherent in Synthetic Aperture Radar (SAR) imagery significantly degrades image quality and complicates subsequent analysis. Given that SAR speckle is multiplicative and Gamma-distributed, effectively despeckling SAR imagery remains challenging. This paper introduces a novel self-supervised framework for SAR image despeckling based on score-based generative models operating in the transformed log domain. We first transform the data into the log-domain and then convert the speckle noise residuals into an approximately additive Gaussian distribution. This step enables the application of score-based models, which are trained in the transformed domain using a self-supervised objective. This objective allows our model to learn the clean underlying signal by training on further corrupted versions of the input data itself. Consequently, our method exhibits significantly shorter inference times compared to many existing self-supervised techniques, offering a robust and practical solution for SAR image restoration.

[468] Unsupervised Deformable Image Registration with Local-Global Attention and Image Decomposition

Zhengyong Huang, Xingwen Sun, Xuting Chang, Ning Jiang, Yao Wang, Jianfei Sun, Hongbin Han, Yao Sui

Main category: eess.IV

TL;DR: LGANet++ is a novel unsupervised deformable image registration framework using local-global attention mechanism that outperforms state-of-the-art methods across cross-patient, cross-time, and cross-modal CT-MR registration tasks.

Details

Motivation: Traditional deformable image registration methods are computationally intensive and lack generalizability. While deep learning attention mechanisms improve feature alignment, they still struggle with regions of high anatomical variability.

Method: Proposed LGANet++ framework with novel local-global attention mechanism integrated with unique feature interaction and fusion techniques for unsupervised deformable image registration.

Result: Outperformed state-of-the-art methods on five public datasets across three scenarios: improved accuracy by 1.39% (cross-patient), 0.71% (cross-time), and 6.12% (cross-modal CT-MR registration).

Conclusion: LGANet++ demonstrates superior registration accuracy, robustness, and generalizability, showing potential to support clinical workflows requiring reliable and efficient image registration.

Abstract: Deformable image registration is a critical technology in medical image analysis, with broad applications in clinical practice such as disease diagnosis, multi-modal fusion, and surgical navigation. Traditional methods often rely on iterative optimization, which is computationally intensive and lacks generalizability. Recent advances in deep learning have introduced attention-based mechanisms that improve feature alignment, yet accurately registering regions with high anatomical variability remains challenging. In this study, we proposed a novel unsupervised deformable image registration framework, LGANet++, which employs a novel local-global attention mechanism integrated with a unique technique for feature interaction and fusion to enhance registration accuracy, robustness, and generalizability. We evaluated our approach using five publicly available datasets, representing three distinct registration scenarios: cross-patient, cross-time, and cross-modal CT-MR registration. The results demonstrated that our approach consistently outperforms several state-of-the-art registration methods, improving registration accuracy by 1.39% in cross-patient registration, 0.71% in cross-time registration, and 6.12% in cross-modal CT-MR registration tasks. These results underscore the potential of LGANet++ to support clinical workflows requiring reliable and efficient image registration. The source code is available at https://github.com/huangzyong/LGANet-Registration.

[469] Partial Decoder Attention Network with Contour-weighted Loss Function for Data-Imbalance Medical Image Segmentation

Zhengyong Huang, Ning Jiang, Xingwen Sun, Lihua Zhang, Peng Chen, Jens Domke, Yao Sui

Main category: eess.IV

TL;DR: Proposed PDANet with contour-weighted segmentation to address data imbalance in medical images, improving segmentation of small/underrepresented structures across multiple anatomical tasks.

Details

Motivation: Medical image segmentation suffers from data imbalance issues where large volume disparities among organs/tissues and uneven sample distributions bias models toward larger/more frequent structures, overlooking smaller/less represented ones, affecting accuracy and robustness.

Method: Developed PDANet, a lightweight segmentation network based on partial decoder mechanism, with novel contour-weighted segmentation approach to improve representation of small and underrepresented structures.

Result: Outperformed nine state-of-the-art methods across three datasets (abdominal organs, brain tumors, pelvic bone fragments). Contour-weighted strategy improved other methods by average Dice score enhancements of 2.32%, 1.67%, and 3.60% respectively.

Conclusion: Contour-weighted segmentation method surpasses current approaches in accuracy and robustness. As a model-independent strategy, it can seamlessly fit various segmentation frameworks, highlighting practical importance and broad application potential in medical image analysis.

Abstract: Image segmentation is pivotal in medical image analysis, facilitating clinical diagnosis, treatment planning, and disease evaluation. Deep learning has significantly advanced automatic segmentation methodologies by providing superior modeling capability for complex structures and fine-grained anatomical regions. However, medical images often suffer from data imbalance issues, such as large volume disparities among organs or tissues, and uneven sample distributions across different anatomical structures. This imbalance tends to bias the model toward larger organs or more frequently represented structures, while overlooking smaller or less represented structures, thereby affecting the segmentation accuracy and robustness. To address these challenges, we proposed a novel contour-weighted segmentation approach, which improves the model’s capability to represent small and underrepresented structures. We developed PDANet, a lightweight and efficient segmentation network based on a partial decoder mechanism. We evaluated our method using three prominent public datasets. The experimental results show that our methodology excelled in three distinct tasks: segmenting multiple abdominal organs, brain tumors, and pelvic bone fragments with injuries. It consistently outperformed nine state-of-the-art methods. Moreover, the proposed contour-weighted strategy improved segmentation for other comparison methods across the three datasets, yielding average enhancements in Dice scores of 2.32%, 1.67%, and 3.60%, respectively. These results demonstrate that our contour-weighted segmentation method surpassed current leading approaches in both accuracy and robustness. As a model-independent strategy, it can seamlessly fit various segmentation frameworks, enhancing their performance. This flexibility highlighted its practical importance and potential for broad use in medical image analysis.

[470] LiNUS: Lightweight Automatic Segmentation of Deep Brain Nuclei for Real-Time DBS Surgery

Shuo Zhang, Zihua Wang, Changgeng He, Chunhua Hu

Main category: eess.IV

TL;DR: LiNUS is a lightweight deep learning framework for automatic STN segmentation in DBS surgery, achieving high Dice scores with fast inference time.

Details

Motivation: Address challenges of small target volume and class imbalance in MRI data for STN segmentation in DBS surgery, improving upon existing methods.

Method: Improves U-Net architecture with spectral normalization constraints, bilinear interpolation upsampling, and multi-scale feature fusion mechanism.

Result: Achieves Dice coefficient of 0.679 on Tsinghua DBS dataset with 0.05s inference time per subject, and 0.89 Dice on high-resolution data, outperforming traditional methods.

Conclusion: LiNUS provides accurate, fast STN segmentation with a dedicated GUI for real-time clinical application in DBS surgery.

Abstract: This paper proposes LiNUS, a lightweight deep learning framework for the automatic segmentation of the Subthalamic Nucleus (STN) in Deep Brain Stimulation (DBS) surgery. Addressing the challenges of small target volume and class imbalance in MRI data, LiNUS improves upon the U-Net architecture by introducing spectral normalization constraints, bilinear interpolation upsampling, and a multi-scale feature fusion mechanism. Experimental results on the Tsinghua DBS dataset (TT14) demonstrate that LiNUS achieves a Dice coefficient of 0.679 with an inference time of only 0.05 seconds per subject, significantly outperforming traditional manual and registration-based methods. Further validation on high-resolution data confirms the model’s robustness, achieving a Dice score of 0.89. A dedicated Graphical User Interface (GUI) was also developed to facilitate real-time clinical application.

[471] Filtered 2D Contour-Based Reconstruction of 3D STL Model from CT-DICOM Images

K. Punnam Chandar, Y. Ravi Kumar

Main category: eess.IV

TL;DR: A method for reconstructing 3D STL models from 2D DICOM contours using filtering to improve accuracy by removing segmentation outliers.

Details

Motivation: Segmentation of low-resolution CT images produces 2D contour data with outliers, causing reconstructed 3D STL models to deviate from actual geometry. Need to improve 3D reconstruction accuracy.

Method: Process CT images (contrast enhancement, noise reduction, smoothing), segment using thresholding, extract 2D contours, filter outliers, perform Delaunay triangulation on filtered points, and join layers layer-by-layer to reconstruct 3D STL model.

Result: Verified on basic shapes and human pelvic bone ROI. Filtered 2D contour points produced improved 3D STL geometry compared to unfiltered reconstruction.

Conclusion: Filtering 2D contour data points before 3D reconstruction improves STL model geometry accuracy by addressing segmentation imperfections from low-resolution images.

Abstract: Reconstructing a 3D Stereo-lithography (STL) Model from 2D Contours of scanned structure in Digital Imaging and Communication in Medicine (DICOM) images is crucial to understand the geometry and deformity. Computed Tomography (CT) images are processed to enhance the contrast, reduce the noise followed by smoothing. The processed CT images are segmented using thresholding technique. 2D contour data points are extracted from segmented CT images and are used to construct 3D STL Models. The 2D contour data points may contain outliers as a result of segmentation of low resolution images and the geometry of the constructed 3D structure deviate from the actual. To cope with the imperfections in segmentation process, in this work we propose to use filtered 2D contour data points to reconstruct 3D STL Model. The filtered 2D contour points of each image are delaunay triangulated and joined layer-by-layer to reconstruct the 3D STL model. The 3D STL Model reconstruction is verified on i) 2D Data points of basic shapes and ii) Region of Interest (ROI) of human pelvic bone and are presented as case studies. The 3D STL model constructed from 2D contour data points of ROI of segmented pelvic bone with and without filtering are presented. The 3D STL model reconstructed from filtered 2D data points improved the geometry of model compared to the model reconstructed without filtering 2D data points.

[472] Vision Models for Medical Imaging: A Hybrid Approach for PCOS Detection from Ultrasound Scans

Md Mahmudul Hoque, Md Mehedi Hassain, Muntakimur Rahaman, Md. Towhidul Islam, Shaista Rani, Md Sharif Mollah

Main category: eess.IV

TL;DR: Researchers developed two hybrid CNN-transformer models for PCOS detection from ultrasound images, with the final DenConREST model achieving 98.23% accuracy.

Details

Motivation: PCOS is a common endocrine disorder affecting many Bangladeshi women, and there's a need for accurate vision-based medical image analysis techniques to improve diagnostic accuracy and reduce errors in PCOS detection from ultrasound images.

Method: Developed two novel hybrid models combining convolutional and transformer-based approaches. The first model ‘DenConST’ integrated DenseNet121, Swin Transformer, and ConvNeXt. The final optimized model ‘DenConREST’ incorporated Swin Transformer, ConvNeXt, DenseNet121, ResNet18, and EfficientNetV2. Used ultrasound images categorized as “infected” (PCOS-positive) and “noninfected” (healthy ovaries).

Result: DenConST achieved 85.69% accuracy, while the optimized DenConREST demonstrated superior performance with 98.23% accuracy, showing the best performance among all evaluated models.

Conclusion: The research presents an efficient solution for PCOS detection from ultrasound images that significantly improves diagnostic accuracy while reducing detection errors, highlighting the effectiveness of hybrid CNN-transformer architectures for medical image analysis.

Abstract: Polycystic Ovary Syndrome (PCOS) is the most familiar endocrine illness in women of reproductive age. Many Bangladeshi women suffer from PCOS disease in their older age. The aim of our research is to identify effective vision-based medical image analysis techniques and evaluate hybrid models for the accurate detection of PCOS. We introduced two novel hybrid models combining convolutional and transformer-based approaches. The training and testing data were organized into two categories: “infected” (PCOS-positive) and “noninfected” (healthy ovaries). In the initial stage, our first hybrid model, ‘DenConST’ (integrating DenseNet121, Swin Transformer, and ConvNeXt), achieved 85.69% accuracy. The final optimized model, ‘DenConREST’ (incorporating Swin Transformer, ConvNeXt, DenseNet121, ResNet18, and EfficientNetV2), demonstrated superior performance with 98.23% accuracy. Among all evaluated models, DenConREST showed the best performance. This research highlights an efficient solution for PCOS detection from ultrasound images, significantly improving diagnostic accuracy while reducing detection errors.

[473] A Dynamic Prognostic Prediction Method for Colorectal Cancer Liver Metastasis

Wei Yang, Yiran Zhu, Yan su, Zesheng Li, Chengchang Pan, Honggang Qi

Main category: eess.IV

TL;DR: DyPro is a deep learning framework that predicts colorectal cancer liver metastasis recurrence and survival by modeling postoperative disease evolution through latent trajectory inference.

Details

Motivation: Current prognostic approaches for colorectal cancer liver metastasis are limited by static, single-timepoint analysis that fails to capture tumor spatial distribution, longitudinal disease dynamics, and multimodal clinical information, resulting in poor predictive accuracy for heterogeneous outcomes.

Method: DyPro uses residual dynamic evolution to infer postoperative latent trajectories. It starts with an initial patient representation and generates a 12-step sequence of trajectory snapshots through autoregressive residual updates, then integrates these to predict recurrence and survival outcomes.

Result: On the MSKCC CRLM dataset, DyPro achieved strong discrimination with C-index of 0.755 for overall survival and 0.714 for disease-free survival, with OS AUC@1y of 0.920 and OS IBS of 0.143 under repeated stratified 5-fold cross-validation.

Conclusion: DyPro provides quantitative risk assessment to support adjuvant therapy planning and follow-up scheduling by capturing dynamic disease evolution, offering improved prognostic accuracy over static approaches for colorectal cancer liver metastasis management.

Abstract: Colorectal cancer liver metastasis (CRLM) exhibits high postoperative recurrence and pronounced prognostic heterogeneity, challenging individualized management. Existing prognostic approaches often rely on static representations from a single postoperative snapshot, and fail to jointly capture tumor spatial distribution, longitudinal disease dynamics, and multimodal clinical information, limiting predictive accuracy. We propose DyPro, a deep learning framework that infers postoperative latent trajectories via residual dynamic evolution. Starting from an initial patient representation, DyPro generates a 12-step sequence of trajectory snapshots through autoregressive residual updates and integrates them to predict recurrence and survival outcomes. On the MSKCC CRLM dataset, DyPro achieves strong discrimination under repeated stratified 5-fold cross-validation, reaching a C-index of 0.755 for OS and 0.714 for DFS, with OS AUC@1y of 0.920 and OS IBS of 0.143. DyPro provides quantitative risk cues to support adjuvant therapy planning and follow-up scheduling.

[474] Weakly-supervised segmentation using inherently-explainable classification models and their application to brain tumour classification

Soumick Chatterjee, Hadya Yassin, Florian Dubost, Andreas Nürnberger, Oliver Speck

Main category: eess.IV

TL;DR: This paper introduces inherently explainable classifiers (GP-UNet, GP-ShuffleUNet, GP-ReconResNet) that generate localization heatmaps for brain tumor classification and weakly-supervised segmentation using only image-level labels.

Details

Motivation: Address two key challenges in medical imaging: 1) opacity of "black-box" deep learning models hindering clinical trust, and 2) need for laborious pixel-wise annotations for segmentation tasks.

Method: Proposes three inherently explainable classifiers with global pooling mechanism that generate localization heatmaps directly influencing classification decisions. These heatmaps are thresholded for weakly-supervised segmentation using only image-level classification labels.

Result: Achieved peak F1-score of 0.93 for multi-class brain tumor classification, median Dice score of 0.728 for weakly-supervised segmentation, and 98.7% accuracy on tumor-only images (outperforming state-of-the-art glioma grading binary classifiers).

Conclusion: The framework successfully combines high diagnostic accuracy with essential transparency, offering trustworthy clinical decision support by providing inherent interpretability without unreliable post-hoc methods and enabling weakly-supervised segmentation.

Abstract: Deep learning has demonstrated significant potential in medical imaging; however, the opacity of “black-box” models hinders clinical trust, while segmentation tasks typically necessitate labourious, hard-to-obtain pixel-wise annotations. To address these challenges simultaneously, this paper introduces a framework for three inherently explainable classifiers (GP-UNet, GP-ShuffleUNet, and GP-ReconResNet). By integrating a global pooling mechanism, these networks generate localisation heatmaps that directly influence classification decisions, offering inherent interpretability without relying on potentially unreliable post-hoc methods. These heatmaps are subsequently thresholded to achieve weakly-supervised segmentation, requiring only image-level classification labels for training. Validated on two datasets for multi-class brain tumour classification, the proposed models achieved a peak F1-score of 0.93. For the weakly-supervised segmentation task, a median Dice score of 0.728 (95% CI 0.715-0.739) was recorded. Notably, on a subset of tumour-only images, the best model achieved an accuracy of 98.7%, outperforming state-of-the-art glioma grading binary classifiers. Furthermore, comparative Precision-Recall analysis validated the framework’s robustness against severe class imbalance, establishing a direct correlation between diagnostic confidence and segmentation fidelity. These results demonstrate that the proposed framework successfully combines high diagnostic accuracy with essential transparency, offering a promising direction for trustworthy clinical decision support. Code is available on GitHub: https://github.com/soumickmj/GPModels

[475] Ultra-Strong Gradient Diffusion MRI with Self-Supervised Learning for Prostate Cancer Characterization

Tanishq Patil, Snigdha Sen, Kieran G. Foley, Fabrizio Fasano, Chantal M. W. Tax, Derek K. Jones, Mara Cercignani, Marco Palombo, Paddy J. Slator, Eleftheria Panagiotaki

Main category: eess.IV

TL;DR: Physics-informed self-supervised VERDICT (ssVERDICT) with ultra-strong gradients enhances prostate microstructure characterization, outperforming conventional methods with 47% CNR boost and 52% reduced inter-patient variation.

Details

Motivation: Conventional dMRI metrics like Apparent Diffusion Coefficient provide mixed tissue features rather than distinct histologic characteristics. Clinical gradient systems (40-80 mT/m) suffer from poor SNR at strong diffusion weightings due to prolonged echo times.

Method: Developed enhanced ssVERDICT fitting approaches using dense multilayer perceptron and convolutional U-Net architectures. Compared against non-linear least-squares VERDICT fitting, original ssVERDICT, and Diffusion Kurtosis Imaging across clinical to ultra-strong gradient systems.

Result: Dense ssVERDICT outperformed NLLS VERDICT with ultra-strong gradients: 47% boost in median CNR, 52% reduction in inter-patient Coefficient of Variation, and 50% reduction in pooled f_ic variation. Delivered highest CNR, most stable parameter estimates, and clearest tumour-normal contrast.

Conclusion: Meaningful gains in non-invasive prostate cancer characterization arise from combining advanced gradient systems with deep learning-based modelling. Ultra-strong gradients (300 mT/m) mitigate SNR limitations while physics-informed self-supervised fitting enhances microstructural insights.

Abstract: Diffusion MRI (dMRI) enables non-invasive assessment of prostate microstructure but conventional dMRI metrics such as the Apparent Diffusion Coefficient in multiparametric MRI and reflect a mixture of underlying tissues features rather than distinct histologic characteristics. Integrating dMRI with the compartment-based biophysical VERDICT (Vascular, Extracellular, and Restricted Diffusion for Cytometry in Tumours) framework offers richer microstructural insights, though clinical gradient systems (40-80 mT/m) often suffer from poor signal-to-noise ratio at stronger diffusion weightings due to prolonged echo times. Ultra-strong gradients (e.g., 300 mT/m) can mitigate these limitations by improving SNR and contrast-to-noise ratios. This study investigates whether physics-informed self-supervised VERDICT (ssVERDICT) fitting when combined with ultra-strong gradient data, enhances prostate microstructural characterization relative to current fitting approaches and clinical gradient systems. We developed enhanced ssVERDICT fitting approaches using dense multilayer perceptron and convolutional U-Net architectures, comparing them against non-linear least-squares (NLLS) VERDICT fitting, original ssVERDICT implementation, and Diffusion Kurtosis Imaging across clinical- to ultra-strong gradient systems. For the same ultra-strong gradient data, Dense ssVERDICT outperformed NLLS VERDICT, boosting median CNR by 47%, cutting inter-patient Coefficient of Variation by 52%, and reducing pooled $f_{ic}$ variation by 50%. Overall, Dense ssVERDICT delivered the highest CNR, the most stable parameter estimates, and the clearest tumour-normal contrast compared with conventional fitting methods and clinical gradient systems. These findings underscore that meaningful gains in non-invasive prostate cancer characterization arise from the combination of advanced gradient systems and deep learning-based modelling.

Today’s Research Highlights

Table of Contents

cs.CL

[1] From Chaos to Clarity: Schema-Constrained AI for Auditable Biomedical Evidence Extraction from Full-Text PDFs

[2] The Slow Drift of Support: Boundary Failures in Multi-Turn Mental Health LLM Dialogues

[3] Opening the Black Box: A Survey on the Mechanisms of Multi-Step Reasoning in Large Language Models

[4] Hallucination-Free Automatic Question & Answer Generation for Intuitive Learning

[5] RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

[6] Project Aletheia: Verifier-Guided Distillation of Backtracking for Small Language Models

[7] Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding

[8] Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

[9] Large Language Models for Large-Scale, Rigorous Qualitative Analysis in Applied Health Services Research

[10] Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks

[11] Business Logic-Driven Text-to-SQL Data Synthesis for Business Intelligence

[12] Towards Execution-Grounded Automated AI Research

[13] Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models

[14] Multi-Agent Constraint Factorization Reveals Latent Invariant Solution Structure

[15] Rewarding How Models Think Pedagogically: Integrating Pedagogical Reasoning and Thinking Rewards for LLMs in Education

[16] Social Caption: Evaluating Social Understanding in Multimodal Models

[17] SearchGym: Bootstrapping Real-World Search Agents via Cost-Effective and High-Fidelity Environment Simulation

[18] Say Anything but This: When Tokenizer Betrays Reasoning in LLMs

[19] Memp: Exploring Agent Procedural Memory

[20] AdaTIR: Adaptive Tool-Integrated Reasoning via Difficulty-Aware Policy Optimization

[21] ClaimDB: A Fact Verification Benchmark over Large Structured Data

[22] What Makes AI Research Replicable? Executable Knowledge Graphs as Scientific Knowledge Representations

[23] DARL: Encouraging Diverse Answers for General Reasoning without Verifiers

[24] Typhoon OCR: Open Vision-Language Model For Thai Document Extraction

[25] Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

[26] RECAP: Resistance Capture in Text-based Mental Health Counseling with Large Language Models

[27] Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech

[28] Comparative Study of Large Language Models on Chinese Film Script Continuation: An Empirical Analysis Based on GPT-5.2 and Qwen-Max

[29] Mitigating Data Imbalance in Automated Speaking Assessment

[30] Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

[31] HiNS: Hierarchical Negative Sampling for More Comprehensive Memory Retrieval Embedding Model

[32] Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation

[33] PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation

[34] CodeDelegator: Mitigating Context Pollution via Role Separation in Code-as-Action Agents

[35] The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations

[36] CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

[37] A Comprehensive Benchmark of Language Models on Unicode and Romanized Sinhala

[38] Obscuring Data Contamination Through Translation: Evidence from Arabic Corpora

[39] Knowledge Restoration-driven Prompt Optimization: Unlocking LLM Potential for Open-Domain Relational Triplet Extraction

[40] \textsc{LogicScore}: Fine-grained Logic Evaluation of Conciseness, Completeness, and Determinateness in Attributed Question Answering

[41] Circadian Modulation of Semantic Exploration in Social Media Language

[42] RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

[43] Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

[44] The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

[45] Is Peer Review Really in Decline? Analyzing Review Quality across Venues and Time

[46] Supporting Humans in Evaluating AI Summaries of Legal Depositions

[47] Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

[48] Metadata Conditioned Large Language Models for Localization

[49] Taxonomy-Aligned Risk Extraction from 10-K Filings with Autonomous Improvement Using LLMs

[50] The Effect of Scripts and Formats on LLM Numeracy

[51] Robust Fake News Detection using Large Language Models under Adversarial Sentiment Attacks

[52] H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

[53] AStar: Boosting Multimodal Reasoning with Automated Structured Thinking

[54] Personality Editing for Language Models through Adjusting Self-Referential Queries

[55] OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents

[56] BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

[57] StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs

[58] Identifying Reliable Evaluation Metrics for Scientific Text Revision

[59] PankRAG: Enhancing Graph Retrieval via Globally Aware Query Resolution and Dependency-Aware Reranking Mechanism

[60] Thunder-NUBench: A Benchmark for LLMs’ Sentence-Level Negation Understanding

[61] Large Language Models Encode Semantics and Alignment in Linearly Separable Representations

[62] Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue

[63] Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs

[64] LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval

[65] A2H-MAS: An Algorithm-to-HLS Multi-Agent System for Automated and Reliable FPGA Implementation

[66] From Construction to Injection: Edit-Based Fingerprints for Large Language Models

[67] TextMineX: Data, Evaluation Framework and Ontology-guided LLM Pipeline for Humanitarian Mine Action

[68] Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering

[69] Context Parametrization with Compositional Adapters

[70] PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs

[71] Beyond Single-Granularity Prompts: A Multi-Scale Chain-of-Thought Prompt Learning for Graph

[72] Conjugate Relation Modeling for Few-Shot Knowledge Graph Completion

[73] BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

[74] Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

[75] Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling

[76] Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents

[77] Interleaved Latent Visual Reasoning with Selective Perceptual Modeling