Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 123]
cs.CV [Total: 131]
cs.AI [Total: 52]
cs.SD [Total: 13]
cs.LG [Total: 195]
cs.MA [Total: 6]
cs.MM [Total: 3]
eess.AS [Total: 14]
eess.IV [Total: 8]

cs.CL

[1] From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text

Shinwoo Park, Yo-Sub Han

Main category: cs.CL

TL;DR: Expert detection of AI-generated Korean text improves from 60% to 100% accuracy through structured calibration using a language-specific rubric, outperforming automated detectors.

Details

Motivation: Distinguishing human-written Korean text from fluent LLM outputs is difficult even for trained linguists, who may over-trust surface well-formedness. The paper investigates whether expert detection can be treated as a learnable skill that can be improved through structured calibration.

Method: Introduces LREAD, a rubric derived from Korean writing standards targeting micro-level artifacts (punctuation optionality, spacing behavior, register shifts). Uses three-phase longitudinal blind protocol with Korean linguistics majors: Phase 1 measures intuition-only detection, Phase 2 enforces criterion-level scoring with explicit justifications, and Phase 3 evaluates domain-focused mastery on held-out elementary essays.

Result: Majority-vote accuracy increases from 60% to 100% across phases. Inter-annotator agreement improves dramatically (Fleiss’ kappa: -0.09 to 0.82). Calibrated humans rely more on language-specific micro-diagnostics not captured by coarse discourse priors used by state-of-the-art LLM detectors.

Conclusion: Rubric-scaffolded expert judgment can serve as an interpretable complement to automated detectors for non-English settings. The paper releases the full rubric and a taxonomy of calibrated detection signatures for Korean text analysis.

Abstract: Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for linguistically trained readers, who can over-trust surface well-formedness. We study whether expert detection can be treated as a learnable skill and improved through structured calibration. We introduce LREAD, a rubric derived from national Korean writing standards and adapted to target micro-level artifacts (e.g., punctuation optionality, spacing behavior, and register shifts). In a three-phase longitudinal blind protocol with Korean linguistics majors, Phase 1 measures intuition-only detection, Phase 2 enforces criterion-level scoring with explicit justifications, and Phase 3 evaluates domain-focused mastery on held-out elementary essays. Across phases, majority-vote accuracy increases from 60% to 100%, accompanied by stronger inter-annotator agreement (Fleiss’ kappa: -0.09 –> 0.82). Compared to state-of-the-art LLM detectors, calibrated humans rely more on language-specific micro-diagnostics that are not well captured by coarse discourse priors. Our findings suggest that rubric-scaffolded expert judgment can serve as an interpretable complement to automated detectors for non-English settings, and we release the full rubric and a taxonomy of calibrated detection signatures.

[2] Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments

Maxwell Crouse, Ibrahim Abdelaziz, Kshitij Fadnis, Siva Sankalp Patel, Kinjal Basu, Chulaka Gunasekara, Sadhana Kumaravel, Asim Munawar, Pavan Kapanipathi

Main category: cs.CL

TL;DR: DiGiT-TC: A synthetic data generation method for multi-turn tool calling conversations that works without stateful execution environments, addressing real-world constraints like data security and distributed tool specifications.

Details

Motivation: Existing synthetic data generation frameworks assume stateful execution environments to validate tool interactions, but this doesn't work in real-world settings with data security concerns or distributed tool specifications where such environments aren't available.

Method: DiGiT-TC generates tool calling conversations with characteristics similar to those from stateful environments using a novel generation pattern that implicitly represents certain tool calls in user requests, enabling validation without execution state.

Result: The approach shows strong performance gains on standard tool calling benchmarks, even in stateful problem settings, demonstrating effectiveness without requiring execution environments.

Conclusion: DiGiT-TC successfully addresses the gap in synthetic data generation for tool calling by enabling realistic conversation generation without stateful execution environments, making it applicable to real-world constrained settings.

Abstract: Synthetic data has proven itself to be a valuable resource for tuning smaller, cost-effective language models to handle the complexities of multi-turn tool calling conversations. While many frameworks and systems for producing synthetic multi-turn tool calling data have been proposed, prior works have frequently assumed that any tool calling interactions will take place in an execution environment that maintains state. When such an environment is available, this is advantageous as it allows for the validity of an interaction to be determined by whether or not the state of the execution environment matches to some prespecified objective. Unfortunately, this does not hold in many real-world tool use settings, e.g., in enterprise settings where data security is of the utmost importance or in cases where tool specifications are synthesized from multiple sources. In this work, we address this gap by introducing a data generation method, DiGiT-TC, that is designed to produce tool calling conversations that have the characteristics of conversations generated through search in a stateful environment. The key to our technique lies in a novel generation pattern that allows our approach to implicitly represent certain tool calls in the user request. We validate our approach on standard tool calling benchmarks and demonstrate that, even in stateful problem settings, our approach results in strong performance gains.

[3] Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

Paul Tarau

Main category: cs.CL

TL;DR: Arrow Language Model: A neural architecture derived from intuitionistic logic where tokens are encoded as left-nested implication chains, next-token prediction corresponds to modus ponens, and sequence processing becomes constructive proof extension via Curry-Howard correspondence.

Details

Motivation: To develop an alternative neural architecture to Transformers and state-space models by grounding next-token prediction in intuitionistic logic principles, moving beyond additive embeddings and attention mechanisms to a more logically-founded approach.

Method: Encode prefixes as left-nested implication chains preserving order through non-commutative composition. Use Curry-Howard correspondence to treat sequence processing as constructive proof extension. Implement practical low-rank neural realization and validate properties with Prolog-based specialized theorem provers.

Result: Shows that a neural architecture equivalent to multiplicative RNNs emerges naturally from proof-theoretic interpretation of next-token prediction as nested intuitionistic implication. Validates relations between commutative vs. non-commutative sequencing and single-token vs. multi-token prediction choices.

Conclusion: The Arrow Language Model provides a logic-based alternative to Transformer architectures, positioning itself as a novel approach to foundational models with connections to intuitionistic logic, state-space models, and multiplicative RNNs.

Abstract: We introduce the \emph{Arrow Language Model}, a neural architecture derived from an intuitionistic-logic interpretation of next-token prediction. Instead of representing tokens as additive embeddings mixed by attention, we encode a prefix as a \emph{left-nested implication chain} whose structure preserves order through non-commutative composition. Next-token prediction corresponds to \emph{modus ponens}, and sequence processing becomes constructive proof extension under the Curry–Howard correspondence. Our Prolog-based specialized theorem provers validate fundamental properties of the neural models, among which relations between commutative vs. non-commutative sequencing and single-token vs. multi-token prediction choices. We show that a neural architecture equivalent to multiplicative RNNs arises naturally from a proof-theoretic interpretation of next-token prediction as nested intuitionistic implication, we present a practical low-rank neural realization and position the model relative to Transformers and state-space models. Keywords: logic-based derivation of neural architectures, intuitionistic implicational logic, token-as-operator neural models, state-space models, alternatives to transformer-based foundational models.

[4] PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review

Songjun Tu, Yiwen Ma, Jiahao Lin, Qichao Zhang, Xiangyuan Lan, Junfeng. Li, Nan Xu, Linjing Li, Dongbin Zhao

Main category: cs.CL

TL;DR: PaperAudit-Bench: A benchmark and framework for evaluating LLMs’ ability to detect subtle, distributed errors in academic papers through structured error detection and evidence-aware review generation.

Details

Motivation: Large language models can generate fluent peer reviews but often lack critical rigor when dealing with subtle, distributed errors across papers. Current automated review systems struggle with identifying complex errors that require cross-section reasoning.

Method: Introduces PaperAudit-Bench with two components: (1) PaperAudit-Dataset - an error dataset covering both section-specific and cross-section errors for controlled evaluation in long-context settings; (2) PaperAudit-Review - an automated review framework combining structured error detection with evidence-aware review generation.

Result: Experiments show large variability in error detectability across models and detection depths, highlighting the difficulty of identifying such errors. Incorporating explicit error detection produces systematically stricter and more discriminative evaluations compared to baseline methods. The dataset supports training lightweight LLM detectors via SFT and RL for effective error detection at reduced cost.

Conclusion: The PaperAudit-Bench framework demonstrates that structured error detection integrated into review workflows improves critical assessment quality for peer review, and the dataset enables efficient training of specialized error detection models.

Abstract: Large language models can generate fluent peer reviews, yet their assessments often lack sufficient critical rigor when substantive issues are subtle and distributed across a paper. In this paper, we introduce PaperAudit-Bench, which consists of two components: (1) PaperAudit-Dataset, an error dataset covering both errors identifiable within individual sections and those requiring cross-section reasoning, designed for controlled evaluation under long-context settings; and (2) PaperAudit-Review, an automated review framework that integrates structured error detection with evidence-aware review generation to support critical assessment. Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths, highlighting the difficulty of identifying such errors under long-context settings. Relative to representative automated reviewing baselines, incorporating explicit error detection into the review workflow produces systematically stricter and more discriminative evaluations, demonstrating its suitability for peer review. Finally, we show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.

[5] PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

Haoyu Zheng, Yun Zhu, Yuqian Yuan, Bo Yuan, Wenqiao Zhang, Siliang Tang, Jun Xiao

Main category: cs.CL

TL;DR: PILOT is a framework that uses a lightweight hyper-network to generate latent guidance vectors, enabling smaller LLMs to internalize strategic planning capabilities from larger teacher models without runtime dependency.

Details

Motivation: Smaller LLMs struggle with strategic planning for multi-step reasoning, leading to error propagation in long-horizon tasks. While they can perform better when conditioned on explicit plans from larger teacher models, runtime reliance on external guidance is impractical due to latency and availability constraints.

Method: PILOT employs a lightweight hyper-network that synthesizes query-conditioned latent guidance vectors. These vectors act as internal steering mechanisms to guide the model’s representations toward optimal reasoning paths without altering the backbone model weights.

Result: Extensive experiments on mathematical and coding benchmarks show PILOT effectively stabilizes reasoning trajectories, consistently outperforming strong baselines (e.g., +8.9% on MATH500) with negligible inference latency.

Conclusion: PILOT successfully bridges the gap between needing external strategic guidance and practical deployment by internalizing planning capabilities into smaller models, enabling them to perform better on complex reasoning tasks without runtime dependencies on larger models.

Abstract: Strategic planning is critical for multi-step reasoning, yet compact Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks. Our analysis reveals that LLMs possess latent reasoning capabilities that can be unlocked when conditioned on explicit plans from a teacher model; however, runtime reliance on external guidance is often impractical due to latency and availability constraints. To bridge this gap, we propose PILOT (Planning via Internalized Latent Optimization Trajectories), a non-invasive framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance. Instead of altering backbone weights, PILOT employs a lightweight Hyper-Network to synthesize a query-conditioned Latent Guidance vector. This vector acts as an internal steering mechanism, guiding the model’s representations toward optimal reasoning paths. Extensive experiments on mathematical and coding benchmarks demonstrate that PILOT effectively stabilizes reasoning trajectories, consistently outperforming strong baselines (e.g., +8.9% on MATH500) with negligible inference latency.

[6] Lowest Span Confidence: A Zero-Shot Metric for Efficient and Black-Box Hallucination Detection in LLMs

Yitong Qiao, Licheng Pan, Yu Mi, Lei Liu, Yue Shen, Fei Sun, Zhixuan Chu

Main category: cs.CL

TL;DR: Proposes LSC (Lowest Span Confidence), a zero-shot hallucination detection metric that uses sliding window n-gram probabilities to identify factual inconsistencies with minimal resource requirements.

Details

Motivation: Existing hallucination detection methods require unrealistic assumptions like intensive sampling or white-box LLM access, making them inefficient for API-based scenarios. Need efficient detection under minimal resource constraints.

Method: LSC evaluates joint likelihood of semantically coherent spans via sliding window mechanism. Identifies regions of lowest marginal confidence across variable-length n-grams to capture local uncertainty patterns correlated with factual inconsistency.

Result: LSC consistently outperforms existing zero-shot baselines across multiple SOTA LLMs and diverse benchmarks. Mitigates dilution effect of perplexity and noise sensitivity of minimum token probability, offering robust factual uncertainty estimation.

Conclusion: LSC provides efficient, resource-constrained hallucination detection using only single forward pass with output probabilities, addressing limitations of existing methods in practical deployment scenarios.

Abstract: Hallucinations in Large Language Models (LLMs), i.e., the tendency to generate plausible but non-factual content, pose a significant challenge for their reliable deployment in high-stakes environments. However, existing hallucination detection methods generally operate under unrealistic assumptions, i.e., either requiring expensive intensive sampling strategies for consistency checks or white-box LLM states, which are unavailable or inefficient in common API-based scenarios. To this end, we propose a novel efficient zero-shot metric called Lowest Span Confidence (LSC) for hallucination detection under minimal resource assumptions, only requiring a single forward with output probabilities. Concretely, LSC evaluates the joint likelihood of semantically coherent spans via a sliding window mechanism. By identifying regions of lowest marginal confidence across variable-length n-grams, LSC could well capture local uncertainty patterns strongly correlated with factual inconsistency. Importantly, LSC can mitigate the dilution effect of perplexity and the noise sensitivity of minimum token probability, offering a more robust estimate of factual uncertainty. Extensive experiments across multiple state-of-the-art (SOTA) LLMs and diverse benchmarks show that LSC consistently outperforms existing zero-shot baselines, delivering strong detection performance even under resource-constrained conditions.

[7] FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition

Junseok Lee, Nahoon Kim, Sangyong Lee, Chang-Jae Chun

Main category: cs.CL

TL;DR: ASKD adaptively reduces teacher dependence to improve student generalization, applied to create FastWhisper which outperforms Whisper with 1.07% lower WER and 5x faster inference.

Details

Motivation: Standard knowledge distillation can cause students to inherit teacher shortcomings, reducing generalization capacity. Need to mitigate this while maintaining compression benefits.

Method: Propose Adaptive Self-Knowledge Distillation (ASKD) that dynamically reduces teacher dependence and uses self-knowledge distillation to improve student generalization. Applied to distill Whisper into FastWhisper.

Result: FastWhisper achieved 1.07% lower word error rate than teacher Whisper, with 5x faster inference time in post-training setting.

Conclusion: ASKD effectively addresses teacher shortcomings inheritance in distillation, enabling creation of compressed models (FastWhisper) that outperform teachers in both accuracy and speed.

Abstract: Knowledge distillation is one of the most effective methods for model compression. Previous studies have focused on the student model effectively training the predictive distribution of the teacher model. However, during training, the student model may inherit the shortcomings of the teacher model, which can lead to a decline in generalization capacity. To mitigate this issue, we propose adaptive self-knowledge distillation (ASKD), which dynamically reduces the dependence of the teacher model to improve the self-training capacity, and performs the self-knowledge distillation method to improve the generalization capacity of the student model. We further distill the Whisper model into a smaller variant, called FastWhisper. In our post-training setting, FastWhisper achieved a word error rate of 1.07% lower than the teacher model Whisper, and its relative inference time was 5 times faster.

[8] A Dialectic Pipeline for Improving LLM Robustness

Sara Candussio

Main category: cs.CL

TL;DR: A dialectic pipeline using self-dialogue enables LLMs to reflect on and correct their own wrong answers without requiring fine-tuning or separate verifiers, preserving generalization while improving output quality.

Details

Motivation: Current methods to reduce LLM hallucinations (fine-tuning, separate verifiers) require heavy computational resources and constrain models to specific domains, making them impractical for many user applications while limiting generalization.

Method: A dialectic pipeline where LLMs engage in self-dialogue to reflect upon and correct tentative wrong answers, enriched with relevant context in an oracle-RAG setting, with studies on context summarization and filtering.

Result: The dialectic pipeline significantly outperforms standard model answers and consistently achieves higher performance than Chain-of-Thought prompting across different datasets and model families.

Conclusion: Self-dialogue through a dialectic pipeline effectively reduces hallucinations and improves answer quality while preserving LLMs’ generalization abilities, offering a practical alternative to resource-intensive methods.

Abstract: Assessing ways in which Language Models can reduce their hallucinations and improve the outputs’ quality is crucial to ensure their large-scale use. However, methods such as fine-tuning on domain-specific data or the training of a separate \textit{ad hoc} verifier require demanding computational resources (not feasible for many user applications) and constrain the models to specific fields of knowledge. In this thesis, we propose a dialectic pipeline that preserves LLMs’ generalization abilities while improving the quality of its answer via self-dialogue, enabling it to reflect upon and correct tentative wrong answers. We experimented with different pipeline settings, testing our proposed method on different datasets and on different families of models. All the pipeline stages are enriched with the relevant context (in an oracle-RAG setting) and a study on the impact of its summarization or its filtering is conducted. We find that our proposed dialectic pipeline is able to outperform by significative margins the standard model answers and that it consistently achieves higher performances than Chain-of-Thought only prompting.

[9] Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Xiaochen Zhu, Caiqi Zhang, Yizhou Chi, Tom Stafford, Nigel Collier, Andreas Vlachos

Main category: cs.CL

TL;DR: Simple modifications to multi-agent debate (MAD) - diversity-aware initialization and confidence-modulated updates - significantly improve LLM performance over vanilla MAD and majority voting across reasoning benchmarks.

Details

Motivation: Vanilla multi-agent debate often underperforms simple majority vote despite higher computational cost, and theoretical analysis shows it cannot reliably improve outcomes under homogeneous agents and uniform belief updates. The paper aims to bridge insights from human deliberation to improve LLM-based debate.

Method: Two lightweight interventions: 1) Diversity-aware initialization that selects more diverse candidate answers to increase likelihood of correct hypothesis at debate start. 2) Confidence-modulated debate protocol where agents express calibrated confidence and condition updates on others’ confidence levels.

Result: Theoretical analysis shows diversity-aware initialization improves prior probability of MAD success, while confidence-modulated updates enable systematic drift to correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, the methods consistently outperform vanilla MAD and majority vote.

Conclusion: Simple, principled modifications inspired by human deliberation mechanisms (diversity of viewpoints and calibrated confidence communication) can substantially enhance debate effectiveness in LLM-based multi-agent systems, connecting insights from human collective decision-making to AI debate protocols.

Abstract: Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others’ confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

[10] HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee

Main category: cs.CL

TL;DR: HEART framework compares humans and LLMs on emotional support conversations using blinded human raters and LLM-as-judge evaluators across five interpersonal dimensions.

Details

Motivation: Current language models lack clear evaluation of interpersonal skills like emotional support, empathy, and conversational adaptation compared to humans. There's a need for a standardized framework to assess these affective capabilities.

Method: HEART framework pairs human and model responses to the same dialogue histories, evaluates them through blinded human raters and an ensemble of LLM-as-judge evaluators using a rubric based on interpersonal communication science across five dimensions: Human Alignment, Empathic Responsiveness, Attunement, Resonance, and Task-Following.

Result: Several frontier models approach or surpass average human responses in perceived empathy and consistency. Humans maintain advantages in adaptive reframing, tension-naming, and nuanced tone shifts, especially in adversarial turns. Human and LLM-as-judge preferences align on about 80% of comparisons, matching inter-human agreement.

Conclusion: HEART reframes supportive dialogue as a distinct capability axis separate from general reasoning or linguistic fluency, providing empirical foundation for understanding where model-generated support aligns with human social judgment and how affective conversational competence scales with model size.

Abstract: Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress. Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations. For each dialogue history, we pair human and model responses and evaluate them through blinded human raters and an ensemble of LLM-as-judge evaluators. All assessments follow a rubric grounded in interpersonal communication science across five dimensions: Human Alignment, Empathic Responsiveness, Attunement, Resonance, and Task-Following. HEART uncovers striking behavioral patterns. Several frontier models approach or surpass the average human responses in perceived empathy and consistency. At the same time, humans maintain advantages in adaptive reframing, tension-naming, and nuanced tone shifts, particularly in adversarial turns. Human and LLM-as-judge preferences align on about 80 percent of pairwise comparisons, matching inter-human agreement, and their written rationales emphasize similar HEART dimensions. This pattern suggests an emerging convergence in the criteria used to assess supportive quality. By placing humans and models on equal footing, HEART reframes supportive dialogue as a distinct capability axis, separable from general reasoning or linguistic fluency. It provides a unified empirical foundation for understanding where model-generated support aligns with human social judgment, where it diverges, and how affective conversational competence scales with model size.

[11] Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

Boxiang Zhao, Qince Li, Zhonghao Wang, Zelin Cao, Yi Wang, Peng Cheng, Bo Lin

Main category: cs.CL

TL;DR: Table-BiEval: A human-free self-supervised framework for evaluating LLMs’ ability to translate natural language into structured formats and tabular data into machine-readable specifications, using deterministic Intermediate Representations to measure structural fidelity.

Details

Motivation: Current evaluations lack effective methodologies to measure structural fidelity in LLM outputs without costly human intervention, as traditional text metrics fail to detect semantic drift in code-like outputs when LLMs translate natural language to structured formats for tool invocation and tabular data processing.

Method: Proposes Table-BiEval, a human-free self-supervised evaluation framework that leverages deterministic Intermediate Representations to calculate Content Semantic Accuracy and Normalized Tree Edit Distance, decoupling structure from content. Empirically evaluates 15 state-of-the-art LLMs across dual topological dimensions: hierarchical structures and flat tables.

Result: Results reveal substantial variability in LLM performance, showing that mid-sized models can surprisingly outperform larger counterparts in structural efficiency. Deep recursive nesting remains a universal bottleneck for current architectures across all evaluated models.

Conclusion: Table-BiEval provides an effective, automated framework for evaluating structural fidelity in LLM outputs, revealing important insights about model capabilities and limitations in handling structured data transformations, with implications for developing more robust autonomous agents.

Abstract: As Large Language Models (LLMs) evolve into autonomous agents, the capability to faithfully translate natural language into rigorous structured formats-essential for tool invocation-and to convert complex tabular information into machine-readable specifications has become paramount. However, current evaluations lack effective methodologies to measure this structural fidelity without costly human intervention, as traditional text metrics fail to detect semantic drift in code-like outputs. This paper proposes Table-BiEval, a novel approach based on a human-free, self-supervised evaluation framework, to assess LLMs performance quantitatively. By leveraging deterministic Intermediate Representations, our framework calculates Content Semantic Accuracy and Normalized Tree Edit Distance to decouple structure from content. Also, it empirically evaluates 15 state-of-the-art LLMs across dual topological dimensions-hierarchical structures and flat tables. The results reveal substantial variability, highlighting that mid-sized models can surprisingly outperform larger counterparts in structural efficiency and confirming that deep recursive nesting remains a universal bottleneck for current architectures.

[12] OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Yitian Chen, Cheng Cheng, Yinan Sun, Zi Ling, Dongdong Ge

Main category: cs.CL

TL;DR: OPT-ENGINE is a benchmark framework to evaluate LLMs on optimization modeling with scalable difficulty, revealing that tool-integrated reasoning with external solvers outperforms pure-text reasoning as complexity increases, and constraint formulation is the main bottleneck.

Details

Motivation: While LLMs show progress in optimization modeling, their capabilities in automated formulation and problem-solving for complex real-world tasks remain poorly understood. There's a need to systematically evaluate LLM performance boundaries in optimization tasks with controllable difficulty levels.

Method: Proposed OPT-ENGINE, an extensible benchmark framework with 10 canonical tasks across operations research (5 Linear Programming and 5 Mixed-Integer Programming). Used this framework to study LLM reasoning capabilities, focusing on generalization to out-of-distribution tasks and identifying performance bottlenecks across problem interpretation to solution generation stages.

Result: Two key findings: 1) Tool-integrated reasoning with external solvers shows significantly higher robustness as task complexity escalates, while pure-text reasoning reaches a performance ceiling; 2) Automated formulation of constraints constitutes the primary performance bottleneck in optimization modeling.

Conclusion: The findings provide actionable guidance for developing next-generation LLMs for advanced optimization, highlighting the importance of tool integration and addressing constraint formulation challenges. The OPT-ENGINE framework enables systematic evaluation of LLM capabilities in optimization modeling.

Abstract: Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real-world tasks. To bridge this gap, we propose OPT-ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT-ENGINE spans 10 canonical tasks across operations research, with five Linear Programming and five Mixed-Integer Programming. Utilizing OPT-ENGINE, we conduct an extensive study of LLMs’ reasoning capabilities, addressing two critical questions: 1.) Do LLMs’ performance remain robust when generalizing to out-of-distribution optimization tasks that scale in complexity beyond current benchmark levels? and 2.) At what stage, from problem interpretation to solution generation, do current LLMs encounter the most significant bottlenecks? Our empirical results yield two key insights: first, tool-integrated reasoning with external solvers exhibits significantly higher robustness as task complexity escalates, while pure-text reasoning reaches a ceiling; second, the automated formulation of constraints constitutes the primary performance bottleneck. These findings provide actionable guidance for developing next-generation LLMs for advanced optimization. Our code is publicly available at \textcolor{blue}{https://github.com/Cardinal-Operations/OPTEngine}.

[13] Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

Yinuo Liu, Emre Sezgin, Eric A. Youngstrom

Main category: cs.CL

TL;DR: LLMs (ChatGPT-5, Gemini-3-Pro, Claude-Sonnet-4.5) show moderate agreement with human reviewers on objective abstract evaluation criteria but weaker performance on subjective dimensions, suggesting AI can assist but not replace human expertise in scientific review.

Details

Motivation: To investigate whether large language models can reliably assess complex academic content and assist in scientific review processes by evaluating their consistency and reliability compared to human reviewers.

Method: Evaluated 160 conference abstracts using three LLMs (ChatGPT-5, Gemini-3-Pro, Claude-Sonnet-4.5) and 14 human reviewers with a standardized rubric. Analyzed composite score distributions, calculated inter-rater reliability using ICCs, and examined Bland-Altman plots for agreement patterns and systematic bias.

Result: LLMs showed good-to-excellent agreement with each other (ICCs: 0.59-0.87). ChatGPT and Claude achieved moderate agreement with humans on objective criteria (ICCs ~.45-.60) but only fair agreement on subjective dimensions (ICCs: 0.23-0.38). Gemini showed fair agreement on half criteria and no reliability on impact/applicability. All LLMs had acceptable mean differences from human scores.

Conclusion: LLMs can process abstracts in batches with moderate agreement on objective criteria and apply rubrics consistently at scale, but weaker performance on subjective dimensions indicates AI should serve a complementary role while human expertise remains essential for scientific evaluation.

Abstract: Introduction: Large language models (LLMs) can process requests and generate texts, but their feasibility for assessing complex academic content needs further investigation. To explore LLM’s potential in assisting scientific review, this study examined ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5’s consistency and reliability in evaluating abstracts compared to one another and to human reviewers. Methods: 160 abstracts from a local conference were graded by human reviewers and three LLMs using one rubric. Composite score distributions across three LLMs and fourteen reviewers were examined. Inter-rater reliability was calculated using intraclass correlation coefficients (ICCs) for within-AI reliability and AI-human concordance. Bland-Altman plots were examined for visual agreement patterns and systematic bias. Results: LLMs achieved good-to-excellent agreement with each other (ICCs: 0.59-0.87). ChatGPT and Claude reached moderate agreement with human reviewers on overall quality and content-specific criteria, with ICCs ~.45-.60 for composite, impression, clarity, objective, and results. They exhibited fair agreement on subjective dimensions, with ICC ranging from 0.23-0.38 for impact, engagement, and applicability. Gemini showed fair agreement on half criteria and no reliability on impact and applicability. Three LLMs showed acceptable or negligible mean difference (ChatGPT=0.24, Gemini=0.42, Claude=-0.02) from the human mean composite scores. Discussion: LLMs could process abstracts in batches with moderate agreement with human experts on overall quality and objective criteria. With appropriate process architecture, they can apply a rubric consistently across volumes of abstracts exceeding feasibility for a human rater. The weaker performance on subjective dimensions indicates that AI should serve a complementary role in evaluation, while human expertise remains essential.

[14] Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR

Zilai Wang, Natarajan Balaji Shankar, Kaiyuan Zhang, Zihan Wang, Abeer Alwan

Main category: cs.CL

TL;DR: Delta SSL embeddings (differences between finetuned and pretrained embeddings) improve child ASR when fused with other SSL models, achieving state-of-the-art results on MyST corpus.

Details

Motivation: Child ASR remains challenging due to limited data and domain mismatch in SSL pretraining. Fine-tuning SSL models on child speech causes representation shifts, and the authors hypothesize that delta embeddings encode task-specific information that can complement other SSL features.

Method: Propose using delta SSL embeddings (differences between embeddings from finetuned and pretrained models) and evaluate multiple fusion strategies on the MyST children’s corpus with different SSL models (HuBERT, W2V2, WavLM).

Result: Delta embedding fusion with WavLM yields up to 10% relative WER reduction for HuBERT and 4.4% reduction for W2V2 compared to finetuned embedding fusion. Fusing WavLM with delta W2V2 embeddings achieves WER of 9.64, setting new SOTA among SSL models on MyST corpus.

Conclusion: Delta embeddings are effective for child ASR, and feature fusion represents a promising direction for advancing child speech recognition performance.

Abstract: Self-supervised learning (SSL) models have achieved impressive results across many speech tasks, yet child automatic speech recognition (ASR) remains challenging due to limited data and pretraining domain mismatch. Fine-tuning SSL models on child speech induces shifts in the representation space. We hypothesize that delta SSL embeddings, defined as the differences between embeddings from a finetuned model and those from its pretrained counterpart, encode task-specific information that complements finetuned features from another SSL model. We evaluate multiple fusion strategies on the MyST childrens corpus using different models. Results show that delta embedding fusion with WavLM yields up to a 10 percent relative WER reduction for HuBERT and a 4.4 percent reduction for W2V2, compared to finetuned embedding fusion. Notably, fusing WavLM with delta W2V2 embeddings achieves a WER of 9.64, setting a new state of the art among SSL models on the MyST corpus. These findings demonstrate the effectiveness of delta embeddings and highlight feature fusion as a promising direction for advancing child ASR.

[15] Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling

Husein Zolkepli

Main category: cs.CL

TL;DR: Improved X-Codec-2.0 by reducing latent rate from 50Hz to 25Hz and increasing output sampling rate from 16kHz to 24kHz through simple pooling and decoder hop size modifications, achieving better efficiency and audio quality.

Details

Motivation: X-Codec-2.0's original configuration at 50Hz latent rate and 16kHz sampling rate limits temporal efficiency and audio fidelity, creating a need for improvements in both efficiency and perceptual quality.

Method: Introduces additional pooling and increases decoder hop size to reduce latent rate from 50Hz to 25Hz while simultaneously raising output sampling rate from 16kHz to 24kHz, without altering the core architecture.

Result: Achieves 0.29 MOS improvement over original X-Codec-2.0 baseline on multilingual Common Voice 17 test set using UTMOSv2, and attains best reported performance among all codecs operating at 25Hz.

Conclusion: Simple architectural modifications can significantly improve both efficiency (lower latent rate) and perceptual quality (higher sampling rate) in neural audio codecs, demonstrating a practical approach to enhancing existing models.

Abstract: X-Codec-2.0 has shown strong performance in neural audio compression and multilingual speech modeling, operating at a 50 Hz latent rate and a 16 kHz sampling rate using frozen HuBERT features. While effective, this configuration limits temporal efficiency and audio fidelity. In this work, we explore a simple and effective modification by introducing additional pooling and increasing the decoder hop size. This reduces the latent rate from 50 Hz to 25 Hz and simultaneously raises the output sampling rate from 16 kHz to 24 kHz, improving efficiency and perceptual quality without altering the core architecture. Evaluated on the multilingual Common Voice 17 test set, the proposed configuration achieves a 0.29 MOS improvement over the original X-Codec-2.0 baseline based on UTMOSv2, and attains the best reported performance among all codecs operating at 25 Hz. The source code, checkpoints, and generation comparisons are released at \href{https://huggingface.co/Scicom-intl/xcodec2-25TPS-24k}{https://huggingface.co/Scicom-intl/xcodec2-25TPS-24k}.

[16] The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Nora Graichen, Iria de-Dios-Flores, Gemma Boleda

Main category: cs.CL

TL;DR: Systematic review of 337 papers on Transformer language models’ syntactic abilities shows over-focus on English, BERT, and easy phenomena like part-of-speech, with weaker performance on syntax-semantics interface tasks.

Details

Motivation: To systematically evaluate the current state of research on Transformer-based language models' syntactic abilities, identify biases and limitations in existing studies, and provide recommendations for more comprehensive future research.

Method: Conducted systematic review of 337 articles analyzing 1,015 model results across various syntactic phenomena and interpretability methods, examining patterns in languages studied, models used, and phenomena investigated.

Result: Current research shows healthy methodological variety but over-focuses on English, BERT, and easy phenomena (part-of-speech, agreement). Models perform well on form-oriented phenomena but show variable/weaker performance on syntax-semantics interface tasks like binding and filler-gap dependencies.

Conclusion: Future work should report complete data, better align theoretical constructs across studies, increase mechanistic methods, and broaden scope to include more languages and complex linguistic phenomena beyond the current narrow focus.

Abstract: We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models, reporting on 1,015 model results from a range of syntactic phenomena and interpretability methods. Our analysis shows that the state of the art presents a healthy variety of methods and data, but an over-focus on a single language (English), a single model (BERT), and phenomena that are easy to get at (like part of speech and agreement). Results also suggest that TLMs capture these form-oriented phenomena well, but show more variable and weaker performance on phenomena at the syntax-semantics interface, like binding or filler-gap dependencies. We provide recommendations for future work, in particular reporting complete data, better aligning theoretical constructs and methods across studies, increasing the use of mechanistic methods, and broadening the empirical scope regarding languages and linguistic phenomena.

[17] MiLorE-SSL: Scaling Multilingual Capabilities in Self-Supervised Models without Forgetting

Jing Xu, Minglin Wu, Xueyuan Chen, Xixin Wu, Helen Meng

Main category: cs.CL

TL;DR: MiLorE-SSL: A lightweight framework combining LoRA modules with soft mixture-of-experts for efficient continual multilingual SSL training, using limited replay data to mitigate forgetting.

Details

Motivation: Multilingual SSL models are limited to languages seen during pretraining, and retraining from scratch for new languages is computationally expensive. Sequential training often causes catastrophic forgetting of previously learned languages.

Method: Combines LoRA (low-rank adaptation) modules with a soft mixture-of-experts mechanism for efficient continual multilingual training. Uses limited replay data from existing languages to mitigate forgetting without needing large historical corpora.

Result: Achieves strong performance on new languages and improves ability in existing ones with only 2.14% trainable parameters, as demonstrated on ML-SUPERB benchmark.

Conclusion: MiLorE-SSL provides an efficient solution for continual multilingual SSL training that minimizes forgetting while maintaining strong performance across languages with minimal parameter overhead.

Abstract: Self-supervised learning (SSL) has greatly advanced speech representation learning, but multilingual SSL models remain constrained to languages encountered during pretraining. Retraining from scratch to incorporate new languages is computationally expensive, while sequential training without migitation strategies often leads to catastrophic forgetting. To address this, we propose MiLorE-SSL, a lightweight framework that combines LoRA modules with a soft mixture-of-experts (MoE) mechanism for efficient continual multilingual training. LoRA provides efficient low-rank adaptation, while soft MoE promotes flexible expert sharing across languages, reducing cross-lingual interference. To further mitigate forgetting, we introduce limited replay data from existing languages, avoiding reliance on large historical corpora. Experiments on ML-SUPERB demonstrate that MiLorE-SSL achieves strong performance in new languages and improves the ability in existing ones with only 2.14% trainable parameters.

[18] Attribution Techniques for Mitigating Hallucinated Information in RAG Systems: A Survey

Yuqing Zhao, Ziyao Liu, Yongsen Zheng, Kwok-Yan Lam

Main category: cs.CL

TL;DR: Survey paper on attribution-based techniques for mitigating hallucinations in Retrieval-Augmented Generation (RAG) systems, providing taxonomy, unified pipeline, and comparative analysis.

Details

Motivation: LLM-based QA systems suffer from hallucinations, and RAG frameworks introduce new hallucination types. Current attribution techniques lack unified taxonomy, systematic comparison, and clear guidelines for practitioners.

Method: Survey methodology: (1) outlines taxonomy of hallucination types in RAG systems, (2) presents unified pipeline for attribution techniques, (3) reviews techniques based on targeted hallucinations, (4) discusses strengths/weaknesses with practical guidelines.

Result: Provides comprehensive framework for understanding and addressing hallucinations in RAG systems through attribution techniques, enabling better selection of solutions based on hallucination types and application context.

Conclusion: The survey fills gaps in current research by offering systematic taxonomy, unified pipeline, and practical guidelines for attribution techniques in RAG systems, providing valuable insights for both future research and practical applications.

Abstract: Large Language Models (LLMs)-based question answering (QA) systems play a critical role in modern AI, demonstrating strong performance across various tasks. However, LLM-generated responses often suffer from hallucinations, unfaithful statements lacking reliable references. Retrieval-Augmented Generation (RAG) frameworks enhance LLM responses by incorporating external references but also introduce new forms of hallucination due to complex interactions between the retriever and generator. To address these challenges, researchers have explored attribution-based techniques that ensure responses are verifiably supported by retrieved content. Despite progress, a unified pipeline for these techniques, along with a clear taxonomy and systematic comparison of their strengths and weaknesses, remains lacking. A well-defined taxonomy is essential for identifying specific failure modes within RAG systems, while comparative analysis helps practitioners choose appropriate solutions based on hallucination types and application context. This survey investigates how attribution-based techniques are used within RAG systems to mitigate hallucinations and addresses the gap by: (i) outlining a taxonomy of hallucination types in RAG systems, (ii) presenting a unified pipeline for attribution techniques, (iii) reviewing techniques based on the hallucinations they target, and (iv) discussing strengths and weaknesses with practical guidelines. This work offers insights for future research and practical use of attribution techniques in RAG systems.

[19] Towards a Mechanistic Understanding of Large Reasoning Models: A Survey of Training, Inference, and Failures

Yi Hu, Jiaqi Gu, Ruxin Wang, Zijun Yao, Hao Peng, Xiaobao Wu, Jianhui Chen, Muhan Zhang, Liangming Pan

Main category: cs.CL

TL;DR: A comprehensive survey paper on mechanistic understanding of Large Reasoning Models (LRMs), covering training dynamics, reasoning mechanisms, and unintended behaviors to bridge black-box performance with mechanistic transparency.

Details

Motivation: While RL-powered Large Reasoning Models have achieved impressive performance, understanding their internal mechanisms has become a critical research frontier. The paper aims to bridge the gap between black-box performance and mechanistic transparency by synthesizing recent findings.

Method: The paper organizes recent mechanistic understanding research into three core dimensions: 1) training dynamics, 2) reasoning mechanisms, and 3) unintended behaviors. It synthesizes findings across these areas through comprehensive survey methodology.

Result: The survey provides a structured framework for understanding LRM mechanisms, identifies key research findings across the three dimensions, and highlights current limitations in mechanistic understanding of these models.

Conclusion: The paper outlines a roadmap for future mechanistic studies, emphasizing the need for applied interpretability, improved methodologies, and a unified theoretical framework to advance understanding of Large Reasoning Models.

Abstract: Reinforcement learning (RL) has catalyzed the emergence of Large Reasoning Models (LRMs) that have pushed reasoning capabilities to new heights. While their performance has garnered significant excitement, exploring the internal mechanisms driving these behaviors has become an equally critical research frontier. This paper provides a comprehensive survey of the mechanistic understanding of LRMs, organizing recent findings into three core dimensions: 1) training dynamics, 2) reasoning mechanisms, and 3) unintended behaviors. By synthesizing these insights, we aim to bridge the gap between black-box performance and mechanistic transparency. Finally, we discuss under-explored challenges to outline a roadmap for future mechanistic studies, including the need for applied interpretability, improved methodologies, and a unified theoretical framework.

[20] Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding

David Linus Ostby

Main category: cs.CL

TL;DR: Stingy Context achieves 18:1 compression for LLM coding tasks using hierarchical tree-based compression, reducing 239k tokens to 11k while maintaining 94-97% success on real-world issues.

Details

Motivation: Large language models face context window limitations for auto-coding tasks, especially with large codebases. Current methods suffer from lost-in-the-middle effects and inefficient context usage, requiring better compression techniques.

Method: Hierarchical tree-based compression scheme using TREEFRAG exploit decomposition. The method organizes code into a tree structure and applies compression to reduce token count while preserving essential information for coding tasks.

Result: Achieves 18:1 compression ratio, reducing a 239k token codebase to 11k tokens. Across 12 Frontier models, maintains 94-97% success rate on 40 real-world issues, outperforming flat compression methods and mitigating lost-in-the-middle effects.

Conclusion: Stingy Context provides an effective hierarchical compression approach for LLM auto-coding, enabling efficient context usage while preserving task fidelity, making large-scale code analysis more practical and cost-effective.

Abstract: We introduce Stingy Context, a hierarchical tree-based compression scheme achieving 18:1 reduction in LLM context for auto-coding tasks. Using our TREEFRAG exploit decomposition, we reduce a real source code base of 239k tokens to 11k tokens while preserving task fidelity. Empirical results across 12 Frontier models show 94 to 97% success on 40 real-world issues at low cost, outperforming flat methods and mitigating lost-in-the-middle effects.

[21] SDUs DAISY: A Benchmark for Danish Culture

Jacob Nielsen, Stine L. Beltoft, Peter Schneider-Kamp, Lukas Galke Poech

Main category: cs.CL

TL;DR: Daisy is a new Danish cultural heritage benchmark based on the Danish Culture Canon 2006, featuring 741 human-verified question-answer pairs covering artifacts from 1300 BC to contemporary times.

Details

Motivation: To create a comprehensive benchmark for evaluating cultural knowledge of Danish heritage, addressing both mainstream and in-depth aspects of Danish culture as defined by the official Culture Canon.

Method: Used the Danish Culture Canon 2006 as source, queried Wikipedia pages for each artifact, had language models generate random questions, implemented sampling strategy with mix of central/peripheral questions, and conducted human verification/correction of all question-answer pairs.

Result: Created Daisy dataset with 741 close-ended question-answer pairs covering 1300 BC archaeological findings, 1700s poems and musical pieces, to contemporary pop music, design, and architecture from 130 artifacts.

Conclusion: Daisy provides a valuable benchmark for assessing cultural knowledge of Danish heritage, spanning historical periods and cultural domains, with human-verified quality and balanced coverage of mainstream and niche cultural aspects.

Abstract: We introduce a new benchmark for Danish culture via cultural heritage, Daisy, based on the curated topics from the Danish Culture Canon 2006. For each artifact in the culture canon, we query the corresponding Wikipedia page and have a language model generate random questions. This yields a sampling strategy within each work, with a mix of central of peripheral questions for each work, not only knowledge of mainstream information, but also in-depth cornerstones defining the heritage of Danish Culture, defined by the Canon committee. Each question-answer pair is humanly approved or corrected in the final dataset consisting of 741 close-ended question answer pairs covering topics, from 1300 BC. archaeological findings, 1700 century poems and musicals pieces to contemporary pop music and Danish design and architecture.

[22] CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity

Sebastien Kawada, Dylan Holyoak

Main category: cs.CL

TL;DR: Hybrid neuro-symbolic system for narrative story similarity that combines neural self-consistency voting with symbolic tiebreaker ensemble for ambiguous cases.

Details

Motivation: To address genuinely ambiguous narrative comparisons where neural models struggle, by creating a system that can selectively defer to symbolic methods when neural predictions are uncertain.

Method: Cascade architecture: 1) Neural component uses LLM with multiple parallel votes and supermajority threshold, 2) Escalates uncertain cases to additional voting rounds, 3) For perfect ties, uses symbolic ensemble with five narrative similarity signals (lexical overlap, semantic embeddings, story grammar structure, event chain alignment, narrative tension curves).

Result: Achieves 81% accuracy on the development set for SemEval-2026 Task 4 on Narrative Story Similarity.

Conclusion: Selective deferral to symbolic methods can enhance neural predictions on genuinely ambiguous narrative comparisons, demonstrating the value of hybrid neuro-symbolic approaches for complex narrative analysis tasks.

Abstract: We present a hybrid neuro-symbolic system for the SemEval-2026 Task 4 on Narrative Story Similarity. Our approach combines neural self-consistency voting with a novel Multi-Scale Narrative Analysis Ensemble that operates as a symbolic tiebreaker. The neural network component uses a large language model with multiple parallel votes, applying a supermajority threshold for confident decisions and escalating uncertain cases to additional voting rounds. When votes result in a perfect tie, a symbolic ensemble combining five narrative similarity signals (lexical overlap, semantic embeddings, story grammar structure, event chain alignment, and narrative tension curves) provides the final decision. Our cascade architecture achieves 81% accuracy on the development set, demonstrating that selective deferral to symbolic methods can enhance neural predictions on genuinely ambiguous narrative comparisons.

[23] “Newspaper Eat” Means “Not Tasty”: A Taxonomy and Benchmark for Coded Languages in Real-World Chinese Online Reviews

Ruyuan Wan, Changye Li, Ting-Hao ‘Kenneth’ Huang

Main category: cs.CL

TL;DR: CodedLang: A dataset of 7,744 Chinese Google Maps reviews with 900 span-level coded language annotations, introducing a taxonomy and benchmarks showing current language models struggle with coded language understanding.

Details

Motivation: Coded language is important in human communication but current language models handle it poorly. Progress has been limited by lack of real-world datasets and clear taxonomies for studying coded language phenomena.

Method: Created CodedLang dataset from Chinese Google Maps reviews with span-level annotations. Developed seven-class taxonomy capturing encoding strategies like phonetic, orthographic, and cross-lingual substitutions. Benchmarked models on detection, classification, and rating prediction tasks.

Result: Even strong language models fail to identify or understand coded language. Many coded expressions rely on pronunciation-based strategies, as revealed by phonetic analysis comparing coded and decoded forms.

Conclusion: Coded language represents an important and underexplored challenge for real-world NLP systems, highlighting the need for better understanding and handling of intentional meaning encoding in human communication.

Abstract: Coded language is an important part of human communication. It refers to cases where users intentionally encode meaning so that the surface text differs from the intended meaning and must be decoded to be understood. Current language models handle coded language poorly. Progress has been limited by the lack of real-world datasets and clear taxonomies. This paper introduces CodedLang, a dataset of 7,744 Chinese Google Maps reviews, including 900 reviews with span-level annotations of coded language. We developed a seven-class taxonomy that captures common encoding strategies, including phonetic, orthographic, and cross-lingual substitutions. We benchmarked language models on coded language detection, classification, and review rating prediction. Results show that even strong models can fail to identify or understand coded language. Because many coded expressions rely on pronunciation-based strategies, we further conducted a phonetic analysis of coded and decoded forms. Together, our results highlight coded language as an important and underexplored challenge for real-world NLP systems.

[24] AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

Hyunjong Ok, Suho Yoo, Hyeonjun Kim, Jaeho Lee

Main category: cs.CL

TL;DR: AuditoryBench++ is a benchmark for evaluating auditory knowledge and reasoning in text-only settings, with AIR-CoT method that enhances LLMs’ auditory imagination through span detection and knowledge injection.

Details

Motivation: Humans can reason about auditory properties without hearing sounds, but language models lack this auditory commonsense, limiting their multimodal interaction capabilities.

Method: Created AuditoryBench++ benchmark for auditory knowledge evaluation, and introduced AIR-CoT method that uses span detection with special tokens and knowledge injection to generate and integrate auditory information during inference.

Result: AIR-CoT generally outperforms both off-the-shelf LLMs/Multimodal LLMs and those augmented with auditory knowledge in extensive experiments.

Conclusion: The work addresses the gap in auditory reasoning capabilities of language models and provides a benchmark and method for improving auditory commonsense in text-only settings.

Abstract: Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.

[25] Text-to-State Mapping for Non-Resolution Reasoning: The Contradiction-Preservation Principle

Kei Saito

Main category: cs.CL

TL;DR: This paper introduces a text-to-state mapping function φ that transforms linguistic input into superposition states within the Non-Resolution Reasoning framework, enabling ambiguity preservation in language processing.

Details

Motivation: The paper addresses the critical gap in how natural language maps to the mathematical structures of Non-Resolution Reasoning (NRR). While NRR provides a framework for maintaining semantic ambiguity, there was no established method to transform raw text into the formal state spaces where NRR operators can act.

Method: The authors formalize the Contradiction-Preservation Principle requiring ambiguous expressions to maintain non-zero entropy in state representations. They develop extraction protocols using existing Large Language Models as interpretation generators to create the text-to-state mapping function φ that transforms linguistic input into superposition states within the NRR framework.

Result: Empirical validation across 68 test sentences spanning lexical, structural, and pragmatic ambiguity shows that the mapping achieves mean Shannon entropy H(S) = 1.087 bits for ambiguous inputs, while baseline single-interpretation approaches yield H(S) = 0.000. This demonstrates effective ambiguity preservation.

Conclusion: The framework provides the missing algorithmic bridge between raw text and the formal state spaces on which NRR operators act, enabling architectural collapse deferment in language model inference and supporting the preservation of semantic ambiguity rather than forcing premature interpretation collapse.

Abstract: Non-Resolution Reasoning (NRR) provides a formal framework for maintaining semantic ambiguity rather than forcing premature interpretation collapse. While the foundational architecture establishes state spaces and operators for ambiguity-preserving computation, the critical question of how natural language maps to these mathematical structures remains open. This paper introduces the text-to-state mapping function φ that transforms linguistic input into superposition states within the NRR framework. We formalize the Contradiction-Preservation Principle, which requires that genuinely ambiguous expressions maintain non-zero entropy in their state representations, and develop extraction protocols using existing Large Language Models as interpretation generators. Empirical validation across 68 test sentences spanning lexical, structural, and pragmatic ambiguity demonstrates that our mapping achieves mean Shannon entropy H(S) = 1.087 bits for ambiguous inputs while baseline single-interpretation approaches yield H(S) = 0.000. The framework provides the missing algorithmic bridge between raw text and the formal state spaces on which NRR operators act, enabling architectural collapse deferment in language model inference.

[26] Quantifying non deterministic drift in large language models

Claire Nicholson

Main category: cs.CL

TL;DR: LLMs show persistent output variability (behavioral drift) even at temperature 0.0, with patterns varying by model size, deployment type, and prompt category.

Details

Motivation: LLMs are widely used but identical prompts don't always produce identical outputs even with fixed parameters. Need to empirically quantify baseline behavioral drift to understand output variability under operator-free conditions.

Method: Repeated-run experiments with gpt-4o-mini and llama3.1-8b across five prompt categories using exact repeats, perturbed inputs, and reuse modes at temperatures 0.0 and 0.7. Drift measured using unique output fractions, lexical similarity, and word count statistics.

Result: Nondeterminism persists even at temperature 0.0, with distinct variability patterns by model size, deployment type, and prompt type. Different models show different drift characteristics.

Conclusion: Establishes systematic empirical baseline for behavioral drift without stabilization techniques, providing reference point for evaluating future drift mitigation methods. Highlights limitations of lexical metrics and emerging semantic approaches.

Abstract: Large language models (LLMs) are widely used for tasks ranging from summarisation to decision support. In practice, identical prompts do not always produce identical outputs, even when temperature and other decoding parameters are fixed. In this work, we conduct repeated-run experiments to empirically quantify baseline behavioural drift, defined as output variability observed when the same prompt is issued multiple times under operator-free conditions. We evaluate two publicly accessible models, gpt-4o-mini and llama3.1-8b, across five prompt categories using exact repeats, perturbed inputs, and reuse modes at temperatures of 0.0 and 0.7. Drift is measured using unique output fractions, lexical similarity, and word count statistics, enabling direct comparison across models, prompting modes, and deployment types. The results show that nondeterminism persists even at temperature 0.0, with distinct variability patterns by model size, deployment, and prompt type. We situate these findings within existing work on concept drift, behavioural drift, and infrastructure-induced nondeterminism, discuss the limitations of lexical metrics, and highlight emerging semantic approaches. By establishing a systematic empirical baseline in the absence of stabilisation techniques, this study provides a reference point for evaluating future drift mitigation and control methods.

[27] Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

Yiting Shen, Kun Li, Wei Zhou, Songlin Hu

Main category: cs.CL

TL;DR: Mem2ActBench is a new benchmark for evaluating LLM agents’ ability to actively apply long-term memory to execute tool-based tasks, rather than just passively retrieve facts.

Details

Motivation: Existing benchmarks only test passive fact retrieval, but real-world assistant usage requires agents to proactively leverage memory to execute actions by selecting appropriate tools and grounding parameters based on previously established preferences and task states across interrupted interactions.

Method: Created an automated pipeline merging heterogeneous sources (ToolACE, BFCL, Oasst1), resolving conflicts via consistency modeling to synthesize 2,029 sessions. Used reverse-generation to produce 400 tool-use tasks from memory chains, with human evaluation confirming memory dependency.

Result: Human evaluation confirmed 91.3% of tasks are strongly memory-dependent. Experiments on seven memory frameworks show current systems remain inadequate at actively utilizing memory for parameter grounding in task execution.

Conclusion: Current LLM agent memory systems are insufficient for actively applying memory to task execution, highlighting the need for more effective approaches to evaluate and improve memory application capabilities in real-world assistant scenarios.

Abstract: Large Language Model (LLM)-based agents are increasingly deployed for complex, tool-based tasks where long-term memory is critical to driving actions. Existing benchmarks, however, primarily test a angent’s ability to passively retrieve isolated facts in response to explicit questions. They fail to evaluate the more crucial capability of actively applying memory to execute tasks. To address this gap, we introduce \textsc{Mem2ActBench}, a benchmark for evaluating whether agents can proactively leverage long-term memory to execute tool-based actions by selecting appropriate tools and grounding their parameters. The benchmark simulates persistent assistant usage, where users mention the same topic across long, interrupted interactions and expect previously established preferences and task states to be implicitly applied. We build the dataset with an automated pipeline that merges heterogeneous sources (ToolACE, BFCL, Oasst1), resolves conflicts via consistency modeling, and synthesizes 2,029 sessions with 12 user–assistant–tool turns on average. From these memory chains, a reverse-generation method produces 400 tool-use tasks, with human evaluation confirming 91.3% are strongly memory-dependent. Experiments on seven memory frameworks show that current systems remain inadequate at actively utilizing memory for parameter grounding, highlighting the need for more effective approaches to evaluate and improve memory application in task execution.

[28] Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen

Thomas Schuster, Julius Trögele, Nico Döring, Robin Krüger, Matthieu Hoffmann, Holger Friedrich

Main category: cs.CL

TL;DR: This paper presents a German medical ASR benchmark evaluating 29 models on simulated doctor-patient conversations, revealing significant performance gaps especially for medical terminology and dialects.

Details

Motivation: While ASR can reduce medical documentation workload, there's a lack of German medical ASR benchmarks, particularly for dialect-inclusive evaluation.

Method: Created curated dataset of simulated German doctor-patient conversations, evaluated 29 ASR models including open-weights (Whisper, Voxtral, Wav2Vec2) and commercial APIs (AssemblyAI, Deepgram) using WER, CER, and BLEU metrics.

Result: Significant performance differences: best systems achieve WER below 3%, but others show much higher error rates, especially for medical terminology and dialect-influenced speech.

Conclusion: German medical ASR needs improvement for dialect and medical terminology handling; the benchmark provides crucial evaluation framework for this specialized domain.

Abstract: Automatic Speech Recognition (ASR) offers significant potential to reduce the workload of medical personnel, for example, through the automation of documentation tasks. While numerous benchmarks exist for the English language, specific evaluations for the German-speaking medical context are still lacking, particularly regarding the inclusion of dialects. In this article, we present a curated dataset of simulated doctor-patient conversations and evaluate a total of 29 different ASR models. The test field encompasses both open-weights models from the Whisper, Voxtral, and Wav2Vec2 families as well as commercial state-of-the-art APIs (AssemblyAI, Deepgram). For evaluation, we utilize three different metrics (WER, CER, BLEU) and provide an outlook on qualitative semantic analysis. The results demonstrate significant performance differences between the models: while the best systems already achieve very good Word Error Rates (WER) of partly below 3%, the error rates of other models, especially concerning medical terminology or dialect-influenced variations, are considerably higher.

[29] On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

Michał Gromadzki, Anna Wróblewska, Agnieszka Kaliska

Main category: cs.CL

TL;DR: This paper presents a comprehensive study on AI-generated text detection using large-scale corpora and novel training strategies, achieving up to 99.6% token-level accuracy across 21 LLMs.

Details

Motivation: The rapid advancement of large language models has made AI-generated text increasingly indistinguishable from human writing, creating significant challenges for authenticity verification in education, publishing, and digital security. This necessitates robust detection methods to address both technical and ethical concerns.

Method: The authors introduce two large-scale corpora: a 1-billion-token human-authored corpus spanning multiple genres, and a 1.9-billion-token AI-generated corpus produced by prompting various LLMs across diverse domains. They develop and evaluate numerous detection models using two novel training paradigms: “Per LLM” and “Per LLM family” fine-tuning approaches.

Result: Across a 100-million-token benchmark covering 21 large language models, the best fine-tuned detector achieves up to 99.6% token-level accuracy, substantially outperforming existing open-source baselines.

Conclusion: The study demonstrates that comprehensive corpora and specialized fine-tuning strategies can significantly improve AI-generated text detection accuracy, providing effective solutions for authenticity verification challenges across various domains.

Abstract: The rapid progress of large language models has enabled the generation of text that closely resembles human writing, creating challenges for authenticity verification in education, publishing, and digital security. Detecting AI-generated text has therefore become a crucial technical and ethical issue. This paper presents a comprehensive study of AI-generated text detection based on large-scale corpora and novel training strategies. We introduce a 1-billion-token corpus of human-authored texts spanning multiple genres and a 1.9-billion-token corpus of AI-generated texts produced by prompting a variety of LLMs across diverse domains. Using these resources, we develop and evaluate numerous detection models and propose two novel training paradigms: Per LLM and Per LLM family fine-tuning. Across a 100-million-token benchmark covering 21 large language models, our best fine-tuned detector achieves up to $99.6%$ token-level accuracy, substantially outperforming existing open-source baselines.

[30] LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?

J. Ben Tamo, Daniel Carlander-Reuterfelt, Jonathan Rubin, Dezhi Hong, Mingxian Wang, Oleg Poliannikov

Main category: cs.CL

TL;DR: The paper identifies language control failures in multilingual LLMs, proposes evaluation protocols to surface these issues, uses interpretability methods to understand the internal structure, and introduces selective fine-tuning of final layers for efficient multilingual adaptation.

Details

Motivation: Despite multilingual pretraining, large language models often struggle with non-English tasks, particularly in language control - the ability to respond in the intended language. The paper aims to address two key failure modes: multilingual transfer bottleneck (correct language but incorrect task response) and language consistency bottleneck (correct task response but wrong language).

Method: 1) Designed a four-scenario evaluation protocol spanning MMLU, MGSM, and XQuAD benchmarks to systematically surface language control issues. 2) Extended logit lens analysis to track language probabilities layer by layer and computed cross-lingual semantic similarity of hidden states for interpretability. 3) Introduced selective fine-tuning of only the final layers responsible for language control based on insights from interpretability analysis.

Result: The interpretability analysis revealed a three-phase internal structure: early layers align inputs into a shared semantic space, middle layers perform task reasoning, and late layers drive language-specific generation. Selective fine-tuning of final layers on Qwen-3-32B and Bloom-7.1B achieved over 98% language consistency across six languages while fine-tuning only 3-5% of parameters, without sacrificing task accuracy. This performance was nearly identical to full-scope fine-tuning but used far fewer computational resources.

Conclusion: The paper presents the first approach to leverage layer-localization of language control for efficient multilingual adaptation. By identifying that language control is concentrated in the final layers and selectively fine-tuning only those layers, the method achieves high language consistency with minimal parameter updates, offering a computationally efficient solution for multilingual adaptation of large language models.

Abstract: Despite multilingual pretraining, large language models often struggle with non-English tasks, particularly in language control, the ability to respond in the intended language. We identify and characterize two key failure modes: the multilingual transfer bottleneck (correct language, incorrect task response) and the language consistency bottleneck (correct task response, wrong language). To systematically surface these issues, we design a four-scenario evaluation protocol spanning MMLU, MGSM, and XQuAD benchmarks. To probe these issues with interpretability, we extend logit lens analysis to track language probabilities layer by layer and compute cross-lingual semantic similarity of hidden states. The results reveal a three-phase internal structure: early layers align inputs into a shared semantic space, middle layers perform task reasoning, and late layers drive language-specific generation. Guided by these insights, we introduce selective fine-tuning of only the final layers responsible for language control. On Qwen-3-32B and Bloom-7.1B, this method achieves over 98 percent language consistency across six languages while fine-tuning only 3-5 percent of parameters, without sacrificing task accuracy. Importantly, this result is nearly identical to that of full-scope fine-tuning (for example, above 98 percent language consistency for both methods across all prompt scenarios) but uses a fraction of the computational resources. To the best of our knowledge, this is the first approach to leverage layer-localization of language control for efficient multilingual adaptation.

[31] Semantic Uncertainty Quantification of Hallucinations in LLMs: A Quantum Tensor Network Based Method

Pragatheeswaran Vipulanandan, Kamal Premaratne, Dilip Sarkar

Main category: cs.CL

TL;DR: A quantum tensor network-based uncertainty quantification framework for detecting hallucinations in LLMs, using semantic equivalence clustering and entropy maximization to identify unreliable outputs.

Details

Motivation: LLMs generate fluent but unreliable outputs (confabulations) that vary arbitrarily even with identical prompts, creating a need for principled hallucination detection methods.

Method: Quantum tensor network pipeline for uncertainty quantification that accounts for aleatoric uncertainty in token probabilities, semantic equivalence clustering of LLM generations, and entropy maximization strategy to prioritize high-certainty outputs.

Result: 116 experiments on TriviaQA, NQ, SVAMP, and SQuAD across 8 LLM architectures show consistent improvements in AUROC and AURAC over state-of-the-art baselines, with robustness across different generation lengths and quantization levels.

Conclusion: The quantum physics-inspired framework provides a principled, interpretable scheme for hallucination detection that remains reliable even in resource-constrained deployments, offering practical guidelines for when human oversight is warranted.

Abstract: Large language models (LLMs) exhibit strong generative capabilities but remain vulnerable to confabulations, fluent yet unreliable outputs that vary arbitrarily even under identical prompts. Leveraging a quantum tensor network based pipeline, we propose a quantum physics inspired uncertainty quantification framework that accounts for aleatoric uncertainty in token sequence probability for semantic equivalence based clustering of LLM generations. This offers a principled and interpretable scheme for hallucination detection. We further introduce an entropy maximization strategy that prioritizes high certainty, semantically coherent outputs and highlights entropy regions where LLM decisions are likely to be unreliable, offering practical guidelines for when human oversight is warranted. We evaluate the robustness of our scheme under different generation lengths and quantization levels, dimensions overlooked in prior studies, demonstrating that our approach remains reliable even in resource constrained deployments. A total of 116 experiments on TriviaQA, NQ, SVAMP, and SQuAD across multiple architectures including Mistral-7B, Mistral-7B-instruct, Falcon-rw-1b, LLaMA-3.2-1b, LLaMA-2-13b-chat, LLaMA-2-7b-chat, LLaMA-2-13b, and LLaMA-2-7b show consistent improvements in AUROC and AURAC over state of the art baselines.

Nishanth Sridhar Nakshatri, Eylon Caplan, Rajkumar Pujari, Dan Goldwasser

Main category: cs.CL

TL;DR: TAIGR is a framework for analyzing health influencer discourse by extracting takeaways, building argumentation graphs, and performing probabilistic inference for validation.

Details

Motivation: Health influencers shape public beliefs through conversational narratives and rhetorical strategies, making claim-centric verification methods inadequate for capturing pragmatic meaning.

Method: Three-stage framework: (1) identify core influencer recommendation (takeaway), (2) construct argumentation graph capturing justification, (3) perform factor graph-based probabilistic inference to validate takeaway.

Result: Evaluated on health influencer video transcripts, showing that accurate validation requires modeling discourse’s pragmatic and argumentative structure rather than treating transcripts as flat claim collections.

Conclusion: TAIGR provides a structured approach to analyze influencer discourse that goes beyond traditional claim verification by capturing argumentative structure and pragmatic meaning.

Abstract: Health influencers play a growing role in shaping public beliefs, yet their content is often conveyed through conversational narratives and rhetorical strategies rather than explicit factual claims. As a result, claim-centric verification methods struggle to capture the pragmatic meaning of influencer discourse. In this paper, we propose TAIGR (Takeaway Argumentation Inference with Grounded References), a structured framework designed to analyze influencer discourse, which operates in three stages: (1) identifying the core influencer recommendation–takeaway; (2) constructing an argumentation graph that captures influencer justification for the takeaway; (3) performing factor graph-based probabilistic inference to validate the takeaway. We evaluate TAIGR on a content validation task over influencer video transcripts on health, showing that accurate validation requires modeling the discourse’s pragmatic and argumentative structure rather than treating transcripts as flat collections of claims.

Vikash Singh, Darion Cassel, Nathaniel Weir, Nick Feng, Sam Bayless

Main category: cs.CL

TL;DR: VERGE: A neurosymbolic framework combining LLMs with SMT solvers for verification-guided answer refinement through iterative logical consistency checking.

Details

Motivation: Despite LLMs' syntactic fluency, ensuring logical correctness in high-stakes domains remains challenging. Current approaches lack formal verification and precise error localization.

Method: Decomposes LLM outputs into atomic claims, autoformalizes them into first-order logic, and verifies consistency using SMT solvers. Uses multi-model consensus, semantic routing, and Minimal Correction Subsets for error localization.

Result: With GPT-OSS-120B model, VERGE achieves 18.7% average performance uplift at convergence across reasoning benchmarks compared to single-pass approaches.

Conclusion: The hybrid neurosymbolic approach provides formal guarantees where possible and consensus verification elsewhere, advancing trustworthy AI through verification-guided refinement.

Abstract: Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers to produce verification-guided answers through iterative refinement. Our approach decomposes LLM outputs into atomic claims, autoformalizes them into first-order logic, and verifies their logical consistency using automated theorem proving. We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking to ensure logic-level alignment between candidates, eliminating the syntactic bias of surface-form metrics, (2) semantic routing that directs different claim types to appropriate verification strategies: symbolic solvers for logical claims and LLM ensembles for commonsense reasoning, and (3) precise logical error localization via Minimal Correction Subsets (MCS), which pinpoint the exact subset of claims to revise, transforming binary failure signals into actionable feedback. Our framework classifies claims by their logical status and aggregates multiple verification signals into a unified score with variance-based penalty. The system iteratively refines answers using structured feedback until acceptance criteria are met or convergence is achieved. This hybrid approach delivers formal guarantees where possible and consensus verification elsewhere, advancing trustworthy AI. With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.

[34] Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects

Amirhossein Haji Mohammad Rezaei, Zahra Shakeri

Main category: cs.CL

TL;DR: Researchers created a counterfactual benchmark to test how cultural information affects medical AI diagnostic accuracy, finding that cultural cues significantly degrade performance, especially when identifiers and context co-occur.

Details

Motivation: To ensure equitable healthcare, medical language models should not alter clinically correct diagnoses when presented with non-decisive cultural information. Current models may be biased by cultural cues that shouldn't affect medical decisions.

Method: Created a benchmark expanding 150 MedQA test items into 1650 variants by inserting culture-related identifier tokens, contextual cues, or their combination for three cultural groups (Indigenous Canadian, Middle-Eastern Muslim, Southeast Asian), plus neutral controls. Evaluated GPT-5.2, Llama-3.1-8B, DeepSeek-R1, and MedGemma under different prompting strategies.

Result: Cultural cues significantly affect accuracy (p<10^-14), with largest degradation when identifier and context co-occur (3-7 percentage point drops). Neutral edits produce smaller, non-systematic changes. Over half of culturally grounded explanations end in incorrect answers, linking culture-referential reasoning to diagnostic failure.

Conclusion: Medical AI models show significant sensitivity to non-decisive cultural information, potentially leading to diagnostic errors. The released benchmark supports evaluation and mitigation of culturally induced diagnostic errors in healthcare AI systems.

Abstract: Engineering sustainable and equitable healthcare requires medical language models that do not change clinically correct diagnoses when presented with non-decisive cultural information. We introduce a counterfactual benchmark that expands 150 MedQA test items into 1650 variants by inserting culture-related (i) identifier tokens, (ii) contextual cues, or (iii) their combination for three groups (Indigenous Canadian, Middle-Eastern Muslim, Southeast Asian), plus a length-matched neutral control, where a clinician verified that the gold answer remains invariant in all variants. We evaluate GPT-5.2, Llama-3.1-8B, DeepSeek-R1, and MedGemma (4B/27B) under option-only and brief-explanation prompting. Across models, cultural cues significantly affect accuracy (Cochran’s Q, $p<10^-14$), with the largest degradation when identifier and context co-occur (up to 3-7 percentage points under option-only prompting), while neutral edits produce smaller, non-systematic changes. A human-validated rubric ($κ=0.76$) applied via an LLM-as-judge shows that more than half of culturally grounded explanations end in an incorrect answer, linking culture-referential reasoning to diagnostic failure. We release prompts and augmentations to support evaluation and mitigation of culturally induced diagnostic errors.

[35] FFE-Hallu:Hallucinations in Fixed Figurative Expressions:Benchmark of Idioms and Proverbs in the Persian Language

Faezeh Hosseini, Mohammadali Yousefzadeh, Yadollah Yaghoobzadeh

Main category: cs.CL

TL;DR: FFEHallu benchmark reveals LLMs struggle with figurative language, showing systematic weaknesses in distinguishing real from fabricated fixed figurative expressions and frequent figurative hallucinations in translation tasks.

Details

Motivation: Fixed figurative expressions (FFEs) like idioms and proverbs are culturally grounded, non-compositional, and conventionally fixed, making them vulnerable to figurative hallucination in LLMs. There's a need for comprehensive benchmarks to evaluate this issue, especially for underrepresented languages like Persian.

Method: Created FFEHallu benchmark with 600 instances across three tasks: 1) FFE generation from meaning, 2) detection of fabricated FFEs across four construction categories, and 3) FFE-to-FFE translation from English to Persian. Evaluated six state-of-the-art multilingual LLMs.

Result: Models show systematic weaknesses in figurative competence and cultural grounding. While GPT4.1 performs relatively well in rejecting fabricated FFEs and retrieving authentic ones, most models struggle to distinguish real expressions from high-quality fabrications and frequently hallucinate during cross-lingual translation.

Conclusion: Substantial gaps exist in current LLMs’ handling of figurative language, highlighting the need for targeted benchmarks like FFEHallu to assess and mitigate figurative hallucination, especially for culturally grounded expressions in underrepresented languages.

Abstract: Figurative language, particularly fixed figurative expressions (FFEs) such as idioms and proverbs, poses persistent challenges for large language models (LLMs). Unlike literal phrases, FFEs are culturally grounded, largely non-compositional, and conventionally fixed, making them especially vulnerable to figurative hallucination. We define figurative hallucination as the generation or endorsement of expressions that sound idiomatic and plausible but do not exist as authentic figurative expressions in the target language. We introduce FFEHallu, the first comprehensive benchmark for evaluating figurative hallucination in LLMs, with a focus on Persian, a linguistically rich yet underrepresented language. FFEHallu consists of 600 carefully curated instances spanning three complementary tasks: (i) FFE generation from meaning, (ii) detection of fabricated FFEs across four controlled construction categories, and (iii) FFE to FFE translation from English to Persian. Evaluating six state of the art multilingual LLMs, we find systematic weaknesses in figurative competence and cultural grounding. While models such as GPT4.1 demonstrate relatively strong performance in rejecting fabricated FFEs and retrieving authentic ones, most models struggle to reliably distinguish real expressions from high quality fabrications and frequently hallucinate during cross lingual translation. These findings reveal substantial gaps in current LLMs handling of figurative language and underscore the need for targeted benchmarks to assess and mitigate figurative hallucination.

[36] Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

Abha Jha, Akanksha Mahajan, Ashwath Vaithinathan Aravindan, Praveen Saravanan, Sai Sailaja Policharla, Sonal Chaturbhuj Gehlot

Main category: cs.CL

TL;DR: RLVR (Reinforcement Learning with Verifiable Rewards) trains LLMs to abstain (“I don’t know”) alongside correctness, reducing hallucinations without severe accuracy loss on multiple-choice tasks.

Details

Motivation: LLMs often produce hallucinated or unverifiable content, undermining reliability in factual domains. Need explicit training to promote intellectual humility through abstention.

Method: Fine-tuned Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on MedMCQA and Hendrycks Math benchmarks using ternary reward structure (-1, r_abs, 1) with varying abstention rewards. Combined RLVR with supervised fine-tuning strategies teaching abstention prior to RL.

Result: Moderate abstention rewards (r_abs ≈ -0.25 to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks. Larger models show greater robustness to abstention incentives. Open-ended QA has limitations due to insufficient exploration, partially mitigated by supervised abstention training.

Conclusion: Verifiable reward design is a feasible and flexible practical approach for hallucination mitigation in language models, with RLVR effectively promoting intellectual humility through abstention training.

Abstract: Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention (“I don’t know”) alongside correctness to promote intellectual humility. We fine-tune and evaluate Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure ($-1$, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine-tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs $\approx -0.25$ to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open-ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here https://github.com/Mystic-Slice/rl-abstention.

[37] BengaliSent140: A Large-Scale Bengali Binary Sentiment Dataset for Hate and Non-Hate Speech Classification

Akif Islam, Sujan Kumar Roy, Md. Ekramul Hamid

Main category: cs.CL

TL;DR: BengaliSent140: A large-scale binary sentiment dataset for Bengali created by consolidating 7 existing datasets into 139,792 samples (68,548 hate, 71,244 not-hate) with harmonized annotations.

Details

Motivation: Bengali sentiment analysis research is constrained by scarce large-scale annotated datasets. Existing datasets are limited in size or confined to single domains (like social media), insufficient for training modern deep learning models that need large, diverse data for robust representations.

Method: Consolidated seven existing Bengali text datasets into a unified corpus. Heterogeneous annotation schemes were systematically harmonized into a binary sentiment formulation with two classes: Not Hate (0) and Hate (1).

Result: Created BengaliSent140 with 139,792 unique text samples (68,548 hate, 71,244 not-hate), offering broader linguistic and contextual coverage than existing datasets. Baseline experimental results demonstrate practical usability.

Conclusion: BengaliSent140 provides a large-scale, balanced binary sentiment dataset for Bengali that enables training and benchmarking of deep learning models, addressing the data scarcity problem in Bengali NLP research.

Abstract: Sentiment analysis for the Bengali language has attracted increasing research interest in recent years. However, progress remains constrained by the scarcity of large-scale and diverse annotated datasets. Although several Bengali sentiment and hate speech datasets are publicly available, most are limited in size or confined to a single domain, such as social media comments. Consequently, these resources are often insufficient for training modern deep learning based models, which require large volumes of heterogeneous data to learn robust and generalizable representations. In this work, we introduce BengaliSent140, a large-scale Bengali binary sentiment dataset constructed by consolidating seven existing Bengali text datasets into a unified corpus. To ensure consistency across sources, heterogeneous annotation schemes are systematically harmonized into a binary sentiment formulation with two classes: Not Hate (0) and Hate (1). The resulting dataset comprises 139,792 unique text samples, including 68,548 hate and 71,244 not-hate instances, yielding a relatively balanced class distribution. By integrating data from multiple sources and domains, BengaliSent140 offers broader linguistic and contextual coverage than existing Bengali sentiment datasets and provides a strong foundation for training and benchmarking deep learning models. Baseline experimental results are also reported to demonstrate the practical usability of the dataset. The dataset is publicly available at https://www.kaggle.com/datasets/akifislam/bengalisent140/

[38] Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, Jiri Gesi, Xianfeng Tang, Chen Luo, Yisi Sang, Hanqing Lu, Manling Li, Dakuo Wang

Main category: cs.CL

TL;DR: Trajectory2Task: A pipeline for generating verifiable tool-use tasks covering realistic user scenarios (ambiguous, changing, infeasible intents) to bridge the gap between idealized agent studies and real-world applications.

Details

Motivation: Real-world tool-calling agents face complex user scenarios (ambiguous, changing, or infeasible intents) that are underrepresented in current training/evaluation data, while most studies focus on idealized settings with general, fixed, well-specified tasks.

Method: Trajectory2Task pipeline: (1) multi-turn exploration to produce valid tool-call trajectories, (2) conversion of trajectories into user-facing tasks with controlled intent adaptations, yielding verifiable tasks for closed-loop evaluation and training.

Result: Benchmarking 7 SOTA LLMs shows frequent failures on generated complex scenarios. Fine-tuning lightweight LLMs with successful trajectories yields consistent improvements across all three conditions and better generalization to unseen tool-use domains.

Conclusion: The pipeline addresses the data gap for realistic tool-use scenarios, reveals current LLM limitations in handling complex user intents, and demonstrates that trajectory-based fine-tuning enhances general tool-calling ability and domain generalization.

Abstract: Tool-calling agents are increasingly deployed in real-world customer-facing workflows. Yet most studies on tool-calling agents focus on idealized settings with general, fixed, and well-specified tasks. In real-world applications, user requests are often (1) ambiguous, (2) changing over time, or (3) infeasible due to policy constraints, and training and evaluation data that cover these diverse, complex interaction patterns remain under-represented. To bridge the gap, we present Trajectory2Task, a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios: ambiguous intent, changing intent, and infeasible intents. The pipeline first conducts multi-turn exploration to produce valid tool-call trajectories. It then converts these trajectories into user-facing tasks with controlled intent adaptations. This process yields verifiable task that support closed-loop evaluation and training. We benchmark seven state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures. Finally, using successful trajectories obtained from task rollouts, we fine-tune lightweight LLMs and find consistent improvements across all three conditions, along with better generalization to unseen tool-use domains, indicating stronger general tool-calling ability.

[39] Me-Agent: A Personalized Mobile Agent with Two-Level User Habit Learning for Enhanced Interaction

Shuoxin Wang, Chang Liu, Gowen Loo, Lifan Zheng, Kaiwen Wei, Xinyi Zeng, Jingyuan Zhang, Yu Tian

Main category: cs.CL

TL;DR: Me-Agent: A learnable personalized mobile agent with two-level user habit learning (prompt-level preference learning with Personal Reward Model and memory-level Hierarchical Preference Memory) that addresses LLM-based agents’ inability to handle ambiguous instructions, learn from history, and handle personalized needs.

Details

Motivation: Current LLM-based mobile agents follow explicit instructions but overlook personalized needs, failing to interpret ambiguous instructions, learn from user interaction history, and handle personalized instructions without personalized context.

Method: Two-level user habit learning: (1) Prompt-level: user preference learning strategy enhanced with Personal Reward Model; (2) Memory-level: Hierarchical Preference Memory storing users’ long-term memory and app-specific memory in different levels.

Result: Me-Agent achieves state-of-the-art performance in personalization on the new User FingerTip benchmark (featuring ambiguous daily life instructions) while maintaining competitive instruction execution performance on general benchmarks.

Conclusion: Me-Agent successfully addresses personalization limitations in mobile agents through learnable and memorable architecture, demonstrating superior personalization capabilities while preserving general instruction execution performance.

Abstract: Large Language Model (LLM)-based mobile agents have made significant performance advancements. However, these agents often follow explicit user instructions while overlooking personalized needs, leading to significant limitations for real users, particularly without personalized context: (1) inability to interpret ambiguous instructions, (2) lack of learning from user interaction history, and (3) failure to handle personalized instructions. To alleviate the above challenges, we propose Me-Agent, a learnable and memorable personalized mobile agent. Specifically, Me-Agent incorporates a two-level user habit learning approach. At the prompt level, we design a user preference learning strategy enhanced with a Personal Reward Model to improve personalization performance. At the memory level, we design a Hierarchical Preference Memory, which stores users’ long-term memory and app-specific memory in different level memory. To validate the personalization capabilities of mobile agents, we introduce User FingerTip, a new benchmark featuring numerous ambiguous instructions for daily life. Extensive experiments on User FingerTip and general benchmarks demonstrate that Me-Agent achieves state-of-the-art performance in personalization while maintaining competitive instruction execution performance.

[40] Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems

Haoyuan Yu, Yuxuan Chen, Minjie Cai

Main category: cs.CL

TL;DR: A full-duplex dialogue framework decomposes conversations into minimal units for independent processing and transition prediction, implemented as a semi-cascaded system using multimodal LLM with auxiliary modules, achieving second place in a challenge.

Details

Motivation: Full-duplex voice interaction is essential for natural human-computer interaction, but current systems struggle with complex dialogue processing and seamless transitions between conversational units.

Method: Framework decomposes dialogue into minimal conversational units for independent processing, implemented as semi-cascaded full-duplex system using multimodal LLM with auxiliary modules (VAD, TTS) in train-free, plug-and-play manner.

Result: System ranks second among all teams on test set of Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction) using HumDial dataset, demonstrating framework effectiveness.

Conclusion: The proposed framework successfully enables natural full-duplex interaction by decomposing complex dialogue into manageable units, with competitive performance validated through challenge participation.

Abstract: Full-duplex voice interaction is crucial for natural human computer interaction. We present a framework that decomposes complex dialogue into minimal conversational units, enabling the system to process each unit independently and predict when to transit to the next. This framework is instantiated as a semi-cascaded full-duplex dialogue system built around a multimodal large language model, supported by auxiliary modules such as voice activity detection (VAD) and text-to-speech (TTS) synthesis. The resulting system operates in a train-free, plug-and-play manner. Experiments on the HumDial dataset demonstrate the effectiveness of our framework, which ranks second among all teams on the test set of the Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction). Code is available at the GitHub repository https://github.com/yu-haoyuan/fd-badcat.

[41] Automated Benchmark Generation from Domain Guidelines Informed by Bloom’s Taxonomy

Si Chen, Le Huy Khiem, Annalisa Szymanski, Ronald Metoyer, Ting Hua, Nitesh V. Chawla

Main category: cs.CL

TL;DR: A framework for generating automated benchmarks from expert guidelines using Bloom’s Taxonomy to evaluate LLMs’ contextual reasoning in practice-based domains through violation-based scenarios, MCQs, and dialogues.

Details

Motivation: Existing LLM benchmarks rely on pre-existing human exam datasets that are often unavailable in practice-based domains where knowledge is procedural and grounded in professional judgment. There's a need to evaluate models' ability to perform contextualized reasoning beyond factual recall in real-world settings.

Method: Introduces a framework that converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels of Bloom’s Taxonomy (Remember, Analyze, etc.). This enables deterministic, reproducible, and scalable evaluation.

Result: Applied to teaching, dietetics, and caregiving domains, the framework reveals non-intuitive model behaviors: LLMs sometimes perform better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). The approach produces large-scale, psychometrically informed benchmarks.

Conclusion: The framework successfully surfaces nuanced differences between model and human-like reasoning in practice-based domains, enabling evaluation of contextualized reasoning in real-world settings through automated, scalable benchmark generation from expert guidelines.

Abstract: Open-ended question answering (QA) evaluates a model’s ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom’s Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.

[42] SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility

Xuanyu Su, Diana Inkpen, Nathalie Japkowicz

Main category: cs.CL

TL;DR: SoftHateBench is a benchmark for detecting “soft hate speech” - discourse that appears reasonable but uses framing and value arguments to steer audiences toward blaming/excluding target groups, which current moderation systems optimized for surface toxicity often miss.

Details

Motivation: Current moderation systems are largely optimized for detecting overt slurs and threats (hard hate speech) but fail to detect subtle, reasoning-driven hostility that appears reasonable on the surface. Existing benchmarks don't systematically measure this gap in detection capabilities.

Method: The authors introduce SoftHateBench, a generative benchmark that produces soft-hate variants while preserving underlying hostile standpoints. They integrate the Argumentum Model of Topics (AMT) for argument structure and Relevance Theory (RT) for logical coherence to rewrite explicit hateful standpoints into seemingly neutral discussions.

Result: The benchmark spans 7 sociocultural domains and 28 target groups, comprising 4,745 soft-hate instances. Evaluations show consistent performance drops across encoder-based detectors, general-purpose LLMs, and safety models when moving from hard to soft hate tiers.

Conclusion: Current hate speech detection systems often fail when the same hostile stance is conveyed through subtle, reasoning-based language rather than explicit toxicity, highlighting a significant gap in moderation capabilities that needs to be addressed.

Abstract: Online hate on social media ranges from overt slurs and threats (\emph{hard hate speech}) to \emph{soft hate speech}: discourse that appears reasonable on the surface but uses framing and value-based arguments to steer audiences toward blaming or excluding a target group. We hypothesize that current moderation systems, largely optimized for surface toxicity cues, are not robust to this reasoning-driven hostility, yet existing benchmarks do not measure this gap systematically. We introduce \textbf{\textsc{SoftHateBench}}, a generative benchmark that produces soft-hate variants while preserving the underlying hostile standpoint. To generate soft hate, we integrate the \emph{Argumentum Model of Topics} (AMT) and \emph{Relevance Theory} (RT) in a unified framework: AMT provides the backbone argument structure for rewriting an explicit hateful standpoint into a seemingly neutral discussion while preserving the stance, and RT guides generation to keep the AMT chain logically coherent. The benchmark spans \textbf{7} sociocultural domains and \textbf{28} target groups, comprising \textbf{4,745} soft-hate instances. Evaluations across encoder-based detectors, general-purpose LLMs, and safety models show a consistent drop from hard to soft tiers: systems that detect explicit hostility often fail when the same stance is conveyed through subtle, reasoning-based language. \textcolor{red}{\textbf{Disclaimer.} Contains offensive examples used solely for research.}

[43] RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis

Elina Sigdel, Anastasia Panfilova

Main category: cs.CL

TL;DR: This paper presents a Russian adaptation of the LIWC methodology, creating a specialized dictionary and analyzer with 96 categories including syntactic, morphological, lexical features and LM predictions, implemented as the RusLICA web service.

Details

Motivation: Existing psycholinguistic analysis tools like LIWC were originally developed for English and translated to other languages, which may not adequately capture Russian grammatical and cultural specificities. There's a need for a native Russian adaptation.

Method: Built a specialized dictionary for Russian using multiple lexicographic resources, semantic dictionaries and corpora (not direct translation). Created 96 categories integrating syntactic, morphological, lexical, statistical features and pre-trained language model predictions. Mapped lemmas to 42 psycholinguistic categories and implemented as RusLICA web service.

Result: Developed a comprehensive Russian LIWC adaptation with 96 categories, including 42 psycholinguistic categories, specifically designed for Russian language characteristics. Created the RusLICA web service implementation.

Conclusion: The paper successfully adapts LIWC methodology for Russian language by creating a native dictionary and analyzer that accounts for Russian grammatical and cultural specificities, providing a more accurate psycholinguistic analysis tool for Russian texts.

Abstract: Defining psycholinguistic characteristics in written texts is a task gaining increasing attention from researchers. One of the most widely used tools in the current field is Linguistic Inquiry and Word Count (LIWC) that originally was developed to analyze English texts and translated into multiple languages. Our approach offers the adaptation of LIWC methodology for the Russian language, considering its grammatical and cultural specificities. The suggested approach comprises 96 categories, integrating syntactic, morphological, lexical, general statistical features, and results of predictions obtained using pre-trained language models (LMs) for text analysis. Rather than applying direct translation to existing thesauri, we built the dictionary specifically for the Russian language based on the content from several lexicographic resources, semantic dictionaries and corpora. The paper describes the process of mapping lemmas to 42 psycholinguistic categories and the implementation of the analyzer as part of RusLICA web service.

[44] Beyond the Needle’s Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

Tianwei Lin, Zuyi Zhou, Xinda Zhao, Chenke Wang, Xiaohong Li, Yu Chen, Chuanrui Hu, Jian Pei, Yafeng Deng

Main category: cs.CL

TL;DR: The paper introduces EverMemBench-S (EMB-S), an adversarial benchmark for evaluating long-context LLM agents’ ability to access and use evidence from large environments, revealing that semantic discrimination (not just context length) is the key bottleneck.

Details

Motivation: Current Needle-in-a-Haystack (NIAH) evaluations are insufficient as they mostly measure benign span localization with unique needles in irrelevant haystacks, failing to capture real-world challenges of evidence access and faithful usage in large environments.

Method: Developed EMB-S benchmark on a 326M-token MemoryBank with collision-tested near-miss hard negatives and gold evidence sets spanning multiple documents. Uses decoupled diagnostic protocol separating evidence access (document-ID localization) from end-to-end QA quality under full-context prompting.

Result: Systems that perform well on benign NIAH degrade sharply in evidence access under semantic interference. Semantic discrimination, not just context length, is the dominant bottleneck for long-context memory at scale.

Conclusion: The proposed EMB-S benchmark and diagnostic protocol reveal critical limitations in current long-context LLM agents, showing that semantic discrimination capabilities are more important than raw context length for effective evidence access in large-scale environments.

Abstract: Long-context LLM agents must access the right evidence from large environments and use it faithfully. However, the popular Needle-in-a-Haystack (NIAH) evaluation mostly measures benign span localization. The needle is near-unique, and the haystack is largely irrelevant. We introduce EverMemBench-S (EMB-S), an adversarial NIAH-style benchmark built on a 326M-token MemoryBank. While the full MemoryBank spans 326M tokens for retrieval-based (RAG) evaluation, we evaluate native long-context models only at scales that fit within each model’s context window (up to 1M tokens in this work) to ensure a fair comparison. EMB-S pairs queries with collision-tested near-miss hard negatives and gold evidence sets spanning one or more documents, validated via human screening and LLM verification. We also propose a decoupled diagnostic protocol that reports evidence access (document-ID localization) separately from end-to-end QA quality under full-context prompting. This enables consistent diagnosis for both native long-context prompting and retrieval pipelines. Across a reference-corpus ladder from domain-isolated 64K contexts to a globally shared 326M-token environment, we observe a clear reality gap. Systems that saturate benign NIAH degrade sharply in evidence access under semantic interference. These results indicate that semantic discrimination, not context length alone, is the dominant bottleneck for long-context memory at scale.

[45] SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger

Kaiyuan Chen, Guangmin Zheng, Jin Wang, Xiaobing Zhou, Xuejie Zhang

Main category: cs.CL

TL;DR: SAPO introduces adaptive process supervision to close the reasoner-verifier gap in small language models, outperforming existing self-evolution methods on math and code tasks.

Details

Motivation: Existing self-evolution methods overlook fine-grained reasoning steps, creating a reasoner-verifier gap. Monte Carlo process supervision is computationally inefficient, making it hard to mitigate this gap. Inspired by Error-Related Negativity (ERN) in neuroscience where reasoners can localize errors and make rapid adjustments.

Method: Self-Adaptive Process Optimization (SAPO) method that adaptively and efficiently introduces process supervision signals by actively minimizing the reasoner-verifier gap, rather than relying on inefficient Monte Carlo estimations.

Result: Extensive experiments show SAPO outperforms most existing self-evolution methods on two challenging task types: mathematics and code. The work also introduces two new benchmarks for process reward models in mathematical and coding tasks.

Conclusion: SAPO effectively addresses the reasoner-verifier gap through adaptive process optimization, demonstrating superior performance on complex reasoning tasks while being more computationally efficient than Monte Carlo-based approaches.

Abstract: Existing self-evolution methods overlook the influence of fine-grained reasoning steps, which leads to the reasoner-verifier gap. The computational inefficiency of Monte Carlo (MC) process supervision further exacerbates the difficulty in mitigating the gap. Motivated by the Error-Related Negativity (ERN), which the reasoner can localize error following incorrect decisions, guiding rapid adjustments, we propose a Self-Adaptive Process Optimization (SAPO) method for self-improvement in Small Language Models (SLMs). SAPO adaptively and efficiently introduces process supervision signals by actively minimizing the reasoner-verifier gap rather than relying on inefficient MC estimations. Extensive experiments demonstrate that the proposed method outperforms most existing self-evolution methods on two challenging task types: mathematics and code. Additionally, to further investigate SAPO’s impact on verifier performance, this work introduces two new benchmarks for process reward models in both mathematical and coding tasks.

[46] Beyond Speedup – Utilizing KV Cache for Sampling and Reasoning

Zeyu Xing, Xing Li, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan

Main category: cs.CL

TL;DR: KV caches can be reused as lightweight representations for downstream tasks without recomputing hidden states, enabling efficient Chain-of-Embedding and Fast/Slow Thinking Switching with minimal accuracy loss.

Details

Motivation: KV caches are typically used only for autoregressive decoding but contain valuable contextual information that could be reused for downstream tasks at no extra computational cost, potentially eliminating the need to recompute or store full hidden states.

Method: Proposes treating KV cache as lightweight representation for two applications: (1) Chain-of-Embedding where KV-derived representations are used for downstream tasks, and (2) Fast/Slow Thinking Switching where they enable adaptive reasoning by switching between different reasoning modes.

Result: KV-derived representations achieve competitive/superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct for Chain-of-Embedding, and enable up to 5.7× reduction in token generation with minimal accuracy loss for Fast/Slow Thinking Switching on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B.

Conclusion: KV caches serve as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference without additional computational overhead.

Abstract: KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: \textbf{(i) Chain-of-Embedding}, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf{(ii) Fast/Slow Thinking Switching}, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to $5.7\times$ with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: https://github.com/cmd2001/ICLR2026_KV-Embedding.

[47] CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

Xinyu Hu, Yancheng He, Weixun Wang, Tao Feng, Li Lin, Jiashun Liu, Wenbo Su, Bo Zheng, Xiaojun Wan

Main category: cs.CL

TL;DR: CE-RM-4B is a 4B parameter generative reward model trained with a two-stage rollout method and unified query-based criteria, achieving superior performance on reward benchmarks and better downstream RL improvements compared to existing LLM-as-a-Judge methods.

Details

Motivation: There's a notable gap between the impressive benchmark performance of LLM-as-a-Judge paradigms and their actual effectiveness in RL practice, attributed to limitations like dominance of pairwise evaluation and inadequate optimization of evaluation criteria.

Method: Proposes CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method using unified query-based criteria, trained on only about 5.7K high-quality data curated from open-source preference datasets.

Result: CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice compared to existing methods.

Conclusion: The proposed approach addresses limitations of existing LLM-as-a-Judge methods by using pointwise evaluation with unified criteria and two-stage training, resulting in better benchmark performance and practical RL effectiveness.

Abstract: Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.

[48] PsychePass: Calibrating LLM Therapeutic Competence via Trajectory-Anchored Tournaments

Zhuang Chen, Dazhen Wan, Zhangkai Zheng, Guanqun Bi, Xiyao Xiao, Binghang Li, Minlie Huang

Main category: cs.CL

TL;DR: PsychePass is a framework that uses trajectory-anchored tournaments to evaluate LLMs’ therapeutic competence, addressing instability in current evaluation methods through controlled client simulation and Swiss-system tournaments.

Details

Motivation: Current evaluation methods for LLMs in mental healthcare suffer from instability due to unstructured, longitudinal counseling nature. Two key problems: process drift (uncontrolled client simulation) and standard drift (unreliable static scoring). Need more robust evaluation framework.

Method: 1) Anchor interaction trajectory in simulation: clients precisely control consultation process to probe multifaceted capabilities. 2) Anchor battle trajectory in judgments: use Swiss-system tournament with dynamic pairwise battles to compute robust Elo ratings. 3) Transform tournament trajectories into reward signals for on-policy reinforcement learning to enhance LLMs.

Result: Extensive experiments validate PsychePass effectiveness and show strong consistency with human expert judgments. Framework provides reliable therapeutic competence evaluation and enables performance improvement through RL.

Conclusion: PsychePass addresses critical instability in LLM therapeutic evaluation through trajectory-anchored tournaments, offering robust assessment and enabling model improvement via tournament-derived reward signals.

Abstract: While large language models show promise in mental healthcare, evaluating their therapeutic competence remains challenging due to the unstructured and longitudinal nature of counseling. We argue that current evaluation paradigms suffer from an unanchored defect, leading to two forms of instability: process drift, where unsteered client simulation wanders away from specific counseling goals, and standard drift, where static pointwise scoring lacks the stability for reliable judgment. To address this, we introduce Ps, a unified framework that calibrates the therapeutic competence of LLMs via trajectory-anchored tournaments. We first anchor the interaction trajectory in simulation, where clients precisely control the fluid consultation process to probe multifaceted capabilities. We then anchor the battle trajectory in judgments through an efficient Swiss-system tournament, utilizing dynamic pairwise battles to yield robust Elo ratings. Beyond ranking, we demonstrate that tournament trajectories can be transformed into credible reward signals, enabling on-policy reinforcement learning to enhance LLMs’ performance. Extensive experiments validate the effectiveness of PsychePass and its strong consistency with human expert judgments.

[49] MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

Qinzhuo Wu, Zhizhuo Yang, Hanhao Li, Pengzhi Gao, Wei Liu, Jian Luan

Main category: cs.CL

TL;DR: MobileBench-OL is an online benchmark with 1080 tasks from 80 Chinese apps that evaluates mobile GUI agents’ task execution, complex reasoning, and noise robustness in real-world environments.

Details

Motivation: Current mobile GUI agent benchmarks focus too much on task instruction-following while neglecting reasoning and exploration abilities, and don't account for real-world random noise, creating a gap between benchmarks and actual mobile environments.

Method: Proposed MobileBench-OL with 1080 tasks from 80 Chinese apps, organized into 5 subsets to measure different evaluation dimensions. Includes an auto-eval framework with reset mechanism for stable, repeatable real-world benchmarking.

Result: Evaluation of 12 leading GUI agents shows significant room for improvement to meet real-world requirements. Human evaluation confirms MobileBench-OL reliably measures agent performance in real environments.

Conclusion: MobileBench-OL addresses limitations of existing benchmarks by comprehensively evaluating mobile GUI agents’ task execution, reasoning, and noise robustness, providing a more realistic assessment for real-world deployment.

Abstract: Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents’ task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 12 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.

[50] Improving Diffusion Language Model Decoding through Joint Search in Generation Order and Token Space

Yangyi Shen, Tianjian Feng, Jiaqi Han, Wen Wang, Tianlang Chen, Chunhua Shen, Jure Leskovec, Stefano Ermon

Main category: cs.CL

TL;DR: Order-Token Search improves Diffusion Language Models by jointly searching over generation order and token values, outperforming baselines on reasoning and coding tasks.

Details

Motivation: Current DLM decoding methods commit to a single trajectory, limiting exploration in trajectory space and failing to leverage the order-agnostic generation capability of diffusion models.

Method: Introduces Order-Token Search with a likelihood estimator that scores denoising actions, enabling stable pruning and efficient exploration of diverse trajectories through joint search over generation order and token values.

Result: Consistently outperforms baselines on GSM8K (3.1%), MATH500 (3.8%), Countdown (7.9%), and HumanEval (6.8%) absolute improvements, matching or surpassing diffu-GRPO post-trained d1-LLaDA.

Conclusion: Joint search over generation order and token values is a key component for advancing decoding in Diffusion Language Models, enabling better exploration of trajectory space.

Abstract: Diffusion Language Models (DLMs) offer order-agnostic generation that can explore many possible decoding trajectories. However, current decoding methods commit to a single trajectory, limiting exploration in trajectory space. We introduce Order-Token Search to explore this space through jointly searching over generation order and token values. Its core is a likelihood estimator that scores denoising actions, enabling stable pruning and efficient exploration of diverse trajectories. Across mathematical reasoning and coding benchmarks, Order-Token Search consistently outperforms baselines on GSM8K, MATH500, Countdown, and HumanEval (3.1%, 3.8%, 7.9%, and 6.8% absolute over backbone), matching or surpassing diffu-GRPO post-trained d1-LLaDA. Our work establishes joint search as a key component for advancing decoding in DLMs.

[51] Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

Qihao Wang, Yue Hu, Mingzhe Lu, Jiayue Wu, Yanbing Liu, Yuanmin Tang

Main category: cs.CL

TL;DR: A framework based on Cognitive Load Theory for diagnosing LLM tool-use capabilities, moving beyond accuracy to identify cognitive bottlenecks through quantifiable intrinsic and extraneous load components.

Details

Motivation: Current benchmarks only report final accuracy, which reveals what models can do but obscures the cognitive bottlenecks that define their true capability boundaries. There's a need to move from simple performance scoring to diagnostic tools that can identify these limitations.

Method: Introduces a framework grounded in Cognitive Load Theory that deconstructs task complexity into two quantifiable components: Intrinsic Load (inherent structural complexity formalized with Tool Interaction Graphs) and Extraneous Load (difficulty from ambiguous task presentation). Created ToolLoad-Bench, the first benchmark with parametrically adjustable cognitive load for controlled experiments.

Result: Evaluation reveals distinct performance cliffs as cognitive load increases, allowing precise mapping of each model’s capability boundary. The framework’s predictions are highly calibrated with empirical results, establishing a principled methodology for understanding agent limits.

Conclusion: The framework provides a diagnostic tool for understanding LLM tool-use limitations and offers a practical foundation for building more efficient systems by identifying cognitive bottlenecks rather than just measuring final accuracy.

Abstract: The ability of Large Language Models (LLMs) to use external tools unlocks powerful real-world interactions, making rigorous evaluation essential. However, current benchmarks primarily report final accuracy, revealing what models can do but obscuring the cognitive bottlenecks that define their true capability boundaries. To move from simple performance scoring to a diagnostic tool, we introduce a framework grounded in Cognitive Load Theory. Our framework deconstructs task complexity into two quantifiable components: Intrinsic Load, the inherent structural complexity of the solution path, formalized with a novel Tool Interaction Graph; and Extraneous Load, the difficulty arising from ambiguous task presentation. To enable controlled experiments, we construct ToolLoad-Bench, the first benchmark with parametrically adjustable cognitive load. Our evaluation reveals distinct performance cliffs as cognitive load increases, allowing us to precisely map each model’s capability boundary. We validate that our framework’s predictions are highly calibrated with empirical results, establishing a principled methodology for understanding an agent’s limits and a practical foundation for building more efficient systems.

[52] SpeechMapper: Speech-to-text Embedding Projector for LLMs

Biswesh Mohapatra, Marcely Zanon Boito, Ioan Calapodescu

Main category: cs.CL

TL;DR: SpeechMapper: Efficient speech-to-LLM training approach that reduces overfitting by pretraining without LLM first, then brief instruction tuning, outperforming existing methods with less data/compute.

Details

Motivation: Current speech LLMs require expensive training of all components on speech instruction data, leading to computational intensity and susceptibility to task/prompt overfitting. Need more cost-efficient and generalizable approach.

Method: Two-stage approach: 1) Pretrain speech foundation model without LLM on inexpensive hardware, 2) Efficiently attach to target LLM via brief 1K-step instruction tuning. Tested with task-agnostic and task-specific instruction tuning strategies.

Result: In task-agnostic settings, SpeechMapper rivals best instruction-following speech LLM from IWSLT25 without training on those tasks. In task-specific settings, outperforms this model across many datasets despite requiring less data and compute.

Conclusion: SpeechMapper offers practical, scalable approach for efficient, generalizable speech-LLM integration without large-scale instruction tuning, addressing overfitting and computational cost issues of current methods.

Abstract: Current speech LLMs bridge speech foundation models to LLMs using projection layers, training all of these components on speech instruction data. This strategy is computationally intensive and susceptible to task and prompt overfitting. We present SpeechMapper, a cost-efficient speech-to-LLM-embedding training approach that mitigates overfitting, enabling more robust and generalizable models. Our model is first pretrained without the LLM on inexpensive hardware, and then efficiently attached to the target LLM via a brief 1K-step instruction tuning (IT) stage. Through experiments on speech translation and spoken question answering, we demonstrate the versatility of SpeechMapper’s pretrained block, presenting results for both task-agnostic IT, an ASR-based adaptation strategy that does not train in the target task, and task-specific IT. In task-agnostic settings, Speechmapper rivals the best instruction-following speech LLM from IWSLT25, despite never being trained on these tasks, while in task-specific settings, it outperforms this model across many datasets, despite requiring less data and compute. Overall, SpeechMapper offers a practical and scalable approach for efficient, generalizable speech-LLM integration without large-scale IT.

[53] Hopes and Fears – Emotion Distribution in the Topic Landscape of Finnish Parliamentary Speech 2000-2020

Anna Ristilä, Otto Tarkka, Veronika Laippala, Kimmo Elo

Main category: cs.CL

TL;DR: Analysis of emotion expression across different topics in Finnish parliamentary speeches from 2000-2020, revealing topic-specific emotional patterns and increasing positivity over time.

Details

Motivation: Existing research treats parliamentary discourse as homogeneous, overlooking topic-specific emotional patterns. While intuitive assumptions exist about which topics evoke stronger emotions, there's little empirical research on emotions linked to different parliamentary topics.

Method: Used emotion analysis model to examine emotion expression in topics from both synchronic (cross-sectional) and diachronic (longitudinal) perspectives, analyzing parliamentary speeches from the Finnish Parliament (Eduskunta) between 2000-2020.

Result: Found evidence of increasing positivity in parliamentary speech over time and identified topic-specific patterns of emotion expression within parliamentary debates.

Conclusion: The study fills a research gap by providing empirical evidence of topic-specific emotion expression in parliamentary discourse, revealing both temporal trends and topic-based emotional variations in political speech.

Abstract: Existing research often treats parliamentary discourse as a homogeneous whole, overlooking topic-specific patterns. Parliamentary speeches address a wide range of topics, some of which evoke stronger emotions than others. While everyone has intuitive assumptions about what the most emotive topics in a parliament may be, there has been little research into the emotions typically linked to different topics. This paper strives to fill this gap by examining emotion expression among the topics of parliamentary speeches delivered in Eduskunta, the Finnish Parliament, between 2000 and 2020. An emotion analysis model is used to investigate emotion expression in topics, from both synchronic and diachronic perspectives. The results strengthen evidence of increasing positivity in parliamentary speech and provide further insights into topic-specific emotion expression within parliamentary debate.

[54] PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use

Qihao Wang, Mingzhe Lu, Jiayue Wu, Yue Hu, Yanbing Liu

Main category: cs.CL

TL;DR: PEARL is a novel framework that enhances LLM planning and execution for complex tool use through a two-stage approach combining offline exploration and online reinforcement learning with group Relative Policy Optimization.

Details

Motivation: Large Language Models struggle with complex, multi-turn tool invocation due to weak planning, tool hallucination, erroneous parameter generation, and poor robust interaction. Current methods fail to address these planning challenges effectively.

Method: Two-stage framework: 1) Offline phase where agents explore tools to learn valid usage patterns and failure conditions; 2) Online reinforcement learning phase with a dedicated Planner trained via group Relative Policy Optimization (GRPO) with a carefully designed reward function providing distinct signals for planning quality.

Result: PEARL significantly outperforms existing methods on ToolHop and T-Eval benchmarks, achieving a new state-of-the-art success rate of 56.5% on ToolHop while maintaining a low invocation error rate.

Conclusion: PEARL marks a key advance in addressing complex planning challenges for tool use, contributing to more robust and reliable LLM-based agents through its innovative two-stage approach combining exploration and reinforcement learning.

Abstract: Large Language Models show great potential with external tools, but face significant challenges in complex, multi-turn tool invocation. They often exhibit weak planning, tool hallucination, erroneous parameter generation, and struggle with robust interaction. To tackle these issues, we present PEARL, a novel framework to enhance LLM planning and execution for sophisticated tool use. PEARL adopts a two-stage approach: an offline phase where the agent explores tools to learn valid usage patterns and failure conditions, and an online reinforcement learning phase. In the online phase, a dedicated Planner is trained via group Relative Policy Optimization (GRPO) with a carefully designed reward function that provides distinct signals for planning quality. Experiments on the ToolHop and T-Eval benchmarks show PEARL significantly outperforms existing methods, achieving a new state-of-the-art success rate of \textbf{56.5%} on ToolHop while maintaining a low invocation error rate. Our work marks a key advance in addressing the complex planning challenges of tool use, contributing to the development of more robust and reliable LLM-based agents.

[55] MuVaC: AVariational Causal Framework for Multimodal Sarcasm Understanding in Dialogues

Diandian Guo, Fangfang Yuan, Cong Cao, Xixun Lin, Chuan Zhou, Hao Peng, Yanan Cao, Yanbing Liu

Main category: cs.CL

TL;DR: MuVaC is a variational causal inference framework that jointly optimizes multimodal sarcasm detection and explanation by modeling their causal relationship, using alignment-then-fusion for robust feature integration and ensuring consistency between detection and explanation.

Details

Motivation: Current research treats multimodal sarcasm detection (MSD) and multimodal sarcasm explanation (MuSE) as separate tasks or overlooks their causal dependency, despite detection being the result of reasoning that explains sarcasm. There's a need for a framework that mimics human cognitive mechanisms to jointly optimize both tasks.

Method: Proposes MuVaC: 1) Models MSD and MuSE using structural causal models with variational causal pathways for joint optimization objectives; 2) Uses alignment-then-fusion approach to integrate multimodal features; 3) Ensures consistency between detection results and explanations to enhance reasoning trustworthiness.

Result: Experimental results demonstrate superiority of MuVaC on public datasets, offering a new perspective for understanding multimodal sarcasm.

Conclusion: MuVaC successfully bridges the gap by jointly optimizing sarcasm detection and explanation through causal inference, providing robust multimodal feature learning that mimics human cognitive mechanisms for sarcasm understanding.

Abstract: The prevalence of sarcasm in multimodal dialogues on the social platforms presents a crucial yet challenging task for understanding the true intent behind online content. Comprehensive sarcasm analysis requires two key aspects: Multimodal Sarcasm Detection (MSD) and Multimodal Sarcasm Explanation (MuSE). Intuitively, the act of detection is the result of the reasoning process that explains the sarcasm. Current research predominantly focuses on addressing either MSD or MuSE as a single task. Even though some recent work has attempted to integrate these tasks, their inherent causal dependency is often overlooked. To bridge this gap, we propose MuVaC, a variational causal inference framework that mimics human cognitive mechanisms for understanding sarcasm, enabling robust multimodal feature learning to jointly optimize MSD and MuSE. Specifically, we first model MSD and MuSE from the perspective of structural causal models, establishing variational causal pathways to define the objectives for joint optimization. Next, we design an alignment-then-fusion approach to integrate multimodal features, providing robust fusion representations for sarcasm detection and explanation generation. Finally, we enhance the reasoning trustworthiness by ensuring consistency between detection results and explanations. Experimental results demonstrate the superiority of MuVaC in public datasets, offering a new perspective for understanding multimodal sarcasm.

[56] BMAM: Brain-inspired Multi-Agent Memory Framework

Yang Li, Jiaxiang Liu, Yusong Wang, Yujie Wu, Mingkun Xu

Main category: cs.CL

TL;DR: BMAM is a brain-inspired multi-agent memory architecture that prevents “soul erosion” by decomposing memory into specialized subsystems (episodic, semantic, salience-aware, control-oriented) operating at different time scales, achieving 78.45% accuracy on LoCoMo benchmark.

Details

Motivation: Language-model-based agents struggle with preserving temporally grounded information and maintaining behavioral consistency across long interaction horizons, a problem termed "soul erosion." Current approaches use single unstructured memory stores that fail to handle temporal reasoning and consistency.

Method: BMAM decomposes agent memory into functionally specialized subsystems inspired by cognitive memory systems: episodic memory (hippocampus-inspired), semantic memory, salience-aware memory, and control-oriented memory. These components operate at complementary time scales. For long-horizon reasoning, BMAM organizes episodic memories along explicit timelines and retrieves evidence by fusing multiple complementary signals.

Result: BMAM achieves 78.45% accuracy on the LoCoMo benchmark under standard long-horizon evaluation setting. Ablation analyses confirm that the hippocampus-inspired episodic memory subsystem plays a critical role in temporal reasoning.

Conclusion: BMAM’s brain-inspired multi-agent memory architecture effectively addresses soul erosion by modeling memory as specialized subsystems rather than a single store, enabling better temporal reasoning and behavioral consistency across extended interaction horizons.

Abstract: Language-model-based agents operating over extended interaction horizons face persistent challenges in preserving temporally grounded information and maintaining behavioral consistency across sessions, a failure mode we term soul erosion. We present BMAM (Brain-inspired Multi-Agent Memory), a general-purpose memory architecture that models agent memory as a set of functionally specialized subsystems rather than a single unstructured store. Inspired by cognitive memory systems, BMAM decomposes memory into episodic, semantic, salience-aware, and control-oriented components that operate at complementary time scales. To support long-horizon reasoning, BMAM organizes episodic memories along explicit timelines and retrieves evidence by fusing multiple complementary signals. Experiments on the LoCoMo benchmark show that BMAM achieves 78.45 percent accuracy under the standard long-horizon evaluation setting, and ablation analyses confirm that the hippocampus-inspired episodic memory subsystem plays a critical role in temporal reasoning.

[57] Can We Improve Educational Diagram Generation with In-Context Examples? Not if a Hallucination Spoils the Bunch

Evanfiya Logacheva, Arto Hellas, Tsvetomila Mihaylova, Juha Sorva, Ava Heinonen, Juho Leinonen

Main category: cs.CL

TL;DR: Novel RST-based method improves AI diagram generation by reducing hallucinations and improving faithfulness to context, though quality varies due to LLM stochasticity.

Details

Motivation: While generative AI is widely used in computing education, concerns exist about the quality of generated materials, particularly regarding alignment with user expectations and factual accuracy.

Method: Introduces a novel method for diagram code generation using in-context examples based on Rhetorical Structure Theory (RST) to align model outputs with user expectations. Evaluated by computer science educators assessing 150 LLM-generated diagrams across logical organization, connectivity, layout aesthetic, and AI hallucination.

Result: The method decreases factual hallucination rates and improves diagram faithfulness to provided context, but quality varies due to LLM stochasticity. Higher complexity text contexts lead to higher hallucination rates, and LLMs often fail to detect mistakes in their output.

Conclusion: The RST-based approach shows promise for improving AI-generated diagram quality in educational contexts, though challenges remain with complex content and LLM error detection capabilities.

Abstract: Generative artificial intelligence (AI) has found a widespread use in computing education; at the same time, quality of generated materials raises concerns among educators and students. This study addresses this issue by introducing a novel method for diagram code generation with in-context examples based on the Rhetorical Structure Theory (RST), which aims to improve diagram generation by aligning models’ output with user expectations. Our approach is evaluated by computer science educators, who assessed 150 diagrams generated with large language models (LLMs) for logical organization, connectivity, layout aesthetic, and AI hallucination. The assessment dataset is additionally investigated for its utility in automated diagram evaluation. The preliminary results suggest that our method decreases the rate of factual hallucination and improves diagram faithfulness to provided context; however, due to LLMs’ stochasticity, the quality of the generated diagrams varies. Additionally, we present an in-depth analysis and discussion on the connection between AI hallucination and the quality of generated diagrams, which reveals that text contexts of higher complexity lead to higher rates of hallucination and LLMs often fail to detect mistakes in their output.

[58] Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models

Kumiko Nakajima, Jan Zuiderveld, Sandro Pezzelle

Main category: cs.CL

TL;DR: The paper critiques the Divergent Association Task (DAT) for LLM creativity evaluation, introduces a new Conditional Divergent Association Task (CDAT) that combines novelty and appropriateness, and finds smaller models often show more creativity than advanced models under this framework.

Details

Motivation: Current assessments of LLM creative capabilities are weakly grounded in human creativity theory. The widely used DAT focuses only on novelty while ignoring appropriateness, which is a core component of creativity according to established theory.

Method: The authors introduce Conditional Divergent Association Task (CDAT), which evaluates novelty conditional on contextual appropriateness. This separates noise from creativity better than DAT while remaining simple and objective. They evaluate a range of state-of-the-art LLMs on both DAT and CDAT.

Result: LLM scores on DAT are lower than two baselines without creative abilities, undermining DAT’s validity. Under CDAT, smaller model families often show the most creativity, while advanced families favor appropriateness at lower novelty. Training and alignment likely shift models toward more appropriate but less creative outputs.

Conclusion: CDAT provides a better grounded creativity evaluation framework than DAT by incorporating both novelty and appropriateness. The findings suggest that model advancement through training and alignment may come at the cost of creative capabilities, favoring appropriateness over novelty.

Abstract: Large language models (LLMs) are increasingly used in verbal creative tasks. However, previous assessments of the creative capabilities of LLMs remain weakly grounded in human creativity theory and are thus hard to interpret. The widely used Divergent Association Task (DAT) focuses on novelty, ignoring appropriateness, a core component of creativity. We evaluate a range of state-of-the-art LLMs on DAT and show that their scores on the task are lower than those of two baselines that do not possess any creative abilities, undermining its validity for model evaluation. Grounded in human creativity theory, which defines creativity as the combination of novelty and appropriateness, we introduce Conditional Divergent Association Task (CDAT). CDAT evaluates novelty conditional on contextual appropriateness, separating noise from creativity better than DAT, while remaining simple and objective. Under CDAT, smaller model families often show the most creativity, whereas advanced families favor appropriateness at lower novelty. We hypothesize that training and alignment likely shift models along this frontier, making outputs more appropriate but less creative. We release the dataset and code.

[59] Single-Nodal Spontaneous Symmetry Breaking in NLP Models

Shalom Rosner, Ronit D. Gross, Ella Koresh, Ido Kanter

Main category: cs.CL

TL;DR: The paper demonstrates spontaneous symmetry breaking in NLP models during pre-training and fine-tuning, occurring at individual attention heads and even single nodes, with a crossover in learning ability as node count increases.

Details

Motivation: To show that spontaneous symmetry breaking phenomena from statistical mechanics can occur in NLP models, even with deterministic dynamics and finite architectures, and to understand how this manifests at different scales from individual nodes to attention heads.

Method: Theoretical framework analyzing symmetry breaking in NLP models using convex hull analysis to upper-bound nodal functions, with experimental demonstration using BERT-6 architecture pre-trained on Wikipedia and fine-tuned on FewRel classification task.

Result: Spontaneous symmetry breaking emerges in NLP models during both pre-training and fine-tuning, observable at individual attention heads and even single nodes. A crossover in learning ability occurs as node count increases, balancing random-guess decrease with nodal cooperation enhancement.

Conclusion: NLP models exhibit spontaneous symmetry breaking similar to statistical mechanics systems, but with nodal functions explicitly contributing to global tasks (unlike spin-glass systems), providing new insights into neural network learning mechanisms.

Abstract: Spontaneous symmetry breaking in statistical mechanics primarily occurs during phase transitions at the thermodynamic limit where the Hamiltonian preserves inversion symmetry, yet the low-temperature free energy exhibits reduced symmetry. Herein, we demonstrate the emergence of spontaneous symmetry breaking in natural language processing (NLP) models during both pre-training and fine-tuning, even under deterministic dynamics and within a finite training architecture. This phenomenon occurs at the level of individual attention heads and is scaled-down to its small subset of nodes and also valid at a single-nodal level, where nodes acquire the capacity to learn a limited set of tokens after pre-training or labels after fine-tuning for a specific classification task. As the number of nodes increases, a crossover in learning ability occurs, governed by the tradeoff between a decrease following random-guess among increased possible outputs, and enhancement following nodal cooperation, which exceeds the sum of individual nodal capabilities. In contrast to spin-glass systems, where a microscopic state of frozen spins cannot be directly linked to the free-energy minimization goal, each nodal function in this framework contributes explicitly to the global network task and can be upper-bounded using convex hull analysis. Results are demonstrated using BERT-6 architecture pre-trained on Wikipedia dataset and fine-tuned on the FewRel classification task.

[60] A Computational Approach to Language Contact – A Case Study of Persian

Ali Basirat, Danial Namazifard, Navid Baradaran Hemmati

Main category: cs.CL

TL;DR: Monolingual language models show selective contact effects: universal syntax is contact-insensitive while morphology (Case/Gender) reflects language-specific structure.

Details

Motivation: To understand how language contact history affects the intermediate representations of monolingual language models, using Persian as a contact-rich case study.

Method: Probe Persian-trained model representations when exposed to languages with varying contact histories with Persian. Quantify linguistic information in intermediate representations and assess distribution across model components for different morphosyntactic features.

Result: Universal syntactic information is largely insensitive to historical contact, while morphological features (Case and Gender) are strongly shaped by language-specific structure.

Conclusion: Contact effects in monolingual language models are selective and structurally constrained, with morphology showing stronger language-specific influence than universal syntax.

Abstract: We investigate structural traces of language contact in the intermediate representations of a monolingual language model. Focusing on Persian (Farsi) as a historically contact-rich language, we probe the representations of a Persian-trained model when exposed to languages with varying degrees and types of contact with Persian. Our methodology quantifies the amount of linguistic information encoded in intermediate representations and assesses how this information is distributed across model components for different morphosyntactic features. The results show that universal syntactic information is largely insensitive to historical contact, whereas morphological features such as Case and Gender are strongly shaped by language-specific structure, suggesting that contact effects in monolingual language models are selective and structurally constrained.

[61] AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Kaiyuan Chen, Qimin Wu, Taiyu Hou, Tianhao Tang, Xueyu Hu, Yuchen Hou, Bikun Li, Chengming Qian, Guoyin Wang, Haolin Chen, Haotong Tian, Haoye Zhang, Haoyu Bian, Hongbing Pan, Hongkang Zhang, Hongyi Zhou, Jiaqi Cai, Jiewu Rao, Jiyuan Ren, Keduan Huang, Lucia Zhu Huang, Mingyu Yuan, Naixu Guo, Qicheng Tang, Qinyan Zhang, Shuai Chen, Siheng Chen, Ting Ting Li, Xiaoxing Guo, Yaocheng Zuo, Yaoqi Guo, Yinan Wang, Yinzhou Yu, Yize Wang, Yuan Jiang, Yuan Tian, Yuanshuo Zhang, Yuxuan Liu, Yvette Yan Zeng, Zenyu Shan, Zihan Yin, Xiaobo Hu, Yang Liu, Yixin Ren, Yuan Gong

Main category: cs.CL

TL;DR: AgentIF-OneDay benchmark evaluates AI agents on diverse daily tasks requiring natural language instructions, attachment understanding, and file-based outputs, revealing API-based and ChatGPT agents perform best.

Details

Motivation: Current AI evaluations focus on increasing task difficulty but lack diversity to cover daily work, life, and learning activities for general users, limiting perception of advanced AI capabilities in everyday scenarios.

Method: Propose AgentIF-OneDay benchmark with 104 tasks (767 scoring points) across three categories: Open Workflow Execution (explicit workflows), Latent Instruction (inferring from attachments), and Iterative Refinement (modifying ongoing work). Uses instance-level rubrics and refined evaluation pipeline with LLM-based verification aligned with human judgment.

Result: Achieved 80.1% agreement rate between LLM-based verification (Gemini-3-Pro) and human judgment. Benchmarking shows API-based and ChatGPT agents built with agent RL perform best, indicating leading LLM APIs and open-source models have internalized agentic capabilities.

Conclusion: AgentIF-OneDay demonstrates that general users can utilize AI agents for diverse daily tasks, and that current LLM APIs enable development of cutting-edge agent products, bridging the gap between advanced AI capabilities and practical daily applications.

Abstract: The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results. The benchmark is structured around three user-centric categories: Open Workflow Execution, which assesses adherence to explicit and complex workflows; Latent Instruction, which requires agents to infer implicit instructions from attachments; and Iterative Refinement, which involves modifying or expanding upon ongoing work. We employ instance-level rubrics and a refined evaluation pipeline that aligns LLM-based verification with human judgment, achieving an 80.1% agreement rate using Gemini-3-Pro. AgentIF-OneDay comprises 104 tasks covering 767 scoring points. We benchmarked four leading general AI agents and found that agent products built based on APIs and ChatGPT agents based on agent RL remain in the first tier simultaneously. Leading LLM APIs and open-source models have internalized agentic capabilities, enabling AI application teams to develop cutting-edge Agent products.

[62] P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering

Wenlin Zhong, Chengyuan Liu, Yiquan Wu, Bovin Tan, Changlong Sun, Yi Wang, Xiaozhong Liu, Kun Kuang

Main category: cs.CL

TL;DR: P2S introduces probabilistic process supervision for LLM reasoning, using path faithfulness rewards from synthesized reference chains to provide dense step-by-step guidance, outperforming outcome-only methods on reading comprehension and medical QA.

Details

Motivation: Existing RL methods for LLM reasoning (like RLPR) focus only on final answer probabilities, neglecting step-by-step supervision of the reasoning process itself, creating a gap in fine-grained process guidance.

Method: P2S synthesizes and filters high-quality reference reasoning chains (gold-CoT) during RL, then computes Path Faithfulness Reward (PFR) for each step based on conditional probability of generating the gold-CoT suffix given current reasoning prefix, combinable with outcome rewards.

Result: Extensive experiments on reading comprehension and medical QA benchmarks show P2S significantly outperforms strong baselines, effectively addressing reward sparsity through dense process guidance.

Conclusion: P2S provides a novel self-supervision framework that enables fine-grained process rewards without separate reward models or human annotations, flexibly integrating with outcome rewards to improve LLM reasoning in general domains.

Abstract: While reinforcement learning with verifiable rewards (RLVR) has advanced LLM reasoning in structured domains like mathematics and programming, its application to general-domain reasoning tasks remains challenging due to the absence of verifiable reward signals. To this end, methods like Reinforcement Learning with Reference Probability Reward (RLPR) have emerged, leveraging the probability of generating the final answer as a reward signal. However, these outcome-focused approaches neglect crucial step-by-step supervision of the reasoning process itself. To address this gap, we introduce Probabilistic Process Supervision (P2S), a novel self-supervision framework that provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps. During reinforcement learning, P2S synthesizes and filters a high-quality reference reasoning chain (gold-CoT). The core of our method is to calculate a Path Faithfulness Reward (PFR) for each reasoning step, which is derived from the conditional probability of generating the gold-CoT’s suffix, given the model’s current reasoning prefix. Crucially, this PFR can be flexibly integrated with any outcome-based reward, directly tackling the reward sparsity problem by providing dense guidance. Extensive experiments on reading comprehension and medical Question Answering benchmarks show that P2S significantly outperforms strong baselines.

[63] Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science

Juan Jose Rubio Jan, Jack Wu, Julia Ive

Main category: cs.CL

TL;DR: LLMs show promise for EHR data tasks: structured querying with Python/Pandas and unstructured text extraction via RAG, evaluated on MIMIC III data with automated synthetic QA generation.

Details

Motivation: To explore LLMs' potential for foundational EHR data science tasks - both structured data querying and unstructured clinical text extraction - which are critical for clinical workflows and analytics.

Method: Developed flexible evaluation framework with automated synthetic QA generation tailored to each dataset/task. Tested on MIMIC III subset (4 structured tables, 1 clinical note type) using local and API-based LLMs. Used RAG pipeline for text extraction and Python/Pandas for structured querying.

Result: LLMs demonstrated potential for precise querying and accurate information extraction in clinical workflows, evaluated through exact-match metrics, semantic similarity, and human judgment.

Conclusion: LLMs can effectively support both structured EHR data querying and unstructured clinical text extraction, showing promise for integration into clinical data science workflows.

Abstract: This study applies Large Language Models (LLMs) to two foundational Electronic Health Record (EHR) data science tasks: structured data querying (using programmatic languages, Python/Pandas) and information extraction from unstructured clinical text via a Retrieval Augmented Generation (RAG) pipeline. We test the ability of LLMs to interact accurately with large structured datasets for analytics and the reliability of LLMs in extracting semantically correct information from free text health records when supported by RAG. To this end, we presented a flexible evaluation framework that automatically generates synthetic question and answer pairs tailored to the characteristics of each dataset or task. Experiments were conducted on a curated subset of MIMIC III, (four structured tables and one clinical note type), using a mix of locally hosted and API-based LLMs. Evaluation combined exact-match metrics, semantic similarity, and human judgment. Our findings demonstrate the potential of LLMs to support precise querying and accurate information extraction in clinical workflows.

[64] Efficient Multimodal Planning Agent for Visual Question-Answering

Zhuo Chen, Xinyu Geng, Xinyu Wang, Yong Jiang, Zhen Zhang, Pengjun Xie, Kewei Tu

Main category: cs.CL

TL;DR: Proposes a multimodal planning agent that dynamically decomposes mRAG pipelines for VQA to optimize efficiency-effectiveness trade-off, reducing search time by 60% while outperforming baselines.

Details

Motivation: Current mRAG approaches for knowledge-intensive VQA tasks use multi-stage pipelines with inherent dependencies, leading to inefficiency issues. There's a need to maintain VQA performance while reducing computational overhead.

Method: Trains a multimodal planning agent that dynamically decomposes the mRAG pipeline, intelligently determining the necessity of each mRAG step to optimize the efficiency-effectiveness trade-off.

Result: Reduces redundant computations by cutting search time over 60% compared to existing methods, decreases costly tool calls, and outperforms all baselines (including Deep Research agent and prompt-based methods) across six datasets.

Conclusion: The proposed multimodal planning agent successfully optimizes the trade-off between efficiency and effectiveness in VQA tasks, demonstrating significant computational savings while maintaining superior performance compared to existing approaches.

Abstract: Visual Question-Answering (VQA) is a challenging multimodal task that requires integrating visual and textual information to generate accurate responses. While multimodal Retrieval-Augmented Generation (mRAG) has shown promise in enhancing VQA systems by providing more evidence on both image and text sides, the default procedure that addresses VQA queries, especially the knowledge-intensive ones, often relies on multi-stage pipelines of mRAG with inherent dependencies. To mitigate the inefficiency limitations while maintaining VQA task performance, this paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task. Our method optimizes the trade-off between efficiency and effectiveness by training the agent to intelligently determine the necessity of each mRAG step. In our experiments, the agent can help reduce redundant computations, cutting search time by over 60% compared to existing methods and decreasing costly tool calls. Meanwhile, experiments demonstrate that our method outperforms all baselines, including a Deep Research agent and a carefully designed prompt-based method, on average over six various datasets. Code will be released.

[65] ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code

Mingqiao Mo, Yunlong Tan, Hao Zhang, Heng Zhang, Yangfan He

Main category: cs.CL

TL;DR: ShieldedCode is the first protection-aware framework that learns robust representations of VMP-protected code using hierarchical dependency modeling and contrastive learning, achieving significant improvements in VM code generation and binary similarity detection.

Details

Motivation: LLMs have advanced code generation but their potential for software protection remains untapped. Traditional virtual machine protection (VMP) relies on rigid, rule-based transformations that are costly to design and vulnerable to automated analysis, while reverse engineering continues to threaten software security.

Method: Builds large-scale paired datasets of source code and normalized VM implementations, introduces hierarchical dependency modeling at intra-, preceding-, and inter-instruction levels, and jointly optimizes language modeling with functionality-aware and protection-aware contrastive objectives. Uses a two-stage continual pre-training and fine-tuning pipeline.

Result: Achieves 26.95% Pass@1 on L0 VM code generation compared to 22.58% for GPT-4o, and improves binary similarity detection Recall@1 by 10% over state-of-the-art methods like jTrans. Framework significantly improves robustness across diverse protection levels.

Conclusion: ShieldedCode opens a new research direction for learning-based software defense by enabling models to generate, compare, and reason over protected code, demonstrating the potential of protection-aware frameworks for software security.

Abstract: Large language models (LLMs) have achieved remarkable progress in code generation, yet their potential for software protection remains largely untapped. Reverse engineering continues to threaten software security, while traditional virtual machine protection (VMP) relies on rigid, rule-based transformations that are costly to design and vulnerable to automated analysis. In this work, we present the first protection-aware framework that learns robust representations of VMP-protected code. Our approach builds large-scale paired datasets of source code and normalized VM implementations, and introduces hierarchical dependency modeling at intra-, preceding-, and inter-instruction levels. We jointly optimize language modeling with functionality-aware and protection-aware contrastive objectives to capture both semantic equivalence and protection strength. To further assess resilience, we propose a protection effectiveness optimization task that quantifies and ranks different VM variants derived from the same source. Coupled with a two-stage continual pre-training and fine-tuning pipeline, our method enables models to generate, compare, and reason over protected code. Extensive experiments show that our framework significantly improves robustness across diverse protection levels, opening a new research direction for learning-based software defense. In this work, we present ShieldedCode, the first protection-aware framework that learns robust representations of VMP-protected code. Our method achieves 26.95% Pass@1 on L0 VM code generation compared to 22.58% for GPT-4o., and improves binary similarity detection Recall@1 by 10% over state of art methods like jTrans.

[66] Online Density-Based Clustering for Real-Time Narrative Evolution Monitorin

Ostap Vykhopen, Viktoria Skorik, Maxim Tereschenko, Veronika Solopova

Main category: cs.CL

TL;DR: The paper investigates replacing batch HDBSCAN with online clustering methods for scalable social media narrative monitoring, evaluating algorithms on cluster quality, efficiency, and narrative metrics in a production pipeline.

Details

Motivation: Traditional batch clustering algorithms like HDBSCAN face scalability challenges for continuous social media streams, requiring complete retraining for each time window, causing memory constraints, computational inefficiency, and inability to adapt to evolving narratives in real-time.

Method: Proposes a three-stage architecture (data collection, modeling, dashboard generation) and evaluates various online clustering algorithms using sliding-window simulations on historical Ukraine information space datasets. Introduces evaluation criteria balancing traditional clustering metrics (Silhouette Coefficient, Davies-Bouldin Index) with narrative-specific metrics (narrative distinctness, contingency and variance).

Result: The research provides comparative analysis of algorithmic trade-offs in realistic operational contexts, addressing the gap between batch-oriented topic modeling frameworks and streaming social media monitoring requirements.

Conclusion: This work bridges critical gaps for scalable narrative intelligence systems, with implications for computational social science, crisis informatics, and narrative surveillance systems by enabling real-time adaptation to evolving narratives.

Abstract: Automated narrative intelligence systems for social media monitoring face significant scalability challenges when processing continuous data streams using traditional batch clustering algorithms. We investigate the replacement of HDBSCAN (offline clustering) with online (streaming/incremental) clustering methods in a production narrative report generation pipeline. The proposed system employs a three-stage architecture (data collection, modeling, dashboard generation) that processes thousands of multilingual social media documents daily. While HDBSCAN excels at discovering hierarchical density-based clusters and handling noise, its batch-only nature necessitates complete retraining for each time window, resulting in memory constraints, computational inefficiency, and inability to adapt to evolving narratives in real-time. This work evaluates a bunch of online clustering algorithms across dimensions of cluster quality preservation, computational efficiency, memory footprint, and integration compatibility with existing workflows. We propose evaluation criteria that balance traditional clustering metrics (Silhouette Coefficient, Davies-Bouldin Index) with narrative metrics (narrative distinctness, contingency and variance). Our methodology includes sliding-window simulations on historical datasets from Ukraine information space, enabling comparative analysis of algorithmic trade-offs in realistic operational contexts. This research addresses a critical gap between batch-oriented topic modeling frameworks and the streaming nature of social media monitoring, with implications for computational social science, crisis informatics, and narrative surveillance systems.

[67] AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

Shicheng Fang, Yuxin Wang, XiaoRan Liu, Jiahao Lu, Chuanyuan Tan, Xinchi Chen, Yining Zheng. Xuanjing Huang, Xipeng Qiu

Main category: cs.CL

TL;DR: AgentLongBench is a new benchmark that evaluates LLM agents through simulated environment rollouts using Lateral Thinking Puzzles, revealing that agents struggle with dynamic information synthesis despite handling static retrieval well.

Details

Motivation: Current benchmarks for LLM agents are largely static and rely on passive retrieval tasks, failing to simulate the complexities of real agent-environment interactions like non-linear reasoning and iterative feedback.

Method: Introduces AgentLongBench which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles, generating interaction trajectories across knowledge-intensive and knowledge-free scenarios.

Result: Experiments with state-of-the-art models and memory systems (32K to 4M tokens) show agents struggle with dynamic information synthesis essential for workflows, despite being adept at static retrieval. The degradation is driven by the minimum tokens required to resolve queries.

Conclusion: High information density in massive tool responses poses a significantly greater challenge than memory fragmentation in long-turn dialogues, highlighting a critical weakness in current LLM agent capabilities for dynamic environments.

Abstract: The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce \textbf{AgentLongBench}, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.

[68] QueerGen: How LLMs Reflect Societal Norms on Gender and Sexuality in Sentence Completion Tasks

Mae Sosto, Delfina Sol Martinez Pandiani, Laura Hollink

Main category: cs.CL

TL;DR: LLMs reproduce societal heterocisnormativity, showing measurable biases in text generation based on gender/sexuality categories, with different model types exhibiting varying patterns of representational harm.

Details

Motivation: To investigate how Large Language Models reproduce societal norms, particularly heterocisnormativity, and how these norms translate into measurable biases in text generation across different gender and sexuality categories.

Method: Examined whether explicit gender/sexuality information influences LLM responses across three subject categories: queer-marked, non-queer-marked, and unmarked. Operationalized representational imbalances through measurable differences in English sentence completions across four dimensions: sentiment, regard, toxicity, and prediction diversity.

Result: Masked Language Models (MLMs) produced least favorable sentiment, higher toxicity, and more negative regard for queer-marked subjects. Autoregressive Language Models (ARLMs) partially mitigated these patterns, while closed-access ARLMs tended to produce more harmful outputs for unmarked subjects.

Conclusion: LLMs reproduce normative social assumptions, but the form and degree of bias depend strongly on specific model characteristics, which may redistribute rather than eliminate representational harms.

Abstract: This paper examines how Large Language Models (LLMs) reproduce societal norms, particularly heterocisnormativity, and how these norms translate into measurable biases in their text generations. We investigate whether explicit information about a subject’s gender or sexuality influences LLM responses across three subject categories: queer-marked, non-queer-marked, and the normalized “unmarked” category. Representational imbalances are operationalized as measurable differences in English sentence completions across four dimensions: sentiment, regard, toxicity, and prediction diversity. Our findings show that Masked Language Models (MLMs) produce the least favorable sentiment, higher toxicity, and more negative regard for queer-marked subjects. Autoregressive Language Models (ARLMs) partially mitigate these patterns, while closed-access ARLMs tend to produce more harmful outputs for unmarked subjects. Results suggest that LLMs reproduce normative social assumptions, though the form and degree of bias depend strongly on specific model characteristics, which may redistribute, but not eliminate, representational harms.

[69] Like a Therapist, But Not: Reddit Narratives of AI in Mental Health Contexts

Elham Aghakhani, Rezvaneh Rezapour

Main category: cs.CL

TL;DR: People use AI for mental health support, evaluating it based on outcomes and trust rather than emotional bonds alone. Positive experiences relate to goal alignment, while companionship use carries risks like dependence.

Details

Motivation: LLMs are increasingly used for emotional support outside clinical settings, but there's limited understanding of how people evaluate and relate to these systems in everyday use. The study aims to analyze real-world user experiences with AI for mental health support.

Method: Analyzed 5,126 Reddit posts from 47 mental health communities using a hybrid LLM-human annotation pipeline. Developed a theory-informed framework based on Technology Acceptance Model and therapeutic alliance theory to analyze evaluative language, adoption attitudes, and relational alignment.

Result: Engagement is shaped primarily by narrated outcomes, trust, and response quality rather than emotional bond alone. Positive sentiment correlates strongly with task and goal alignment. Companionship-oriented use often involves misaligned alliances and risks like dependence and symptom escalation.

Conclusion: Theory-grounded constructs can be effectively operationalized in large-scale discourse analysis. Studying how users interpret language technologies in sensitive real-world contexts is crucial, as different use patterns (goal-oriented vs. companionship) carry different risks and benefits.

Abstract: Large language models (LLMs) are increasingly used for emotional support and mental health-related interactions outside clinical settings, yet little is known about how people evaluate and relate to these systems in everyday use. We analyze 5,126 Reddit posts from 47 mental health communities describing experiential or exploratory use of AI for emotional support or therapy. Grounded in the Technology Acceptance Model and therapeutic alliance theory, we develop a theory-informed annotation framework and apply a hybrid LLM-human pipeline to analyze evaluative language, adoption-related attitudes, and relational alignment at scale. Our results show that engagement is shaped primarily by narrated outcomes, trust, and response quality, rather than emotional bond alone. Positive sentiment is most strongly associated with task and goal alignment, while companionship-oriented use more often involves misaligned alliances and reported risks such as dependence and symptom escalation. Overall, this work demonstrates how theory-grounded constructs can be operationalized in large-scale discourse analysis and highlights the importance of studying how users interpret language technologies in sensitive, real-world contexts.

Jing Yang, Moritz Hechtbauer, Elisabeth Khalilov, Evelyn Luise Brinkmann, Vera Schmitt, Nils Feldhus

Main category: cs.CL

TL;DR: Persona prompting improves hate speech classification but degrades rationale quality, fails to align with real demographics, and doesn’t mitigate model biases.

Details

Motivation: To investigate how persona prompting affects LLM-generated rationales for socially sensitive tasks like hate speech detection, particularly examining alignment with different demographic groups and impact on model bias.

Method: Used datasets with word-level human rationales, simulated different demographic personas, measured agreement with human annotations from various demographic groups, and evaluated across three LLMs to assess persona prompting’s impact on bias and human alignment.

Result: 1) Persona prompting improves hate speech classification but degrades rationale quality; 2) Simulated personas don’t align with real demographic counterparts and models resist significant steering; 3) Models show consistent demographic biases and over-flag content as harmful regardless of persona prompting.

Conclusion: Persona prompting creates a critical trade-off: while it can improve classification in socially-sensitive tasks, it often comes at the cost of rationale quality and fails to mitigate underlying biases, requiring caution in its application.

Abstract: For socially sensitive tasks like hate speech detection, the quality of explanations from Large Language Models (LLMs) is crucial for factors like user trust and model alignment. While Persona prompting (PP) is increasingly used as a way to steer model towards user-specific generation, its effect on model rationales remains underexplored. We investigate how LLM-generated rationales vary when conditioned on different simulated demographic personas. Using datasets annotated with word-level rationales, we measure agreement with human annotations from different demographic groups, and assess the impact of PP on model bias and human alignment. Our evaluation across three LLMs results reveals three key findings: (1) PP improving classification on the most subjective task (hate speech) but degrading rationale quality. (2) Simulated personas fail to align with their real-world demographic counterparts, and high inter-persona agreement shows models are resistant to significant steering. (3) Models exhibit consistent demographic biases and a strong tendency to over-flag content as harmful, regardless of PP. Our findings reveal a critical trade-off: while PP can improve classification in socially-sensitive tasks, it often comes at the cost of rationale quality and fails to mitigate underlying biases, urging caution in its application.

[71] SERA: Soft-Verified Efficient Repository Agents

Ethan Shen, Danny Tormoen, Saurabh Shah, Ali Farhadi, Tim Dettmers

Main category: cs.CL

TL;DR: SERA enables efficient training of coding agents specialized to private codebases using supervised finetuning, achieving state-of-the-art performance at 26-57x lower cost than previous methods.

Details

Motivation: Open-weight coding agents should theoretically excel at private codebase specialization, but training costs and complexity have prevented this advantage from being realized in practice.

Method: Soft Verified Generation (SVG) generates thousands of trajectories from a single code repository, enabling efficient supervised finetuning (SFT) for repository specialization.

Result: SERA achieves state-of-the-art results among fully open-source models while matching frontier open-weight models, with 26x cheaper training than RL and 57x cheaper than synthetic data methods.

Conclusion: This work makes private codebase specialization practical, accelerates open coding agent research, and demonstrates the advantage of open-source models that can adapt to specific repositories.

Abstract: Open-weight coding agents should hold a fundamental advantage over closed-source systems: they can be specialized to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical. We show it is now practical. We present Soft-Verified Efficient Repository Agents (SERA), an efficient method for training coding agents that enables the rapid and cheap creation of agents specialized to private codebases. Using only supervised finetuning (SFT), SERA achieves state-of-the-art results among fully open-source (open data, method, code) models while matching the performance of frontier open-weight models like Devstral-Small-2. Creating SERA models is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. Our method, Soft Verified Generation (SVG), generates thousands of trajectories from a single code repository. Combined with cost-efficiency, this enables specialization to private codebases. Beyond repository specialization, we apply SVG to a larger corpus of codebases, generating over 200,000 synthetic trajectories. We use this dataset to provide detailed analysis of scaling laws, ablations, and confounding factors for training coding agents. Overall, we believe our work will greatly accelerate research on open coding agents and showcase the advantage of open-source models that can specialize to private codebases. We release SERA as the first model in Ai2’s Open Coding Agents series, along with all our code, data, and Claude Code integration to support the research community.

[72] Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata

Main category: cs.CL

TL;DR: Transformers can learn multimodal associations from few in-context examples, with surprising efficiency: high-diversity primary modality data enables multimodal ICL with low secondary modality complexity.

Details

Motivation: To understand how transformers learn to associate information across modalities from in-context examples, investigating the mechanistic foundations of multimodal in-context learning.

Method: Controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. Analysis includes unimodal ICL principles, RoPE effects, and multimodal learning asymmetry.

Result: RoPE increases data complexity threshold for ICL. Multimodal setting reveals fundamental learning asymmetry: high-diversity primary modality data enables multimodal ICL with surprisingly low secondary modality complexity. Both settings rely on induction-style mechanisms copying labels from matching exemplars.

Conclusion: The findings provide mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation of cross-modal associations.

Abstract: Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation.

[73] Structured Semantic Information Helps Retrieve Better Examples for In-Context Learning in Few-Shot Relation Extraction

Aunabil Chakma, Mihai Surdeanu, Eduardo Blanco

Main category: cs.CL

TL;DR: Novel hybrid method for one-shot relation extraction combines syntactic-semantic similarity selection with LLM-generated examples to improve in-context learning performance.

Details

Motivation: Need to automatically obtain additional examples for in-context learning in one-shot relation extraction, as single examples are insufficient for robust performance.

Method: Introduces syntactic-semantic structure similarity selection for choosing new examples, combined with LLM-generated examples to create a hybrid system that provides complementary perspectives.

Result: Hybrid method consistently outperforms alternative strategies, achieves state-of-the-art on FS-TACRED and strong gains on FewRel subset, transfers well across datasets and LLM families.

Conclusion: Combining syntactic-semantic similarity selection with LLM-generated examples creates a more holistic representation of relations, leading to superior performance in one-shot relation extraction.

Abstract: This paper presents several strategies to automatically obtain additional examples for in-context learning of one-shot relation extraction. Specifically, we introduce a novel strategy for example selection, in which new examples are selected based on the similarity of their underlying syntactic-semantic structure to the provided one-shot example. We show that this method results in complementary word choices and sentence structures when compared to LLM-generated examples. When these strategies are combined, the resulting hybrid system achieves a more holistic picture of the relations of interest than either method alone. Our framework transfers well across datasets (FS-TACRED and FS-FewRel) and LLM families (Qwen and Gemma). Overall, our hybrid selection method consistently outperforms alternative strategies and achieves state-of-the-art performance on FS-TACRED and strong gains on a customized FewRel subset.

[74] Linear representations in language models can change dramatically over a conversation

Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan

Main category: cs.CL

TL;DR: Language model representations evolve dynamically during conversations, with factual information changing based on context, challenging static interpretability methods.

Details

Motivation: To understand how language model representations change during conversations and how this affects interpretability and steering approaches.

Method: Analyzed linear representation dynamics in simulated conversations across different model families and layers, using both on-policy conversations and script replay, and tested steering effects at different conversation points.

Result: Representations change dramatically during conversations (factual info can become non-factual and vice versa), changes are content-dependent and robust across models, occur even with script replay, but weaker with explicit framing like sci-fi stories.

Conclusion: Representations evolve in response to conversational roles, challenging static interpretability methods but opening new research directions for understanding contextual adaptation.

Abstract: Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering – in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.

[75] When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation

David Tan, Pinzhen Chen, Josef van Genabith, Koel Dutta Chowdhury

Main category: cs.CL

TL;DR: LLMs can be benchmark-contaminated, inflating scores and masking memorization as generalization. Using FLORES-200, the study shows contamination can be cross-directional and memorization persists despite source-side perturbations, though named entity replacement helps detect it.

Details

Motivation: To investigate benchmark contamination in multilingual LLMs, which can artificially inflate performance scores and mask memorization as true generalization, particularly in multilingual settings where contamination can transfer to supposedly "uncontaminated" languages.

Method: Used FLORES-200 translation benchmark as diagnostic tool; studied two 7-8B instruction-tuned multilingual LLMs: Bloomz (contaminated, trained on FLORES) and Llama (uncontaminated control). Analyzed cross-directional contamination effects and tested memorization persistence through various source-side perturbations including paraphrasing and named entity replacement.

Result: Confirmed Bloomz’s FLORES contamination; demonstrated machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Memorized references often persist despite source-side perturbations, but named entity replacement consistently decreases BLEU scores, providing an effective probing method for detecting memorization.

Conclusion: Benchmark contamination in LLMs leads to inflated performance metrics that mask memorization as generalization. Cross-directional contamination effects exist, and while memorization is resilient to many perturbations, named entity replacement serves as an effective diagnostic tool for detecting contamination and memorization in multilingual translation models.

Abstract: Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to “uncontaminated” languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz’s FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.

[76] An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases

Dylan Bouchard

Main category: cs.CL

TL;DR: The paper presents a decision framework for selecting appropriate bias and fairness evaluation metrics for LLMs based on deployment context, with an open-source library called LangFair.

Details

Motivation: Existing approaches lack systematic guidance for selecting appropriate bias and fairness evaluation metrics for LLMs across different deployment contexts, as bias risks vary substantially depending on how models are used.

Method: A decision framework that maps LLM use cases (model + prompt population) to relevant bias/fairness metrics based on task type, presence of protected attributes in prompts, and stakeholder priorities. Introduces novel metrics using stereotype classifiers and counterfactual adaptations of text similarity measures that only require LLM outputs.

Result: Extensive experiments show fairness risks cannot be reliably assessed from benchmark performance alone - results on one prompt dataset often overstate or understate risks for another, emphasizing the need for context-specific evaluation.

Conclusion: Fairness evaluation must be grounded in specific deployment contexts, and the proposed framework with LangFair library provides practical tools for context-aware bias assessment in LLMs.

Abstract: Bias and fairness risks in Large Language Models (LLMs) vary substantially across deployment contexts, yet existing approaches lack systematic guidance for selecting appropriate evaluation metrics. We present a decision framework that maps LLM use cases, characterized by a model and population of prompts, to relevant bias and fairness metrics based on task type, whether prompts contain protected attribute mentions, and stakeholder priorities. Our framework addresses toxicity, stereotyping, counterfactual unfairness, and allocational harms, and introduces novel metrics based on stereotype classifiers and counterfactual adaptations of text similarity measures. All metrics require only LLM outputs for computation, simplifying implementation while avoiding embedding-based approaches that often correlate poorly with downstream harms. We provide an open-source Python library, LangFair, for practical adoption. Extensive experiments demonstrate that fairness risks cannot be reliably assessed from benchmark performance alone: results on one prompt dataset likely overstate or understate risks for another, underscoring that fairness evaluation must be grounded in the specific deployment context.

[77] LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP

Danlu Chen, Freda Shi, Aditi Agarwal, Jacobo Myerston, Taylor Berg-Kirkpatrick

Main category: cs.CL

TL;DR: LogogramNLP benchmark enables NLP analysis of ancient logographic languages using visual representations, showing visual processing outperforms text for some tasks.

Details

Motivation: Ancient logographic writing systems lack transcribed data (mostly images), creating a bottleneck for NLP analysis since traditional NLP requires discrete token sequences.

Method: Created LogogramNLP benchmark with transcribed and visual datasets for four writing systems, comparing visual vs. text encoding strategies for classification, translation, and parsing tasks.

Result: Visual representations outperform textual representations for some tasks, suggesting visual processing can unlock cultural heritage data for NLP analysis.

Conclusion: Direct visual processing offers a viable solution for analyzing ancient logographic languages, potentially unlocking large amounts of untranscribed cultural heritage data.

Abstract: Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription – this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses.

[78] Are LLMs Really Not Knowledgeable? Mining the Submerged Knowledge in LLMs’ Memory

Xingjian Tao, Yiwei Wang, Yujun Cai, Zhicheng Yang, Jing Tang

Main category: cs.CL

TL;DR: LLMs retain correct knowledge internally even when generating wrong answers; new Hits@k metric reveals more factual knowledge than standard QA accuracy shows; current prompting methods suppress correct answers.

Details

Motivation: LLMs underperform on QA tasks despite having knowledge, with failures attributed to hallucinations/uncertainty. The authors investigate whether LLMs actually retain correct knowledge internally even when generating incorrect answers.

Method: Analyze token-level output distributions to find correct answers among high-probability candidates; propose Hits@k metric to evaluate latent knowledge retention independent of surface form; design experiments to measure suppression effect of prompting strategies that allow “unsure” outputs.

Result: LLMs possess significantly more factual knowledge than standard QA accuracy reflects; prompting strategies allowing “unsure” outputs inadvertently suppress correct answers by discouraging low-confidence generation.

Conclusion: Current evaluation methods underestimate LLM knowledge; new metrics and prompt design approaches are needed to better assess and utilize latent knowledge in knowledge-intensive tasks.

Abstract: Large language models (LLMs) have shown promise as parametric knowledge bases, but often underperform on question answering (QA) tasks due to hallucinations and uncertainty. While prior work attributes these failures to knowledge gaps in the model’s parameters, we uncover a complementary phenomenon: LLMs frequently retain correct knowledge even when generating incorrect or “unsure” answers. By analyzing the token-level output distributions, we find that correct answers often appear among high-probability candidates, despite not being selected. Motivated by this, we propose Hits@k, a novel metric to evaluate latent knowledge retention independent of answer surface form. Our experiments reveal that LLMs possess significantly more factual knowledge than is reflected by standard QA accuracy. Building on these insights, we further examine the prevailing few-shot QA paradigm. We find that prompting strategies which allow “unsure” outputs can inadvertently suppress correct answers by discouraging low-confidence generation. We design a set of quantitative experiments to measure this suppression effect, offering practical guidance for future prompt and decoding design in knowledge-intensive tasks.

[79] Summaries as Centroids for Interpretable and Scalable Text Clustering

Jairo Diaz-Rodriguez

Main category: cs.CL

TL;DR: k-NLPmeans and k-LLMmeans are text clustering variants of k-means that use textual summaries as centroids instead of numeric vectors, making clusters human-readable and auditable while maintaining k-means assignment efficiency.

Details

Motivation: Traditional k-means clustering produces numeric centroids that are not human-interpretable for text data. There's a need for clustering methods that produce readable, auditable cluster prototypes while maintaining computational efficiency and avoiding expensive LLM calls for large datasets.

Method: Two variants: k-NLPmeans uses lightweight deterministic summarizers for offline, low-cost operation; k-LLMmeans uses LLMs for summaries under a fixed per-iteration budget (cost doesn’t scale with dataset size). Both periodically replace numeric centroids with textual summaries while retaining k-means assignments in embedding space. Also includes mini-batch extension for streaming text clustering.

Result: The approach consistently outperforms classical baselines across diverse datasets, embedding models, and summarization strategies. It approaches the accuracy of recent LLM-based clustering methods without requiring extensive LLM calls. The authors also provide a StackExchange-derived benchmark for evaluating streaming text clustering.

Conclusion: The summary-as-centroid approach enables human-readable, auditable text clustering that maintains computational efficiency. The LLM-optional design allows flexible deployment from low-cost deterministic methods to LLM-enhanced versions with controlled costs, making it suitable for both offline and real-time streaming applications.

Abstract: We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.

[80] Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval

Aditya Sharma, Christopher J. Pal, Amal Zouaq

Main category: cs.CL

TL;DR: PGMR is a modular framework that separates SPARQL query generation into LLM-based placeholder creation and non-parametric URI retrieval, reducing hallucinations and improving robustness.

Details

Motivation: LLMs often hallucinate KG elements like URIs when generating SPARQL queries, leading to inaccuracies and out-of-distribution errors due to reliance on opaque parametric knowledge.

Method: PGMR uses a two-step approach: 1) LLM generates intermediate query with natural language placeholders for URIs, 2) non-parametric memory module retrieves and resolves correct KG URIs from the knowledge graph.

Result: PGMR significantly improves query correctness across various LLMs, datasets, and distribution shifts, nearly eliminates URI hallucinations, and shows superior safety with retrieval confidence thresholds and resilience to memory noise.

Conclusion: The modular separation of parametric and non-parametric components in PGMR provides a robust, safe approach to SPARQL query generation that effectively addresses LLM hallucinations while maintaining performance under challenging conditions.

Abstract: The ability to generate SPARQL queries from natural language questions is crucial for ensuring efficient and accurate retrieval of structured data from knowledge graphs (KG). While large language models (LLMs) have been widely adopted for SPARQL query generation, they are often susceptible to hallucinations and out-of-distribution errors when generating KG elements, such as Uniform Resource Identifiers (URIs), based on opaque internal parametric knowledge. We propose PGMR (Post-Generation Memory Retrieval), a modular framework where the LLM produces an intermediate query using natural language placeholders for URIs, and a non-parametric memory module is subsequently employed to retrieve and resolve the correct KG URIs. PGMR significantly enhances query correctness (SQM) across various LLMs, datasets, and distribution shifts, while achieving the near-complete suppression of URI hallucinations. Critically, we demonstrate PGMR’s superior safety and robustness: a retrieval confidence threshold enables PGMR to effectively refuse to answer queries that lack support, and the retriever proves highly resilient to memory noise, maintaining strong performance even when the non-parametric memory size is scaled up to 9 times with irrelevant, distracting entities.

[81] Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models

David Bani-Harouni, Chantal Pellegrini, Paul Stangel, Ege Özsoy, Kamilia Zaripova, Matthias Keicher, Nassir Navab

Main category: cs.CL

TL;DR: A novel RL method for fine-tuning LLMs to produce calibrated confidence estimates alongside factual answers, optimizing for alignment between confidence and actual accuracy.

Details

Motivation: Safe and trustworthy use of LLMs requires accurate confidence expression in their answers, addressing the need for calibrated uncertainty estimation in factual question answering.

Method: Reinforcement Learning approach that directly fine-tunes LLMs using a reward based on the logarithmic scoring rule, explicitly penalizing both over- and under-confidence, integrating confidence calibration into the generative process.

Result: Models trained with this approach show substantially improved calibration and generalize to unseen tasks without further fine-tuning, suggesting emergence of general confidence awareness.

Conclusion: The proposed RL method successfully enables LLMs to produce calibrated confidence estimates, addressing a critical need for trustworthy AI systems through integrated confidence calibration during generation.

Abstract: A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We propose a novel Reinforcement Learning approach that allows to directly fine-tune LLMs to express calibrated confidence estimates alongside their answers to factual questions. Our method optimizes a reward based on the logarithmic scoring rule, explicitly penalizing both over- and under-confidence. This encourages the model to align its confidence estimates with the actual predictive accuracy. The optimal policy under our reward design would result in perfectly calibrated confidence expressions. Unlike prior approaches that decouple confidence estimation from response generation, our method integrates confidence calibration seamlessly into the generative process of the LLM. Empirically, we demonstrate that models trained with our approach exhibit substantially improved calibration and generalize to unseen tasks without further fine-tuning, suggesting the emergence of general confidence awareness.

[82] NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context

Ben Yao, Qiuchi Li, Yazhou Zhang, Siyu Yang, Bohan Zhang, Prayag Tiwari, Jing Qin

Main category: cs.CL

TL;DR: First benchmark for nursing value alignment evaluates LLMs on five core nursing values (Altruism, Human Dignity, Integrity, Justice, Professionalism) using 4,400 instances across easy and hard levels collected from real hospital settings.

Details

Motivation: LLMs in clinical practice risk patients trusting AI over nurses, potentially intensifying nurse-patient conflicts. Need to evaluate if LLMs align with core nursing values upheld by human nurses.

Method: Created NurValues benchmark with five core nursing values from international codes. Two-level dataset: Easy-Level (2,200 instances from 5-month field study across three hospitals), Hard-Level (2,200 dialogue-based instances with contextual cues and subtle misleading signals). Evaluated 23 state-of-the-art LLMs.

Result: General LLMs outperform medical LLMs on nursing value alignment. Justice is the hardest value dimension for LLMs to align with. Provides insights into how LLMs navigate ethical challenges in clinician-patient interactions.

Conclusion: NurValues is the first real-world benchmark for healthcare value alignment, offering novel insights into LLM performance on nursing ethics and highlighting risks of LLM deployment in clinical settings where patients might trust AI over professional nursing judgment.

Abstract: While LLMs have demonstrated medical knowledge and conversational ability, their deployment in clinical practice raises new risks: patients may place greater trust in LLM-generated responses than in nurses’ professional judgments, potentially intensifying nurse-patient conflicts. Such risks highlight the urgent need of evaluating whether LLMs align with the core nursing values upheld by human nurses. This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. We define two-level tasks on the benchmark, considering the two characteristics of emerging nurse-patient conflicts. The Easy-Level dataset consists of 2,200 value-aligned and value-violating instances, which are collected through a five-month longitudinal field study across three hospitals of varying tiers; The Hard-Level dataset is comprised of 2,200 dialogue-based instances that embed contextual cues and subtle misleading signals, which increase adversarial complexity and better reflect the subjectivity and bias of narrators in the context of emerging nurse-patient conflicts. We evaluate a total of 23 SoTA LLMs on their ability to align with nursing values, and find that general LLMs outperform medical ones, and Justice is the hardest value dimension. As the first real-world benchmark for healthcare value alignment, NurValues provides novel insights into how LLMs navigate ethical challenges in clinician-patient interactions.

[83] Cochain: Balancing Insufficient and Excessive Collaboration in LLM Agent Workflows

Jiaxing Zhao, Hongbin Xie, Yuzhen Lei, Xuan Song, Zhuoran Shi, Lianxin Li, Shuangxue Liu, Linguo Xie, Haoran Zhang

Main category: cs.CL

TL;DR: Cochain is a collaboration prompting framework that combines knowledge graphs and prompt trees to solve business workflow problems more efficiently than chain-of-thought or multi-agent approaches.

Details

Motivation: Current approaches have limitations: single-agent chain-of-thought faces collaboration challenges due to complex cross-domain prompt design, while multi-agent systems consume excessive tokens and dilute the primary problem in business workflows.

Method: Cochain constructs an integrated knowledge graph with multi-stage knowledge and maintains/retrieves a prompts tree to obtain relevant prompt information across different business workflow stages.

Result: Cochain outperforms all baselines in both prompt engineering and multi-agent LLMs across multiple datasets. Expert evaluation shows small models with Cochain can outperform GPT-4.

Conclusion: Cochain effectively solves business workflow collaboration problems by combining knowledge and prompts at reduced cost, offering a superior alternative to existing approaches.

Abstract: Large Language Models (LLMs) have demonstrated impressive performance in executing complex reasoning tasks. Chain-of-thought effectively enhances reasoning capabilities by unlocking the potential of large models, while multi-agent systems provide more comprehensive solutions by integrating the collective intelligence of multiple agents. However, both approaches face significant limitations. Single-agent with chain-of-thought, due to the inherent complexity of designing cross-domain prompts, faces collaboration challenges. Meanwhile, multi-agent systems consume substantial tokens and inevitably dilute the primary problem, which is particularly problematic in business workflow tasks. To address these challenges, we propose Cochain, a collaboration prompting framework that effectively solves the business workflow collaboration problem by combining knowledge and prompts at a reduced cost. Specifically, we construct an integrated knowledge graph that incorporates knowledge from multiple stages. Furthermore, by maintaining and retrieving a prompts tree, we can obtain prompt information relevant to other stages of the business workflow. We perform extensive evaluations of Cochain across multiple datasets, demonstrating that Cochain outperforms all baselines in both prompt engineering and multi-agent LLMs. Additionally, expert evaluation results indicate that the use of a small model in combination with Cochain outperforms GPT-4.

[84] Multimodal Conversation Structure Understanding

Kent K. Chang, Mackenzie Hanh Cramer, Anna Ho, Ti Ti Nguyen, Yilin Yuan, David Bamman

Main category: cs.CL

TL;DR: Multimodal LLMs struggle with conversation structure understanding, especially when character identities are anonymized, and sociolinguistic analysis reveals gender disparities in conversational roles in TV dialogues.

Details

Motivation: To investigate whether multimodal large language models can adequately parse conversation structure (roles and threading), which remains underexplored despite their dialogue capabilities.

Method: Introduce a suite of tasks and release TV-MMPC dataset for multimodal conversation structure understanding; evaluate multimodal LLMs against heuristic baseline; conduct sociolinguistic analysis of 350,842 utterances in TVQA dataset.

Result: All multimodal LLMs outperform heuristic baseline, but best-performing model experiences substantial performance drop when character identities are anonymized. Sociolinguistic analysis shows female characters initiate conversations proportionally to speaking time but are 1.2x more likely than men to be cast as addressee/side-participant, and side-participants shift conversational register from personal to social.

Conclusion: Multimodal LLMs have limitations in conversation structure understanding, particularly without character identity cues, and TV dialogues exhibit gender disparities in conversational roles with side-participants influencing conversational register.

Abstract: While multimodal large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversation – conversational roles and threading – remains underexplored. In this work, we introduce a suite of tasks and release TV-MMPC, a new annotated dataset, for multimodal conversation structure understanding. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized. Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants shifts the conversational register from personal to social.

[85] In-context Language Learning for Endangered Languages in Speech Recognition

Zhaolin Li, Jan Niehues

Main category: cs.CL

TL;DR: LLMs can learn unseen, low-resource languages for speech recognition through in-context learning, matching or surpassing dedicated language models while preserving original capabilities.

Details

Motivation: Current LLMs support only a small subset of the world's ~7,000 languages, creating a need to explore whether LLMs can learn new languages without supervised data, particularly for speech recognition tasks.

Method: Investigates in-context learning (ICL) for language learning in LLMs, testing on four diverse endangered languages not previously trained on. Compares probability-based vs instruction-based approaches and evaluates performance on language modeling and ASR tasks.

Result: More relevant text samples improve performance in both language modeling and ASR. Probability-based approach outperforms instruction-based approach. ICL enables LLMs to achieve ASR performance comparable to or better than dedicated language models trained specifically for these languages, while preserving original LLM capabilities.

Conclusion: LLMs can effectively learn unseen, low-resource languages through in-context learning for speech recognition tasks, offering a promising approach to expand language support without requiring extensive supervised training data.

Abstract: With approximately 7,000 languages spoken worldwide, current large language models (LLMs) support only a small subset. Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). With experiments on four diverse endangered languages that LLMs have not been trained on, we find that providing more relevant text samples enhances performance in both language modelling and Automatic Speech Recognition (ASR) tasks. Furthermore, we show that the probability-based approach outperforms the traditional instruction-based approach in language learning. Lastly, we show ICL enables LLMs to achieve ASR performance that is comparable to or even surpasses dedicated language models trained specifically for these languages, while preserving the original capabilities of the LLMs. Our code is publicly available.

[86] Strategic Dialogue Assessment: The Crooked Path to Innocence

Anshun Asher Zheng, Junyi Jessy Li, David I. Beaver

Main category: cs.CL

TL;DR: The paper introduces SDA, a framework for assessing strategic language use in adversarial settings, and shows LLMs have limited understanding of strategic pragmatics despite scaling.

Details

Motivation: Most pragmatics research focuses on cooperative communication, leaving a gap in understanding strategic language use in adversarial contexts like courtroom cross-examinations.

Method: Developed SDA framework combining Gricean and game-theoretic pragmatics, with commitment-based discourse taxonomy and estimable proxies for credibility. Created CPD dataset of annotated courtroom dialogues and three interpretable metrics (BAT, PAT, NRBAT).

Result: LLMs show limited pragmatic understanding of strategic language. Model size improves performance on metrics, but reasoning ability hurts performance by introducing overcomplication and confusion.

Conclusion: Strategic language assessment requires specialized frameworks beyond cooperative pragmatics. Current LLMs lack sophisticated understanding of adversarial communication, with reasoning capabilities actually impairing performance.

Abstract: Language is often used strategically, particularly in high-stakes, adversarial settings, yet most work on pragmatics and LLMs centers on cooperativity. This leaves a gap in the systematic understanding of strategic communication in adversarial settings. To address this, we introduce SDA (Strategic Dialogue Assessment), a framework grounded in Gricean and game-theoretic pragmatics to assess strategic use of language. It adapts the ME Game jury function to make it empirically estimable for analyzing dialogue. Our approach incorporates two key adaptations: a commitment-based taxonomy of discourse moves, which provides a finer-grained account of strategic effects, and the use of estimable proxies grounded in Gricean maxims to operationalize abstract constructs such as credibility. Together, these adaptations build on discourse theory by treating discourse as the strategic management of commitments, enabling systematic evaluation of how conversational moves advance or undermine discourse goals. We further derive three interpretable metrics-Benefit at Turn (BAT), Penalty at Turn (PAT), and Normalized Relative Benefit at Turn (NRBAT)-to quantify the perceived strategic effects of discourse moves. We also present CPD (the Crooked Path Dataset), an annotated dataset of real courtroom cross-examinations, to demonstrate the framework’s effectiveness. Using these tools, we evaluate a range of LLMs and show that LLMs generally exhibit limited pragmatic understanding of strategic language. While model size shows an increase in performance on our metrics, reasoning ability does not help and largely hurts, introducing overcomplication and internal confusion.

[87] Read as You See: Guiding Unimodal LLMs for Low-Resource Explainable Harmful Meme Detection

Fengjun Pan, Xiaobao Wu, Tho Quan, Anh Tuan Luu

Main category: cs.CL

TL;DR: U-CoT+ is a resource-efficient framework for harmful meme detection that uses lightweight LLMs with a meme-to-text pipeline and zero-shot Chain-of-Thought prompting, achieving comparable performance to resource-intensive methods while being explainable and adaptable.

Details

Motivation: Existing harmful meme detection methods are resource-intensive, inflexible, and lack explainability, limiting their practical application in real-world content moderation. There's a need for accessible, transparent, and adaptable solutions.

Method: U-CoT+ decouples meme content recognition from harmfulness analysis using a high-fidelity meme-to-text pipeline that converts multimodal memes into natural language descriptions. It then uses unimodal LLMs with zero-shot Chain-of-Thought prompting guided by human-crafted guidelines for interpretable reasoning.

Result: Extensive experiments on seven benchmark datasets show that U-CoT+ achieves performance comparable to resource-intensive baselines, demonstrating effectiveness as a scalable, explainable, and low-resource solution.

Conclusion: U-CoT+ provides a resource-efficient, flexible, and transparent framework for harmful meme detection that leverages lightweight LLMs effectively, offering a practical solution for real-world content moderation with explainable reasoning and adaptability to diverse sociocultural criteria.

Abstract: Detecting harmful memes is crucial for safeguarding the integrity and harmony of online environments, yet existing detection methods are often resource-intensive, inflexible, and lacking explainability, limiting their applicability in assisting real-world web content moderation. We propose U-CoT+, a resource-efficient framework that prioritizes accessibility, flexibility and transparency in harmful meme detection by fully harnessing the capabilities of lightweight unimodal large language models (LLMs). Instead of directly prompting or fine-tuning large multimodal models (LMMs) as black-box classifiers, we avoid immediate reasoning over complex visual inputs but decouple meme content recognition from meme harmfulness analysis through a high-fidelity meme-to-text pipeline, which collaborates lightweight LMMs and LLMs to convert multimodal memes into natural language descriptions that preserve critical visual information, thus enabling text-only LLMs to “see” memes by “reading”. Grounded in textual inputs, we further guide unimodal LLMs’ reasoning under zero-shot Chain-of-Thoughts (CoT) prompting with targeted, interpretable, context-aware, and easily obtained human-crafted guidelines, thus providing accountable step-by-step rationales, while enabling flexible and efficient adaptation to diverse sociocultural criteria of harmfulness. Extensive experiments on seven benchmark datasets show that U-CoT+ achieves performance comparable to resource-intensive baselines, highlighting its effectiveness and potential as a scalable, explainable, and low-resource solution to support harmful meme detection.

[88] Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning

Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, Michalis Vazirgiannis

Main category: cs.CL

TL;DR: Curriculum learning for LLM pretraining accelerates convergence by 18-45%, with compression ratio, lexical diversity, and readability as most effective difficulty metrics.

Details

Motivation: Curriculum learning (organizing training data from easy to hard) has shown benefits in other ML domains but remains underexplored for language model pretraining, presenting an opportunity to improve training efficiency.

Method: Systematic investigation with 200+ models trained on up to 100B tokens using three strategies: vanilla curriculum learning, pacing-based sampling, and interleaved curricula, guided by six difficulty metrics spanning linguistic and information-theoretic properties.

Result: Curriculum learning consistently accelerates convergence in early/mid-training phases (18-45% reduction in training steps), yields sustained improvements up to 3.5% when used as warmup, with compression ratio, lexical diversity (MTLD), and readability (Flesch Reading Ease) identified as most effective difficulty signals.

Conclusion: Data ordering provides a practical, orthogonal mechanism to existing data selection methods for more efficient LLM pretraining, with curriculum learning offering significant convergence acceleration and performance improvements.

Abstract: Curriculum learning-organizing training data from easy to hard-has improved efficiency across machine learning domains, yet remains underexplored for language model pretraining. We present the first systematic investigation of curriculum learning in LLM pretraining, with over 200 models trained on up to 100B tokens across three strategies: vanilla curriculum learning, pacing-based sampling, and interleaved curricula, guided by six difficulty metrics spanning linguistic and information-theoretic properties. We evaluate performance on eight benchmarks under three realistic scenarios: limited data, unlimited data, and continual training. Our experiments show that curriculum learning consistently accelerates convergence in early and mid-training phases,reducing training steps by $18-45%$ to reach baseline performance. When applied as a warmup strategy before standard random sampling, curriculum learning yields sustained improvements up to $3.5%$. We identify compression ratio, lexical diversity (MTLD), and readability (Flesch Reading Ease) as the most effective difficulty signals. Our findings demonstrate that data ordering-orthogonal to existing data selection methods-provides a practical mechanism for more efficient LLM pretraining.

[89] The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Dahua Lin, Kai Chen

Main category: cs.CL

TL;DR: TAIL (Turing MAchine Imitation Learning) improves LLM length generalization by synthesizing CoT data that imitates Turing Machine execution, enabling models to handle longer sequences than seen during training.

Details

Motivation: Current data-driven approaches for length generalization are task-specific with limited performance. The paper seeks a more general solution by focusing on computable reasoning problems that algorithms can solve, inspired by Turing Machine capabilities.

Method: TAIL synthesizes chain-of-thought data that imitates Turing Machine execution using computer programs. It linearly expands reasoning steps into atomic states to prevent shortcut learning and implements explicit memory fetch mechanisms to reduce difficulties in dynamic, long-range data access.

Result: TAIL significantly improves length generalization and performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The model exhibits read-and-write behaviors consistent with Turing Machine properties in attention layers.

Conclusion: Key Turing Machine concepts (not just thinking styles) are essential for length generalization. TAIL provides a promising direction for LLM reasoning learning from synthetic data, with models demonstrating Turing Machine-like behaviors in their internal representations.

Abstract: Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.

[90] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations

Li-Chun Lu, Miri Liu, Pin-Chun Lu, Yufei Tian, Shao-Hua Sun, Nanyun Peng

Main category: cs.CL

TL;DR: Paper analyzes four creativity evaluation metrics (perplexity, LLM-as-a-Judge, Creativity Index, syntactic templates) across multiple creative domains, finding limited consistency and highlighting key limitations of each approach.

Details

Motivation: To systematically evaluate and compare existing creativity assessment metrics across diverse creative domains to understand their reliability, consistency, and alignment with human judgments.

Method: Compiled datasets with human-aligned creative and uncreative examples across three domains (creative writing, unconventional problem-solving, research ideation). Evaluated four metrics’ ability to discriminate between creative and uncreative sets, analyzing their performance and limitations.

Result: Found limited consistency across domains and metrics: metrics that work in one domain fail in others, and different metrics often disagree on the same data. Identified specific limitations: perplexity measures fluency not novelty; LLM-as-a-Judge shows inconsistency and bias; CI measures lexical diversity with implementation sensitivity; syntactic templates fail with formulaic language.

Conclusion: Current creativity evaluation metrics lack robustness and generalizability across domains. There’s a critical need for more reliable, domain-agnostic evaluation frameworks that better align with human judgments of creativity.

Abstract: We examine, analyze, and compare four representative creativity measures–perplexity, LLM-as-a-Judge, the Creativity Index (CI; measuring n-gram overlap with web corpora), and syntactic templates (detecting repetition of common part-of-speech patterns)–across the diverse creative domains, such as creative writing, unconventional problem-solving, and research ideation. For each domain, we compile datasets with human-aligned creative and uncreative examples and evaluate each metric’s ability to discriminate between the two sets. Our analyses reveal limited consistency both across domains and metrics, as metrics that distinguish creativity in one domain fail in others (e.g., CI correctly distinguishes in creative writing but fails in problem-solving), and different metrics often disagree on the same data points (e.g., CI suggests one set to be more creative, while perplexity indicates the other set to be more creative.) We highlight key limitations, such as perplexity reflecting fluency rather than novelty; LLM-as-a-Judge producing inconsistent judgments under minor prompt variations and exhibiting bias towards particular labels; CI primarily measuring lexical diversity, with high sensitivity to implementation choices; and syntactic templates being ineffective in settings dominated by formulaic language. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.

[91] Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking

Jian Chen, Jiabao Dou

Main category: cs.CL

TL;DR: ARCE is a knowledge distillation framework that uses LLMs to create synthetic task-oriented corpora for pre-training smaller models, achieving state-of-the-art performance on AEC domain information extraction while being more efficient than LLMs.

Details

Motivation: There's a need for accurate information extraction in AEC domain for automated rule checking, but LLMs are impractical for resource-constrained environments while standard efficient models struggle with domain gaps. Pre-training on human-curated corpora is labor-intensive and costly.

Method: Propose ARCE (Augmented RoBERTa with Contextualized Elucidations), a knowledge distillation framework that leverages LLMs to synthesize a task-oriented corpus called Cote for incremental pre-training of smaller models. Systematically explores optimal knowledge transfer strategies.

Result: ARCE achieves new state-of-the-art on benchmark AEC dataset with 77.20% Macro-F1 score, outperforming both domain-specific baselines and fine-tuned LLMs. Reveals “less is more” principle: simple direct explanations are more effective than complex role-based rationales for NER tasks.

Conclusion: ARCE provides an effective solution for domain adaptation in AEC by distilling LLM knowledge into efficient models, with practical implications for automated rule checking systems in resource-constrained environments.

Abstract: Accurate information extraction from specialized texts is a critical challenge for automated rule checking (ARC) in the architecture, engineering, and construction (AEC) domain. While large language models (LLMs) possess strong reasoning capabilities, their deployment in resource-constrained AEC environments is often impractical. Conversely, standard efficient models struggle with the significant domain gap. Although this gap can be mitigated by pre-training on large, humancurated corpora, such approaches are labor-intensive and costly. To address this, we propose ARCE (Augmented RoBERTa with Contextualized Elucidations), a novel knowledge distillation framework that leverages LLMs to synthesize a task-oriented corpus, termed Cote, for incrementally pre-training smaller models. ARCE systematically explores the optimal strategy for knowledge transfer. Our extensive experiments demonstrate that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20% and outperforming both domain-specific baselines and fine-tuned LLMs. Crucially, our study reveals a less is more principle: simple, direct explanations prove significantly more effective for domain adaptation than complex, role-based rationales in the NER task, which tend to introduce semantic noise. The source code will be made publicly available upon acceptance.

[92] Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang

Main category: cs.CL

TL;DR: Mask-GCG improves jailbreak attacks on LLMs by using learnable token masking to identify and prune redundant tokens in attack suffixes, reducing computational overhead while maintaining attack success rates.

Details

Motivation: Existing jailbreak attack methods like GCG and its variants use fixed-length suffixes, but the potential redundancy within these suffixes remains unexplored. This redundancy increases computational overhead unnecessarily.

Method: Mask-GCG employs learnable token masking to identify impactful tokens within jailbreak suffixes. It increases update probability for high-impact tokens while pruning low-impact ones, reducing gradient space size and computational cost.

Result: Experiments show most suffix tokens contribute significantly to attack success, but pruning a minority of low-impact tokens doesn’t affect loss values or attack success rates, revealing token redundancy in LLM prompts.

Conclusion: Mask-GCG provides a plug-and-play method for efficient jailbreak attacks that reduces computational overhead while maintaining effectiveness, offering insights for developing more efficient and interpretable LLMs from a security perspective.

Abstract: Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

[93] From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu

Main category: cs.CL

TL;DR: TtT is a unified audio-text framework that combines autoregressive text generation with non-autoregressive audio diffusion in a single Transformer, outperforming existing methods on multimodal speech tasks.

Details

Motivation: Existing multimodal models for audio-text tasks use autoregressive methods that don't properly account for the different dependencies in text (target-target relations) versus audio (source-target relations). There's a need for a unified approach that handles these modality-specific characteristics effectively.

Method: Proposes Text-to-Talk (TtT) framework integrating AR text generation with NAR audio diffusion using absorbing discrete diffusion. Uses modality-aware attention mechanism for causal text decoding and bidirectional audio modeling. Implements three training strategies to reduce train-test discrepancies and employs block-wise diffusion for parallel audio synthesis during inference.

Result: TtT consistently surpasses strong AR and NAR baselines on Audio-QA, ASR, AAC and speech-to-speech benchmarks. Ablation studies confirm the contribution of each component, and the training strategies effectively reduce train-test discrepancies.

Conclusion: TtT provides an effective unified framework for multimodal audio-text generation that properly addresses modality-specific dependencies, achieving state-of-the-art performance while enabling parallel audio synthesis and flexible handling of variable-length outputs.

Abstract: Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive (AR) methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates AR text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order AR property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Comprehensive experiments on Audio-QA, ASR, AAC and speech-to-speech benchmarks show that TtT consistently surpasses strong AR and NAR baselines, with additional ablation and training-strategy analyses confirming the contribution of each component. We will open-source our models, data and code to facilitate future research in this direction.

[94] Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, An Zhang

Main category: cs.CL

TL;DR: ReMemR1 improves long-context QA by integrating memory retrieval into memory updates and using multi-level rewards, outperforming SOTA with minimal overhead.

Details

Motivation: Existing "memorize while reading" methods for long-context QA suffer from pruning of latent evidence, information loss through overwriting, and sparse reinforcement learning signals when dealing with evidence dispersed across millions of tokens.

Method: ReMemR1 integrates memory retrieval into the memory update process, enabling selective callback of historical memories for non-linear reasoning. It also uses a multi-level reward design combining final-answer rewards with dense, step-level signals to guide effective memory use.

Result: Extensive experiments show ReMemR1 significantly outperforms state-of-the-art baselines on long-context question answering while incurring negligible computational overhead.

Conclusion: ReMemR1 effectively mitigates information degradation, improves supervision, and supports complex multi-hop reasoning in long-context QA, trading marginal cost for robust long-context reasoning.

Abstract: Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory buffer that is dynamically updated via a linear document scan, also known as the “memorize while reading” methods. While this approach scales efficiently, it suffers from pruning of latent evidence, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning. To further strengthen training, we propose a multi-level reward design, which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support complex multi-hop reasoning. Extensive experiments demonstrate that ReMemR1 significantly outperforms state-of-the-art baselines on long-context question answering while incurring negligible computational overhead, validating its ability to trade marginal cost for robust long-context reasoning.

[95] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models

David Anugraha, Shou-Yi Hung, Zilu Tang, Annie En-Shiun Lee, Derry Tanti Wijaya, Genta Indra Winata

Main category: cs.CL

TL;DR: mR3 is a massively multilingual reward reasoning model covering 72 languages that achieves state-of-the-art performance on multilingual evaluation benchmarks while being significantly smaller than larger models.

Details

Motivation: LLM judges work well for English evaluation but don't generalize effectively to non-English settings, and there's limited understanding of what makes effective multilingual training for such evaluation models.

Method: Developed mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages. Conducted comprehensive study of data and curriculum selection strategies, including support for reasoning in target languages.

Result: Achieved SOTA performance on multilingual reward model benchmarks, surpassing much larger models (like GPT-OSS-120B) while being up to 9x smaller. Demonstrated effectiveness in off-policy preference optimization and validated quality through human studies across 12 languages.

Conclusion: mR3 provides an effective solution for multilingual evaluation, with broad language coverage and strong performance even for extremely low-resource languages unseen during training. The model, data, and code are open-sourced.

Abstract: Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including support for reasoning in the target language. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to 9x smaller, and its effectiveness is further confirmed through extensive ablation studies. Finally, we demonstrate the effectiveness of mR3 in off-policy preference optimization and validate the quality of its reasoning traces and rubric-based evaluations through human studies with 20 annotators across 12 languages, where mR3 models’ reasoning is preferred, including for extremely low-resource languages that are entirely unseen during training. Our models, data, and code are available as open source at https://github.com/rubricreward/mr3.

[96] TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning

Manish Nagaraj, Sakshi Choudhary, Utkarsh Saxena, Deepak Ravikumar, Kaushik Roy

Main category: cs.CL

TL;DR: TRIM is a token-centric framework that uses attention-based fingerprints instead of gradients to efficiently select high-quality coresets for instruction tuning, outperforming baselines by up to 9% with lower computational cost.

Details

Motivation: Current methods for selecting instruction tuning coresets rely on computationally expensive gradient-based approaches that overlook fine-grained token-level features, creating a need for more efficient and sensitive selection methods.

Method: TRIM uses forward-only, token-centric framework with attention-based “fingerprints” from target samples to match underlying representational patterns, avoiding expensive backward passes and focusing on structural task features.

Result: TRIM-selected coresets outperform state-of-the-art baselines by up to 9% on downstream tasks, sometimes even surpassing full-data fine-tuning performance, while achieving this at a fraction of the computational cost.

Conclusion: TRIM establishes itself as a scalable and efficient alternative for building high-quality instruction-tuning datasets by using attention-based fingerprints instead of gradients, enabling better task alignment with lower computational overhead.

Abstract: Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based “fingerprints” from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.

[97] Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection

Yanran Chen, Lynn Greschner, Roman Klinger, Michael Klenk, Steffen Eger

Main category: cs.CL

TL;DR: LLMs can inject emotional appeals into fallacious arguments, reducing human fallacy detection by 14.5% and making arguments more convincing.

Details

Motivation: Logical fallacies are common in public communication and can mislead audiences. While fallacious arguments lack soundness, they can still appear convincing due to subjective factors like emotional framing. There's a need to understand how emotional appeals interact with fallacies and convincingness computationally.

Method: First computational study using LLMs to systematically change emotional appeals in fallacious arguments. Benchmark eight LLMs on injecting emotional appeal while preserving logical structures. Use best models to generate stimuli for human study measuring fallacy detection and convincingness.

Result: LLM-driven emotional framing reduces human fallacy detection in F1 by 14.5% on average. Humans perform better in fallacy detection when perceiving enjoyment than fear or sadness. These three emotions (enjoyment, fear, sadness) correlate with significantly higher convincingness compared to neutral or other emotion states.

Conclusion: The work demonstrates that LLMs can effectively manipulate emotional framing in fallacious arguments, reducing human detection ability and increasing convincingness. This has implications for AI-driven emotional manipulation in fallacious argumentation contexts.

Abstract: Logical fallacies are common in public communication and can mislead audiences; fallacious arguments may still appear convincing despite lacking soundness, because convincingness is inherently subjective. We present the first computational study of how emotional framing interacts with fallacies and convincingness, using large language models (LLMs) to systematically change emotional appeals in fallacious arguments. We benchmark eight LLMs on injecting emotional appeal into fallacious arguments while preserving their logical structures, then use the best models to generate stimuli for a human study. Our results show that LLM-driven emotional framing reduces human fallacy detection in F1 by 14.5% on average. Humans perform better in fallacy detection when perceiving enjoyment than fear or sadness, and these three emotions also correlate with significantly higher convincingness compared to neutral or other emotion states. Our work has implications for AI-driven emotional manipulation in the context of fallacious argumentation.

[98] Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs

Xu Pan, Ely Hahami, Jingxuan Fan, Ziqian Xie, Haim Sompolinsky

Main category: cs.CL

TL;DR: dLLMs learn new facts better than arLLMs without needing paraphrase augmentation, and masked fine-tuning gives arLLMs similar knowledge injection advantages.

Details

Motivation: LLMs need to update factual knowledge as facts evolve, but current fine-tuning methods suffer from compute-heavy paraphrase augmentation and the reversal curse. Recent findings suggest dLLMs may learn new knowledge more easily than arLLMs.

Method: 1) Controlled knowledge fine-tuning experiments comparing dLLMs and arLLMs. 2) Proposed masked fine-tuning for arLLMs where models reconstruct original text from masked versions. 3) Applied demasking objective to supervised fine-tuning on math tasks.

Result: dLLMs achieve high QA accuracy without paraphrase augmentation, while arLLMs need paraphrases. Masked fine-tuning enables arLLMs to achieve similar knowledge injection advantages (no paraphrase needed, reversal curse resistant). Demasking objective also improves SFT on math tasks.

Conclusion: The demasking objective is key for effective knowledge injection, bridging the gap between dLLMs and arLLMs, with broader applicability beyond factual knowledge updates.

Abstract: Large language models (LLMs) are often used in environments where facts evolve, yet factual knowledge updates via fine-tuning on unstructured text often suffers from 1) reliance on compute-heavy paraphrase augmentation and 2) the reversal curse. Recent studies show diffusion large language models (dLLMs) require fewer training samples to achieve lower loss in pre-training and are more resistant to the reversal curse, suggesting dLLMs may learn new knowledge more easily than autoregressive LLMs (arLLMs). We test this hypothesis in controlled knowledge fine-tuning experiments and find that while arLLMs rely on paraphrase augmentation to generalize knowledge text into question-answering (QA) capability, dLLMs do not require paraphrases to achieve high QA accuracy. To further investigate whether the demasking objective alone can induce such a knowledge injection advantage in dLLMs regardless of their diffusion denoising paradigm, we propose masked fine-tuning for arLLMs, which prompts an arLLM to reconstruct the original text given a masked version in context. The masked fine-tuning for arLLMs substantially improves the efficacy of knowledge injection, i.e. no paraphrase needed and resistant to the reversal curse, closing the gap between arLLMs and dLLMs. We also demonstrate that the same demasking objective improves supervised fine-tuning (SFT) on math tasks over standard SFT, suggesting broader applicability of the demasking objective.

[99] A-IPO: Adaptive Intent-driven Preference Optimization

Wenqing Wang, Muhammad Asif Ali, Ali Shoker, Ruohan Yang, Junyang Chen, Ying Sha, Huan Wang

Main category: cs.CL

TL;DR: A-IPO is a new alignment method that infers latent user intent from prompts and incorporates it into preference optimization, addressing limitations of existing methods like DPO that overlook minority opinions.

Details

Motivation: Existing alignment methods like DPO default to majority views, overlooking diverse and dynamic human preferences shaped by regional, cultural, and social factors, and fail to capture latent user intentions in prompts.

Method: A-IPO introduces an intention module that infers latent intent behind user prompts and explicitly incorporates this inferred intent into the reward function, creating stronger alignment between model responses and user intentions through an intention-response similarity term.

Result: A-IPO achieves substantial improvements: up to +24.8 win-rate and +45.6 Response-Intention Consistency on Real-pref; up to +38.6 Response Similarity and +52.2 Defense Success Rate on Attack-pref; and up to +54.6 Intention Consistency Score on GlobalOpinionQA-Ext.

Conclusion: A-IPO facilitates pluralistic preference optimization while enhancing adversarial robustness, consistently surpassing existing baselines by explicitly modeling diverse user intents.

Abstract: Human preferences are diverse and dynamic, shaped by regional, cultural, and social factors. Existing alignment methods like Direct Preference Optimization (DPO) and its variants often default to majority views, overlooking minority opinions and failing to capture latent user intentions in prompts. To address these limitations, we introduce \underline{\textbf{A}}daptive \textbf{\underline{I}}ntent-driven \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization (\textbf{A-IPO}). Specifically,A-IPO introduces an intention module that infers the latent intent behind each user prompt and explicitly incorporates this inferred intent into the reward function, encouraging stronger alignment between the preferred model’s responses and the user’s underlying intentions. We demonstrate, both theoretically and empirically, that incorporating an intention–response similarity term increases the preference margin (by a positive shift of $λ,Δ\mathrm{sim}$ in the log-odds), resulting in clearer separation between preferred and dispreferred responses compared to DPO. For evaluation, we introduce two new benchmarks, Real-pref, Attack-pref along with an extended version of an existing dataset, GlobalOpinionQA-Ext, to assess real-world and adversarial preference alignment. Through explicit modeling of diverse user intents,A-IPO facilitates pluralistic preference optimization while simultaneously enhancing adversarial robustness in preference alignment. Comprehensive empirical evaluation demonstrates that A-IPO consistently surpasses existing baselines, yielding substantial improvements across key metrics: up to +24.8 win-rate and +45.6 Response-Intention Consistency on Real-pref; up to +38.6 Response Similarity and +52.2 Defense Success Rate on Attack-pref; and up to +54.6 Intention Consistency Score on GlobalOpinionQA-Ext.

[100] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

Yuanchen Wu, Saurabh Verma, Justin Lee, Fangzhou Xiong, Poppy Zhang, Amel Awadelkarim, Xu Chen, Yubai Yuan, Shawndra Hill

Main category: cs.CL

TL;DR: PDO is a label-free prompt optimization framework using pairwise preference feedback from an LLM judge, treating prompt selection as a dueling-bandit problem with Thompson Sampling and mutation-based exploration.

Details

Motivation: Most automatic prompt optimization methods require costly ground-truth references (labeled validation data), creating a need for sample-efficient, label-free optimization methods.

Method: PDO uses pairwise preference feedback from an LLM judge, frames prompt selection as a dueling-bandit problem, employs Double Thompson Sampling for efficient comparisons under budget constraints, and uses top-performer guided mutation to expand candidate prompts while pruning weak ones.

Result: PDO consistently identifies stronger prompts than label-free baselines on BIG-bench Hard and MS MARCO datasets, offering favorable quality-cost trade-offs under constrained comparison budgets.

Conclusion: PDO provides an effective, sample-efficient framework for label-free prompt optimization that doesn’t require ground-truth references, making it practical for real-world applications where labeled data is scarce or expensive.

Abstract: Large language models (LLMs) are highly sensitive to prompts, but most automatic prompt optimization (APO) methods assume access to ground-truth references (e.g., labeled validation data) that are costly to obtain. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization based on pairwise preference feedback from an LLM judge. PDO casts prompt selection as a dueling-bandit problem and combines (i) Double Thompson Sampling to prioritize informative comparisons under a fixed judge budget, with (ii) top-performer guided mutation to expand the candidate pool while pruning weak prompts. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently identifies stronger prompts than label-free baselines, while offering favorable quality–cost trade-offs under constrained comparison budgets.

[101] YNTP-100: A Benchmark for Your Next Token Prediction with 100 People

Shiyao Ding, Takayuki Ito

Main category: cs.CL

TL;DR: YNTP-100 benchmark enables evaluation of personalized LLM alignment using multilingual multi-day human-agent conversations from 100 people, addressing privacy-constrained personal communication data collection.

Details

Motivation: LLMs trained for general next-token prediction fail to generate personalized responses reflecting how specific individuals communicate, and progress is limited by privacy constraints on collecting real-world personal communication data.

Method: Proposes Your Next Token Prediction (YNTP) task that formulates personalized response generation as token-level prediction conditioned on user interaction history, and introduces YNTP-100 benchmark built from multilingual multi-day human-agent conversations with 100 people.

Result: Evaluates external (parameter-preserving) and internal (parameter-updating) alignment methods using metrics of substance similarity and stylistic consistency, with dataset and results publicly available.

Conclusion: YNTP-100 provides a systematic evaluation framework for personalized LLM alignment, enabling research on generating responses that reflect individual communication patterns while addressing privacy concerns.

Abstract: Large language models (LLMs) trained for general \textit{next-token prediction} often fail to generate responses that reflect how specific individuals communicate. Progress on personalized alignment is further limited by the difficulty of collecting real-world personal communication data due to privacy constraints. We propose Your Next Token Prediction (YNTP), a task that formulates personalized response generation as token-level prediction conditioned on user interaction history. We introduce \textbf{YNTP-100}, a benchmark built from multilingual multi-day human–agent conversations with 100 people, enabling systematic evaluation of user-specific response behavior. We evaluate external (parameter-preserving) and internal (parameter-updating) alignment methods using metrics of substance similarity and stylistic consistency. The dataset and results are publicly available at: https://github.com/AnonymousHub4Submissions/YNTP100.

[102] Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Shuangshuang Ying, Yunwen Li, Xingwei Qu, Xin Li, Sheng Jin, Minghao Liu, Zhoufutu Wen, Xeron Du, Tianyu Zheng, Yichi Zhang, Letian Ni, Yuyang Cheng, Zhenzhu Yang, Qiguang Chen, Jingzhe Ding, Shengda Long, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Libo Qin, Ge Zhang, Wenhao Huang, Wanxiang Che, Chenghua Lin

Main category: cs.CL

TL;DR: Current RLHF methods fail on subjective writing preferences; generative reward models with reasoning chains outperform standard sequence-based models by ~30% accuracy on a new benchmark where objective quality signals are controlled.

Details

Motivation: Existing preference learning methods perform well on standard benchmarks but degrade significantly when objective quality signals (like factual accuracy or length) are removed, suggesting they may not truly capture subjective quality preferences like creativity and style.

Method: Created WritingPreferenceBench with 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, controlling for objective factors. Tested sequence-based reward models (standard RLHF architecture), zero-shot LM judges, and generative reward models that produce explicit reasoning chains.

Result: Sequence-based reward models achieved only 52.7% accuracy, zero-shot LM judges 53.9%, while generative reward models with reasoning achieved 81.8%. High within-model variance across genres (18.2% to 81.8% accuracy), with no improvement from scaling (27B vs 8B models).

Conclusion: Current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences. Successful preference modeling requires intermediate reasoning representations, not direct classification.

Abstract: Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models–the standard architecture for RLHF–achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.

[103] Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

Yasser Hamidullah, Koel Dutta Chowdhury, Yusser Al Ghussin, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina España-Bonet

Main category: cs.CL

TL;DR: Proposes a token-level reliability measure to detect hallucinations in sign language translation by quantifying visual grounding, showing it predicts hallucination rates and works across datasets/models.

Details

Motivation: Hallucination is a critical flaw in vision-language models, especially in sign language translation where meaning depends on precise visual grounding. Gloss-free models are particularly vulnerable as they map signer movements directly to language without intermediate gloss supervision for alignment.

Method: Proposes a token-level reliability measure combining feature-based sensitivity (measuring internal changes when video is masked) and counterfactual signals (probability differences between clean and altered video inputs). These are aggregated into a sentence-level reliability score.

Result: Reliability predicts hallucination rates, generalizes across datasets (PHOENIX-2014T, CSL-Daily) and architectures (gloss-based and gloss-free models), decreases under visual degradations, distinguishes grounded from guessed tokens, and improves hallucination risk estimation when combined with text-based signals.

Conclusion: Establishes reliability as a practical and reusable tool for diagnosing hallucinations in SLT, laying groundwork for more robust hallucination detection in multimodal generation. Qualitative analysis reveals why gloss-free models are more susceptible to hallucinations.

Abstract: Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.

[104] COMMUNITYNOTES: A Dataset for Exploring the Helpfulness of Fact-Checking Explanations

Rui Xing, Preslav Nakov, Timothy Baldwin, Jey Han Lau

Main category: cs.CL

TL;DR: The paper introduces COMMUNITYNOTES dataset and framework for predicting helpfulness of community fact-checking notes and reasons why they’re helpful, improving fact-checking systems.

Details

Motivation: Community-based fact-checking platforms face challenges: most notes remain unpublished due to slow annotation, and helpfulness lacks clear definitions. There's a need to predict note helpfulness and understand why notes are helpful.

Method: Created COMMUNITYNOTES dataset (104k multilingual posts with user notes and helpfulness labels). Proposed framework that automatically generates and improves reason definitions via prompt optimization, integrating them into prediction models.

Result: Optimized definitions improve both helpfulness and reason prediction. Helpfulness information benefits existing fact-checking systems.

Conclusion: The framework successfully addresses gaps in community-based fact-checking by providing automated methods to predict note helpfulness and understand reasons, enhancing fact-checking systems.

Abstract: Fact-checking on major platforms, such as X, Meta, and TikTok, is shifting from expert-driven verification to a community-based setup, where users contribute explanatory notes to clarify why a post might be misleading. An important challenge here is determining whether an explanation is helpful for understanding real-world claims and the reasons why, which remains largely underexplored in prior research. In practice, most community notes remain unpublished due to slow community annotation, and the reasons for helpfulness lack clear definitions. To bridge these gaps, we introduce the task of predicting both the helpfulness of explanatory notes and the reason for this. We present COMMUNITYNOTES, a large-scale multilingual dataset of 104k posts with user-provided notes and helpfulness labels. We further propose a framework that automatically generates and improves reason definitions via automatic prompt optimization, and integrate them into prediction. Our experiments show that the optimized definitions can improve both helpfulness and reason prediction. Finally, we show that the helpfulness information is beneficial for existing fact-checking systems.

[105] LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo

Main category: cs.CL

TL;DR: LEGO-Eval is an evaluation framework with diverse tools for assessing alignment between fine-grained instructions and generated 3D scenes, addressing limitations of current methods like CLIPScore and VLMs.

Details

Motivation: Current LLM-generated 3D scenes lack realistic spatial layouts and object attributes due to coarse-grained instructions. This causes embodied agents trained in unrealistic environments to learn priors diverging from real-world physics and semantics, degrading deployment performance. Existing evaluation methods fail to reliably assess scene-instruction alignment due to shallow 3D scene understanding.

Method: Introduces LEGO-Eval framework with diverse tools designed to explicitly ground scene components for accurate alignment assessment. Also presents LEGO-Bench benchmark with detailed instructions specifying complex layouts and attributes of real-world environments.

Result: LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. LEGO-Bench reveals significant limitations in current generation methods - success rates reach at most 10% for generating scenes that fully align with fine-grained instructions.

Conclusion: The proposed LEGO-Eval framework provides more reliable assessment of 3D scene-instruction alignment than current methods, while LEGO-Bench exposes severe limitations in existing scene generation approaches, highlighting the need for improved methods that can handle fine-grained instructions.

Abstract: Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.

[106] First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

Dmytro Vitel, Anshuman Chhabra

Main category: cs.CL

TL;DR: This paper challenges prior findings that embedding layers are best for computing training sample influence in LLMs, showing middle attention layers are better and proposing improved aggregation methods and evaluation metrics.

Details

Motivation: Current influence estimation methods for LLMs face computational challenges with billion-parameter models, requiring layer subset selection. Prior work incorrectly concluded embedding layers are most informative based on the cancellation effect hypothesis, which this paper aims to correct.

Method: The authors provide theoretical and empirical evidence against the cancellation effect, identify middle attention layers as better influence estimators, propose alternative aggregation methods (ranking and vote-based instead of standard averaging), and introduce the Noise Detection Rate (NDR) metric for evaluating influence scores without model retraining.

Result: Extensive experiments across various LLMs show that first layers are not necessarily better than last layers for influence estimation, contradicting prior knowledge. Middle attention layers perform better, and alternative aggregation methods significantly improve performance compared to standard averaging.

Conclusion: The paper overturns previous assumptions about layer selection for influence estimation in LLMs, establishes middle attention layers as superior, provides better aggregation techniques, and introduces NDR as a more reliable evaluation metric than the cancellation effect.

Abstract: Identifying how training samples influence/impact Large Language Model (LLM) decision-making is essential for effectively interpreting model decisions and auditing large-scale datasets. Current training sample influence estimation methods (also known as influence functions) undertake this goal by utilizing information flow through the model via its first-order and higher-order gradient terms. However, owing to the large model sizes of today consisting of billions of parameters, these influence computations are often restricted to some subset of model layers to ensure computational feasibility. Prior seminal work by Yeh et al. (2022) in assessing which layers are best suited for computing language data influence concluded that the first (embedding) layers are the most informative for this purpose, using a hypothesis based on influence scores canceling out (i.e., the cancellation effect). In this work, we propose theoretical and empirical evidence demonstrating how the cancellation effect is unreliable, and that middle attention layers are better estimators for influence. Furthermore, we address the broader challenge of aggregating influence scores across layers, and showcase how alternatives to standard averaging (such as ranking and vote-based methods) can lead to significantly improved performance. Finally, we propose better methods for evaluating influence score efficacy in LLMs without undertaking model retraining, and propose a new metric known as the Noise Detection Rate (NDR) that exhibits strong predictive capability compared to the cancellation effect. Through extensive experiments across LLMs of varying types and scales, we concretely determine that the first (layers) are not necessarily better than the last (layers) for LLM influence estimation, contrasting with prior knowledge in the field.

[107] Structure-Aware Decoding Mechanisms for Complex Entity Extraction with Large-Scale Language Models

Zhimin Qiu, Di Wu, Feng Liu, Yuxiao Wang

Main category: cs.CL

TL;DR: Structure-aware decoding method for nested/overlapping entity extraction that maintains semantic integrity and structural consistency through candidate span generation and hierarchical constraints.

Details

Motivation: Traditional approaches struggle to maintain both semantic integrity and structural consistency in nested and overlapping entity extraction tasks, requiring better methods that can handle entity boundaries, hierarchical relationships, and cross-dependencies simultaneously.

Method: Uses pretrained language models for context-aware representations, candidate span generation mechanism for multi-granular entity span features, structured attention modeling, hierarchical structural constraints during decoding, and joint optimization of classification loss and structural consistency loss.

Result: Significant improvements in Accuracy, Precision, Recall, and F1-Score on ACE 2005 dataset, particularly in nested and overlapping entity recognition, with stronger boundary localization and structural modeling capability.

Conclusion: Structure-aware decoding is effective for complex semantic extraction, provides new perspective for hierarchical understanding in language models, and establishes methodological foundation for high-precision information extraction.

Abstract: This paper proposes a structure-aware decoding method based on large language models to address the difficulty of traditional approaches in maintaining both semantic integrity and structural consistency in nested and overlapping entity extraction tasks. The method introduces a candidate span generation mechanism and structured attention modeling to achieve unified modeling of entity boundaries, hierarchical relationships, and cross-dependencies. The model first uses a pretrained language model to obtain context-aware semantic representations, then captures multi-granular entity span features through candidate representation combinations, and introduces hierarchical structural constraints during decoding to ensure consistency between semantics and structure. To enhance stability in complex scenarios, the model jointly optimizes classification loss and structural consistency loss, maintaining high recognition accuracy under multi-entity co-occurrence and long-sentence dependency conditions. Experiments conducted on the ACE 2005 dataset demonstrate significant improvements in Accuracy, Precision, Recall, and F1-Score, particularly in nested and overlapping entity recognition, where the model shows stronger boundary localization and structural modeling capability. This study verifies the effectiveness of structure-aware decoding in complex semantic extraction tasks, provides a new perspective for developing language models with hierarchical understanding, and establishes a methodological foundation for high-precision information extraction.

[108] Dual-objective Language Models: Training Efficiency Without Overfitting

David Samuel, Lucas Georges Gabriel Charpentier

Main category: cs.CL

TL;DR: Combining autoregressive and masked-diffusion training objectives improves language model performance without architectural changes, achieving benefits of both approaches.

Details

Motivation: Autoregressive models are training-efficient but prone to overfitting, while masked-diffusion models are more resilient to overfitting but less efficient to train. The paper aims to combine both approaches to get the best of both worlds.

Method: Train language models with dual-objective training combining autoregressive and masked-diffusion objectives without architectural modifications. Conduct systematic evaluation of 50 models with varying data repetition levels to find optimal balance between objectives.

Result: Dual-objective training consistently outperforms single-objective models across all evaluated settings. The optimal balance between objectives is similar whether targeting autoregressive or masked-diffusion downstream performance.

Conclusion: Combining autoregressive and masked-diffusion objectives is optimal for language modeling, providing both training efficiency and overfitting resilience without requiring architectural changes.

Abstract: This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal balance between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal balance is similar whether targeting autoregressive or masked-diffusion downstream performance.

[109] ShareChat: A Dataset of Chatbot Conversations in the Wild

Yueru Yan, Tuc Nguyen, Bo Su, Melissa Lieffers, Thai Le

Main category: cs.CL

TL;DR: ShareChat is a large-scale dataset of 142,808 real chatbot conversations (660,293 turns) collected from public URLs of major commercial LLM platforms (ChatGPT, Perplexity, Grok, Gemini, Claude), preserving native platform features and covering 101 languages from April 2023 to October 2025.

Details

Motivation: Current LLM research treats models as generic text generators, but commercial LLMs are distinct products with unique interfaces that shape user behavior. Existing datasets fail to capture authentic chatbot usage because they collect text-only data through uniform interfaces, obscuring platform-specific affordances and real-world interaction patterns.

Method: The authors collected 142,808 conversations (660,293 turns) directly from publicly shared URLs on five major commercial LLM platforms: ChatGPT, Perplexity, Grok, Gemini, and Claude. The dataset preserves native platform features like citations and thinking traces, covers 101 languages, and spans from April 2023 to October 2025. It offers longer context windows and greater interaction depth than prior datasets.

Result: ShareChat provides a comprehensive resource with three illustrative case studies: 1) completeness analysis of intent satisfaction, 2) citation study of model grounding, and 3) temporal analysis of engagement rhythms. The dataset captures authentic user-LLM interactions with preserved platform affordances, making it the first large-scale dataset of real-world chatbot conversations from commercial platforms.

Conclusion: ShareChat addresses the critical limitation of current LLM datasets by providing authentic, platform-preserved chatbot conversations from commercial LLMs. This resource enables researchers to study real-world user behavior, platform-specific interactions, and temporal patterns in LLM usage, offering a vital foundation for understanding how people actually use commercial chatbot products in practice.

Abstract: While academic research typically treats Large Language Models (LLM) as generic text generators, they are distinct commercial products with unique interfaces and capabilities that fundamentally shape user behavior. Current datasets obscure this reality by collecting text-only data through uniform interfaces that fail to capture authentic chatbot usage. To address this limitation, we present ShareChat, a large-scale corpus of 142,808 conversations (660,293 turns) sourced directly from publicly shared URLs on ChatGPT, Perplexity, Grok, Gemini, and Claude. ShareChat distinguishes itself by preserving native platform affordances, such as citations and thinking traces, across a diverse collection covering 101 languages and the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. To illustrate the dataset’s breadth, we present three case studies: a completeness analysis of intent satisfaction, a citation study of model grounding, and a temporal analysis of engagement rhythms. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild. The dataset is publicly available via Hugging Face.

[110] JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation

Bingyang Kelvin Liu, Ziyu Patrick Chen, David P. Woodruff

Main category: cs.CL

TL;DR: JEPA-Reasoner decouples reasoning from token generation using a latent-space JEPA for reasoning and a separate Talker for text generation, improving robustness and performance.

Details

Motivation: Current autoregressive models couple reasoning and token generation, making reasoning vulnerable to compounding expression errors. There's a need to isolate reasoning from token-level failures.

Method: Proposes JEPA-Reasoner architecture with two modules: (1) Joint-Embedding Predictive Architecture (JEPA) for pure latent-space reasoning, and (2) Talker module for linguistic reconstruction from latent representations.

Result: A 0.9B model achieves 149.5% improvement in 8-shot GSM8K accuracy over coupled Transformer baseline. Enables error containment, continuous guidance, and uncertainty representation.

Conclusion: Decoupling reasoning from token generation provides a more robust foundation for complex reasoning than scaling coupled models, shifting architectural focus to decoupled designs.

Abstract: Current autoregressive language models couple high-level reasoning and low-level token generation into a single sequential process, making the reasoning trajectory vulnerable to compounding expression errors. We propose JEPA-Reasoner, a novel architectural paradigm that decouples these tasks using a Joint-Embedding Predictive Architecture (JEPA) for pure latent-space reasoning and a separate Talker module for linguistic reconstruction. By isolating the reasoning engine from the discrete token-sampling process, our architecture enables: (1) Error Containment, where token-level failures cannot propagate into the latent reasoning chain; (2) Continuous Guidance, providing the generator with access to the entire lossless reasoning trajectory; and (3) Representation of Uncertainty, allowing the model to maintain multiple hypotheses via mixed latent vectors. Controlled experiments on synthetic and natural language tasks demonstrate that this decoupling enables a 0.9B model to achieve a 149.5% improvement in 8-shot GSM8K accuracy over a coupled Transformer baseline trained on identical data. This work shifts the focus from scaling coupled models to investigating decoupled architectures as a more robust foundation for complex reasoning.

[111] Attention Projection Mixing with Exogenous Anchors

Jonathan Su

Main category: cs.CL

TL;DR: ExoFormer introduces external anchor projections to resolve the structural conflict in cross-layer attention reuse, improving optimization and data efficiency through normalized mixing and preserving token identity.

Details

Motivation: Cross-layer reuse of early attention projections creates a structural conflict where the first layer must simultaneously serve as a stable reusable anchor for deeper layers while also being an effective computational block, which constrains performance of internal-anchor designs.

Method: ExoFormer learns exogenous anchor projections outside the sequential layer stack, using a unified normalized mixing framework that mixes queries, keys, values, and gate logits with learnable coefficients (elementwise, headwise, scalar), where normalizing anchor sources is key to stable reuse.

Result: ExoFormer variants consistently outperform internal-anchor counterparts, with the dynamic variant achieving 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention.

Conclusion: External anchors preserve essential token identity through the Offloading Hypothesis, allowing layers to specialize exclusively in feature transformation, and the approach facilitates more efficient optimization and data usage in attention mechanisms.

Abstract: Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in feature transformation. We release code and models to facilitate future research.

[112] HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning

Ziang Cui, Mengran Yu, Tianjiao Li, Chenyu Shi, Yingxuan Shi, Lusheng Zhang, Hongwei Lin

Main category: cs.CL

TL;DR: LLMs have cross-lingual verbosity bias making them unsuitable for time-constrained translation tasks like subtitling. The paper introduces Sand-Glass benchmark and HOMURA RL framework to optimize semantic preservation vs. temporal compliance.

Details

Motivation: LLMs achieve remarkable multilingual translation but suffer from systemic cross-lingual verbosity bias, making them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to balance semantic fidelity with rigid temporal feasibility.

Method: Introduce Sand-Glass benchmark for evaluating translation under syllable-level duration constraints. Propose HOMURA, a reinforcement learning framework that optimizes the trade-off between semantic preservation and temporal compliance using KL-regularized objective with novel dynamic syllable-ratio reward.

Result: Experimental results show the method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.

Conclusion: The proposed approach effectively “tames” LLM output length for time-constrained translation tasks, bridging the gap between semantic fidelity and temporal feasibility in applications like subtitling and dubbing.

Abstract: Large Language Models (LLMs) have achieved remarkable strides in multilingual translation but are hindered by a systemic cross-lingual verbosity bias, rendering them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to resolve this conflict between semantic fidelity and rigid temporal feasibility. To bridge this gap, we first introduce Sand-Glass, a benchmark specifically designed to evaluate translation under syllable-level duration constraints. Furthermore, we propose HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. By employing a KL-regularized objective with a novel dynamic syllable-ratio reward, HOMURA effectively “tames” the output length. Experimental results demonstrate that our method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.

[113] PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning

Bingxuan Li, Jeonghwan Kim, Cheng Qian, Xiusi Chen, Eitan Anzenberg, Niran Kundapur, Heng Ji

Main category: cs.CL

TL;DR: The paper introduces CalConflictBench, a benchmark for evaluating LLM agents on calendar conflict resolution, and proposes PEARL, a reinforcement learning framework that improves agent performance by 55% through preference memory and round-wise rewards.

Details

Motivation: Busy professionals spend significant time resolving overlapping calendar invitations, and human delegation doesn't scale. The paper aims to determine if LLMs can be trusted to manage time effectively through automated calendar conflict resolution.

Method: The authors introduce CalConflictBench, a benchmark where conflicts are presented round-by-round over a calendar year. They propose PEARL, a reinforcement learning framework that augments agents with external preference memory to store inferred strategies (attendee priorities, topic importance, time/location preferences) and optimizes agents with round-wise rewards for decision correctness, ranking quality, and memory usage.

Result: Current LLM agents perform poorly on CalConflictBench (e.g., Qwen-3-30B-Think has 35% average error rate). PEARL achieves an error reduction rate of 0.76 and 55% improvement in average error rate compared to the strongest baseline.

Conclusion: The paper demonstrates that while current LLM agents struggle with calendar conflict resolution, the proposed PEARL framework significantly improves performance through preference memory and reinforcement learning, showing promise for trustworthy time management automation.

Abstract: Overlapping calendar invitations force busy professionals to repeatedly decide which meetings to attend, reschedule, or decline. We refer to this preference-driven decision process as calendar conflict resolution. Automating this decision process is crucial yet challenging. Scheduling logistics can drain hours, and human delegation often fails at scale, which motivates us to ask: Can we trust large language models (LLMs) or language agents to manage time? To enable a systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution. In CalConflictBench, conflicts are presented to agents round-by-round over a calendar year, requiring them to infer and adapt to user preferences progressively. Our experiments show that current LLM agents perform poorly with high error rates, e.g., Qwen-3-30B-Think has an average error rate of 35%. To address this gap, we propose PEARL, a reinforcement-learning framework that (i) augments the language agent with an external preference memory that stores and updates inferred strategies (e.g., attendee priorities, topic importance, time/location preferences), and (ii) optimizes the agent with round-wise rewards that directly supervise decision correctness, ranking quality, and memory usage across rounds. Experiments on CalConflictBench show that PEARL achieves an error reduction rate of 0.76 and a 55% improvement in average error rate compared to the strongest baseline.

[114] Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum

Víctor Yeste, Paolo Rosso

Main category: cs.CL

TL;DR: Study on detecting 19 human values in English sentences using ValueEval'24 corpus, comparing supervised models with LLMs under 8GB GPU constraints.

Details

Motivation: To develop compute-efficient methods for detecting human values in text under realistic hardware constraints (8GB GPU), providing empirical guidance for value-aware NLP models.

Method: Used ValueEval'24 corpus (74k sentences). Compared: 1) DeBERTa-base for moral presence detection, 2) Direct multi-label vs presence-gated hierarchical approaches, 3) Lightweight auxiliary signals (context, LIWC-22, moral lexica, topics), 4) Soft-voting ensembles, 5) Benchmarking 7-9B LLMs (Gemma 2, Llama 3.1, Mistral, Qwen) in zero/few-shot and QLoRA setups.

Result: 1) Moral presence detectable (F1=0.74), 2) Presence gating didn’t improve over direct prediction under matched compute, 3) Best supervised ensemble reached macro-F1=0.332 (vs previous baseline ~0.28), 4) LLMs lagged behind supervised ensemble under same hardware constraints.

Conclusion: Supervised ensembles with lightweight auxiliary signals outperform LLMs for value detection under 8GB GPU budgets, providing practical guidance for compute-efficient value-aware NLP models.

Abstract: We study sentence-level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus). Each sentence is annotated with value presence, yielding a binary moral-presence label and a 19-way multi-label task under severe class imbalance. First, we show that moral presence is learnable from single sentences: a DeBERTa-base classifier attains positive-class F1 = 0.74 with calibrated thresholds. Second, we compare direct multi-label value detectors with presence-gated hierarchies under a single 8 GB GPU budget. Under matched compute, presence gating does not improve over direct prediction, indicating that gate recall becomes a bottleneck. Third, we investigate lightweight auxiliary signals - short-range context, LIWC-22 and moral lexica, and topic features - and small ensembles. Our best supervised configuration, a soft-voting ensemble of DeBERTa-based models enriched with such signals, reaches macro-F1 = 0.332 on the 19 values, improving over the best previous English-only baseline on this corpus (macro-F1 $\approx$ 0.28). We additionally benchmark 7-9B instruction-tuned LLMs (Gemma 2 9B, Llama 3.1 8B, Mistral 8B, Qwen 2.5 7B) in zero-/few-shot and QLoRA setups, and find that they lag behind the supervised ensemble under the same hardware constraint. Overall, our results provide empirical guidance for building compute-efficient, value-aware NLP models under realistic GPU budgets.

[115] Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

Thanathai Lertpetchpun, Yoonjeong Lee, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd, Shrikanth Narayanan

Main category: cs.CL

TL;DR: The paper proposes using phonological rules for accent control in TTS systems, introduces a phoneme shift rate metric to measure embedding-rule interactions, and shows that combining rules with speaker embeddings produces more authentic accents while revealing accent-speaker identity entanglement.

Details

Motivation: Current TTS systems use speaker embeddings for accent control, but these embeddings lack interpretability and controllability since they encode multiple traits (timbre, emotion, accent) simultaneously. There's a need for more transparent and linguistically motivated approaches to accent control in speech synthesis.

Method: The study analyzes interactions between speaker embeddings and phonological rules for accent synthesis, using American vs. British English as a case study. They implement specific phonological rules (flapping, rhoticity, vowel correspondences) and propose the phoneme shift rate (PSR) metric to quantify how strongly embeddings preserve or override rule-based transformations.

Result: Experiments show that combining phonological rules with speaker embeddings yields more authentic accents than using embeddings alone. The PSR metric reveals that embeddings can attenuate or overwrite rules, demonstrating entanglement between accent and speaker identity in current TTS systems.

Conclusion: Phonological rules provide an effective lever for accent control in TTS systems and offer a framework for evaluating disentanglement in speech generation. The findings highlight the potential for more interpretable and controllable accent synthesis through linguistically motivated approaches.

Abstract: Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.

[116] Common to Whom? Regional Cultural Commonsense and LLM Bias in India

Sangmitra Madhusudan, Trush Shashank More, Steph Buongiorno, Renata Dividino, Jad Kabbara, Ali Emami

Main category: cs.CL

TL;DR: Indica is a benchmark showing cultural commonsense varies regionally within India, not nationally, with LLMs performing poorly (13-21% accuracy) and showing geographic bias toward Central/North India.

Details

Motivation: Existing cultural commonsense benchmarks treat nations as monolithic, assuming uniform practices within national boundaries. The authors question whether cultural commonsense holds uniformly within a nation or varies at sub-national levels, focusing on India as a culturally diverse case study.

Method: Created Indica benchmark with human-annotated answers from five Indian regions (North, South, East, West, Central) across 515 questions spanning 8 domains of everyday life, yielding 1,630 region-specific question-answer pairs. Evaluated eight state-of-the-art LLMs on region-specific questions.

Result: Only 39.4% of questions elicited agreement across all five regions, showing cultural commonsense is predominantly regional, not national. LLMs achieved only 13.4%-20.9% accuracy on region-specific questions and exhibited geographic bias, over-selecting Central and North India (30-40% more often) while under-representing East and West.

Conclusion: Cultural commonsense varies significantly at sub-national levels, challenging monolithic national assumptions. LLMs perform poorly on region-specific cultural knowledge and exhibit systematic geographic biases. The methodology provides a generalizable framework for evaluating cultural commonsense in any culturally heterogeneous nation.

Abstract: Existing cultural commonsense benchmarks treat nations as monolithic, assuming uniform practices within national boundaries. But does cultural commonsense hold uniformly within a nation, or does it vary at the sub-national level? We introduce Indica, the first benchmark designed to test LLMs’ ability to address this question, focusing on India - a nation of 28 states, 8 union territories, and 22 official languages. We collect human-annotated answers from five Indian regions (North, South, East, West, and Central) across 515 questions spanning 8 domains of everyday life, yielding 1,630 region-specific question-answer pairs. Strikingly, only 39.4% of questions elicit agreement across all five regions, demonstrating that cultural commonsense in India is predominantly regional, not national. We evaluate eight state-of-the-art LLMs and find two critical gaps: models achieve only 13.4%-20.9% accuracy on region-specific questions, and they exhibit geographic bias, over-selecting Central and North India as the “default” (selected 30-40% more often than expected) while under-representing East and West. Beyond India, our methodology provides a generalizable framework for evaluating cultural commonsense in any culturally heterogeneous nation, from question design grounded in anthropological taxonomy, to regional data collection, to bias measurement.

[117] PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice

Yuzhen Shi, Huanghai Liu, Yiran Hu, Gaojie Song, Xinran Xu, Yubo Ma, Tianyi Tang, Li Zhang, Qingjing Chen, Di Feng, Wenbo Lv, Weiheng Wu, Kexin Yang, Sen Yang, Wei Wang, Rongyao Shi, Yuanyang Qiu, Yuemeng Qi, Jingwen Zhang, Xiaoyu Sui, Yifan Chen, Yi Zhang, An Yang, Bowen Yu, Dayiheng Liu, Junyang Lin, Weixing Shen, Bing Zhao, Charles L. A. Clarke, Hu Wei

Main category: cs.CL

TL;DR: PLawBench is a practical legal benchmark that evaluates LLMs on realistic legal tasks, revealing significant limitations in their fine-grained legal reasoning capabilities.

Details

Motivation: Existing legal benchmarks are too simplified and standardized, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. They also use coarse metrics and don't assess fine-grained legal reasoning.

Method: Introduces PLawBench with 850 questions across 13 practical legal scenarios, modeling real-world workflows through three task categories: public legal consultation, practical case analysis, and legal document generation. Uses expert-designed evaluation rubrics (12,500 items) and an LLM-based evaluator aligned with human judgments.

Result: Evaluation of 10 state-of-the-art LLMs shows none achieve strong performance on PLawBench, revealing substantial limitations in fine-grained legal reasoning capabilities.

Conclusion: Current LLMs have significant limitations in practical legal reasoning, highlighting the need for more realistic evaluation benchmarks and improved development of legal LLMs.

Abstract: As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model’s ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.

[118] Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

Zecheng Tang, Quantong Qiu, Yi Yang, Zhiyi Hong, Haiya Xiang, Kebin Liu, Qingqing Dang, Juntao Li, Min Zhang

Main category: cs.CL

TL;DR: Elastic Attention enables LLMs to dynamically adjust attention sparsity based on input, improving efficiency in long-context scenarios without sacrificing performance.

Details

Motivation: Standard attention has quadratic complexity that limits LLM scalability for long contexts. Existing hybrid attention approaches use fixed sparse/full attention ratios and don't adapt to varying task sparsity requirements during inference.

Method: Proposes Elastic Attention that integrates a lightweight Attention Router into pretrained models. The router dynamically assigns each attention head to different computation modes, allowing the model to adjust overall sparsity based on input.

Result: Method achieves both strong performance and efficient inference with only 12 hours of training on 8xA800 GPUs. Experiments across three long-context benchmarks on widely-used LLMs demonstrate superiority.

Conclusion: Elastic Attention provides an adaptive solution to the attention scalability problem, enabling LLMs to dynamically balance computation efficiency and performance based on input characteristics.

Abstract: The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8xA800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.

[119] S$^3$-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference

Qingsen Ma, Dianyun Wang, Yaoye Wang, Lechen Ning, Sujie Zhu, Xiaohang Zhang, Jiaming Lyu, Linhao Ren, Zhenbo Xu, Zhaofeng He

Main category: cs.CL

TL;DR: S3-Attention is a memory-first inference framework that replaces KV caching with sparse feature indexing for efficient long-context processing, achieving comparable performance to full-context inference while reducing GPU memory usage.

Details

Motivation: Long-context inference in LLMs suffers from memory inefficiency (linear KV cache scaling) and noise from external retrieval methods that often return lexically similar but causally irrelevant passages.

Method: S3-Attention treats long-context processing as attention-aligned endogenous retrieval: decodes transient key/query projections into sparse feature identifiers using lightweight autoencoders, builds CPU-based inverted index mapping features to token positions during streaming scan, and retrieves compact evidence spans using feature co-activation (optionally fused with BM25).

Result: S3-Hybrid closely matches full-context inference across multiple model families under LongBench evaluation with fixed prompting/decoding/token budgets, improves robustness in information-dense settings, but has higher wall-clock latency than optimized full-KV baselines.

Conclusion: S3-Attention demonstrates effective memory-first inference for long contexts, enabling KV cache elimination and bounded GPU memory usage, though requires future kernel-level optimization to address latency issues.

Abstract: Large language models are increasingly applied to multi-document and long-form inputs, yet long-context inference remains memory- and noise-inefficient. Key-value (KV) caching scales linearly with context length, while external retrieval methods often return lexically similar but causally irrelevant passages. We present S3-Attention, a memory-first inference-time framework that treats long-context processing as attention-aligned endogenous retrieval. S3-Attention decodes transient key and query projections into top-k sparse feature identifiers using lightweight sparse autoencoders, and constructs a CPU-based inverted index mapping features to token positions or spans during a single streaming scan. This design allows the KV cache to be discarded entirely and bounds GPU memory usage by the scan chunk size. At generation time, feature co-activation is used to retrieve compact evidence spans, optionally fused with BM25 for exact lexical matching. Under a unified LongBench evaluation protocol with fixed prompting, decoding, and matched token budgets, S3-Hybrid closely matches full-context inference across multiple model families and improves robustness in several information-dense settings. We also report an engineering limitation of the current prototype, which incurs higher wall-clock latency than optimized full-KV baselines, motivating future kernel-level optimization.

[120] LLMs versus the Halting Problem: Revisiting Program Termination Prediction

Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, Peter O’Hearn

Main category: cs.CL

TL;DR: LLMs show strong performance in predicting program termination on SV-Comp 2025 benchmarks, with GPT-5 and Claude Sonnet-4.5 ranking close to top tools, but they struggle with providing valid proofs and performance degrades with longer programs.

Details

Motivation: The Halting Problem is undecidable, making automatic verification tools approximate and language-specific. Recent LLM advances raise the question of whether LLMs can reliably predict program termination, potentially offering a new approach to this fundamental problem.

Method: Evaluated LLMs on diverse C programs from the Termination category of SV-Comp 2025. Compared LLM performance against traditional verification tools using test-time-scaling.

Result: LLMs perform remarkably well: GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool, and Code World Model (CWM) would place just behind the second-ranked tool. However, LLMs often fail to provide valid proof witnesses and performance decreases with program length.

Conclusion: LLMs are effective at predicting program termination but have limitations in providing proofs and handling longer programs. These insights motivate further research into using LLMs for reasoning about undecidable problems in computer science.

Abstract: Determining whether a program terminates is a central problem in computer science. Turing’s foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.

[121] RPO-RAG: Aligning Small LLMs with Relation-aware Preference Optimization for Knowledge Graph Question Answering

Kaehyun Um, KyuHwan Yeom, Haerim Yang, Minyoung Choi, Hyeongjun Yang, Kyong-Ho Lee

Main category: cs.CL

TL;DR: RPO-RAG: A KG-based retrieval-augmented generation framework specifically designed for small LLMs (<8B parameters) that improves reasoning on knowledge graph question answering tasks through semantic path sampling, relation-aware optimization, and answer-centered prompting.

Details

Motivation: Existing KG-based RAG approaches have limitations: they use semantics-unaware path sampling, are weakly aligned with KG reasoning objectives, and feed retrieved paths directly without organization. Most prior work focuses on large LLMs (ChatGPT/GPT-4) or models above 7B parameters, leaving sub-7B models underexplored despite their practical importance for resource-efficient applications.

Method: RPO-RAG introduces three key innovations: 1) Query-path semantic sampling strategy for informative supervisory signals, 2) Relation-aware preference optimization that aligns training with intermediate KG reasoning signals like relations, and 3) Answer-centered prompt design that organizes entities and reasoning paths in an interpretable format specifically for small LLMs.

Result: Extensive experiments on WebQSP and CWQ KGQA datasets show RPO-RAG bridges performance gap between small and large LLMs. On WebQSP, improves F1 by up to 8.8%; on CWQ achieves new SOTA among models under 8B parameters in both Hit and F1 metrics. Works effectively even with models under 3B parameters.

Conclusion: RPO-RAG substantially improves reasoning capability of small LLMs (<8B parameters) for KGQA tasks, demonstrating their potential for resource-efficient and practical on-device applications. The framework addresses key limitations of existing KG-based RAG approaches and enables effective knowledge utilization by small models.

Abstract: Large Language Models (LLMs) have recently demonstrated remarkable reasoning abilities, yet hallucinate on knowledge-intensive tasks. Retrieval-augmented generation (RAG) mitigates this issue by grounding answers in external sources, e.g., knowledge graphs (KGs). However, existing KG-based RAG approaches rely on semantics-unaware path sampling and are weakly aligned with KG reasoning objectives, which limits further accuracy gains. They also feed retrieved paths directly into the reasoner without organizing them into answer-centered reasoning paths, hindering small LLMs’ ability to leverage the retrieved knowledge. Furthermore, prior works predominantly rely on large LLMs (e.g., ChatGPT/GPT-4) or assume backbones above 7B parameters, leaving sub-7B models underexplored. We address this gap with RPO-RAG, the first KG-based RAG framework specifically designed for small LLMs, to the best of our knowledge. RPO-RAG introduces three key innovations: (1) a query-path semantic sampling strategy that provides informative supervisory signals; (2) a relation-aware preference optimization that aligns training with intermediate KG reasoning signals (e.g., relation); and (3) an answer-centered prompt design that organizes entities and reasoning paths in an interpretable format. Extensive experiments on two benchmark Knowledge Graph Question Answering (KGQA) datasets, WebQSP and CWQ, demonstrate that RPO-RAG effectively bridges the performance gap between small and large language models. On WebQSP, it improves F1 by up to 8.8%, reflecting enhanced answer precision, while on CWQ it achieves new state-of-the-art results among models under 8B parameters in both Hit and F1. Overall, RPO-RAG substantially improves the reasoning capability of small LLMs, even under 3B parameters-highlighting their potential for resource-efficient and practical on-device KGQA applications.

[122] LVLMs and Humans Ground Differently in Referential Communication

Peter Zeng, Weiling Li, Amie Paige, Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras, Gregory Zelinsky, Susan Brennan, Owen Rambow

Main category: cs.CL

TL;DR: LVLMs struggle with modeling common ground in collaborative referential communication tasks, unlike humans who excel at interactive resolution of referring expressions.

Details

Motivation: Generative AI agents need accurate human intent prediction for effective partnership, but current limitations in modeling common ground hinder collaboration abilities.

Method: Referential communication experiment with factorial design (human-human, human-AI, AI-human, AI-AI pairs) using picture matching tasks with non-lexicalized objects, collecting 356 dialogues across 89 pairs over 4 rounds each.

Result: The study reveals LVLMs’ limitations in interactively resolving referring expressions, a crucial skill underlying human language use, while providing tools and corpus for analysis.

Conclusion: Current LVLMs lack the ability to effectively model common ground in collaborative tasks, highlighting a critical deficit that needs addressing for better human-AI partnership.

Abstract: For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs’ limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.

[123] Zero-Shot Stance Detection in the Wild: Dynamic Target Generation and Multi-Target Adaptation

Aohua Li, Yuanshuo Zhang, Ge Gao, Bo Chen, Xiaobing Zhao

Main category: cs.CL

TL;DR: Zero-shot stance detection with dynamic target generation and multi-target adaptation for social media, where targets are not predefined but automatically identified from text.

Details

Motivation: Real-world social media stance detection faces challenges because targets are not predefined or static but complex and dynamic, unlike traditional approaches that assume given targets.

Method: Propose DGTA task with dynamic target generation and multi-target adaptation; construct Chinese social media dataset; explore integrated and two-stage fine-tuning strategies for LLMs; evaluate various baseline models.

Result: Fine-tuned LLMs achieve superior performance: two-stage fine-tuned Qwen2.5-7B attains 66.99% target recognition score; integrated fine-tuned DeepSeek-R1-Distill-Qwen-7B achieves 79.26% stance detection F1 score.

Conclusion: The proposed zero-shot stance detection with dynamic target generation addresses real-world social media challenges, with fine-tuned LLMs demonstrating strong performance on this novel task.

Abstract: Current stance detection research typically relies on predicting stance based on given targets and text. However, in real-world social media scenarios, targets are neither predefined nor static but rather complex and dynamic. To address this challenge, we propose a novel task: zero-shot stance detection in the wild with Dynamic Target Generation and Multi-Target Adaptation (DGTA), which aims to automatically identify multiple target-stance pairs from text without prior target knowledge. We construct a Chinese social media stance detection dataset and design multi-dimensional evaluation metrics. We explore both integrated and two-stage fine-tuning strategies for large language models (LLMs) and evaluate various baseline models. Experimental results demonstrate that fine-tuned LLMs achieve superior performance on this task: the two-stage fine-tuned Qwen2.5-7B attains the highest comprehensive target recognition score of 66.99%, while the integrated fine-tuned DeepSeek-R1-Distill-Qwen-7B achieves a stance detection F1 score of 79.26%.

cs.CV

[124] Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation

Gautham Vinod, Bruce Coburn, Siddeshwar Raghavan, Jiangpeng He, Fengqing Zhu

Main category: cs.CV

TL;DR: A method for true-to-scale 3D reconstruction from monocular images to accurately estimate food portion sizes for precision nutrition applications.

Details

Motivation: Chronic diseases like obesity and diabetes require accurate food intake monitoring. Current AI dietary assessment struggles with portion size estimation from images, and existing 3D reconstruction methods fail to recover real-world scale, limiting their use in precision nutrition.

Method: Leverages rich visual features from models trained on large-scale datasets to estimate the scale of reconstructed objects, converting single-view 3D reconstructions into true-to-life, physically meaningful models.

Result: Outperforms existing techniques on two publicly available datasets, achieving nearly 30% reduction in mean absolute volume-estimation error.

Conclusion: The method bridges 3D computer vision and digital health by enabling accurate true-to-scale 3D reconstruction from monocular images, showing potential to enhance precision nutrition applications.

Abstract: The rise of chronic diseases related to diet, such as obesity and diabetes, emphasizes the need for accurate monitoring of food intake. While AI-driven dietary assessment has made strides in recent years, the ill-posed nature of recovering size (portion) information from monocular images for accurate estimation of ``how much did you eat?’’ is a pressing challenge. Some 3D reconstruction methods have achieved impressive geometric reconstruction but fail to recover the crucial real-world scale of the reconstructed object, limiting its usage in precision nutrition. In this paper, we bridge the gap between 3D computer vision and digital health by proposing a method that recovers a true-to-scale 3D reconstructed object from a monocular image. Our approach leverages rich visual features extracted from models trained on large-scale datasets to estimate the scale of the reconstructed object. This learned scale enables us to convert single-view 3D reconstructions into true-to-life, physically meaningful models. Extensive experiments and ablation studies on two publicly available datasets show that our method consistently outperforms existing techniques, achieving nearly a 30% reduction in mean absolute volume-estimation error, showcasing its potential to enhance the domain of precision nutrition. Code: https://gitlab.com/viper-purdue/size-matters

[125] DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation

Zhen Yao, Xin Li, Taotao Jing, Shuai Zhang, Mooi Choo Chuah

Main category: cs.CV

TL;DR: DiSa introduces a saliency-aware foreground-background disentangled framework for open-vocabulary semantic segmentation to address VLMs’ foreground bias and limited spatial localization issues.

Details

Motivation: Vision-language models like CLIP have two critical limitations when adapted to segmentation: (1) Foreground Bias - they tend to ignore background regions, and (2) Limited Spatial Localization - resulting in blurred object boundaries.

Method: DiSa uses a Saliency-aware Disentanglement Module (SDM) to separately model foreground and background ensemble features in a divide-and-conquer manner, and a Hierarchical Refinement Module (HRM) that leverages pixel-wise spatial contexts and enables channel-wise feature refinement through multi-level updates.

Result: Extensive experiments on six benchmarks demonstrate that DiSa consistently outperforms state-of-the-art methods.

Conclusion: The proposed DiSa framework effectively addresses the foreground bias and spatial localization limitations of VLMs in open-vocabulary semantic segmentation through explicit saliency-aware foreground-background disentanglement and hierarchical refinement.

Abstract: Open-vocabulary semantic segmentation aims to assign labels to every pixel in an image based on text labels. Existing approaches typically utilize vision-language models (VLMs), such as CLIP, for dense prediction. However, VLMs, pre-trained on image-text pairs, are biased toward salient, object-centric regions and exhibit two critical limitations when adapted to segmentation: (i) Foreground Bias, which tends to ignore background regions, and (ii) Limited Spatial Localization, resulting in blurred object boundaries. To address these limitations, we introduce DiSa, a novel saliency-aware foreground-background disentangled framework. By explicitly incorporating saliency cues in our designed Saliency-aware Disentanglement Module (SDM), DiSa separately models foreground and background ensemble features in a divide-and-conquer manner. Additionally, we propose a Hierarchical Refinement Module (HRM) that leverages pixel-wise spatial contexts and enables channel-wise feature refinement through multi-level updates. Extensive experiments on six benchmarks demonstrate that DiSa consistently outperforms state-of-the-art methods.

[126] Semi-Supervised Masked Autoencoders: Unlocking Vision Transformer Potential with Limited Data

Atik Faysal, Mohammad Rostami, Reihaneh Gh. Roshan, Nikhil Muralidhar, Huaxia Wang

Main category: cs.CV

TL;DR: SSMAE is a semi-supervised framework that combines masked image reconstruction with classification using dynamically selected pseudo-labels, featuring a validation-driven gating mechanism to reduce confirmation bias in Vision Transformers.

Details

Motivation: To address the challenge of training Vision Transformers when labeled data is scarce but unlabeled data is abundant, overcoming the limitations of supervised ViTs and fine-tuned MAE in low-label regimes.

Method: Proposes SSMAE framework that jointly optimizes masked image reconstruction and classification using both unlabeled and labeled data with dynamically selected pseudo-labels. Introduces a validation-driven gating mechanism that activates pseudo-labeling only after the model achieves reliable, high-confidence predictions consistent across both weakly and strongly augmented views.

Result: SSMAE consistently outperforms supervised ViT and fine-tuned MAE on CIFAR-10 and CIFAR-100, with largest gains in low-label regimes (+9.24% over ViT on CIFAR-10 with 10% labels). Demonstrates that when pseudo-labels are introduced is as important as how they are generated.

Conclusion: SSMAE effectively leverages abundant unlabeled data for Vision Transformer training in low-label scenarios through careful pseudo-label selection and timing, providing a practical solution for data-efficient transformer training with code publicly available.

Abstract: We address the challenge of training Vision Transformers (ViTs) when labeled data is scarce but unlabeled data is abundant. We propose Semi-Supervised Masked Autoencoder (SSMAE), a framework that jointly optimizes masked image reconstruction and classification using both unlabeled and labeled samples with dynamically selected pseudo-labels. SSMAE introduces a validation-driven gating mechanism that activates pseudo-labeling only after the model achieves reliable, high-confidence predictions that are consistent across both weakly and strongly augmented views of the same image, reducing confirmation bias. On CIFAR-10 and CIFAR-100, SSMAE consistently outperforms supervised ViT and fine-tuned MAE, with the largest gains in low-label regimes (+9.24% over ViT on CIFAR-10 with 10% labels). Our results demonstrate that when pseudo-labels are introduced is as important as how they are generated for data-efficient transformer training. Codes are available at https://github.com/atik666/ssmae.

[127] XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark

Shuai Liu, Youmeng Li, Jizeng Wei

Main category: cs.CV

TL;DR: XY-Cut++ is an advanced document reading order recovery method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to handle complex layouts and outperforms existing methods by up to 24%.

Details

Motivation: Existing document reading order recovery methods struggle with complex layouts (like multi-column newspapers), high-overhead cross-modal interactions between visual regions and textual semantics, and lack robust evaluation benchmarks, limiting their effectiveness for RAG and LLM preprocessing.

Method: XY-Cut++ integrates three key components: pre-mask processing, multi-granularity segmentation, and cross-modal matching to address layout ordering challenges. It builds upon traditional XY-Cut techniques with these enhancements for better handling of complex document structures.

Result: XY-Cut++ achieves state-of-the-art performance with 98.8 BLEU overall on the DocBench-100 dataset, outperforming existing baselines by up to 24%. It demonstrates consistent accuracy across both simple and complex layouts while maintaining simplicity and efficiency.

Conclusion: XY-Cut++ establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective Retrieval-Augmented Generation (RAG) and large language model (LLM) preprocessing.

Abstract: Document Reading Order Recovery is a fundamental task in document image understanding, playing a pivotal role in enhancing Retrieval-Augmented Generation (RAG) and serving as a critical preprocessing step for large language models (LLMs). Existing methods often struggle with complex layouts(e.g., multi-column newspapers), high-overhead interactions between cross-modal elements (visual regions and textual semantics), and a lack of robust evaluation benchmarks. We introduce XY-Cut++, an advanced layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to address these challenges. Our method significantly enhances layout ordering accuracy compared to traditional XY-Cut techniques. Specifically, XY-Cut++ achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency. It outperforms existing baselines by up to 24% and demonstrates consistent accuracy across simple and complex layouts on the newly introduced DocBench-100 dataset. This advancement establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective RAG and LLM preprocessing.

[128] Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

Chuan Qin, Constantin Venhoff, Sonia Joseph, Fanyi Xiao, Stefan Scherer

Main category: cs.CV

TL;DR: Sparse CLIP integrates sparsity directly into CLIP training to create interpretable yet high-performing vision-language representations, challenging the assumption that interpretability requires sacrificing accuracy.

Details

Motivation: CLIP's dense and opaque latent representations pose significant interpretability challenges. While post-hoc approaches like Sparse Autoencoders (SAEs) exist, they often degrade downstream performance and lose CLIP's multimodal capabilities. The paper aims to address this tension between interpretability and performance.

Method: Proposes integrating sparsity directly into CLIP training rather than using post-hoc approaches. This yields sparse representations that maintain performance while being interpretable. The method preserves multimodal capabilities and enables semantic concept alignment.

Result: Sparse CLIP representations preserve strong downstream task performance, achieve superior interpretability compared to SAEs, and retain multimodal capabilities. The approach enables semantic concept alignment, reveals cross-modal knowledge emergence dynamics, and allows for interpretable vision-based steering in vision-language models.

Conclusion: The findings challenge the conventional wisdom that interpretability requires sacrificing accuracy. Interpretability and performance can be co-optimized, offering a promising design principle for future multimodal models. The approach demonstrates that sparse representations can maintain both interpretability and strong performance.

Abstract: Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs). Despite its success, CLIP’s dense and opaque latent representations pose significant interpretability challenges. A common assumption is that interpretability and performance are in tension: enforcing sparsity during training degrades accuracy, motivating recent post-hoc approaches such as Sparse Autoencoders (SAEs). However, these post-hoc approaches often suffer from degraded downstream performance and loss of CLIP’s inherent multimodal capabilities, with most learned features remaining unimodal. We propose a simple yet effective approach that integrates sparsity directly into CLIP training, yielding representations that are both interpretable and performant. Compared to SAEs, our Sparse CLIP representations preserve strong downstream task performance, achieve superior interpretability, and retain multimodal capabilities. We show that multimodal sparse features enable straightforward semantic concept alignment and reveal training dynamics of how cross-modal knowledge emerges. Finally, as a proof of concept, we train a vision-language model on sparse CLIP representations that enables interpretable, vision-based steering capabilities. Our findings challenge conventional wisdom that interpretability requires sacrificing accuracy and demonstrate that interpretability and performance can be co-optimized, offering a promising design principle for future models.

[129] NucFuseRank: Dataset Fusion and Performance Ranking for Nuclei Instance Segmentation

Nima Torbati, Anastasia Meshcheryakova, Ramona Woitek, Sepideh Hatamikia, Diana Mechtcheriakova, Amirreza Mahbod

Main category: cs.CV

TL;DR: This paper focuses on benchmarking nuclei instance segmentation datasets rather than developing new models, by standardizing public H&E-stained image datasets, evaluating them with state-of-the-art models, and creating unified training/test sets for fair comparison.

Details

Motivation: Most research focuses on developing new segmentation algorithms and benchmarking on limited datasets, but there's a need for systematic evaluation of datasets themselves for nuclei instance segmentation in H&E-stained histological images.

Method: 1) Identified manually annotated public H&E-stained image datasets through literature review; 2) Standardized them into unified input/annotation format; 3) Evaluated datasets using two state-of-the-art models (CNN-based and hybrid CNN-Vision Transformer); 4) Created unified test set (NucFuse-test) for cross-dataset evaluation and unified training set (NucFuse-train) by merging multiple datasets.

Result: Systematically evaluated and ranked datasets based on segmentation performance, provided comprehensive analyses, generated fused datasets, conducted external validation, and made implementation publicly available to establish a new benchmark.

Conclusion: The work provides a comprehensive benchmark for training, testing, and evaluating nuclei instance segmentation models on H&E-stained histological images, shifting focus from model development to dataset evaluation and standardization.

Abstract: Nuclei instance segmentation in hematoxylin and eosin (H&E)-stained images plays an important role in automated histological image analysis, with various applications in downstream tasks. While several machine learning and deep learning approaches have been proposed for nuclei instance segmentation, most research in this field focuses on developing new segmentation algorithms and benchmarking them on a limited number of arbitrarily selected public datasets. In this work, rather than focusing on model development, we focused on the datasets used for this task. Based on an extensive literature review, we identified manually annotated, publicly available datasets of H&E-stained images for nuclei instance segmentation and standardized them into a unified input and annotation format. Using two state-of-the-art segmentation models, one based on convolutional neural networks (CNNs) and one based on a hybrid CNN and vision transformer architecture, we systematically evaluated and ranked these datasets based on their nuclei instance segmentation performance. Furthermore, we proposed a unified test set (NucFuse-test) for fair cross-dataset evaluation and a unified training set (NucFuse-train) for improved segmentation performance by merging images from multiple datasets. By evaluating and ranking the datasets, performing comprehensive analyses, generating fused datasets, conducting external validation, and making our implementation publicly available, we provided a new benchmark for training, testing, and evaluating nuclei instance segmentation models on H&E-stained histological images.

[130] RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis

Linfeng Dong, Yuchen Yang, Hao Wu, Wei Wang, Yuenan Hou, Zhihang Zhong, Xiao Sun

Main category: cs.CV

TL;DR: RacketVision introduces a novel sports analytics dataset with fine-grained racket pose and ball position annotations for table tennis, tennis, and badminton, enabling research on ball tracking, racket pose estimation, and trajectory forecasting.

Details

Motivation: There's a need for large-scale, fine-grained datasets in sports analytics that capture complex human-object interactions, particularly for racket sports where traditional datasets lack detailed racket pose annotations alongside ball positions.

Method: Created a comprehensive dataset with fine-grained racket pose annotations and ball positions across three racket sports. Evaluated baseline methods and discovered that CrossAttention mechanisms are essential for effective multi-modal fusion of racket pose features with ball tracking data.

Result: The dataset enables three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. CrossAttention-based fusion outperforms naive concatenation approaches and surpasses strong unimodal baselines for trajectory prediction.

Conclusion: RacketVision provides a versatile resource for sports analytics research, demonstrating that sophisticated multi-modal fusion (CrossAttention) is crucial for leveraging racket pose information effectively, with applications in dynamic object tracking, conditional motion forecasting, and multimodal analysis.

Abstract: We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision

[131] Look in the Middle: Structural Anchor Pruning for Scalable Visual RAG Indexing

Zhuchenyang Liu, Ziyu Hu, Yao Zhang, Yu Xiao

Main category: cs.CV

TL;DR: SAP is a training-free pruning method that identifies key visual patches from middle layers to achieve over 90% index vector compression while maintaining retrieval performance, challenging previous assumptions about query-dependent visual token importance.

Details

Motivation: Existing Vision-Language Models for Visual Document Retrieval have prohibitive index vector size overheads. Current training-free pruning methods underperform random selection in high-compression scenarios (>80%), and prior research questions the feasibility of training-free pruning by claiming visual token importance is query-dependent.

Method: Proposes Structural Anchor Pruning (SAP) - a training-free pruning method that identifies key visual patches from middle layers rather than final layers. Also introduces Oracle Score Retention (OSR) protocol to evaluate how layer-wise information affects compression efficiency.

Result: On ViDoRe benchmark, SAP reduces index vectors by over 90% while maintaining robust retrieval fidelity. OSR analysis reveals that semantic structural anchor patches persist in middle layers, unlike in final layers where structural signals dissipate.

Conclusion: SAP provides a highly scalable solution for Visual RAG by achieving high-performance compression through middle-layer structural anchor identification, challenging previous assumptions about the infeasibility of training-free pruning for visual document retrieval.

Abstract: Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive index vector size overheads. Training-free pruning solutions (e.g., EOS-attention based methods) can reduce index vector size by approximately 60% without model adaptation, but often underperform random selection in high-compression scenarios (> 80%). Prior research (e.g., Light-ColPali) attributes this to the conclusion that visual token importance is inherently query-dependent, thereby questioning the feasibility of training-free pruning. In this work, we propose Structural Anchor Pruning (SAP), a training-free pruning method that identifies key visual patches from middle layers to achieve high performance compression. We also introduce Oracle Score Retention (OSR) protocol to evaluate how layer-wise information affects compression efficiency. Evaluations on the ViDoRe benchmark demonstrate that SAP reduces index vectors by over 90% while maintaining robust retrieval fidelity, providing a highly scalable solution for Visual RAG. Furthermore, our OSR-based analysis reveals that semantic structural anchor patches persist in the middle layers, unlike traditional pruning solutions that focus on the final layer where structural signals dissipate.

[132] Efficient Token Pruning for LLaDA-V

Zhewen Wan, Tianchen Song, Chen Lin, Zhiyong Zhao, Xianpeng Lang

Main category: cs.CV

TL;DR: Proposes structured token pruning for diffusion-based multimodal models (LLaDA-V) by selectively removing visual tokens in middle-to-late layers to reduce FLOPs by 65% while maintaining 95% task performance.

Details

Motivation: Diffusion-based multimodal models like LLaDA-V have high computational overhead due to bidirectional attention and iterative denoising. Analysis shows these models aggregate cross-modal information mainly in middle-to-late layers, causing delayed semantic alignment and inefficient processing.

Method: Proposes structured token pruning strategy targeting middle-to-late layers of the first denoising step. Unlike FastV’s shallow-layer pruning, this aligns with LLaDA-V’s delayed attention aggregation. First-step pruning reduces computation across all subsequent denoising steps while preserving critical semantic information.

Result: Best configuration reduces computational cost by up to 65% while preserving an average of 95% task performance across multiple benchmarks. First work to investigate structured token pruning in diffusion-based large multimodal models.

Conclusion: Provides empirical basis for efficient LLaDA-V inference and demonstrates potential of vision-aware pruning in diffusion-based multimodal models. Structured pruning in middle-to-late layers effectively balances efficiency and performance.

Abstract: Diffusion-based large multimodal models, such as LLaDA-V, have demonstrated impressive capabilities in vision-language understanding and generation. However, their bidirectional attention mechanism and diffusion-style iterative denoising paradigm introduce significant computational overhead, as visual tokens are repeatedly processed across all layers and denoising steps. In this work, we conduct an in-depth attention analysis and reveal that, unlike autoregressive decoders, LLaDA-V aggregates cross-modal information predominantly in middle-to-late layers, leading to delayed semantic alignment. Motivated by this observation, we propose a structured token pruning strategy inspired by FastV, selectively removing a proportion of visual tokens at designated layers to reduce FLOPs while preserving critical semantic information. To the best of our knowledge, this is the first work to investigate structured token pruning in diffusion-based large multimodal models. Unlike FastV, which focuses on shallow-layer pruning, our method targets the middle-to-late layers of the first denoising step to align with LLaDA-V’s delayed attention aggregation to maintain output quality, and the first-step pruning strategy reduces the computation across all subsequent steps. Our framework provides an empirical basis for efficient LLaDA-V inference and highlights the potential of vision-aware pruning in diffusion-based multimodal models. Across multiple benchmarks, our best configuration reduces computational cost by up to 65% while preserving an average of 95% task performance.

[133] TeleStyle: Content-Preserving Style Transfer in Images and Videos

Shiwen Zhang, Xiaoyan Yang, Bojia Zi, Haibin Huang, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: TeleStyle is a lightweight Diffusion Transformer model for image and video style transfer that achieves state-of-the-art performance by leveraging curriculum continual learning on hybrid datasets.

Details

Motivation: Content-preserving style transfer remains challenging for Diffusion Transformers due to entangled content and style features in their representations. Existing methods struggle with maintaining content fidelity while achieving good style transfer.

Method: Built on Qwen-Image-Edit, TeleStyle uses a Curriculum Continual Learning framework trained on hybrid datasets of curated clean triplets and synthetic noisy triplets from diverse style categories. Includes video-to-video stylization module for temporal consistency.

Result: Achieves state-of-the-art performance across three core metrics: style similarity, content consistency, and aesthetic quality. Model generalizes well to unseen styles without compromising content fidelity.

Conclusion: TeleStyle presents an effective solution for content-preserving style transfer in both images and videos, demonstrating superior performance through innovative training methodology and dataset curation.

Abstract: Content-preserving style transfer, generating stylized outputs based on content and style references, remains a significant challenge for Diffusion Transformers (DiTs) due to the inherent entanglement of content and style features in their internal representations. In this technical report, we present TeleStyle, a lightweight yet effective model for both image and video stylization. Built upon Qwen-Image-Edit, TeleStyle leverages the base model’s robust capabilities in content preservation and style customization. To facilitate effective training, we curated a high-quality dataset of distinct specific styles and further synthesized triplets using thousands of diverse, in-the-wild style categories. We introduce a Curriculum Continual Learning framework to train TeleStyle on this hybrid dataset of clean (curated) and noisy (synthetic) triplets. This approach enables the model to generalize to unseen styles without compromising precise content fidelity. Additionally, we introduce a video-to-video stylization module to enhance temporal consistency and visual quality. TeleStyle achieves state-of-the-art performance across three core evaluation metrics: style similarity, content consistency, and aesthetic quality. Code and pre-trained models are available at https://github.com/Tele-AI/TeleStyle

[134] Automated Marine Biofouling Assessment: Benchmarking Computer Vision and Multimodal LLMs on the Level of Fouling Scale

Brayden Hamilton, Tim Cashmore, Peter Driscoll, Trevor Gee, Henry Williams

Main category: cs.CV

TL;DR: Automated classification of marine biofouling severity using computer vision models and LLMs, with hybrid methods showing promise for scalable assessment.

Details

Motivation: Traditional diver inspections for marine biofouling on vessel hulls are hazardous and not scalable, creating need for automated assessment methods to address ecological, economic, and biosecurity risks.

Method: Evaluated convolutional neural networks, transformer-based segmentation, and zero-shot large multimodal language models on expert-labelled dataset from New Zealand Ministry for Primary Industries, using structured prompts and retrieval for LLMs.

Result: Computer vision models achieved high accuracy at extreme Level of Fouling categories but struggled with intermediate levels due to dataset imbalance and image framing. LLMs achieved competitive performance without training and provided interpretable outputs.

Conclusion: Hybrid methods integrating segmentation coverage with LLM reasoning offer a promising pathway toward scalable and interpretable biofouling assessment, demonstrating complementary strengths across approaches.

Abstract: Marine biofouling on vessel hulls poses major ecological, economic, and biosecurity risks. Traditional survey methods rely on diver inspections, which are hazardous and limited in scalability. This work investigates automated classification of biofouling severity on the Level of Fouling (LoF) scale using both custom computer vision models and large multimodal language models (LLMs). Convolutional neural networks, transformer-based segmentation, and zero-shot LLMs were evaluated on an expert-labelled dataset from the New Zealand Ministry for Primary Industries. Computer vision models showed high accuracy at extreme LoF categories but struggled with intermediate levels due to dataset imbalance and image framing. LLMs, guided by structured prompts and retrieval, achieved competitive performance without training and provided interpretable outputs. The results demonstrate complementary strengths across approaches and suggest that hybrid methods integrating segmentation coverage with LLM reasoning offer a promising pathway toward scalable and interpretable biofouling assessment.

[135] JAFAR: Jack up Any Feature at Any Resolution

Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome

Main category: cs.CV

TL;DR: JAFAR is a lightweight attention-based feature upsampler that enhances spatial resolution of foundation vision encoder outputs to arbitrary target resolutions without high-resolution supervision.

Details

Motivation: Foundation vision encoders produce low-resolution spatial features, requiring upsampling for high-resolution downstream tasks. Existing methods may be limited in flexibility or performance.

Method: JAFAR uses attention-based module with Spatial Feature Transform (SFT) modulation to align high-resolution queries (from low-level features) with semantically enriched low-resolution keys. Learns at low upsampling ratios and generalizes to higher scales.

Result: Extensive experiments show JAFAR effectively recovers fine-grained spatial details and outperforms existing feature upsampling methods across diverse downstream tasks.

Conclusion: JAFAR provides a flexible, lightweight solution for feature upsampling that generalizes well to high output scales without requiring high-resolution supervision.

Abstract: Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io

[136] DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Haoyou Deng, Keyu Yan, Chaojie Mao, Xiang Wang, Yu Liu, Changxin Gao, Nong Sang

Main category: cs.CV

TL;DR: DenseGRPO addresses sparse reward problem in GRPO-based flow matching models by introducing dense step-wise rewards and adaptive exploration for better human preference alignment in text-to-image generation.

Details

Motivation: Existing GRPO-based approaches suffer from sparse reward problem where terminal rewards are applied to all intermediate denoising steps, creating mismatch between global feedback and fine-grained contributions at each step.

Method: Proposes DenseGRPO with two key components: (1) predicts step-wise reward gain as dense reward for each denoising step using ODE-based approach on intermediate clean images; (2) introduces reward-aware scheme to calibrate exploration space by adaptively adjusting timestep-specific stochasticity injection in SDE sampler.

Result: Extensive experiments on multiple standard benchmarks demonstrate effectiveness of DenseGRPO and highlight critical role of valid dense rewards in flow matching model alignment.

Conclusion: DenseGRPO successfully addresses sparse reward problem in GRPO-based flow matching models by providing fine-grained step-wise rewards and adaptive exploration, leading to improved human preference alignment for text-to-image generation.

Abstract: Recent GRPO-based approaches built on flow matching models have shown remarkable improvements in human preference alignment for text-to-image generation. Nevertheless, they still suffer from the sparse reward problem: the terminal reward of the entire denoising trajectory is applied to all intermediate steps, resulting in a mismatch between the global feedback signals and the exact fine-grained contributions at intermediate denoising steps. To address this issue, we introduce \textbf{DenseGRPO}, a novel framework that aligns human preference with dense rewards, which evaluates the fine-grained contribution of each denoising step. Specifically, our approach includes two key components: (1) we propose to predict the step-wise reward gain as dense reward of each denoising step, which applies a reward model on the intermediate clean images via an ODE-based approach. This manner ensures an alignment between feedback signals and the contributions of individual steps, facilitating effective training; and (2) based on the estimated dense rewards, a mismatch drawback between the uniform exploration setting and the time-varying noise intensity in existing GRPO-based methods is revealed, leading to an inappropriate exploration space. Thus, we propose a reward-aware scheme to calibrate the exploration space by adaptively adjusting a timestep-specific stochasticity injection in the SDE sampler, ensuring a suitable exploration space at all timesteps. Extensive experiments on multiple standard benchmarks demonstrate the effectiveness of the proposed DenseGRPO and highlight the critical role of the valid dense rewards in flow matching model alignment.

[137] Neural Cellular Automata: From Cells to Pixels

Ehsan Pajouheshgar, Yitao Xu, Ali Abbasi, Alexander Mordvintsev, Wenzel Jakob, Sabine Süsstrunk

Main category: cs.CV

TL;DR: Hybrid NCA model combines coarse-grid NCA with implicit decoder for arbitrary-resolution outputs while preserving self-organizing behavior.

Details

Motivation: NCAs are limited to low-resolution outputs due to quadratic training/memory growth with grid size, local information propagation limiting long-range communication, and heavy compute demands for real-time high-resolution inference.

Method: Pair an NCA evolving on a coarse grid with a lightweight implicit decoder that maps cell states and local coordinates to appearance attributes, enabling arbitrary-resolution rendering. Use task-specific losses for morphogenesis and texture synthesis with minimal memory/computation overhead.

Result: Hybrid models produce high-resolution outputs in real-time across 2D/3D grids and mesh domains while preserving characteristic self-organizing behavior of NCAs.

Conclusion: The proposed hybrid approach overcomes NCA resolution limitations by combining coarse-grid evolution with implicit decoding, enabling efficient high-resolution generation with preserved self-organizing properties.

Abstract: Neural Cellular Automata (NCAs) are bio-inspired dynamical systems in which identical cells iteratively apply a learned local update rule to self-organize into complex patterns, exhibiting regeneration, robustness, and spontaneous dynamics. Despite their success in texture synthesis and morphogenesis, NCAs remain largely confined to low-resolution outputs. This limitation stems from (1) training time and memory requirements that grow quadratically with grid size, (2) the strictly local propagation of information that impedes long-range cell communication, and (3) the heavy compute demands of real-time inference at high resolution. In this work, we overcome this limitation by pairing an NCA that evolves on a coarse grid with a lightweight implicit decoder that maps cell states and local coordinates to appearance attributes, enabling the same model to render outputs at arbitrary resolution. Moreover, because both the decoder and NCA updates are local, inference remains highly parallelizable. To supervise high-resolution outputs efficiently, we introduce task-specific losses for morphogenesis (growth from a seed) and texture synthesis with minimal additional memory and computation overhead. Our experiments across 2D/3D grids and mesh domains demonstrate that our hybrid models produce high-resolution outputs in real-time, and preserve the characteristic self-organizing behavior of NCAs.

[138] Feature Projection Learning for Better Vision-Language Reasoning

Yi Zhang, Weicheng Lin, Liang-Jie Zhang

Main category: cs.CV

TL;DR: FPL (Feature Projection Learning) is a simple, efficient method that adapts CLIP to downstream tasks by projecting class prototypes into query image feature space and using reconstruction error as class scores, outperforming SOTA methods.

Details

Motivation: Existing methods for adapting Vision-Language Pre-trained models like CLIP to downstream tasks suffer from limited performance, excessive learnable parameters, or extended training times, hindering their effectiveness.

Method: FPL uses a projection model to project class prototype features into query image feature space and reconstructs the query image feature map. The negative average squared reconstruction error serves as the class score, transforming classification into a feature projection problem. The final output combines predictions from the projection model and original CLIP.

Result: Comprehensive empirical evaluations confirm that FPL delivers superior accuracy, surpassing current state-of-the-art methods by a substantial margin.

Conclusion: FPL provides a simple yet efficient and effective approach for adapting CLIP to downstream tasks, addressing limitations of previous methods while achieving better performance.

Abstract: Vision-Language Pre-Trained models, notably CLIP, that utilize contrastive learning have proven highly adept at extracting generalizable visual features. To inherit the well-learned knowledge of VLP models for downstream tasks, several approaches aim to adapt them efficiently with limited supervision. However, these methods either suffer from limited performance, excessive learnable parameters, or extended training times, all of which hinder their effectiveness in adapting the CLIP model to downstream tasks. In this work, we propose a simple yet efficient and effective method called \textit{\textbf{F}eature \textbf{P}rojection \textbf{L}earning(FPL)} to address these problems. Specifically, we develop a projection model that projects class prototype features into the query image feature space and reconstructs the query image feature map. The negative average squared reconstruction error is used as the class score. In this way, we transform the classification problem into a feature projection problem. The final output of this method is a combination of the prediction from the projection model and the original pre-trained CLIP. Comprehensive empirical evaluations confirm that FPL delivers superior accuracy, surpassing the current state-of-the-art methods by a substantial margin.

[139] AEDR: Training-Free AI-Generated Image Attribution via Autoencoder Double-Reconstruction

Chao Wang, Zijin Yang, Yaofei Wang, Weiming Zhang, Kejiang Chen

Main category: cs.CV

TL;DR: AEDR is a training-free attribution method for generative models that uses double reconstruction with autoencoders to trace image origins, achieving 25.5% higher accuracy with 1% computational time compared to existing methods.

Details

Motivation: The rapid advancement of image-generation technologies raises security concerns about malicious use of photorealistic images. Tracing the origin of such images is essential for security, but existing reconstruction-based attribution methods suffer from reduced accuracy and high computational costs when applied to state-of-the-art generative models.

Method: AEDR (AutoEncoder Double-Reconstruction) performs two consecutive reconstructions using the model’s autoencoder and adopts the ratio of these two reconstruction losses as the attribution signal. This signal is further calibrated using the image homogeneity metric to improve accuracy. The method is training-free and designed for generative models with continuous autoencoders.

Result: Experiments on eight top latent diffusion models show that AEDR achieves 25.5% higher attribution accuracy than existing reconstruction-based methods, while requiring only 1% of the computational time.

Conclusion: AEDR provides an effective and efficient solution for tracing the origin of images generated by advanced generative models, addressing both accuracy and computational efficiency challenges in attribution methods.

Abstract: The rapid advancement of image-generation technologies has made it possible for anyone to create photorealistic images using generative models, raising significant security concerns. To mitigate malicious use, tracing the origin of such images is essential. Reconstruction-based attribution methods offer a promising solution, but they often suffer from reduced accuracy and high computational costs when applied to state-of-the-art (SOTA) models. To address these challenges, we propose AEDR (AutoEncoder Double-Reconstruction), a novel training-free attribution method designed for generative models with continuous autoencoders. Unlike existing reconstruction-based approaches that rely on the value of a single reconstruction loss, AEDR performs two consecutive reconstructions using the model’s autoencoder, and adopts the ratio of these two reconstruction losses as the attribution signal. This signal is further calibrated using the image homogeneity metric to improve accuracy, which inherently cancels out absolute biases caused by image complexity, with autoencoder-based reconstruction ensuring superior computational efficiency. Experiments on eight top latent diffusion models show that AEDR achieves 25.5% higher attribution accuracy than existing reconstruction-based methods, while requiring only 1% of the computational time.

[140] Visual Prompt-Agnostic Evolution

Junze Wang, Lei Fan, Dezheng Zhang, Weipeng Jing, Donglin Di, Yang Song, Sidong Liu, Cong Cong

Main category: cs.CV

TL;DR: PAE improves Visual Prompt Tuning by addressing unstable training dynamics through frequency-aware initialization, cross-layer coordination via Koopman operator, and Lyapunov-based regularization, achieving faster convergence and better accuracy.

Details

Motivation: Existing Visual Prompt Tuning (VPT) methods suffer from unstable training dynamics with gradient oscillations, where shallow-layer prompts stagnate early while deeper-layer prompts oscillate with high variance, causing cross-layer mismatch that slows convergence and degrades performance.

Method: Proposes Prompt-Agnostic Evolution (PAE) with three key components: 1) Frequency-aware initialization that uncovers and propagates frequency shortcut patterns the backbone uses for recognition, 2) Shared Koopman operator for cross-layer coordination using global linear transformation instead of layer-specific updates, 3) Lyapunov stability theory-inspired regularizer to constrain error amplification during prompt evolution.

Result: PAE accelerates convergence with average 1.41× speedup and improves accuracy by 1-3% on 25 datasets across multiple downstream tasks. It’s prompt-agnostic, lightweight, and integrates seamlessly with diverse VPT variants without backbone modification or inference-time changes.

Conclusion: PAE effectively addresses the training instability in Visual Prompt Tuning by modeling prompt dynamics through frequency-aware initialization, cross-layer coordination, and stability regularization, resulting in faster convergence and better performance while maintaining compatibility with existing VPT frameworks.

Abstract: Visual Prompt Tuning (VPT) adapts a frozen Vision Transformer (ViT) to downstream tasks by inserting a small number of learnable prompt tokens into the token sequence at each layer. However, we observe that existing VPT variants often suffer from unstable training dynamics, characterized by gradient oscillations. A layer-wise analysis reveals that shallow-layer prompts tend to stagnate early, while deeper-layer prompts exhibit high-variance oscillations, leading to cross-layer mismatch. These issues slow convergence and degrade final performance. To address these challenges, we propose Prompt-Agnostic Evolution ($\mathtt{PAE}$), which strengthens vision prompt tuning by explicitly modeling prompt dynamics. From a frequency-domain perspective, we initialize prompts in a task-aware direction by uncovering and propagating frequency shortcut patterns that the backbone inherently exploits for recognition. To ensure coherent evolution across layers, we employ a shared Koopman operator that imposes a global linear transformation instead of uncoordinated, layer-specific updates. Finally, inspired by Lyapunov stability theory, we introduce a regularizer that constrains error amplification during evolution. Extensive experiments show that $\mathtt{PAE}$ accelerates convergence with an average $1.41\times$ speedup and improves accuracy by 1–3% on 25 datasets across multiple downstream tasks. Beyond performance, $\mathtt{PAE}$ is prompt-agnostic and lightweight, and it integrates seamlessly with diverse VPT variants without backbone modification or inference-time changes.

[141] WaveletGaussian: Wavelet-domain Diffusion for Sparse-view 3D Gaussian Object Reconstruction

Hung Nguyen, Runfa Li, An Le, Truong Nguyen

Main category: cs.CV

TL;DR: WaveletGaussian: A more efficient framework for sparse-view 3D Gaussian Splatting reconstruction by shifting diffusion to wavelet domain and using lightweight refinement for high-frequency components.

Details

Motivation: 3D Gaussian Splatting performs poorly in sparse-view settings, and existing diffusion-based repair methods are computationally expensive due to fine-tuning and repair steps.

Method: Shift diffusion to wavelet domain (only apply to low-resolution LL subband), use lightweight network for high-frequency subbands, and employ efficient online random masking strategy for training pairs instead of leave-one-out.

Result: Achieves competitive rendering quality on Mip-NeRF 360 and OmniObject3D datasets while substantially reducing training time compared to previous methods.

Conclusion: WaveletGaussian provides an efficient alternative for sparse-view 3D Gaussian reconstruction by leveraging wavelet domain processing and optimized training strategies.

Abstract: 3D Gaussian Splatting (3DGS) has become a powerful representation for image-based object reconstruction, yet its performance drops sharply in sparse-view settings. Prior works address this limitation by employing diffusion models to repair corrupted renders, subsequently using them as pseudo ground truths for later optimization. While effective, such approaches incur heavy computation from the diffusion fine-tuning and repair steps. We present WaveletGaussian, a framework for more efficient sparse-view 3D Gaussian object reconstruction. Our key idea is to shift diffusion into the wavelet domain: diffusion is applied only to the low-resolution LL subband, while high-frequency subbands are refined with a lightweight network. We further propose an efficient online random masking strategy to curate training pairs for diffusion fine-tuning, replacing the commonly used, but inefficient, leave-one-out strategy. Experiments across two benchmark datasets, Mip-NeRF 360 and OmniObject3D, show WaveletGaussian achieves competitive rendering quality while substantially reducing training time.

[142] BLENDER: Blended Text Embeddings and Diffusion Residuals for Intra-Class Image Synthesis in Deep Metric Learning

Jan Niklas Kolf, Ozan Tezcan, Justin Theiss, Hyung Jun Kim, Wentao Bao, Bhargav Bhushanam, Khushi Gupta, Arun Kejariwal, Naser Damer, Fadi Boutros

Main category: cs.CV

TL;DR: BLenDeR is a diffusion sampling method that uses set-theory operations on denoising residuals to generate diverse synthetic data for Deep Metric Learning, improving performance on standard benchmarks.

Details

Motivation: Deep Generative Models can create synthetic data to augment authentic data in Deep Metric Learning, but existing approaches lack controllability in generating diverse attribute combinations within classes to enhance intra-class diversity.

Method: BLenDeR leverages set-theory inspired union and intersection operations on denoising residuals from diffusion models. Union encourages any attribute present across multiple prompts, while intersection extracts common directions using principal component analysis, enabling controlled synthesis of diverse attribute combinations.

Result: BLenDeR consistently outperforms state-of-the-art baselines across multiple datasets and backbones, achieving 3.7% increase in Recall@1 on CUB-200 and 1.8% increase on Cars-196 under standard experimental settings.

Conclusion: BLenDeR effectively addresses limitations of existing generative approaches by providing controllable synthesis of diverse intra-class samples, leading to improved Deep Metric Learning performance through enhanced data augmentation.

Abstract: The rise of Deep Generative Models (DGM) has enabled the generation of high-quality synthetic data. When used to augment authentic data in Deep Metric Learning (DML), these synthetic samples enhance intra-class diversity and improve the performance of downstream DML tasks. We introduce BLenDeR, a diffusion sampling method designed to increase intra-class diversity for DML in a controllable way by leveraging set-theory inspired union and intersection operations on denoising residuals. The union operation encourages any attribute present across multiple prompts, while the intersection extracts the common direction through a principal component surrogate. These operations enable controlled synthesis of diverse attribute combinations within each class, addressing key limitations of existing generative approaches. Experiments on standard DML benchmarks demonstrate that BLenDeR consistently outperforms state-of-the-art baselines across multiple datasets and backbones. Specifically, BLenDeR achieves 3.7% increase in Recall@1 on CUB-200 and a 1.8% increase on Cars-196, compared to state-of-the-art baselines under standard experimental settings.

[143] Reversible Efficient Diffusion for Image Fusion

Xingxin Xu, Bing Cao, DongDong Li, Qinghua Hu, Pengfei Zhu

Main category: cs.CV

TL;DR: RED model: A reversible efficient diffusion framework for multi-modal image fusion that combines diffusion model generative power with explicit supervision while avoiding distribution estimation challenges.

Details

Motivation: Current diffusion models for image fusion suffer from detail loss due to noise error accumulation in Markov processes, and explicit supervision in end-to-end training faces computational efficiency challenges.

Method: Proposes Reversible Efficient Diffusion (RED) model - an explicitly supervised training framework that inherits diffusion model generative capabilities while avoiding distribution estimation.

Result: The framework aims to preserve fine details and maintain high visual fidelity in fused images by addressing the limitations of traditional diffusion approaches.

Conclusion: RED provides an efficient solution for multi-modal image fusion that combines the strengths of diffusion models with explicit supervision while overcoming computational and detail-preservation challenges.

Abstract: Multi-modal image fusion aims to consolidate complementary information from diverse source images into a unified representation. The fused image is expected to preserve fine details and maintain high visual fidelity. While diffusion models have demonstrated impressive generative capabilities in image generation, they often suffer from detail loss when applied to image fusion tasks. This issue arises from the accumulation of noise errors inherent in the Markov process, leading to inconsistency and degradation in the fused results. However, incorporating explicit supervision into end-to-end training of diffusion-based image fusion introduces challenges related to computational efficiency. To address these limitations, we propose the Reversible Efficient Diffusion (RED) model - an explicitly supervised training framework that inherits the powerful generative capability of diffusion models while avoiding the distribution estimation.

[144] Hallucination Begins Where Saliency Drops

Xiaofeng Zhang, Yuanchao Zhu, Chaochen Gu, Xiaosong Yuan, Qiyan Zhao, Jiawei Cao, Feilong Tang, Sinan Fan, Yaomin Shen, Chen Shen, Hao Tang

Main category: cs.CV

TL;DR: LVLMs-Saliency is a gradient-aware diagnostic framework that fuses attention weights with input gradients to detect hallucinations in large vision-language models, revealing that hallucinations occur when preceding tokens have low saliency for next-token prediction.

Details

Motivation: Existing approaches for detecting hallucinations in LVLMs rely solely on forward-pass attention patterns and neglect gradient-based signals that reveal how token influence propagates through the network, limiting their ability to reliably distinguish hallucinated from factually grounded outputs.

Method: Introduces LVLMs-Saliency framework that quantifies visual grounding strength by fusing attention weights with input gradients. Proposes two inference-time mechanisms: Saliency-Guided Rejection Sampling (SGRS) to filter candidate tokens during decoding, and Local Coherence Reinforcement (LocoRE) to strengthen attention to recent predecessors and counteract contextual forgetting.

Result: Extensive experiments across multiple LVLMs demonstrate significant reduction in hallucination rates while preserving fluency and task performance, offering a robust and interpretable solution for enhancing model reliability.

Conclusion: The proposed gradient-aware framework effectively detects and mitigates hallucinations in LVLMs by addressing the limitation of existing attention-only approaches, providing both diagnostic insights and practical solutions for improving model reliability.

Abstract: Recent studies have examined attention dynamics in large vision-language models (LVLMs) to detect hallucinations. However, existing approaches remain limited in reliably distinguishing hallucinated from factually grounded outputs, as they rely solely on forward-pass attention patterns and neglect gradient-based signals that reveal how token influence propagates through the network. To bridge this gap, we introduce LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token by fusing attention weights with their input gradients. Our analysis uncovers a decisive pattern: hallucinations frequently arise when preceding output tokens exhibit low saliency toward the prediction of the next token, signaling a breakdown in contextual memory retention. Leveraging this insight, we propose a dual-mechanism inference-time framework to mitigate hallucinations: (1) Saliency-Guided Rejection Sampling (SGRS), which dynamically filters candidate tokens during autoregressive decoding by rejecting those whose saliency falls below a context-adaptive threshold, thereby preventing coherence-breaking tokens from entering the output sequence; and (2) Local Coherence Reinforcement (LocoRE), a lightweight, plug-and-play module that strengthens attention from the current token to its most recent predecessors, actively counteracting the contextual forgetting behavior identified by LVLMs-Saliency. Extensive experiments across multiple LVLMs demonstrate that our method significantly reduces hallucination rates while preserving fluency and task performance, offering a robust and interpretable solution for enhancing model reliability. Code is available at: https://github.com/zhangbaijin/LVLMs-Saliency

[145] A Source-Free Approach for Domain Adaptation via Multiview Image Transformation and Latent Space Consistency

Debopom Sutradhar, Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Reem E. Mohamed, Sami Azam

Main category: cs.CV

TL;DR: A novel source-free domain adaptation method using multiview augmentation and latent space consistency to learn domain-invariant features directly from target domain without needing source data or complex alignment.

Details

Motivation: Existing domain adaptation methods require source domain data access, adversarial training, or complex pseudo-labeling techniques, which are computationally expensive. There's a need for more efficient approaches that can adapt without source data.

Method: Uses multiview augmentation and latent space consistency techniques to learn domain-invariant features directly from target domain. Enforces consistency between multiple augmented views in latent space, eliminates need for source-target alignment or pseudo-label refinement. Includes ConvNeXt-based encoder and loss function combining classification and consistency objectives.

Result: Achieves average classification accuracy of 90.72% on Office-31, 84% on Office-Home, and 97.12% on Office-Caltech datasets. Improves existing methods by +1.23%, +7.26%, and +1.77% on respective datasets.

Conclusion: The proposed source-free domain adaptation method effectively learns transferable representations directly from target domain using multiview consistency, outperforming existing approaches while being more computationally efficient.

Abstract: Domain adaptation (DA) addresses the challenge of transferring knowledge from a source domain to a target domain where image data distributions may differ. Existing DA methods often require access to source domain data, adversarial training, or complex pseudo-labeling techniques, which are computationally expensive. To address these challenges, this paper introduces a novel source-free domain adaptation method. It is the first approach to use multiview augmentation and latent space consistency techniques to learn domain-invariant features directly from the target domain. Our method eliminates the need for source-target alignment or pseudo-label refinement by learning transferable representations solely from the target domain by enforcing consistency between multiple augmented views in the latent space. Additionally, the method ensures consistency in the learned features by generating multiple augmented views of target domain data and minimizing the distance between their feature representations in the latent space. We also introduce a ConvNeXt-based encoder and design a loss function that combines classification and consistency objectives to drive effective adaptation directly from the target domain. The proposed model achieves an average classification accuracy of 90. 72%, 84%, and 97. 12% in Office-31, Office-Home and Office-Caltech datasets, respectively. Further evaluations confirm that our study improves existing methods by an average classification accuracy increment of +1.23%, +7.26%, and +1.77% on the respective datasets.

[146] REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation

Haotian Wang, Yuzhe Weng, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Qingfeng Liu

Main category: cs.CV

TL;DR: REST is a real-time streaming audio-driven talking head generation framework using diffusion models with compact latent space, ID-Context Cache, and Asynchronous Streaming Distillation to achieve high-speed, consistent generation.

Details

Motivation: Current diffusion-based talking head generation models suffer from slow inference speeds and non-autoregressive paradigms that limit real-time applications, creating a need for efficient streaming solutions.

Method: 1) Learn compact video latent space via spatiotemporal VAE with high compression; 2) ID-Context Cache mechanism for identity consistency and temporal coherence in streaming; 3) Asynchronous Streaming Distillation to reduce error accumulation using teacher-student framework.

Result: REST outperforms state-of-the-art methods in both generation speed and overall performance, achieving real-time streaming capability while maintaining quality.

Conclusion: REST bridges autoregressive and diffusion approaches, enabling efficient real-time talking head generation with breakthrough speed-performance trade-off.

Abstract: Diffusion models have significantly advanced the field of talking head generation (THG). However, slow inference speeds and prevalent non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, a pioneering diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through a spatiotemporal variational autoencoder with a high compression ratio. Additionally, to enable semi-autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles into key-value caching for maintaining identity consistency and temporal coherence during long-term streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) strategy is proposed to mitigate error accumulation and enhance temporal consistency in streaming generation, leveraging a non-streaming teacher with an asynchronous noise schedule to supervise the streaming student. REST bridges the gap between autoregressive and diffusion-based approaches, achieving a breakthrough in efficiency for applications requiring real-time THG. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.

[147] Spatiotemporal Semantic V2X Framework for Cooperative Collision Prediction

Murat Arda Onsu, Poonam Lohan, Burak Kantarci, Aisha Syed, Matthew Andrews, Sean Kennedy

Main category: cs.CV

TL;DR: Semantic V2X framework uses V-JEPA to generate future frame embeddings at RSUs, transmitted to vehicles for lightweight collision prediction, reducing bandwidth by 10,000x while improving F1-score by 10%.

Details

Motivation: ITS needs real-time collision prediction but conventional approaches transmitting raw video/sensor data are impractical due to bandwidth and latency constraints in vehicular communications.

Method: RSU-mounted cameras use Video Joint Embedding Predictive Architecture (V-JEPA) to generate spatiotemporal semantic embeddings of future frames. These embeddings are transmitted via V2X to vehicles, where a lightweight attentive probe and classifier decode them for collision prediction.

Result: The framework achieves 10% F1-score improvement for collision prediction while reducing transmission requirements by four orders of magnitude (10,000x) compared to raw video transmission.

Conclusion: Semantic V2X communication enables cooperative, real-time collision prediction in ITS by transmitting only task-relevant semantic embeddings instead of raw data, balancing accuracy with communication efficiency.

Abstract: Intelligent Transportation Systems (ITS) demand real-time collision prediction to ensure road safety and reduce accident severity. Conventional approaches rely on transmitting raw video or high-dimensional sensory data from roadside units (RSUs) to vehicles, which is impractical under vehicular communication bandwidth and latency constraints. In this work, we propose a semantic V2X framework in which RSU-mounted cameras generate spatiotemporal semantic embeddings of future frames using the Video Joint Embedding Predictive Architecture (V-JEPA). To evaluate the system, we construct a digital twin of an urban traffic environment enabling the generation of d verse traffic scenarios with both safe and collision events. These embeddings of the future frame, extracted from V-JEPA, capture task-relevant traffic dynamics and are transmitted via V2X links to vehicles, where a lightweight attentive probe and classifier decode them to predict imminent collisions. By transmitting only semantic embeddings instead of raw frames, the proposed system significantly reduces communication overhead while maintaining predictive accuracy. Experimental results demonstrate that the framework with an appropriate processing method achieves a 10% F1-score improvement for collision prediction while reducing transmission requirements by four orders of magnitude compared to raw video. This validates the potential of semantic V2X communication to enable cooperative, real-time collision prediction in ITS.

[148] Artifact-Aware Evaluation for High-Quality Video Generation

Chen Zhu, Jiashu Zhu, Yanxun Li, Meiqi Wu, Bingze Song, Chubin Chen, Jiahong Wu, Xiangxiang Chu, Yangang Wang

Main category: cs.CV

TL;DR: This paper introduces a comprehensive evaluation protocol for video generation focusing on Appearance, Motion, and Camera artifacts, with a taxonomy of 10 artifact categories, a large-scale dataset (GenVID), and a detection framework (DVAR).

Details

Motivation: Existing video generation evaluation approaches only provide coarse quality scores without detailed localization and categorization of specific artifacts, making it difficult to identify and address specific generative failures.

Method: The paper introduces: 1) A comprehensive evaluation protocol focusing on Appearance, Motion, and Camera aspects; 2) A taxonomy of 10 prevalent artifact categories; 3) GenVID dataset (80k videos from various SOTA models with careful annotations); 4) DVAR framework for dense video artifact recognition.

Result: Extensive experiments show the approach significantly improves artifact detection accuracy and enables effective filtering of low-quality content compared to existing methods.

Conclusion: The proposed comprehensive evaluation protocol, taxonomy, dataset, and framework provide a robust solution for fine-grained identification and classification of generative artifacts in video generation, addressing limitations of existing coarse evaluation methods.

Abstract: With the rapid advancement of video generation techniques, evaluating and auditing generated videos has become increasingly crucial. Existing approaches typically offer coarse video quality scores, lacking detailed localization and categorization of specific artifacts. In this work, we introduce a comprehensive evaluation protocol focusing on three key aspects affecting human perception: Appearance, Motion, and Camera. We define these axes through a taxonomy of 10 prevalent artifact categories reflecting common generative failures observed in video generation. To enable robust artifact detection and categorization, we introduce GenVID, a large-scale dataset of 80k videos generated by various state-of-the-art video generation models, each carefully annotated for the defined artifact categories. Leveraging GenVID, we develop DVAR, a Dense Video Artifact Recognition framework for fine-grained identification and classification of generative artifacts. Extensive experiments show that our approach significantly improves artifact detection accuracy and enables effective filtering of low-quality content.

[149] Towards Compact and Robust DNNs via Compression-aware Sharpness Minimization

Jialuo He, Huangxun Chen

Main category: cs.CV

TL;DR: C-SAM integrates sharpness-aware minimization with pruning by perturbing pruning masks instead of parameters, enabling discovery of pruning patterns that maintain both model compactness and robustness.

Details

Motivation: Current approaches either prune SAM-trained models (undermining robustness due to structural changes) or apply SAM after pruning (constrained by early pruning patterns). There's a need to jointly optimize for compactness and robustness in on-device DNN deployments.

Method: C-SAM shifts sharpness-aware learning from parameter perturbations to mask perturbations. It explicitly perturbs pruning masks during training to promote a flatter loss landscape with respect to model structure, enabling discovery of pruning patterns that optimize both compactness and robustness.

Result: Extensive experiments on CelebA-HQ, Flowers-102, and CIFAR-10-C across ResNet-18, GoogLeNet, and MobileNet-V2 show C-SAM consistently achieves higher certified robustness than strong baselines (up to 42% improvement) while maintaining task accuracy comparable to unpruned models.

Conclusion: C-SAM successfully bridges the gap between model compression and robustness by shifting the focus from parameter perturbations to mask perturbations, enabling joint optimization of compactness and robustness for on-device DNN deployments.

Abstract: Sharpness-Aware Minimization (SAM) has recently emerged as an effective technique for improving DNN robustness to input variations. However, its interplay with the compactness requirements of on-device DNN deployments remains less explored. Simply pruning a SAM-trained model can undermine robustness, since flatness in the continuous parameter space does not necessarily translate to robustness under the discrete structural changes induced by pruning. Conversely, applying SAM after pruning may be fundamentally constrained by architectural limitations imposed by an early, robustness-agnostic pruning pattern. To address this gap, we propose Compression-aware ShArpness Minimization (C-SAM), a framework that shifts sharpness-aware learning from parameter perturbations to mask perturbations. By explicitly perturbing pruning masks during training, C-SAM promotes a flatter loss landscape with respect to model structure, enabling the discovery of pruning patterns that simultaneously optimize model compactness and robustness to input variations. Extensive experiments on CelebA-HQ, Flowers-102, and CIFAR-10-C across ResNet-18, GoogLeNet, and MobileNet-V2 show that C-SAM consistently achieves higher certified robustness than strong baselines, with improvements of up to 42%, while maintaining task accuracy comparable to the corresponding unpruned models.

[150] Bridging the Applicator Gap with Data-Doping:Dual-Domain Learning for Precise Bladder Segmentation in CT-Guided Brachytherapy

Suresh Das, Siladittya Manna, Sayantari Ghosh

Main category: cs.CV

TL;DR: A dual domain learning strategy combining CT scans with and without brachytherapy applicators improves bladder segmentation under covariate shift, achieving near-exclusive WA data performance with only 10-30% WA data.

Details

Motivation: Covariate shift causes performance degradation in medical image segmentation, especially for scarce CT scans with brachytherapy applicators (WA) that show anatomical deformation and artifacts. While CT scans without applicators (NA) are widely available, they fail to capture WA characteristics.

Method: Proposed dual domain learning strategy integrating NA and WA CT data. Systematic experiments with multiple deep learning architectures across axial, coronal, and sagittal planes, testing various proportions of WA data in predominantly NA training sets.

Result: NA data alone fails to capture WA characteristics, but adding modest WA data (10-30%) achieves segmentation performance comparable to models trained exclusively on WA data. Achieves Dice similarity coefficients up to 0.94 and IoU scores up to 0.92.

Conclusion: Integrating anatomically similar but distribution-shifted datasets effectively overcomes data scarcity and enhances deep learning segmentation for brachytherapy treatment planning, demonstrating practical domain adaptation with limited target domain data.

Abstract: Performance degradation due to covariate shift remains a major challenge for deep learning models in medical image segmentation. An open question is whether samples from a shifted distribution can effectively support learning when combined with limited target domain data. We investigate this problem in the context of bladder segmentation in CT guided gynecological brachytherapy, a critical task for accurate dose optimization and organ at risk sparing. While CT scans without brachytherapy applicators (no applicator: NA) are widely available, scans with applicators inserted (with applicator: WA) are scarce and exhibit substantial anatomical deformation and imaging artifacts, making automated segmentation particularly difficult. We propose a dual domain learning strategy that integrates NA and WA CT data to improve robustness and generalizability under covariate shift. Using a curated assorted dataset, we show that NA data alone fail to capture the anatomical and artifact related characteristics of WA images. However, introducing a modest proportion of WA data into a predominantly NA training set leads to significant performance improvements. Through systematic experiments across axial, coronal, and sagittal planes using multiple deep learning architectures, we demonstrate that doping only 10 to 30 percent WA data achieves segmentation performance comparable to models trained exclusively on WA data. The proposed approach attains Dice similarity coefficients of up to 0.94 and Intersection over Union scores of up to 0.92, indicating effective domain adaptation and improved clinical reliability. This study highlights the value of integrating anatomically similar but distribution shifted datasets to overcome data scarcity and enhance deep learning based segmentation for brachytherapy treatment planning.

[151] Physically Guided Visual Mass Estimation from a Single RGB Image

Sungjae Lee, Junhan Jeong, Yeonjoo Hong, Kwang In Kim

Main category: cs.CV

TL;DR: A physically structured framework for estimating object mass from single RGB images by combining geometry (via depth estimation) and material semantics (via vision-language models) to address the ambiguity between volume and density.

Details

Motivation: Mass estimation from visual input is ill-posed because mass depends on both geometric volume and material density, neither of which is directly observable from RGB appearance. This ambiguity requires physically meaningful representations to constrain plausible solutions.

Method: From single RGB images: 1) Recover 3D geometry via monocular depth estimation for volume information, 2) Extract coarse material semantics using a vision-language model for density reasoning, 3) Fuse geometry, semantic, and appearance representations through instance-adaptive gating, 4) Predict two physically guided latent factors (volume- and density-related) via separate regression heads using only mass supervision.

Result: The proposed method consistently outperforms state-of-the-art methods on image2mass and ABO-500 datasets.

Conclusion: A physically structured approach that aligns visual cues with physical factors governing mass (volume and density) provides an effective framework for single-image mass estimation, addressing the inherent ambiguity in this ill-posed problem.

Abstract: Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two physically guided latent factors (volume- and density-related) are predicted through separate regression heads under mass-only supervision. Experiments on image2mass and ABO-500 show that the proposed method consistently outperforms state-of-the-art methods.

[152] Structure-constrained Language-informed Diffusion Model for Unpaired Low-dose Computed Tomography Angiography Reconstruction

Genyuan Zhang, Zihao Wang, Zhifan Gao, Lei Xu, Zhen Zhou, Haijun Yu, Jianjia Zhang, Xiujian Liu, Weiwei Zhang, Shaoyu Wang, Huazhu Fu, Fenglin Liu, Weiwen Wu

Main category: cs.CV

TL;DR: SLDM uses structure constraints and language guidance to generate normal-dose CT images from low-dose contrast media, reducing ICM overdose risks while maintaining diagnostic quality.

Details

Motivation: Iodinated contrast media (ICM) overdose in CT scans can cause kidney damage and allergic reactions. Existing deep learning methods struggle with accurate enhancement using incompletely paired images due to limited structural recognition capabilities.

Method: Proposes Structure-constrained Language-informed Diffusion Model (SLDM) with three key components: 1) Structural prior extraction to constrain model inference and ensure structural consistency, 2) Semantic supervision with spatial intelligence integrating visual perception and spatial reasoning, 3) Subtraction angiography enhancement module to improve contrast in ICM regions for optimal observation.

Result: Qualitative visual comparisons and quantitative metrics demonstrate SLDM’s effectiveness in angiographic reconstruction for low-dose contrast medium CT angiography, achieving accurate enhancement while reducing required ICM dose.

Conclusion: SLDM provides a unified medical generation model that successfully addresses limitations of existing methods by integrating structural synergy and spatial intelligence, enabling accurate CT enhancement from low-dose contrast media while maintaining diagnostic power and reducing overdose risks.

Abstract: The application of iodinated contrast media (ICM) improves the sensitivity and specificity of computed tomography (CT) for a wide range of clinical indications. However, overdose of ICM can cause problems such as kidney damage and life-threatening allergic reactions. Deep learning methods can generate CT images of normal-dose ICM from low-dose ICM, reducing the required dose while maintaining diagnostic power. However, existing methods are difficult to realize accurate enhancement with incompletely paired images, mainly because of the limited ability of the model to recognize specific structures. To overcome this limitation, we propose a Structure-constrained Language-informed Diffusion Model (SLDM), a unified medical generation model that integrates structural synergy and spatial intelligence. First, the structural prior information of the image is effectively extracted to constrain the model inference process, thus ensuring structural consistency in the enhancement process. Subsequently, semantic supervision strategy with spatial intelligence is introduced, which integrates the functions of visual perception and spatial reasoning, thus prompting the model to achieve accurate enhancement. Finally, the subtraction angiography enhancement module is applied, which serves to improve the contrast of the ICM agent region to suitable interval for observation. Qualitative analysis of visual comparison and quantitative results of several metrics demonstrate the effectiveness of our method in angiographic reconstruction for low-dose contrast medium CT angiography.

[153] TPGDiff: Hierarchical Triple-Prior Guided Diffusion for Image Restoration

Yanjie Tu, Qingsen Yan, Axi Niu, Jiacong Tang

Main category: cs.CV

TL;DR: TPGDiff is a unified image restoration model using triple priors (degradation, structural, semantic) in diffusion models to handle diverse degradations while preserving spatial structures.

Details

Motivation: Existing unified restoration models struggle with severe degradations and often disrupt spatial structures when integrating semantic information into shallow diffusion layers, causing blurring artifacts.

Method: Triple-Prior Guided Diffusion (TPGDiff) network that incorporates: 1) degradation priors throughout diffusion trajectory, 2) structural priors in shallow layers using multi-source structural cues, and 3) semantic priors in deep layers using distillation-driven semantic extractor.

Result: Superior performance and generalization across diverse restoration scenarios on both single- and multi-degradation benchmarks.

Conclusion: TPGDiff effectively addresses diverse image restoration tasks through hierarchical and complementary prior guidance, overcoming limitations of existing unified restoration models.

Abstract: All-in-one image restoration aims to address diverse degradation types using a single unified model. Existing methods typically rely on degradation priors to guide restoration, yet often struggle to reconstruct content in severely degraded regions. Although recent works leverage semantic information to facilitate content generation, integrating it into the shallow layers of diffusion models often disrupts spatial structures (\emph{e.g.}, blurring artifacts). To address this issue, we propose a Triple-Prior Guided Diffusion (TPGDiff) network for unified image restoration. TPGDiff incorporates degradation priors throughout the diffusion trajectory, while introducing structural priors into shallow layers and semantic priors into deep layers, enabling hierarchical and complementary prior guidance for image reconstruction. Specifically, we leverage multi-source structural cues as structural priors to capture fine-grained details and guide shallow layers representations. To complement this design, we further develop a distillation-driven semantic extractor that yields robust semantic priors, ensuring reliable high-level guidance at deep layers even under severe degradations. Furthermore, a degradation extractor is employed to learn degradation-aware priors, enabling stage-adaptive control of the diffusion process across all timesteps. Extensive experiments on both single- and multi-degradation benchmarks demonstrate that TPGDiff achieves superior performance and generalization across diverse restoration scenarios. Our project page is: https://leoyjtu.github.io/tpgdiff-project.

[154] OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion

Shuoyan Wei, Feng Li, Chen Zhou, Runmin Cong, Yao Zhao, Huihui Bai

Main category: cs.CV

TL;DR: OSDEnhancer is the first one-step diffusion framework for real-world space-time video super-resolution, achieving both spatial enhancement and temporal upsampling with coherent dynamics.

Details

Motivation: Diffusion models excel at video super-resolution but haven't been explored for space-time video super-resolution (STVSR) which requires both spatial upscaling and temporal frame rate improvement. Existing methods struggle with real-world complex degradations and need better reconstruction fidelity and temporal consistency.

Method: Proposes OSDEnhancer with: 1) Linear pre-interpolation for initial spatiotemporal structures, 2) Temporal Refinement and Spatial Enhancement Mixture of Experts (TR-SE MoE) with distinct expert pathways for progressive learning, 3) Bidirectional deformable VAE decoder for recurrent spatiotemporal aggregation and propagation.

Result: Achieves state-of-the-art performance with superior generalization capability in real-world scenarios, demonstrating effective STVSR through efficient one-step diffusion.

Conclusion: OSDEnhancer successfully addresses real-world STVSR challenges by combining one-step diffusion with specialized expert pathways and spatiotemporal aggregation, offering robust performance for complex unknown degradations.

Abstract: Diffusion models (DMs) have demonstrated exceptional success in video super-resolution (VSR), showcasing a powerful capacity for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic visual content from low-resolution to high-resolution but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simplified degradation assumptions, which often struggle in real-world scenarios with complex unknown degradations. Such a high demand for reconstruction fidelity and temporal consistency makes the development of a robust STVSR framework particularly non-trivial. To address these challenges, we propose OSDEnhancer, a novel framework that, to the best of our knowledge, represents the first method to achieve real-world STVSR through an efficient one-step diffusion process. OSDEnhancer initializes essential spatiotemporal structures through a linear pre-interpolation strategy and pivots on training temporal refinement and spatial enhancement mixture of experts (TR-SE MoE), which allows distinct expert pathways to progressively learn robust, specialized representations for temporal coherence and spatial detail, further collaboratively reinforcing each other during inference. A bidirectional deformable variational autoencoder (VAE) decoder is further introduced to perform recurrent spatiotemporal aggregation and propagation, enhancing cross-frame reconstruction fidelity. Experiments demonstrate that the proposed method achieves state-of-the-art performance while maintaining superior generalization capability in real-world scenarios.

[155] CPiRi: Channel Permutation-Invariant Relational Interaction for Multivariate Time Series Forecasting

Jiyuan Xu, Wenyu Zhang, Xin Jing, Shuai Chen, Shuai Zhang, Jiahao Nie

Main category: cs.CV

TL;DR: CPiRi is a channel permutation invariant framework for multivariate time series forecasting that learns cross-channel structure from data rather than memorizing fixed channel ordering, enabling deployment without retraining when channels are added or reordered.

Details

Motivation: Current methods have limitations: channel-dependent models overfit to channel ordering and can't adapt to channel changes, while channel-independent models ignore inter-channel dependencies and limit performance. There's a need for models that can handle structural and distributional co-drift without retraining.

Method: CPiRi combines spatio-temporal decoupling architecture with permutation-invariant regularization: frozen pretrained temporal encoder extracts temporal features, lightweight spatial module learns content-driven inter-channel relations, and channel shuffling enforces permutation invariance during training.

Result: State-of-the-art results on multiple benchmarks, stable performance when channel orders are shuffled, strong inductive generalization to unseen channels (even when trained on only half of channels), and practical efficiency on large-scale datasets.

Conclusion: CPiRi addresses limitations of both channel-dependent and channel-independent models by learning cross-channel structure from data rather than memorizing ordering, enabling robust deployment in dynamic settings with channel changes without requiring retraining.

Abstract: Current methods for multivariate time series forecasting can be classified into channel-dependent and channel-independent models. Channel-dependent models learn cross-channel features but often overfit the channel ordering, which hampers adaptation when channels are added or reordered. Channel-independent models treat each channel in isolation to increase flexibility, yet this neglects inter-channel dependencies and limits performance. To address these limitations, we propose \textbf{CPiRi}, a \textbf{channel permutation invariant (CPI)} framework that infers cross-channel structure from data rather than memorizing a fixed ordering, enabling deployment in settings with structural and distributional co-drift without retraining. CPiRi couples \textbf{spatio-temporal decoupling architecture} with \textbf{permutation-invariant regularization training strategy}: a frozen pretrained temporal encoder extracts high-quality temporal features, a lightweight spatial module learns content-driven inter-channel relations, while a channel shuffling strategy enforces CPI during training. We further \textbf{ground CPiRi in theory} by analyzing permutation equivariance in multivariate time series forecasting. Experiments on multiple benchmarks show state-of-the-art results. CPiRi remains stable when channel orders are shuffled and exhibits strong \textbf{inductive generalization} to unseen channels even when trained on \textbf{only half} of the channels, while maintaining \textbf{practical efficiency} on large-scale datasets. The source code is released at https://github.com/JasonStraka/CPiRi.

[156] GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

Mai Su, Qihan Yu, Zhongtao Wang, Yilong Li, Chengwei Pan, Yisong Chen, Guoping Wang

Main category: cs.CV

TL;DR: GVGS improves 3D Gaussian Splatting surface reconstruction by combining visibility-aware multi-view geometric consistency and progressive quadtree-calibrated monocular depth constraints.

Details

Motivation: 3D Gaussian Splatting has efficient optimization and rendering but struggles with accurate surface reconstruction. Existing methods using multi-view geometric consistency fail with large geometric discrepancies, while monocular depth priors suffer from scale ambiguity and local inconsistency.

Method: Two key innovations: 1) Gaussian visibility-aware multi-view geometric consistency constraint that aggregates visibility of shared Gaussian primitives across views for more accurate geometric supervision. 2) Progressive quadtree-calibrated monocular depth constraint that performs block-wise affine calibration from coarse to fine spatial scales to mitigate scale ambiguity while preserving details.

Result: Extensive experiments on DTU and TNT datasets show consistent improvements in geometric accuracy over prior Gaussian-based and implicit surface reconstruction methods.

Conclusion: The proposed GVGS method effectively addresses limitations of existing approaches by combining visibility-aware multi-view constraints with calibrated monocular priors, achieving superior surface reconstruction quality for 3D Gaussian Splatting.

Abstract: 3D Gaussian Splatting enables efficient optimization and high-quality rendering, yet accurate surface reconstruction remains challenging. Prior methods improve surface reconstruction by refining Gaussian depth estimates, either via multi-view geometric consistency or through monocular depth priors. However, multi-view constraints become unreliable under large geometric discrepancies, while monocular priors suffer from scale ambiguity and local inconsistency, ultimately leading to inaccurate Gaussian depth supervision. To address these limitations, we introduce a Gaussian visibility-aware multi-view geometric consistency constraint that aggregates the visibility of shared Gaussian primitives across views, enabling more accurate and stable geometric supervision. In addition, we propose a progressive quadtree-calibrated Monocular depth constraint that performs block-wise affine calibration from coarse to fine spatial scales, mitigating the scale ambiguity of depth priors while preserving fine-grained surface details. Extensive experiments on DTU and TNT datasets demonstrate consistent improvements in geometric accuracy over prior Gaussian-based and implicit surface reconstruction methods. Codes are available at an anonymous repository: https://github.com/GVGScode/GVGS.

[157] Test-Time Adaptation for Anomaly Segmentation via Topology-Aware Optimal Transport Chaining

Ali Zia, Usman Ali, Umer Ramzan, Abdul Rehman, Abdelwahed Khamis, Wei Xiang

Main category: cs.CV

TL;DR: TopoOT integrates topological data analysis with optimal transport for anomaly segmentation, achieving SOTA performance by using persistence diagrams and test-time adaptation.

Details

Motivation: Traditional threshold-based binarization methods for anomaly segmentation are brittle under distribution shift. Deep topological data analysis can capture structural invariants that persist across scales, making it suitable for characterizing anomalies as disruptions to global structure rather than local fluctuations.

Method: TopoOT combines multi-filtration persistence diagrams with test-time adaptation using Optimal Transport Chaining. This sequentially aligns persistence diagrams across thresholds and filtrations to compute geodesic stability scores. These scores identify features consistently preserved across scales and generate stability-aware pseudo-labels that supervise a lightweight head trained online with OT-consistency and contrastive objectives.

Result: TopoOT achieves state-of-the-art performance across standard 2D and 3D anomaly detection benchmarks, outperforming the most competitive methods by up to +24.1% mean F1 on 2D datasets and +10.2% on 3D anomaly segmentation benchmarks.

Conclusion: The integration of topological data analysis with optimal transport provides a robust framework for anomaly segmentation that maintains performance under domain shift by focusing on structural invariants rather than local features.

Abstract: Deep topological data analysis (TDA) offers a principled framework for capturing structural invariants such as connectivity and cycles that persist across scales, making it a natural fit for anomaly segmentation (AS). Unlike thresholdbased binarisation, which produces brittle masks under distribution shift, TDA allows anomalies to be characterised as disruptions to global structure rather than local fluctuations. We introduce TopoOT, a topology-aware optimal transport (OT) framework that integrates multi-filtration persistence diagrams (PDs) with test-time adaptation (TTA). Our key innovation is Optimal Transport Chaining, which sequentially aligns PDs across thresholds and filtrations, yielding geodesic stability scores that identify features consistently preserved across scales. These stabilityaware pseudo-labels supervise a lightweight head trained online with OT-consistency and contrastive objectives, ensuring robust adaptation under domain shift. Across standard 2D and 3D anomaly detection benchmarks, TopoOT achieves state-of-the-art performance, outperforming the most competitive methods by up to +24.1% mean F1 on 2D datasets and +10.2% on 3D AS benchmarks.

[158] MMSF: Multitask and Multimodal Supervised Framework for WSI Classification and Survival Analysis

Chengying She, Chengwei Chen, Xinran Zhang, Ben Wang, Lizhuang Liu, Chengwei Shao, Yun Bian

Main category: cs.CV

TL;DR: MMSF is a multimodal framework for computational pathology that integrates whole slide images with clinical data using a multitask approach with linear-complexity MIL backbone, achieving significant performance improvements across multiple cancer datasets.

Details

Motivation: Multimodal evidence is crucial in computational pathology, but integrating heterogeneous signals (gigapixel whole slide images and patient-level clinical descriptors) is challenging due to distinct feature statistics and scales.

Method: MMSF uses a multitask multimodal supervised framework with: 1) graph feature extraction for tissue topology, 2) clinical data embedding module, 3) feature fusion module aligning modality-shared and modality-specific representations, and 4) Mamba-based MIL encoder with multitask prediction heads.

Result: On CAMELYON16 and TCGA-NSCLC: 2.1-6.6% accuracy and 2.2-6.9% AUC improvements. On five TCGA survival cohorts: 7.1-9.8% C-index improvements vs unimodal methods and 5.6-7.1% vs multimodal alternatives.

Conclusion: MMSF effectively integrates multimodal pathology data through explicit decomposition and fusion of cross-modal information, demonstrating superior performance across various cancer prognosis tasks compared to existing methods.

Abstract: Multimodal evidence is critical in computational pathology: gigapixel whole slide images capture tumor morphology, while patient-level clinical descriptors preserve complementary context for prognosis. Integrating such heterogeneous signals remains challenging because feature spaces exhibit distinct statistics and scales. We introduce MMSF, a multitask and multimodal supervised framework built on a linear-complexity MIL backbone that explicitly decomposes and fuses cross-modal information. MMSF comprises a graph feature extraction module embedding tissue topology at the patch level, a clinical data embedding module standardizing patient attributes, a feature fusion module aligning modality-shared and modality-specific representations, and a Mamba-based MIL encoder with multitask prediction heads. Experiments on CAMELYON16 and TCGA-NSCLC demonstrate 2.1–6.6% accuracy and 2.2–6.9% AUC improvements over competitive baselines, while evaluations on five TCGA survival cohorts yield 7.1–9.8% C-index improvements compared with unimodal methods and 5.6–7.1% over multimodal alternatives.

[159] PalmBridge: A Plug-and-Play Feature Alignment Framework for Open-Set Palmprint Verification

Chenke Zhang, Ziyuan Yang, Licheng Yan, Shuyi Li, Andrew Beng Jin Teoh, Bob Zhang, Yi Zhang

Main category: cs.CV

TL;DR: PalmBridge is a plug-and-play feature-space alignment framework using vector quantization to address domain shift in palmprint recognition, improving open-set verification and cross-dataset generalization.

Details

Motivation: Real-world palmprint recognition performance degrades due to feature distribution shifts from heterogeneous deployment conditions. Current deep models assume closed stationary distributions and overfit to dataset-specific textures rather than learning domain-invariant representations. Data augmentation often fails under significant domain mismatch.

Method: PalmBridge learns a compact set of representative vectors from training features via vector quantization. During enrollment/verification, each feature vector is mapped to its nearest representative vector under minimum-distance criterion, then blended with the original vector. The framework uses joint optimization with task supervision, feature-consistency objective, and orthogonality regularization to form a stable shared embedding space.

Result: Experiments on multiple palmprint datasets and backbone architectures show PalmBridge consistently reduces EER in intra-dataset open-set evaluation and improves cross-dataset generalization with negligible to modest runtime overhead.

Conclusion: PalmBridge effectively addresses domain shift in palmprint recognition through feature-space alignment with vector quantization, suppressing nuisance variation while retaining discriminative identity cues, leading to improved open-set verification performance.

Abstract: Palmprint recognition is widely used in biometric systems, yet real-world performance often degrades due to feature distribution shifts caused by heterogeneous deployment conditions. Most deep palmprint models assume a closed and stationary distribution, leading to overfitting to dataset-specific textures rather than learning domain-invariant representations. Although data augmentation is commonly used to mitigate this issue, it assumes augmented samples can approximate the target deployment distribution, an assumption that often fails under significant domain mismatch. To address this limitation, we propose PalmBridge, a plug-and-play feature-space alignment framework for open-set palmprint verification based on vector quantization. Rather than relying solely on data-level augmentation, PalmBridge learns a compact set of representative vectors directly from training features. During enrollment and verification, each feature vector is mapped to its nearest representative vector under a minimum-distance criterion, and the mapped vector is then blended with the original vector. This design suppresses nuisance variation induced by domain shifts while retaining discriminative identity cues. The representative vectors are jointly optimized with the backbone network using task supervision, a feature-consistency objective, and an orthogonality regularization term to form a stable and well-structured shared embedding space. Furthermore, we analyze feature-to-representative mappings via assignment consistency and collision rate to assess model’s sensitivity to blending weights. Experiments on multiple palmprint datasets and backbone architectures show that PalmBridge consistently reduces EER in intra-dataset open-set evaluation and improves cross-dataset generalization with negligible to modest runtime overhead.

[160] Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Zengbin Wang, Xuecai Hu, Yong Wang, Feng Xiong, Man Zhang, Xiangxiang Chu

Main category: cs.CV

TL;DR: SpatialGenEval is a new benchmark for evaluating spatial intelligence in text-to-image models, featuring 1,230 information-dense prompts across 25 scenes with 10 spatial sub-domains, plus a SpatialT2I dataset for fine-tuning that improves model performance.

Details

Motivation: Current T2I models struggle with complex spatial relationships (perception, reasoning, interaction), and existing benchmarks fail to adequately test these capabilities due to short, information-sparse prompt designs.

Method: Created SpatialGenEval benchmark with 1,230 long, information-dense prompts across 25 real-world scenes, each covering 10 spatial sub-domains with multi-choice Q&A pairs. Also constructed SpatialT2I dataset with 15,400 text-image pairs using rewritten prompts for consistency while preserving information density.

Result: Evaluation of 21 SOTA models shows higher-order spatial reasoning remains a primary bottleneck. Fine-tuning foundation models (Stable Diffusion-XL, Uniworld-V1, OmniGen2) with SpatialT2I data yields consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic spatial relations.

Conclusion: SpatialGenEval provides systematic evaluation of spatial intelligence in T2I models, revealing current limitations in spatial reasoning. The SpatialT2I dataset demonstrates a data-centric approach to improve spatial intelligence through fine-tuning, offering a practical pathway for enhancing T2I model capabilities.

Abstract: Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while preserving information density. Fine-tuned results on current foundation models (i.e., Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.

[161] CURVE: Learning Causality-Inspired Invariant Representations for Robust Scene Understanding via Uncertainty-Guided Regularization

Yue Liang, Jiatong Du, Ziyi Yang, Yanjun Huang, Hong Chen

Main category: cs.CV

TL;DR: CURVE: A causality-inspired framework using variational uncertainty modeling and uncertainty-guided structural regularization to suppress spurious correlations in scene graphs for better out-of-distribution generalization.

Details

Motivation: Scene graphs often overfit to spurious correlations, which severely hinders out-of-distribution generalization. Current approaches fail to address environment-specific relations that cause poor generalization under distribution shifts.

Method: CURVE integrates variational uncertainty modeling with uncertainty-guided structural regularization. It uses prototype-conditioned debiasing to disentangle invariant interaction dynamics from environment-dependent variations, promoting sparse and domain-stable topology.

Result: The framework was evaluated in zero-shot transfer and low-data sim-to-real adaptation scenarios. CURVE demonstrates ability to learn domain-stable sparse topologies and provides reliable uncertainty estimates for risk prediction under distribution shifts.

Conclusion: CURVE effectively addresses spurious correlation issues in scene graphs through causality-inspired uncertainty modeling, enabling better out-of-distribution generalization and reliable uncertainty estimation for risk-aware applications.

Abstract: Scene graphs provide structured abstractions for scene understanding, yet they often overfit to spurious correlations, severely hindering out-of-distribution generalization. To address this limitation, we propose CURVE, a causality-inspired framework that integrates variational uncertainty modeling with uncertainty-guided structural regularization to suppress high-variance, environment-specific relations. Specifically, we apply prototype-conditioned debiasing to disentangle invariant interaction dynamics from environment-dependent variations, promoting a sparse and domain-stable topology. Empirically, we evaluate CURVE in zero-shot transfer and low-data sim-to-real adaptation, verifying its ability to learn domain-stable sparse topologies and provide reliable uncertainty estimates to support risk prediction under distribution shifts.

[162] RAW-Flow: Advancing RGB-to-RAW Image Reconstruction with Deterministic Latent Flow Matching

Zhen Liu, Diedong Feng, Hai Jiang, Liaoyuan Zeng, Hao Wang, Chaoyu Feng, Lei Lei, Bing Zeng, Shuaicheng Liu

Main category: cs.CV

TL;DR: RAW-Flow: A flow matching framework for RGB-to-RAW reconstruction that treats it as a deterministic latent transport problem, outperforming existing methods.

Details

Motivation: Existing learning-based methods treat RGB-to-RAW reconstruction as direct regression, leading to detail inconsistency and color deviation due to the ill-posed nature of inverse ISP and information loss in quantized RGB images.

Method: Reformulates RGB-to-RAW reconstruction as deterministic latent transport using flow matching to learn vector fields in latent space. Includes cross-scale context guidance module for hierarchical RGB feature injection, and dual-domain latent autoencoder with feature alignment constraint for stable training.

Result: Extensive experiments show RAW-Flow outperforms state-of-the-art approaches both quantitatively and visually, achieving accurate reconstruction of structural details and color information.

Conclusion: The generative perspective with flow matching effectively addresses limitations of direct regression approaches, enabling high-fidelity RGB-to-RAW reconstruction through latent space transport.

Abstract: RGB-to-RAW reconstruction, or the reverse modeling of a camera Image Signal Processing (ISP) pipeline, aims to recover high-fidelity RAW data from RGB images. Despite notable progress, existing learning-based methods typically treat this task as a direct regression objective and struggle with detail inconsistency and color deviation, due to the ill-posed nature of inverse ISP and the inherent information loss in quantized RGB images. To address these limitations, we pioneer a generative perspective by reformulating RGB-to-RAW reconstruction as a deterministic latent transport problem and introduce a novel framework named RAW-Flow, which leverages flow matching to learn a deterministic vector field in latent space, to effectively bridge the gap between RGB and RAW representations and enable accurate reconstruction of structural details and color information. To further enhance latent transport, we introduce a cross-scale context guidance module that injects hierarchical RGB features into the flow estimation process. Moreover, we design a dual-domain latent autoencoder with a feature alignment constraint to support the proposed latent transport framework, which jointly encodes RGB and RAW inputs while promoting stable training and high-fidelity reconstruction. Extensive experiments demonstrate that RAW-Flow outperforms state-of-the-art approaches both quantitatively and visually.

[163] Dual-Modality IoT Framework for Integrated Access Control and Environmental Safety Monitoring with Real-Time Cloud Analytics

Abdul Hasib, A. S. M. Ahsanul Sarkar Akib, Nihal Das Ankur, Anish Giri

Main category: cs.CV

TL;DR: A dual-modality IoT framework integrates RFID access control with environmental safety monitoring using ESP32 microcontrollers, achieving high performance at 82% cost reduction compared to commercial solutions.

Details

Motivation: Traditional physical security and environmental safety systems operate as independent silos, leading to operational inefficiencies, delayed emergency responses, and increased management complexity in smart infrastructure.

Method: A unified cloud architecture with two coordinated subsystems: Subsystem 1 for RFID authentication with servo gate control and Google Sheets logging; Subsystem 2 for safety monitoring with flame detection, water flow measurement, LCD display, and personnel identification. Both use ESP32 microcontrollers for edge processing with intelligent local caching for network disruptions.

Result: 45-day experimental evaluation showed: 99.2% RFID authentication accuracy with 0.82s average response time, 98.5% flame detection reliability within 5m range, 99.8% cloud data logging success rate. Total cost was 5,400 BDT (~$48), representing 82% reduction compared to commercial integrated solutions.

Conclusion: The research establishes a practical framework for synergistic security-safety integration, demonstrating professional-grade performance can be achieved through careful architectural design and component optimization while maintaining exceptional cost-effectiveness and accessibility for diverse applications.

Abstract: The integration of physical security systems with environmental safety monitoring represents a critical advancement in smart infrastructure management. Traditional approaches maintain these systems as independent silos, creating operational inefficiencies, delayed emergency responses, and increased management complexity. This paper presents a comprehensive dual-modality Internet of Things framework that seamlessly integrates RFID-based access control with multi-sensor environmental safety monitoring through a unified cloud architecture. The system comprises two coordinated subsystems: Subsystem 1 implements RFID authentication with servo-actuated gate control and real-time Google Sheets logging, while Subsystem 2 provides comprehensive safety monitoring incorporating flame detection, water flow measurement, LCD status display, and personnel identification. Both subsystems utilize ESP32 microcontrollers for edge processing and wireless connectivity. Experimental evaluation over 45 days demonstrates exceptional performance metrics: 99.2% RFID authentication accuracy with 0.82-second average response time, 98.5% flame detection reliability within 5-meter range, and 99.8% cloud data logging success rate. The system maintains operational integrity during network disruptions through intelligent local caching mechanisms and achieves total implementation cost of 5,400 BDT (approximately $48), representing an 82% reduction compared to commercial integrated solutions. This research establishes a practical framework for synergistic security-safety integration, demonstrating that professional-grade performance can be achieved through careful architectural design and component optimization while maintaining exceptional cost-effectiveness and accessibility for diverse application scenarios.

[164] RepSFNet : A Single Fusion Network with Structural Reparameterization for Crowd Counting

Mas Nurul Achmadiah, Chi-Chia Sun, Wen-Kai Kuo, Jun-Wei Hsieh

Main category: cs.CV

TL;DR: RepSFNet is a lightweight crowd counting model using reparameterized kernels and feature fusion modules for accurate real-time estimation with reduced computational cost.

Details

Motivation: Address challenges in crowd counting including scale variations, occlusions, and high computational cost of existing models for real-time applications.

Method: Uses RepLK-ViT backbone with large reparameterized kernels, Feature Fusion module (ASPP+CAN), Concatenate Fusion module, and MSE+Optimal Transport loss training.

Result: Achieves competitive accuracy on ShanghaiTech, NWPU, UCF-QNRF datasets while reducing inference latency by up to 34% compared to SOTA methods.

Conclusion: RepSFNet provides accurate, real-time crowd counting suitable for edge computing applications by balancing performance and computational efficiency.

Abstract: Crowd counting remains challenging in variable-density scenes due to scale variations, occlusions, and the high computational cost of existing models. To address these issues, we propose RepSFNet (Reparameterized Single Fusion Network), a lightweight architecture designed for accurate and real-time crowd estimation. RepSFNet leverages a RepLK-ViT backbone with large reparameterized kernels for efficient multi-scale feature extraction. It further integrates a Feature Fusion module combining Atrous Spatial Pyramid Pooling (ASPP) and Context-Aware Network (CAN) to achieve robust, density-adaptive context modeling. A Concatenate Fusion module is employed to preserve spatial resolution and generate high-quality density maps. By avoiding attention mechanisms and multi-branch designs, RepSFNet significantly reduces parameters and computational complexity. The training objective combines Mean Squared Error and Optimal Transport loss to improve both count accuracy and spatial distribution alignment. Experiments conducted on ShanghaiTech, NWPU, and UCF-QNRF datasets demonstrate that RepSFNet achieves competitive accuracy while reducing inference latency by up to 34 percent compared to recent state-of-the-art methods, making it suitable for real-time and low-power edge computing applications.

[165] HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion Generation

Mengge Liu, Yan Di, Gu Wang, Yun Qu, Dekai Zhu, Yanyan Li, Xiangyang Ji

Main category: cs.CV

TL;DR: HINT is an autoregressive diffusion framework for multi-human motion generation that handles variable text lengths and agent counts through hierarchical interaction modeling and sliding-window generation.

Details

Motivation: Existing offline methods for multi-human motion generation have limitations: they generate fixed-length motions with fixed agent counts, cannot handle long/variable text, and struggle with varying numbers of agents. This motivates an autoregressive approach that can generate motions step-by-step.

Method: HINT uses: 1) Disentangled motion representation in canonicalized latent space (decoupling local motion from inter-person interactions), enabling adaptation to varying human counts; 2) Sliding-window strategy for online generation with local within-window and global cross-window conditions to capture past history, inter-person dependencies, and text alignment.

Result: HINT matches strong offline models and surpasses autoregressive baselines. On InterHuman benchmark, achieves FID of 3.100, significantly improving over previous SOTA of 5.154.

Conclusion: HINT is the first autoregressive framework for multi-human motion generation with hierarchical interaction modeling, enabling handling of variable text lengths and agent counts while maintaining long-horizon coherence and fine-grained interactions.

Abstract: Text-driven multi-human motion generation with complex interactions remains a challenging problem. Despite progress in performance, existing offline methods that generate fixed-length motions with a fixed number of agents, are inherently limited in handling long or variable text, and varying agent counts. These limitations naturally encourage autoregressive formulations, which predict future motions step by step conditioned on all past trajectories and current text guidance. In this work, we introduce HINT, the first autoregressive framework for multi-human motion generation with Hierarchical INTeraction modeling in diffusion. First, HINT leverages a disentangled motion representation within a canonicalized latent space, decoupling local motion semantics from inter-person interactions. This design facilitates direct adaptation to varying numbers of human participants without requiring additional refinement. Second, HINT adopts a sliding-window strategy for efficient online generation, and aggregates local within-window and global cross-window conditions to capture past human history, inter-person dependencies, and align with text guidance. This strategy not only enables fine-grained interaction modeling within each window but also preserves long-horizon coherence across all the long sequence. Extensive experiments on public benchmarks demonstrate that HINT matches the performance of strong offline models and surpasses autoregressive baselines. Notably, on InterHuman, HINT achieves an FID of 3.100, significantly improving over the previous state-of-the-art score of 5.154.

[166] Let’s Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

Yuhao Sun, Chengyi Cai, Jiacheng Zhang, Zesheng Ye, Xingliang Yuan, Feng Liu

Main category: cs.CV

TL;DR: BiFTA improves CLIP’s zero-shot performance by removing redundant information from both image patches and text descriptions through view and description refinement.

Details

Motivation: Existing fine-grained text-visual alignment methods suffer from redundancy in both image patches and text descriptions, which reduces alignment effectiveness and zero-shot performance.

Method: BiFTA uses two refinement approaches: (1) View refinement removes redundant image patches with high IoU ratios, and (2) Description refinement removes redundant text descriptions with high pairwise cosine similarity.

Result: BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP models.

Conclusion: Removing redundant information from both visual and textual components is necessary for effective fine-grained text-visual alignment and improved zero-shot performance.

Abstract: Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{View Refinement} and \emph{Description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.

[167] Quartet of Diffusions: Structure-Aware Point Cloud Generation through Part and Symmetry Guidance

Chenliang Zhou, Fangcheng Zhong, Weihao Xia, Albert Miao, Canberk Baykal, Cengiz Oztireli

Main category: cs.CV

TL;DR: Quartet of Diffusions: A 3D point cloud generation framework using four coordinated diffusion models to explicitly model part composition and symmetry for structured, controllable shape generation.

Details

Motivation: Prior methods treat shape generation holistically or only support part composition, lacking explicit symmetry modeling and fine-grained control over shape attributes while maintaining structural coherence.

Method: Four coordinated diffusion models learn distributions of: 1) global shape latents, 2) symmetries, 3) semantic parts, and 4) their spatial assembly. This structured pipeline disentangles generation into interpretable components with a central global latent for coherence.

Result: Achieves state-of-the-art performance with guaranteed symmetry, coherent part placement, diverse high-quality outputs, and fine-grained control over individual parts while preserving global consistency.

Conclusion: First 3D point cloud generation framework that fully integrates and enforces both symmetry and part priors throughout the generative process, enabling structured, controllable shape generation.

Abstract: We introduce the Quartet of Diffusions, a structure-aware point cloud generation framework that explicitly models part composition and symmetry. Unlike prior methods that treat shape generation as a holistic process or only support part composition, our approach leverages four coordinated diffusion models to learn distributions of global shape latents, symmetries, semantic parts, and their spatial assembly. This structured pipeline ensures guaranteed symmetry, coherent part placement, and diverse, high-quality outputs. By disentangling the generative process into interpretable components, our method supports fine-grained control over shape attributes, enabling targeted manipulation of individual parts while preserving global consistency. A central global latent further reinforces structural coherence across assembled parts. Our experiments show that the Quartet achieves state-of-the-art performance. To our best knowledge, this is the first 3D point cloud generation framework that fully integrates and enforces both symmetry and part priors throughout the generative process.

[168] Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding

Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li, Huang Chen, Xin Li, Haoyu Cao, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Qianyu Li, Antai Guo, Yanzhen Liao, Yanqiu Qu, Haodong Lin, Chengxu He, Shuangyin Liu

Main category: cs.CV

TL;DR: Youtu-Parsing is an efficient document parsing model using Vision Transformer with dynamic-resolution encoder and prompt-guided LLM, featuring high-parallelism decoding strategies for 5-11x speedup while maintaining SOTA performance.

Details

Motivation: To create an efficient and versatile document parsing model that can handle diverse document elements (text, formulas, tables, charts, seals, hierarchical structures) with strong robustness for multilingual and handwritten content, while achieving high performance for large-scale document intelligence applications.

Method: Uses a decoupled framework with native Vision Transformer (ViT) with dynamic-resolution visual encoder for shared document features, combined with prompt-guided Youtu-LLM-2B language model for layout analysis and region-prompted decoding. Introduces high-parallelism decoding with two strategies: token parallelism (generates 64 candidate tokens per step with verification) and query parallelism (simultaneous prediction for multiple bounding boxes).

Result: Achieves 5-11x speedup over traditional autoregressive decoding, with additional 2x acceleration from query parallelism while maintaining output quality. Demonstrates state-of-the-art performance on OmniDocBench and olmOCR-bench benchmarks, with strong robustness for rare characters, multilingual text, and handwritten content.

Conclusion: Youtu-Parsing represents a highly efficient and versatile document parsing model with significant experimental value and practical utility for large-scale document intelligence applications, combining high-parallelism decoding strategies with robust performance across diverse document elements and challenging scenarios.

Abstract: This paper presents Youtu-Parsing, an efficient and versatile document parsing model designed for high-performance content extraction. The architecture employs a native Vision Transformer (ViT) featuring a dynamic-resolution visual encoder to extract shared document features, coupled with a prompt-guided Youtu-LLM-2B language model for layout analysis and region-prompted decoding. Leveraging this decoupled and feature-reusable framework, we introduce a high-parallelism decoding strategy comprising two core components: token parallelism and query parallelism. The token parallelism strategy concurrently generates up to 64 candidate tokens per inference step, which are subsequently validated through a verification mechanism. This approach yields a 5–11x speedup over traditional autoregressive decoding and is particularly well-suited for highly structured scenarios, such as table recognition. To further exploit the advantages of region-prompted decoding, the query parallelism strategy enables simultaneous content prediction for multiple bounding boxes (up to five), providing an additional 2x acceleration while maintaining output quality equivalent to standard decoding. Youtu-Parsing encompasses a diverse range of document elements, including text, formulas, tables, charts, seals, and hierarchical structures. Furthermore, the model exhibits strong robustness when handling rare characters, multilingual text, and handwritten content. Extensive evaluations demonstrate that Youtu-Parsing achieves state-of-the-art (SOTA) performance on both the OmniDocBench and olmOCR-bench benchmarks. Overall, Youtu-Parsing demonstrates significant experimental value and practical utility for large-scale document intelligence applications.

[169] MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models

Wenbo Xu, Wei Lu, Xiangyang Luo, Jiantao Zhou

Main category: cs.CV

TL;DR: MARE is a vision-language model approach for explainable deepfake detection that uses multimodal alignment, reinforcement learning from human feedback, and forgery disentanglement to improve accuracy and reliability.

Details

Motivation: Existing deepfake detection methods mainly focus on classification or spatial localization, but rapid advancements in generative models require more sophisticated approaches. There's a need for explainable detection that enhances both accuracy and reliability of vision-language models in deepfake detection and reasoning.

Method: MARE uses multimodal alignment and reinforcement learning with comprehensive reward functions (including RLHF) to generate text-spatially aligned reasoning content. It also introduces a forgery disentanglement module to separate intrinsic forgery traces from high-level facial semantics.

Result: MARE achieves state-of-the-art performance in both quantitative and qualitative evaluations, demonstrating superior accuracy and reliability in deepfake detection and reasoning content generation.

Conclusion: The proposed MARE framework effectively addresses the limitations of existing deepfake detection methods by providing explainable, accurate, and reliable detection through multimodal alignment, reinforcement learning, and forgery disentanglement techniques.

Abstract: Deepfake detection is a widely researched topic that is crucial for combating the spread of malicious content, with existing methods mainly modeling the problem as classification or spatial localization. The rapid advancements in generative models impose new demands on Deepfake detection. In this paper, we propose multimodal alignment and reinforcement for explainable Deepfake detection via vision-language models, termed MARE, which aims to enhance the accuracy and reliability of Vision-Language Models (VLMs) in Deepfake detection and reasoning. Specifically, MARE designs comprehensive reward functions, incorporating reinforcement learning from human feedback (RLHF), to incentivize the generation of text-spatially aligned reasoning content that adheres to human preferences. Besides, MARE introduces a forgery disentanglement module to capture intrinsic forgery traces from high-level facial semantics, thereby improving its authenticity detection capability. We conduct thorough evaluations on the reasoning content generated by MARE. Both quantitative and qualitative experimental results demonstrate that MARE achieves state-of-the-art performance in terms of accuracy and reliability.

[170] Exploiting the Final Component of Generator Architectures for AI-Generated Image Detection

Yanzhu Liu, Xiao Liu, Yuexuan Wang, Mondal Soumik

Main category: cs.CV

TL;DR: A novel AI-generated image detection method that contaminates real images using generators’ final components, achieving 98.83% accuracy across 22 unseen generators with minimal training data.

Details

Motivation: Existing deepfake detectors generalize poorly to images from unseen AI generators. The authors observed that despite different training paradigms, many modern image generators share common final architectural components that convert intermediate representations into images.

Method: Proposes to “contaminate” real images using the generator’s final component, then train a detector to distinguish contaminated images from original real images. Introduces a taxonomy based on generators’ final components and categorizes 21 widely used generators. Uses only 100 samples from each of three representative categories for fine-tuning on DINOv3 backbone.

Result: Achieves average accuracy of 98.83% across 22 testing sets from unseen generators, demonstrating strong generalization capability with minimal training data.

Conclusion: The final component contamination approach effectively captures generator-specific artifacts, enabling robust detection of AI-generated images from unseen generators with high accuracy and minimal training requirements.

Abstract: With the rapid proliferation of powerful image generators, accurate detection of AI-generated images has become essential for maintaining a trustworthy online environment. However, existing deepfake detectors often generalize poorly to images produced by unseen generators. Notably, despite being trained under vastly different paradigms, such as diffusion or autoregressive modeling, many modern image generators share common final architectural components that serve as the last stage for converting intermediate representations into images. Motivated by this insight, we propose to “contaminate” real images using the generator’s final component and train a detector to distinguish them from the original real images. We further introduce a taxonomy based on generators’ final components and categorize 21 widely used generators accordingly, enabling a comprehensive investigation of our method’s generalization capability. Using only 100 samples from each of three representative categories, our detector-fine-tuned on the DINOv3 backbone-achieves an average accuracy of 98.83% across 22 testing sets from unseen generators.

[171] Efficient Autoregressive Video Diffusion with Dummy Head

Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, Yan Lu

Main category: cs.CV

TL;DR: Dummy Forcing: A method to optimize autoregressive video diffusion models by reducing redundant attention to historical frames, achieving 2.0x speedup with minimal quality loss.

Details

Motivation: The authors identified that multi-head self-attention in autoregressive video diffusion models under-utilizes historical frames, with about 25% of attention heads focusing almost exclusively on the current frame, making their KV caches redundant.

Method: Proposes Dummy Forcing with three components: 1) heterogeneous memory allocation to reduce head-wise context redundancy, 2) dynamic head programming to adaptively classify head types, and 3) context packing for more aggressive cache compression.

Result: Achieves up to 2.0x speedup over baseline without additional training, supporting video generation at 24.3 FPS with less than 0.5% quality drop.

Conclusion: Dummy Forcing effectively optimizes autoregressive video diffusion models by controlling context accessibility across attention heads, significantly improving inference speed with negligible quality degradation.

Abstract: The autoregressive video diffusion model has recently gained considerable research interest due to its causal modeling and iterative denoising. In this work, we identify that the multi-head self-attention in these models under-utilizes historical frames: approximately 25% heads attend almost exclusively to the current frame, and discarding their KV caches incurs only minor performance degradation. Building upon this, we propose Dummy Forcing, a simple yet effective method to control context accessibility across different heads. Specifically, the proposed heterogeneous memory allocation reduces head-wise context redundancy, accompanied by dynamic head programming to adaptively classify head types. Moreover, we develop a context packing technique to achieve more aggressive cache compression. Without additional training, our Dummy Forcing delivers up to 2.0x speedup over the baseline, supporting video generation at 24.3 FPS with less than 0.5% quality drop. Project page is available at https://csguoh.github.io/project/DummyForcing/.

[172] Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI

Jesse Phitidis, Alison Q. Smithard, William N. Whiteley, Joanna M. Wardlaw, Miguel O. Bernabeu, Maria Valdés Hernández

Main category: cs.CV

TL;DR: Researchers developed strategies for training deep learning models to segment white matter hyperintensities and ischemic stroke lesions using partially labeled MRI data, finding that pseudolabel methods worked best.

Details

Motivation: White matter hyperintensities and ischemic stroke lesions are visually confounding on FLAIR MRI sequences and often co-occur, making it difficult to develop deep learning models that can accurately segment and differentiate them using fully labeled data.

Method: The study investigated six different training strategies for a combined WMH and ISL segmentation model using partially labeled data. They combined private and public datasets totaling 2052 MRI volumes, with 1341 containing WMH annotations and 1152 containing ISL annotations. The methods focused on leveraging partially labeled data effectively.

Result: Several methods successfully leveraged partially labeled data to improve model performance. The pseudolabel approach yielded the best results for segmenting both white matter hyperintensities and ischemic stroke lesions.

Conclusion: Partially labeled data can be effectively used to train deep learning models for segmenting confounding cerebral small vessel disease features, with pseudolabel methods showing particular promise for improving segmentation accuracy of both WMH and ISL.

Abstract: White matter hyperintensities (WMH) and ischaemic stroke lesions (ISL) are imaging features associated with cerebral small vessel disease (SVD) that are visible on brain magnetic resonance imaging (MRI) scans. The development and validation of deep learning models to segment and differentiate these features is difficult because they visually confound each other in the fluid-attenuated inversion recovery (FLAIR) sequence and often appear in the same subject. We investigated six strategies for training a combined WMH and ISL segmentation model using partially labelled data. We combined privately held fully and partially labelled datasets with publicly available partially labelled datasets to yield a total of 2052 MRI volumes, with 1341 and 1152 containing ground truth annotations for WMH and ISL respectively. We found that several methods were able to effectively leverage the partially labelled data to improve model performance, with the use of pseudolabels yielding the best result.

[173] Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V

Meiqi Wu, Bingze Song, Ruimin Lin, Chen Zhu, Xiaokun Feng, Jiahong Wu, Xiangxiang Chu, Kaiqi Huang

Main category: cs.CV

TL;DR: The paper introduces Latent Temporal Discrepancy (LTD) as a motion prior to improve video generation quality in dynamic scenarios by weighting losses based on frame-to-frame variation.

Details

Motivation: Current video generation models perform well in static scenarios but degrade in motion-heavy videos due to noise disrupting temporal coherence and difficulty learning dynamic regions. Existing diffusion models use static loss for all scenarios, limiting their ability to capture complex dynamics.

Method: Proposes Latent Temporal Discrepancy (LTD) as a motion prior that measures frame-to-frame variation in latent space. LTD assigns larger penalties to regions with higher discrepancy while maintaining regular optimization for stable regions, creating a motion-aware loss weighting strategy.

Result: Extensive experiments on VBench and VMBench show consistent gains, outperforming strong baselines by 3.31% on VBench and 3.58% on VMBench, achieving significant improvements in motion quality.

Conclusion: The LTD motion prior effectively addresses the limitations of static loss approaches in video generation, enabling better reconstruction of high-frequency dynamics and improved motion quality in generated videos.

Abstract: Video generation models have achieved notable progress in static scenarios, yet their performance in motion video generation remains limited, with quality degrading under drastic dynamic changes. This is due to noise disrupting temporal coherence and increasing the difficulty of learning dynamic regions. {Unfortunately, existing diffusion models rely on static loss for all scenarios, constraining their ability to capture complex dynamics.} To address this issue, we introduce Latent Temporal Discrepancy (LTD) as a motion prior to guide loss weighting. LTD measures frame-to-frame variation in the latent space, assigning larger penalties to regions with higher discrepancy while maintaining regular optimization for stable regions. This motion-aware strategy stabilizes training and enables the model to better reconstruct high-frequency dynamics. Extensive experiments on the general benchmark VBench and the motion-focused VMBench show consistent gains, with our method outperforming strong baselines by 3.31% on VBench and 3.58% on VMBench, achieving significant improvements in motion quality.

[174] Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits

Zelong Sun, Jiahui Wu, Ying Ba, Dong Jing, Zhiwu Lu

Main category: cs.CV

TL;DR: Portrait Collection Generation (PCG) is a new task for creating coherent portrait collections by editing reference images with natural language. The paper introduces CHEESE dataset (24K collections, 573K samples) and SCheese framework with adaptive feature fusion and ConsistencyNet for identity/detail preservation.

Details

Motivation: Social media platforms create demand for intuitive ways to generate diverse, high-quality portrait collections. Existing methods struggle with complex multi-attribute modifications (pose, layout, viewpoint) while preserving high-fidelity details (identity, clothing, accessories).

Method: 1) CHEESE dataset construction using Large Vision-Language Model pipeline with inversion-based verification. 2) SCheese framework combining text-guided generation with hierarchical identity/detail preservation, featuring adaptive feature fusion mechanism and ConsistencyNet for fine-grained feature injection.

Result: Comprehensive experiments validate CHEESE dataset’s effectiveness in advancing PCG. SCheese achieves state-of-the-art performance in generating coherent portrait collections while maintaining identity and detail consistency.

Conclusion: The paper introduces the novel PCG task, provides the first large-scale dataset (CHEESE), and proposes an effective framework (SCheese) that successfully addresses challenges of complex multi-attribute modifications and high-fidelity detail preservation in portrait collection generation.

Abstract: As social media platforms proliferate, users increasingly demand intuitive ways to create diverse, high-quality portrait collections. In this work, we introduce Portrait Collection Generation (PCG), a novel task that generates coherent portrait collections by editing a reference portrait image through natural language instructions. This task poses two unique challenges to existing methods: (1) complex multi-attribute modifications such as pose, spatial layout, and camera viewpoint; and (2) high-fidelity detail preservation including identity, clothing, and accessories. To address these challenges, we propose CHEESE, the first large-scale PCG dataset containing 24K portrait collections and 573K samples with high-quality modification text annotations, constructed through an Large Vison-Language Model-based pipeline with inversion-based verification. We further propose SCheese, a framework that combines text-guided generation with hierarchical identity and detail preservation. SCheese employs adaptive feature fusion mechanism to maintain identity consistency, and ConsistencyNet to inject fine-grained features for detail consistency. Comprehensive experiments validate the effectiveness of CHEESE in advancing PCG, with SCheese achieving state-of-the-art performance.

[175] Context Tokens are Anchors: Understanding the Repetition Curse in dMLLMs from an Information Flow Perspective

Qiyan Zhao, Xiaofeng Zhang, Shuochen Chang, Qianyu Chen, Xiaosong Yuan, Xuhang Chen, Luoqi Liu, Jiajun Zhang, Xu-Yao Zhang, Da-Han Wang

Main category: cs.CV

TL;DR: CoTA is a plug-and-play method that mitigates repetitive text generation in diffusion-based MLLMs by preserving context token information flow and penalizing uncertain predictions during decoding.

Details

Motivation: Recent diffusion-based Multimodal Large Language Models (dMLLMs) suffer from high inference latency and rely on caching techniques, which often introduce undesirable repetitive text generation (the "Repeat Curse"). The authors aim to investigate the underlying mechanisms behind this repetition issue.

Method: The authors analyze repetition generation through information flow analysis, revealing three key findings about context tokens and entropy convergence. Based on these insights, they propose CoTA, which enhances attention of context tokens to preserve intrinsic information flow patterns and introduces a penalty term to the confidence score during decoding to avoid outputs driven by uncertain context tokens.

Result: CoTA demonstrates significant effectiveness in alleviating repetition and achieves consistent performance improvements on general tasks through extensive experiments.

Conclusion: The information flow analysis provides insights into the repetition problem in dMLLMs, and CoTA offers an effective plug-and-play solution that mitigates repetitive generation while maintaining or improving overall performance.

Abstract: Recent diffusion-based Multimodal Large Language Models (dMLLMs) suffer from high inference latency and therefore rely on caching techniques to accelerate decoding. However, the application of cache mechanisms often introduces undesirable repetitive text generation, a phenomenon we term the \textbf{Repeat Curse}. To better investigate underlying mechanism behind this issue, we analyze repetition generation through the lens of information flow. Our work reveals three key findings: (1) context tokens aggregate semantic information as anchors and guide the final predictions; (2) as information propagates across layers, the entropy of context tokens converges in deeper layers, reflecting the model’s growing prediction certainty; (3) Repetition is typically linked to disruptions in the information flow of context tokens and to the inability of their entropy to converge in deeper layers. Based on these insights, we present \textbf{CoTA}, a plug-and-play method for mitigating repetition. CoTA enhances the attention of context tokens to preserve intrinsic information flow patterns, while introducing a penalty term to the confidence score during decoding to avoid outputs driven by uncertain context tokens. With extensive experiments, CoTA demonstrates significant effectiveness in alleviating repetition and achieves consistent performance improvements on general tasks. Code is available at https://github.com/ErikZ719/CoTA

[176] AnomalyVFM – Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

Matic Fučka, Vitjan Zavrtanik, Danijel Skočaj

Main category: cs.CV

TL;DR: AnomalyVFM transforms any pretrained vision foundation model into a strong zero-shot anomaly detector by addressing dataset diversity limitations and shallow adaptation strategies through synthetic data generation and parameter-efficient adaptation.

Details

Motivation: Current zero-shot anomaly detection methods using vision foundation models (VFMs) like DINOv2 lag behind vision-language models (VLMs) due to limited diversity in existing auxiliary anomaly detection datasets and overly shallow VFM adaptation strategies.

Method: Proposes AnomalyVFM framework with: 1) robust three-stage synthetic dataset generation scheme, and 2) parameter-efficient adaptation mechanism using low-rank feature adapters and confidence-weighted pixel loss.

Result: With RADIO backbone, achieves average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by 3.3 percentage points.

Conclusion: AnomalyVFM effectively bridges the performance gap between VFMs and VLMs for zero-shot anomaly detection through improved dataset diversity and deeper adaptation strategies.

Abstract: Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision-language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page: https://maticfuc.github.io/anomaly_vfm/

[177] IOTA: Corrective Knowledge-Guided Prompt Learning via Black-White Box Framework

Shaokun Wang, Yifan Yu, Yuhang He, Weili Guan, Yihong Gong

Main category: cs.CV

TL;DR: IOTA is a black-white box prompt learning framework that combines data-driven black box optimization with knowledge-driven white box correction for better downstream task adaptation of pre-trained models.

Details

Motivation: Previous PET methods treat pre-trained models as opaque black boxes, relying only on data-driven optimization and underutilizing their inherent prior knowledge, limiting effective downstream task adaptation.

Method: IOTA integrates a data-driven Black Box module with a knowledge-driven White Box module. The White Box module derives corrective knowledge by contrasting wrong predictions with right cognition, verbalizes this into interpretable human prompts, and uses a corrective knowledge-guided prompt selection strategy to guide the Black Box module.

Result: Experimental results on 12 image classification benchmarks under few-shot and easy-to-hard adaptation settings demonstrate the effectiveness of corrective knowledge and the superiority of IOTA over state-of-the-art methods.

Conclusion: By jointly leveraging knowledge- and data-driven learning signals, IOTA achieves effective downstream task adaptation, overcoming limitations of purely data-driven PET methods.

Abstract: Recently, adapting pre-trained models to downstream tasks has attracted increasing interest. Previous Parameter-Efficient-Tuning (PET) methods regard the pre-trained model as an opaque Black Box model, relying purely on data-driven optimization and underutilizing their inherent prior knowledge. This oversight limits the models’ potential for effective downstream task adaptation. To address these issues, we propose a novel black-whIte bOx prompT leArning framework (IOTA), which integrates a data-driven Black Box module with a knowledge-driven White Box module for downstream task adaptation. Specifically, the White Box module derives corrective knowledge by contrasting the wrong predictions with the right cognition. This knowledge is verbalized into interpretable human prompts and leveraged through a corrective knowledge-guided prompt selection strategy to guide the Black Box module toward more accurate predictions. By jointly leveraging knowledge- and data-driven learning signals, IOTA achieves effective downstream task adaptation. Experimental results on 12 image classification benchmarks under few-shot and easy-to-hard adaptation settings demonstrate the effectiveness of corrective knowledge and the superiority of our method over state-of-the-art methods.

[178] Advancing Open-source World Models

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, Hao Ouyang

Main category: cs.CV

TL;DR: LingBot-World is an open-source world simulator based on video generation that offers high-fidelity dynamics across diverse environments, minute-level horizon with long-term memory, and real-time interactivity with <1 second latency.

Details

Motivation: To bridge the gap between open-source and closed-source technologies in world simulation, empowering the community with practical applications in content creation, gaming, and robot learning.

Method: Developed as a video generation-based world simulator that maintains high fidelity across various environments (realism, scientific, cartoon styles), preserves contextual consistency over time (long-term memory), and achieves real-time performance.

Result: Created an open-sourced world simulator with three key features: (1) high fidelity and robust dynamics across diverse environments, (2) minute-level horizon with contextual consistency, (3) real-time interactivity with under 1 second latency at 16 FPS.

Conclusion: LingBot-World represents a top-tier open-source world model that narrows the divide between open and closed-source technologies, with potential applications in content creation, gaming, and robot learning.

Abstract: We present LingBot-World, an open-sourced world simulator stemming from video generation. Positioned as a top-tier world model, LingBot-World offers the following features. (1) It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond. (2) It enables a minute-level horizon while preserving contextual consistency over time, which is also known as “long-term memory”. (3) It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.

[179] DeepSeek-OCR 2: Visual Causal Flow

Haoran Wei, Yaofeng Sun, Yukun Li

Main category: cs.CV

TL;DR: DeepSeek-OCR 2 introduces DeepEncoder V2, a novel encoder that dynamically reorders visual tokens based on image semantics, moving beyond rigid raster-scan processing to mimic human visual perception patterns.

Details

Motivation: Current vision-language models process visual tokens in fixed raster-scan order with static positional encoding, which contradicts human visual perception that follows flexible, semantically-driven scanning patterns, especially for complex layouts.

Method: DeepEncoder V2 is designed with causal reasoning capabilities to intelligently reorder visual tokens before feeding them to LLMs, exploring a paradigm of 2D image understanding through two-cascaded 1D causal reasoning structures.

Result: The work presents a novel architectural approach that enables dynamic visual token reordering based on image semantics, offering potential for genuine 2D reasoning in vision-language models.

Conclusion: This research introduces a new paradigm for vision-language processing that better aligns with human cognitive mechanisms, with code and model weights publicly available for further exploration.

Abstract: We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder-DeepEncoder V2-capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, human vision exhibits causally-informed sequential processing. Inspired by this cognitive mechanism, DeepEncoder V2 is designed to endow the encoder with causal reasoning capabilities, enabling it to intelligently reorder visual tokens prior to LLM-based content interpretation. This work explores a novel paradigm: whether 2D image understanding can be effectively achieved through two-cascaded 1D causal reasoning structures, thereby offering a new architectural approach with the potential to achieve genuine 2D reasoning. Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR-2.

[180] DiffVC-RT: Towards Practical Real-Time Diffusion-based Perceptual Neural Video Compression

Wenzhuo Ma, Zhenzhong Chen

Main category: cs.CV

TL;DR: DiffVC-RT is the first real-time diffusion-based neural video compression framework that achieves 80.1% bitrate savings over VTM-17.0 with real-time speeds of 206/30 fps for 720p videos.

Details

Motivation: Current diffusion-based neural video compression faces three critical challenges: severe information loss, prohibitive inference latency, and poor temporal consistency, which hinder practical deployment.

Method: Three key innovations: 1) Efficient and Informative Model Architecture with strategic module replacements and pruning; 2) Explicit and Implicit Consistency Modeling with zero-cost Online Temporal Shift Module; 3) Asynchronous and Parallel Decoding Pipeline with Mixed Half Precision and Batch-dimension Temporal Shift.

Result: Achieves 80.1% bitrate savings in LPIPS over VTM-17.0 on HEVC dataset, with real-time encoding/decoding speeds of 206/30 fps for 720p videos on NVIDIA H800 GPU.

Conclusion: DiffVC-RT represents a significant milestone in diffusion-based video compression, enabling real-time performance while maintaining high compression efficiency and temporal consistency.

Abstract: The practical deployment of diffusion-based Neural Video Compression (NVC) faces critical challenges, including severe information loss, prohibitive inference latency, and poor temporal consistency. To bridge this gap, we propose DiffVC-RT, the first framework designed to achieve real-time diffusion-based perceptual NVC. First, we introduce an Efficient and Informative Model Architecture. Through strategic module replacements and pruning, this architecture significantly reduces computational complexity while mitigating structural information loss. Second, to address generative flickering artifacts, we propose Explicit and Implicit Consistency Modeling. We enhance temporal consistency by explicitly incorporating a zero-cost Online Temporal Shift Module within the U-Net, complemented by hybrid implicit consistency constraints. Finally, we present an Asynchronous and Parallel Decoding Pipeline incorporating Mixed Half Precision, which enables asynchronous latent decoding and parallel frame reconstruction via a Batch-dimension Temporal Shift design. Experiments show that DiffVC-RT achieves 80.1% bitrate savings in terms of LPIPS over VTM-17.0 on HEVC dataset with real-time encoding and decoding speeds of 206 / 30 fps for 720p videos on an NVIDIA H800 GPU, marking a significant milestone in diffusion-based video compression.

Shaokun Wang, Weili Guan, Jizhou Han, Jianlong Wu, Yupeng Hu, Liqiang Nie

Main category: cs.CV

TL;DR: StructAlign: A structured cross-modal alignment method for Continual Text-to-Video Retrieval that uses simplex ETF geometry and cross-modal relation preservation to mitigate feature drift and catastrophic forgetting.

Details

Motivation: Continual Text-to-Video Retrieval (CTVR) suffers from catastrophic forgetting due to two forms of feature drift: intra-modal drift within each modality and non-cooperative drift across modalities causing modality misalignment.

Method: 1) Uses simplex Equiangular Tight Frame (ETF) geometry as unified geometric prior to align text and video features with category-level ETF prototypes via cross-modal ETF alignment loss. 2) Introduces Cross-modal Relation Preserving loss that leverages complementary modalities to preserve cross-modal similarity relations and suppress intra-modal feature drift.

Result: Extensive experiments on benchmark datasets show StructAlign consistently outperforms state-of-the-art continual retrieval approaches.

Conclusion: By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR through structured cross-modal alignment.

Abstract: Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method consistently outperforms state-of-the-art continual retrieval approaches.

[182] Person Re-ID in 2025: Supervised, Self-Supervised, and Language-Aligned. What Works?

Lakshman Balasubramanian

Main category: cs.CV

TL;DR: This paper analyzes Person ReID training paradigms, evaluating supervised, self-supervised, and language-aligned models across 11 models and 9 datasets to assess cross-domain generalization and foundation model performance.

Details

Motivation: The motivation is to understand how different training paradigms (supervised, self-supervised, language-aligned) affect Person ReID performance, particularly in cross-domain scenarios, and to evaluate the role of foundation models in improving generalization through richer visual representations.

Method: The authors compare three training paradigms across 11 models and 9 datasets. They analyze supervised models, self-supervised models, and language-aligned foundation models (like SigLIP2) to evaluate their robustness in cross-domain Person ReID applications.

Result: Supervised models perform well within their training domain but fail in cross-domain scenarios. Language-aligned models show surprising cross-domain robustness for ReID tasks despite not being explicitly trained for them, demonstrating better generalization capabilities.

Conclusion: Language-aligned foundation models offer promising cross-domain generalization for Person ReID, suggesting that richer, more transferable visual representations from foundation models can overcome the domain-specific limitations of traditional supervised approaches.

Abstract: Person Re-Identification (ReID) remains a challenging problem in computer vision. This work reviews various training paradigm and evaluates the robustness of state-of-the-art ReID models in cross-domain applications and examines the role of foundation models in improving generalization through richer, more transferable visual representations. We compare three training paradigms, supervised, self-supervised, and language-aligned models. Through the study the aim is to answer the following questions: Can supervised models generalize in cross-domain scenarios? How does foundation models like SigLIP2 perform for the ReID tasks? What are the weaknesses of current supervised and foundational models for ReID? We have conducted the analysis across 11 models and 9 datasets. Our results show a clear split: supervised models dominate their training domain but crumble on cross-domain data. Language-aligned models, however, show surprising robustness cross-domain for ReID tasks, even though they are not explicitly trained to do so. Code and data available at: https://github.com/moiiai-tech/object-reid-benchmark.

[183] CLEAR-Mamba:Towards Accurate, Adaptive and Trustworthy Multi-Sequence Ophthalmic Angiography Classification

Zhuonan Wang, Wenjie Yan, Wenqiao Zhang, Xiaohui Song, Jian Ma, Ke Yao, Yibo Yu, Beng Chin Ooi

Main category: cs.CV

TL;DR: CLEAR-Mamba enhances MedMamba for ophthalmic angiography classification with hypernetwork-based adaptive conditioning and reliability-aware prediction to address cross-domain generalization and confidence issues.

Details

Motivation: Medical image classification is crucial for CAD, but ophthalmic angiography (FFA/ICGA) faces challenges due to single-modality nature, subtle lesion patterns, and inter-device variability, limiting generalization and high-confidence prediction.

Method: Proposes CLEAR-Mamba with two key innovations: 1) HaC - hypernetwork-based adaptive conditioning layer for dynamic parameter generation based on input features to improve cross-domain adaptability, and 2) RaP - reliability-aware prediction scheme using evidential uncertainty learning to emphasize low-confidence samples and improve stability.

Result: CLEAR-Mamba consistently outperforms multiple baselines including original MedMamba across various metrics, showing particular advantages in multi-disease classification and reliability-aware prediction. A large-scale ophthalmic angiography dataset covering FFA/ICGA modalities was constructed for evaluation.

Conclusion: The study provides an effective solution balancing generalizability and reliability for modality-specific medical image classification tasks, demonstrating improved performance in ophthalmic angiography analysis.

Abstract: Medical image classification is a core task in computer-aided diagnosis (CAD), playing a pivotal role in early disease detection, treatment planning, and patient prognosis assessment. In ophthalmic practice, fluorescein fundus angiography (FFA) and indocyanine green angiography (ICGA) provide hemodynamic and lesion-structural information that conventional fundus photography cannot capture. However, due to the single-modality nature, subtle lesion patterns, and significant inter-device variability, existing methods still face limitations in generalization and high-confidence prediction. To address these challenges, we propose CLEAR-Mamba, an enhanced framework built upon MedMamba with optimizations in both architecture and training strategy. Architecturally, we introduce HaC, a hypernetwork-based adaptive conditioning layer that dynamically generates parameters according to input feature distributions, thereby improving cross-domain adaptability. From a training perspective, we develop RaP, a reliability-aware prediction scheme built upon evidential uncertainty learning, which encourages the model to emphasize low-confidence samples and improves overall stability and reliability. We further construct a large-scale ophthalmic angiography dataset covering both FFA and ICGA modalities, comprising multiple retinal disease categories for model training and evaluation. Experimental results demonstrate that CLEAR-Mamba consistently outperforms multiple baseline models, including the original MedMamba, across various metrics-showing particular advantages in multi-disease classification and reliability-aware prediction. This study provides an effective solution that balances generalizability and reliability for modality-specific medical image classification tasks.

[184] GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection

Shuguang Zhang, Junhong Lian, Guoxin Yu, Baoxun Xu, Xiang Ao

Main category: cs.CV

TL;DR: GDCNet uses MLLM-generated objective image descriptions as semantic anchors to detect sarcasm by measuring discrepancies with original text, outperforming existing methods on MSD benchmarks.

Details

Motivation: Existing MSD methods struggle with loosely related or semantically indirect image-text pairs, and LLM-generated sarcastic cues often introduce noise due to diversity and subjectivity.

Method: Proposes Generative Discrepancy Comparison Network (GDCNet) that uses MLLM-generated factual image captions as stable semantic anchors, computes semantic/sentiment discrepancies between generated descriptions and original text, measures visual-textual fidelity, and fuses features via gated module.

Result: Extensive experiments show GDCNet achieves superior accuracy and robustness, establishing new state-of-the-art on MMSD2.0 benchmark.

Conclusion: GDCNet effectively addresses limitations of existing MSD methods by using objective image descriptions as semantic anchors for more reliable sarcasm detection through discrepancy analysis.

Abstract: Multimodal sarcasm detection (MSD) aims to identify sarcasm within image-text pairs by modeling semantic incongruities across modalities. Existing methods often exploit cross-modal embedding misalignment to detect inconsistency but struggle when visual and textual content are loosely related or semantically indirect. While recent approaches leverage large language models (LLMs) to generate sarcastic cues, the inherent diversity and subjectivity of these generations often introduce noise. To address these limitations, we propose the Generative Discrepancy Comparison Network (GDCNet). This framework captures cross-modal conflicts by utilizing descriptive, factually grounded image captions generated by Multimodal LLMs (MLLMs) as stable semantic anchors. Specifically, GDCNet computes semantic and sentiment discrepancies between the generated objective description and the original text, alongside measuring visual-textual fidelity. These discrepancy features are then fused with visual and textual representations via a gated module to adaptively balance modality contributions. Extensive experiments on MSD benchmarks demonstrate GDCNet’s superior accuracy and robustness, establishing a new state-of-the-art on the MMSD2.0 benchmark.

[185] OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks

Jing Wu, Daphne Barretto, Yiye Chen, Nicholas Gydé, Yanan Jian, Yuhang He, Vibhav Vineet

Main category: cs.CV

TL;DR: OS-Marathon benchmark with 242 long-horizon repetitive tasks across 2 domains evaluates SOTA computer-use agents, plus a few-shot demonstration method to teach workflow logic for effective execution on larger datasets.

Details

Motivation: Long-horizon repetitive workflows (like processing expense reports or entering student grades) are tedious for humans but ideal for computer-use agents due to their structured, recurring sub-workflows. The absence of evaluation benchmarks is a bottleneck.

Method: 1) Establish OS-Marathon benchmark with 242 tasks across 2 domains. 2) Introduce cost-effective method using few-shot examples to construct condensed demonstrations that teach agents workflow logic, enabling execution on larger unseen data.

Result: Extensive experiments demonstrate both the inherent challenges of these long-horizon repetitive tasks and the effectiveness of the proposed few-shot demonstration method.

Conclusion: OS-Marathon addresses the benchmark gap for evaluating computer-use agents on long-horizon repetitive workflows, and the proposed few-shot demonstration method enables effective workflow learning and execution on larger datasets.

Abstract: Long-horizon, repetitive workflows are common in professional settings, such as processing expense reports from receipts and entering student grades from exam papers. These tasks are often tedious for humans since they can extend to extreme lengths proportional to the size of the data to process. However, they are ideal for Computer-Use Agents (CUAs) due to their structured, recurring sub-workflows with logic that can be systematically learned. Identifying the absence of an evaluation benchmark as a primary bottleneck, we establish OS-Marathon, comprising 242 long-horizon, repetitive tasks across 2 domains to evaluate state-of-the-art (SOTA) agents. We then introduce a cost-effective method to construct a condensed demonstration using only few-shot examples to teach agents the underlying workflow logic, enabling them to execute similar workflows effectively on larger, unseen data collections. Extensive experiments demonstrate both the inherent challenges of these tasks and the effectiveness of our proposed method. Project website: https://os-marathon.github.io/.

[186] FD-MAD: Frequency-Domain Residual Analysis for Face Morphing Attack Detection

Diogo J. Paulo, Hugo Proença, João C. Neves

Main category: cs.CV

TL;DR: A region-aware frequency-based approach for single-image morphing attack detection that uses residual frequency domain analysis and Markov Random Field fusion to achieve strong cross-dataset performance.

Details

Motivation: Face morphing attacks threaten face recognition systems in border control and identity verification, especially in single-image scenarios without trusted references. Current morph detection systems struggle with cross-dataset generalization.

Method: 1) Introduces residual frequency domain to decouple signal frequency from natural spectral decay for easier discrimination; 2) Uses Markov Random Field to combine evidence from different facial regions for globally consistent decisions; 3) Lightweight approach using only spectral features.

Result: Achieves 1.85% average EER on FRLL-Morph and 6.12% average EER on MAD22 (ranking second). Shows good BPCER at low APCER using only spectral features, outperforming strong baselines in cross-dataset/cross-morph settings.

Conclusion: Fourier-domain residual modeling with structured regional fusion offers a competitive alternative to deep S-MAD architectures, demonstrating strong cross-dataset generalization capabilities for morphing attack detection.

Abstract: Face morphing attacks present a significant threat to face recognition systems used in electronic identity enrolment and border control, particularly in single-image morphing attack detection (S-MAD) scenarios where no trusted reference is available. In spite of the vast amount of research on this problem, morph detection systems struggle in cross-dataset scenarios. To address this problem, we introduce a region-aware frequency-based morph detection strategy that drastically improves over strong baseline methods in challenging cross-dataset and cross-morph settings using a lightweight approach. Having observed the separability of bona fide and morph samples in the frequency domain of different facial parts, our approach 1) introduces the concept of residual frequency domain, where the frequency of the signal is decoupled from the natural spectral decay to easily discriminate between morph and bona fide data; 2) additionally, we reason in a global and local manner by combining the evidence from different facial regions in a Markov Random Field, which infers a globally consistent decision. The proposed method, trained exclusively on the synthetic morphing attack detection development dataset (SMDD), is evaluated in challenging cross-dataset and cross-morph settings on FRLL-Morph and MAD22 sets. Our approach achieves an average equal error rate (EER) of 1.85% on FRLL-Morph and ranks second on MAD22 with an average EER of 6.12%, while also obtaining a good bona fide presentation classification error rate (BPCER) at a low attack presentation classification error rate (APCER) using only spectral features. These findings indicate that Fourier-domain residual modeling with structured regional fusion offers a competitive alternative to deep S-MAD architectures.

[187] ProSkill: Segment-Level Skill Assessment in Procedural Videos

Michele Mazzamuto, Daniele Di Mauro, Gianpiero Francesca, Giovanni Maria Farinella, Antonino Furnari

Main category: cs.CV

TL;DR: ProSkill is the first benchmark dataset for action-level skill assessment in procedural videos, featuring both absolute and pairwise skill annotations using a novel Swiss Tournament + ELO rating annotation protocol.

Details

Motivation: Current skill assessment research lacks large-scale datasets for complex procedural activities, focuses mainly on sports, and uses limited annotation approaches (only pairwise or binary labels). There's a need for better datasets to advance skill assessment in procedural tasks like manufacturing and daily activities.

Method: Introduced ProSkill dataset with a novel annotation protocol: uses Swiss Tournament scheme for efficient pairwise comparisons, then aggregates these into consistent continuous global scores using an ELO-based rating system. This enables both pairwise and absolute skill assessment annotations.

Result: Created the first benchmark dataset for action-level skill assessment in procedural tasks. When used to benchmark state-of-the-art skill assessment algorithms, the suboptimal results highlight the dataset’s value and the challenges in this domain.

Conclusion: ProSkill addresses critical gaps in skill assessment research by providing a comprehensive dataset with both absolute and pairwise annotations, enabling better evaluation of algorithms and advancing research in procedural video skill assessment.

Abstract: Skill assessment in procedural videos is crucial for the objective evaluation of human performance in settings such as manufacturing and procedural daily tasks. Current research on skill assessment has predominantly focused on sports and lacks large-scale datasets for complex procedural activities. Existing studies typically involve only a limited number of actions, focus on either pairwise assessments (e.g., A is better than B) or on binary labels (e.g., good execution vs needs improvement). In response to these shortcomings, we introduce ProSkill, the first benchmark dataset for action-level skill assessment in procedural tasks. ProSkill provides absolute skill assessment annotations, along with pairwise ones. This is enabled by a novel and scalable annotation protocol that allows for the creation of an absolute skill assessment ranking starting from pairwise assessments. This protocol leverages a Swiss Tournament scheme for efficient pairwise comparisons, which are then aggregated into consistent, continuous global scores using an ELO-based rating system. We use our dataset to benchmark the main state-of-the-art skill assessment algorithms, including both ranking-based and pairwise paradigms. The suboptimal results achieved by the current state-of-the-art highlight the challenges and thus the value of ProSkill in the context of skill assessment for procedural videos. All data and code are available at https://fpv-iplab.github.io/ProSkill/

Pankhi Kashyap, Mainak Singha, Biplab Banerjee

Main category: cs.CV

TL;DR: BiMoRS: A lightweight bi-modal prompt learning framework for remote sensing tasks that uses image captions to enhance CLIP adaptation, achieving better domain generalization than existing methods.

Details

Motivation: Prompt learning works well for natural images but struggles with remote sensing data due to multi-label scenes, high intra-class variability, and diverse spatial resolutions. Existing methods fail to identify dominant semantic cues and generalize poorly to novel RS classes.

Method: Uses frozen image captioning model (BLIP-2) to extract textual semantic summaries from RS images, tokenizes captions with BERT, fuses them with CLIP visual features, and employs lightweight cross-attention to condition learnable query prompts on fused textual-visual representations without modifying CLIP backbone.

Result: Outperforms strong baselines by up to 2% on average across four RS datasets and three domain generalization tasks, demonstrating consistent performance gains.

Conclusion: BiMoRS effectively adapts vision-language models to remote sensing tasks by leveraging textual semantic summaries from images, addressing unique RS challenges and improving domain generalization capabilities.

Abstract: Prompt learning (PL) has emerged as an effective strategy to adapt vision-language models (VLMs), such as CLIP, for downstream tasks under limited supervision. While PL has demonstrated strong generalization on natural image datasets, its transferability to remote sensing (RS) imagery remains underexplored. RS data present unique challenges, including multi-label scenes, high intra-class variability, and diverse spatial resolutions, that hinder the direct applicability of existing PL methods. In particular, current prompt-based approaches often struggle to identify dominant semantic cues and fail to generalize to novel classes in RS scenarios. To address these challenges, we propose BiMoRS, a lightweight bi-modal prompt learning framework tailored for RS tasks. BiMoRS employs a frozen image captioning model (e.g., BLIP-2) to extract textual semantic summaries from RS images. These captions are tokenized using a BERT tokenizer and fused with high-level visual features from the CLIP encoder. A lightweight cross-attention module then conditions a learnable query prompt on the fused textual-visual representation, yielding contextualized prompts without altering the CLIP backbone. We evaluate BiMoRS on four RS datasets across three domain generalization (DG) tasks and observe consistent performance gains, outperforming strong baselines by up to 2% on average. Codes are available at https://github.com/ipankhi/BiMoRS.

[189] Decoupling Perception and Calibration: Label-Efficient Image Quality Assessment Framework

Xinyue Li, Zhichao Zhang, Zhiming Xu, Shubo Xu, Xiongkuo Min, Yitong Chen, Guangtao Zhai

Main category: cs.CV

TL;DR: LEAF is a label-efficient framework that distills quality perception from large multimodal language models into lightweight regressors, reducing the need for expensive MOS annotations while maintaining strong correlation with human judgments.

Details

Motivation: Current MLLM-based IQA methods are computationally expensive and require substantial MOS annotations. The authors argue the bottleneck is not in MLLMs' quality perception capacity but in MOS scale calibration, so they aim to reduce annotation costs while maintaining performance.

Method: LEAF distills perceptual quality priors from an MLLM teacher into a lightweight student regressor. The teacher provides dense supervision through point-wise judgments and pair-wise preferences with reliability estimates. The student learns through joint distillation and is calibrated on a small MOS subset to align with human annotations.

Result: Experiments on both user-generated and AI-generated IQA benchmarks show the method significantly reduces the need for human annotations while maintaining strong MOS-aligned correlations, making lightweight IQA practical under limited annotation budgets.

Conclusion: LEAF enables efficient image quality assessment by leveraging MLLMs’ perceptual capabilities while minimizing expensive human annotation requirements, addressing the core bottleneck of MOS scale calibration rather than quality perception capacity.

Abstract: Recent multimodal large language models (MLLMs) have demonstrated strong capabilities in image quality assessment (IQA) tasks. However, adapting such large-scale models is computationally expensive and still relies on substantial Mean Opinion Score (MOS) annotations. We argue that for MLLM-based IQA, the core bottleneck lies not in the quality perception capacity of MLLMs, but in MOS scale calibration. Therefore, we propose LEAF, a Label-Efficient Image Quality Assessment Framework that distills perceptual quality priors from an MLLM teacher into a lightweight student regressor, enabling MOS calibration with minimal human supervision. Specifically, the teacher conducts dense supervision through point-wise judgments and pair-wise preferences, with an estimate of decision reliability. Guided by these signals, the student learns the teacher’s quality perception patterns through joint distillation and is calibrated on a small MOS subset to align with human annotations. Experiments on both user-generated and AI-generated IQA benchmarks demonstrate that our method significantly reduces the need for human annotations while maintaining strong MOS-aligned correlations, making lightweight IQA practical under limited annotation budgets.

[190] LEMON: How Well Do MLLMs Perform Temporal Multimodal Understanding on Instructional Videos?

Zhuang Yu, Lei Shen, Jing Zhao, Shiliang Sun

Main category: cs.CV

TL;DR: LEMON is a new benchmark for evaluating multimodal LLMs on STEM lecture videos, featuring long-form, knowledge-intensive content with temporal structure and cross-modal reasoning challenges.

Details

Motivation: Current MLLMs show strong performance on general vision/audio/language tasks, but their capabilities on long-form, knowledge-intensive educational content with temporal structure remain unexplored. There's a need for benchmarks that test multimodal understanding in instructional contexts requiring sustained reasoning.

Method: Created LEMON benchmark with 2,277 STEM lecture video segments (avg 196.1s) across 5 disciplines and 29 courses, yielding 4,181 QA pairs (3,413 multiple-choice, 768 open-ended). Features semantic richness, tightly coupled modalities, explicit temporal/pedagogical structure, and contextually linked multi-turn questions. Includes 6 major tasks and 12 subtasks covering perception to reasoning to generation.

Result: Comprehensive experiments show substantial performance gaps across tasks, revealing that even state-of-the-art MLLMs like GPT-4o struggle with temporal reasoning and instructional prediction in long-form educational content.

Conclusion: LEMON serves as an extensible, challenging benchmark for advancing multimodal perception, reasoning, and generation in long-form instructional content, highlighting current limitations and providing a testbed for future MLLM development.

Abstract: Recent multimodal large language models (MLLMs) have shown remarkable progress across vision, audio, and language tasks, yet their performance on long-form, knowledge-intensive, and temporally structured educational content remains largely unexplored. To bridge this gap, we introduce LEMON, a Lecture-based Evaluation benchmark for MultimOdal uNderstanding, focusing on STEM lecture videos that require long-horizon reasoning and cross-modal integration. LEMON comprises 2,277 video segments spanning 5 disciplines and 29 courses, with an average duration of 196.1 seconds, yielding 4,181 high-quality QA pairs, including 3,413 multiple-choice and 768 open-ended questions. Distinct from existing video benchmarks, LEMON features: (1) semantic richness and disciplinary density, (2) tightly coupled video-audio-text modalities, (3) explicit temporal and pedagogical structure, and (4) contextually linked multi-turn questioning. It further encompasses six major tasks and twelve subtasks, covering the full cognitive spectrum from perception to reasoning and then to generation. Comprehensive experiments reveal substantial performance gaps across tasks, highlighting that even state-of-the-art MLLMs like GPT-4o struggle with temporal reasoning and instructional prediction. We expect LEMON to serve as an extensible and challenging benchmark for advancing multimodal perception, reasoning, and generation in long-form instructional contents.

[191] Li-ViP3D++: Query-Gated Deformable Camera-LiDAR Fusion for End-to-End Perception and Trajectory Prediction

Matej Halinkovic, Nina Masarykova, Alexey Vinel, Marek Galinski

Main category: cs.CV

TL;DR: Li-ViP3D++ is a query-based multimodal perception-and-prediction framework that introduces Query-Gated Deformable Fusion to integrate multi-view RGB and LiDAR in query space for end-to-end autonomous driving tasks.

Details

Motivation: Modular perception-prediction pipelines restrict information flow and amplify errors. Existing query-based models don't sufficiently explore camera-LiDAR complementarity in query space, often using heuristic fusion schemes that introduce bias and prevent full information utilization.

Method: Proposes Query-Gated Deformable Fusion (QGDF) with three components: (1) aggregates image evidence via masked attention across cameras/feature levels, (2) extracts LiDAR context through differentiable BEV sampling with learned per-query offsets, (3) uses query-conditioned gating to adaptively weight visual and geometric cues per agent.

Result: On nuScenes: improves end-to-end behavior and detection quality with higher EPA (0.335) and mAP (0.502), reduces false positives (FP ratio 0.147), and is faster than prior Li-ViP3D variant (139.82 ms vs. 145.91 ms).

Conclusion: Query-space, fully differentiable camera-LiDAR fusion increases robustness of end-to-end perception-and-prediction without sacrificing deployability, demonstrating effective multimodal integration in autonomous driving systems.

Abstract: End-to-end perception and trajectory prediction from raw sensor data is one of the key capabilities for autonomous driving. Modular pipelines restrict information flow and can amplify upstream errors. Recent query-based, fully differentiable perception-and-prediction (PnP) models mitigate these issues, yet the complementarity of cameras and LiDAR in the query-space has not been sufficiently explored. Models often rely on fusion schemes that introduce heuristic alignment and discrete selection steps which prevent full utilization of available information and can introduce unwanted bias. We propose Li-ViP3D++, a query-based multimodal PnP framework that introduces Query-Gated Deformable Fusion (QGDF) to integrate multi-view RGB and LiDAR in query space. QGDF (i) aggregates image evidence via masked attention across cameras and feature levels, (ii) extracts LiDAR context through fully differentiable BEV sampling with learned per-query offsets, and (iii) applies query-conditioned gating to adaptively weight visual and geometric cues per agent. The resulting architecture jointly optimizes detection, tracking, and multi-hypothesis trajectory forecasting in a single end-to-end model. On nuScenes, Li-ViP3D++ improves end-to-end behavior and detection quality, achieving higher EPA (0.335) and mAP (0.502) while substantially reducing false positives (FP ratio 0.147), and it is faster than the prior Li-ViP3D variant (139.82 ms vs. 145.91 ms). These results indicate that query-space, fully differentiable camera-LiDAR fusion can increase robustness of end-to-end PnP without sacrificing deployability.

[192] Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification

Xin Jin, Jinming Liu, Yuntao Wei, Junyan Lin, Zhicheng Wang, Jianguo Huang, Xudong Yang, Yanxiao Liu, Wenjun Zeng

Main category: cs.CV

TL;DR: This paper explores the connection between compression efficiency and AI intelligence, unifying traditional visual coding with modern visual token technology from multimodal LLMs, and proposes future directions for next-generation visual representation techniques.

Details

Motivation: The motivation stems from the observation that compression efficiency correlates with improved AI model performance. Both classical visual coding (like H.264/265) and emerging visual token technology in multimodal LLMs share the same fundamental objective: maximizing semantic information fidelity while minimizing computational cost. The paper aims to bridge these two historically separate fields to gain deeper insights.

Method: The paper first provides comprehensive overviews of two dominant technique families: Visual Coding (traditional compression standards) and Vision Token Technology (from generative multimodal models). It then unifies them from an optimization perspective, discussing the compression efficiency vs. model performance trade-off. Based on this unified formulation, the authors synthesize bidirectional insights and forecast next-generation techniques.

Result: The paper demonstrates through experiments that task-oriented token developments show large potential in practical applications like multimodal LLMs, AI-generated content, and embodied AI. It also reveals the possibility of standardizing a general token technology similar to traditional codecs for a wide range of intelligent tasks.

Conclusion: The paper concludes that compression efficiency is fundamentally linked to intelligence in AI systems. By unifying visual coding and visual token technology, it provides insights for developing next-generation visual representation techniques that could lead to standardized, efficient token technologies for diverse intelligent applications, much like traditional video codecs revolutionized multimedia systems.

Abstract: “Compression Tells Intelligence”, is supported by research in artificial intelligence, particularly concerning (multimodal) large language models (LLMs/MLLMs), where compression efficiency often correlates with improved model performance and capabilities. For compression, classical visual coding based on traditional information theory has developed over decades, achieving great success with numerous international industrial standards widely applied in multimedia (e.g., image/video) systems. Except that, the recent emergingvisual token technology of generative multi-modal large models also shares a similar fundamental objective like visual coding: maximizing semantic information fidelity during the representation learning while minimizing computational cost. Therefore, this paper provides a comprehensive overview of two dominant technique families first – Visual Coding and Vision Token Technology – then we further unify them from the aspect of optimization, discussing the essence of compression efficiency and model performance trade-off behind. Next, based on the proposed unified formulation bridging visual coding andvisual token technology, we synthesize bidirectional insights of themselves and forecast the next-gen visual codec and token techniques. Last but not least, we experimentally show a large potential of the task-oriented token developments in the more practical tasks like multimodal LLMs (MLLMs), AI-generated content (AIGC), and embodied AI, as well as shedding light on the future possibility of standardizing a general token technology like the traditional codecs (e.g., H.264/265) with high efficiency for a wide range of intelligent tasks in a unified and effective manner.

[193] FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models

Haonan Zhong, Wei Song, Tingxu Han, Maurice Pagnucco, Jingling Xue, Yang Song

Main category: cs.CV

TL;DR: FairT2V is a training-free framework that mitigates gender bias in text-to-video generation by neutralizing prompt embeddings through spherical geodesic transformations, applied only during early denoising steps to maintain temporal coherence.

Details

Motivation: Text-to-video diffusion models have shown rapid progress but their demographic biases, particularly gender bias, remain largely unexplored. The authors identified that bias primarily originates from pretrained text encoders that encode implicit gender associations even for neutral prompts.

Method: FairT2V uses anchor-based spherical geodesic transformations to neutralize prompt embeddings while preserving semantics. To maintain temporal coherence, debiasing is applied only during early identity-forming steps through a dynamic denoising schedule. The framework is training-free and doesn’t require finetuning.

Result: Experiments on Open-Sora show that FairT2V substantially reduces demographic bias across occupations with minimal impact on video quality. The authors also propose a video-level fairness evaluation protocol combining VideoLLM-based reasoning with human verification.

Conclusion: FairT2V effectively addresses demographic bias in text-to-video generation through a training-free approach that neutralizes encoder-induced bias while maintaining video quality and temporal coherence, providing a practical solution for fair video generation.

Abstract: Text-to-video (T2V) diffusion models have achieved rapid progress, yet their demographic biases, particularly gender bias, remain largely unexplored. We present FairT2V, a training-free debiasing framework for text-to-video generation that mitigates encoder-induced bias without finetuning. We first analyze demographic bias in T2V models and show that it primarily originates from pretrained text encoders, which encode implicit gender associations even for neutral prompts. We quantify this effect with a gender-leaning score that correlates with bias in generated videos. Based on this insight, FairT2V mitigates demographic bias by neutralizing prompt embeddings via anchor-based spherical geodesic transformations while preserving semantics. To maintain temporal coherence, we apply debiasing only during early identity-forming steps through a dynamic denoising schedule. We further propose a video-level fairness evaluation protocol combining VideoLLM-based reasoning with human verification. Experiments on the modern T2V model Open-Sora show that FairT2V substantially reduces demographic bias across occupations with minimal impact on video quality.

[194] Open-Vocabulary Functional 3D Human-Scene Interaction Generation

Jie Liu, Yu Sun, Alpar Cseke, Yao Feng, Nicolas Heron, Michael J. Black, Yan Zhang

Main category: cs.CV

TL;DR: FunHSI is a training-free framework that generates functionally correct 3D human-scene interactions from open-vocabulary task prompts by reasoning about object functionality and human-scene contact.

Details

Motivation: Existing methods for generating 3D human-scene interactions lack explicit reasoning about object functionality and corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. There's a need for a framework that can generate functionally correct interactions for applications in embodied AI, robotics, and interactive content creation.

Method: FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model interactions via a contact graph. It uses vision-language models to synthesize humans performing tasks in images and estimate 3D body/hand poses. The proposed 3D body configuration is then refined through stage-wise optimization for physical plausibility and functional correctness.

Result: FunHSI generates more plausible general 3D interactions (like “sitting on a sofa”) and supports fine-grained functional interactions (like “increasing the room temperature”). Extensive experiments show it consistently produces functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.

Conclusion: FunHSI successfully addresses the challenge of generating functionally correct 3D human-scene interactions by incorporating explicit reasoning about object functionality and human-scene contact, outperforming existing methods in both general and fine-grained interaction scenarios.

Abstract: Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. In this work, we propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. In contrast to existing methods, FunHSI not only synthesizes more plausible general 3D interactions, such as “sitting on a sofa’’, while supporting fine-grained functional human-scene interactions, e.g., “increasing the room temperature’’. Extensive experiments demonstrate that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.

[195] A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion

Willams de Lima Costa, Thifany Ketuli Silva de Souza, Jonas Ferreira Silva, Carlos Gabriel Bezerra Pereira, Bruno Reis Vila Nova, Leonardo Silvino Brito, Rafael Raider Leoni, Juliano Silva, Valter Ferreira, Sibele Miguel Soares Neto, Samantha Uehara, Daniel Giacomo, João Marcelo Teixeira, Veronica Teichrieb, Cristiano Coelho de Araújo

Main category: cs.CV

TL;DR: Multimodal road surface classification framework fusing images and IMU data with cross-attention and adaptive gating, plus new ROAD dataset with real-world, vision-only, and synthetic subsets for better generalization.

Details

Motivation: Existing road surface classification techniques fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets lacking environmental diversity.

Method: Multimodal framework fusing images and inertial measurements using lightweight bidirectional cross-attention module followed by adaptive gating layer that adjusts modality contributions under domain shifts. Introduces ROAD dataset with three subsets: real-world multimodal recordings, vision-only subset for adverse conditions, and synthetic subset for out-of-distribution generalization.

Result: Achieves +1.4 pp improvement over previous SOTA on PVS benchmark and +11.6 pp improvement on multimodal ROAD subset, with consistently higher F1-scores on minority classes. Demonstrates stable performance across challenging visual conditions including nighttime, heavy rain, and mixed-surface transitions.

Conclusion: Combining affordable camera and IMU sensors with multimodal attention mechanisms provides scalable, robust foundation for road surface understanding, particularly relevant for regions with environmental variability and cost constraints.

Abstract: Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets that lack environmental diversity. This work addresses these limitations by introducing a multimodal framework that fuses images and inertial measurements using a lightweight bidirectional cross-attention module followed by an adaptive gating layer that adjusts modality contributions under domain shifts. Given the limitations of current benchmarks, especially regarding lack of variability, we introduce ROAD, a new dataset composed of three complementary subsets: (i) real-world multimodal recordings with RGB-IMU streams synchronized using a gold-standard industry datalogger, captured across diverse lighting, weather, and surface conditions; (ii) a large vision-only subset designed to assess robustness under adverse illumination and heterogeneous capture setups; and (iii) a synthetic subset generated to study out-of-distribution generalization in scenarios difficult to obtain in practice. Experiments show that our method achieves a +1.4 pp improvement over the previous state-of-the-art on the PVS benchmark and an +11.6 pp improvement on our multimodal ROAD subset, with consistently higher F1-scores on minority classes. The framework also demonstrates stable performance across challenging visual conditions, including nighttime, heavy rain, and mixed-surface transitions. These findings indicate that combining affordable camera and IMU sensors with multimodal attention mechanisms provides a scalable, robust foundation for road surface understanding, particularly relevant for regions where environmental variability and cost constraints limit the adoption of high-end sensing suites.

[196] FreeFix: Boosting 3D Gaussian Splatting via Fine-Tuning-Free Diffusion Models

Hongyu Zhou, Zisen Shao, Sheng Miao, Pan Wang, Dongfeng Bai, Bingbing Liu, Yiyi Liao

Main category: cs.CV

TL;DR: FreeFix is a fine-tuning-free approach that uses pretrained image diffusion models to enhance extrapolated 3D rendering quality while maintaining generalization, avoiding the trade-off between fidelity and generalization seen in previous methods.

Details

Motivation: Current novel view synthesis methods (NeRF, 3D Gaussian Splatting) rely on dense inputs and degrade at extrapolated views. Existing generative model approaches face a trade-off: fine-tuning diffusion models improves fidelity but risks overfitting, while fine-tuning-free methods preserve generalization but yield lower fidelity.

Method: FreeFix uses an interleaved 2D-3D refinement strategy with pretrained image diffusion models (avoiding costly video diffusion models). It introduces a per-pixel confidence mask to identify uncertain regions for targeted improvement, enabling consistent refinement without fine-tuning.

Result: Experiments across multiple datasets show FreeFix improves multi-frame consistency and achieves performance comparable to or surpassing fine-tuning-based methods, while retaining strong generalization ability.

Conclusion: FreeFix successfully pushes the boundary of the fidelity-generalization trade-off in novel view synthesis, demonstrating that image diffusion models can be effectively leveraged for consistent 3D refinement without fine-tuning.

Abstract: Neural Radiance Fields and 3D Gaussian Splatting have advanced novel view synthesis, yet still rely on dense inputs and often degrade at extrapolated views. Recent approaches leverage generative models, such as diffusion models, to provide additional supervision, but face a trade-off between generalization and fidelity: fine-tuning diffusion models for artifact removal improves fidelity but risks overfitting, while fine-tuning-free methods preserve generalization but often yield lower fidelity. We introduce FreeFix, a fine-tuning-free approach that pushes the boundary of this trade-off by enhancing extrapolated rendering with pretrained image diffusion models. We present an interleaved 2D-3D refinement strategy, showing that image diffusion models can be leveraged for consistent refinement without relying on costly video diffusion models. Furthermore, we take a closer look at the guidance signal for 2D refinement and propose a per-pixel confidence mask to identify uncertain regions for targeted improvement. Experiments across multiple datasets show that FreeFix improves multi-frame consistency and achieves performance comparable to or surpassing fine-tuning-based methods, while retaining strong generalization ability.

[197] UDEEP: Edge-based Computer Vision for In-Situ Underwater Crayfish and Plastic Detection

Dennis Monari, Farhad Fassihi Tash, Jordan J. Bird, Ahmad Lotfi, Isibor Kennedy Ihianle, Salisu Wada Yahaya, Isibor Kennedy Ihianle, Md Mahmudul Hasan, Pedro Sousa, Pedro Machado

Main category: cs.CV

TL;DR: This paper introduces a Cognitive Edge Device platform and two underwater datasets for detecting invasive signal crayfish and plastic pollution, achieving high accuracy with YOLOv5s (mAP@0.5 of 0.90).

Details

Motivation: Invasive signal crayfish cause ecosystem damage by spreading disease, burrowing destructively, and competing with native species, while plastic pollution further threatens aquatic ecosystems. There's a need for effective monitoring solutions to address these environmental challenges.

Method: Developed a Cognitive Edge Device computing platform and created two publicly available underwater datasets annotated with crayfish and plastic debris sequences. Trained and evaluated four YOLO variants for object detection.

Result: YOLOv5s achieved the highest detection accuracy with mAP@0.5 of 0.90 and best precision among the tested models for detecting crayfish and plastic objects.

Conclusion: The proposed CED platform and YOLO-based detection system provide an effective solution for monitoring invasive crayfish and plastic pollution, supporting conservation efforts for threatened aquatic ecosystems.

Abstract: Invasive signal crayfish have a detrimental impact on ecosystems. They spread the fungal-type crayfish plague disease (Aphanomyces astaci) that is lethal to the native white clawed crayfish, the only native crayfish species in Britain. Invasive signal crayfish extensively burrow, causing habitat destruction, erosion of river banks and adverse changes in water quality, while also competing with native species for resources leading to declines in native populations. Moreover, pollution exacerbates the vulnerability of White-clawed crayfish, with their populations declining by over 90%. To safeguard aquatic ecosystems, it is imperative to address the challenges posed by invasive species and pollution in aquatic ecosystem’s. This article introduces the Cognitive Edge Device (CED) computing platform for the detection of crayfish and plastic. It also presents two publicly available underwater datasets, annotated with sequences of crayfish and aquatic plastic debris. Four You Only Look Once (YOLO) variants were trained and evaluated for crayfish and plastic object detection. YOLOv5s achieved the highest detection accuracy, with an mAP@0.5 of 0.90, and achieved the best precision

[198] Improving Fine-Grained Control via Aggregation of Multiple Diffusion Models

Conghan Yue, Zhengwei Peng, Shiyan Du, Zhi Ji, Chuangjian Cai, Le Wan, Dongyu Zhang

Main category: cs.CV

TL;DR: AMDM is a training-free algorithm that aggregates features from multiple diffusion models to enable fine-grained control without additional training or complex datasets.

Details

Motivation: Existing diffusion models struggle with fine-grained control due to dataset limitations and complex architecture designs, requiring a solution that avoids expensive training and dataset construction.

Method: AMDM integrates latent features from multiple diffusion models within the same ecosystem into a specified model, activating particular features for fine-grained control without any training.

Result: Experimental results show AMDM significantly improves fine-grained control without training, and reveals that diffusion models initially focus on position, attributes, and style features before improving quality.

Conclusion: AMDM provides a new perspective for fine-grained conditional generation in diffusion models, enabling utilization of existing models without complex datasets, architectures, or high training costs.

Abstract: While many diffusion models perform well when controlling particular aspects such as style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel training-free algorithm for fine-grained generation, called Aggregation of Multiple Diffusion Models (AMDM). The algorithm integrates features in the latent data space from multiple diffusion models within the same ecosystem into a specified model, thereby activating particular features and enabling fine-grained control. Experimental results demonstrate that AMDM significantly improves fine-grained control without training, validating its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional generation in diffusion models. Specifically, it allows us to fully utilize existing or develop new conditional diffusion models that control specific aspects, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at: https://github.com/Hammour-steak/AMDM.

[199] TIPO: Text to Image with Text Presampling for Prompt Optimization

Shih-Ying Yeh, Sang-Hyun Park, Yi Li, Giyeong Oh, Xuehai Wang, Min Song, Youngjae Yu

Main category: cs.CV

TL;DR: TIPO introduces an efficient prompt optimization method for text-to-image generation that expands simple user prompts into richer versions using a lightweight pre-trained model, achieving better visual quality and alignment than resource-intensive alternatives.

Details

Motivation: Current prompt optimization methods for text-to-image generation often rely on resource-intensive approaches like large language models or reinforcement learning, which are computationally expensive and not scalable. There's a need for an efficient, automated method that can refine simple prompts into detailed versions while preserving original intent.

Method: TIPO uses a lightweight pre-trained model to expand simple user prompts into richer, more detailed versions. It samples refined prompts from a targeted sub-distribution within the broader semantic space, ensuring preservation of original intent while enhancing visual quality and detail.

Result: Extensive experiments show TIPO achieves stronger text alignment, reduced visual artifacts, and consistently higher human preference rates compared to alternatives. It maintains competitive aesthetic quality while being computationally efficient and scalable across multiple domains.

Conclusion: TIPO demonstrates the effectiveness of distribution-aligned prompt engineering for text-to-image generation, offering a scalable and efficient alternative to resource-intensive methods. It opens new possibilities for automated prompt refinement in T2I tasks.

Abstract: TIPO (Text-to-Image Prompt Optimization) introduces an efficient approach for automatic prompt refinement in text-to-image (T2I) generation. Starting from simple user prompts, TIPO leverages a lightweight pre-trained model to expand these prompts into richer and more detailed versions. Conceptually, TIPO samples refined prompts from a targeted sub-distribution within the broader semantic space, preserving the original intent while significantly improving visual quality, coherence, and detail. Unlike resource-intensive methods based on large language models (LLMs) or reinforcement learning (RL), TIPO offers strong computational efficiency and scalability, opening new possibilities for effective automated prompt engineering in T2I tasks. Extensive experiments across multiple domains demonstrate that TIPO achieves stronger text alignment, reduced visual artifacts, and consistently higher human preference rates, while maintaining competitive aesthetic quality. These results highlight the effectiveness of distribution-aligned prompt engineering and point toward broader opportunities for scalable, automated refinement in text-to-image generation.

[200] NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

Bikang Pan, Qun Li, Xiaoying Tang, Wei Huang, Zhen Fang, Feng Liu, Jingya Wang, Jingyi Yu, Ye Shi

Main category: cs.CV

TL;DR: PromptMAE uses mean absolute error loss for robust prompt learning against noisy labels, enhanced by PromptOT data purification via optimal transport, forming NLPrompt method.

Details

Motivation: Real-world datasets often contain noisy labels that degrade prompt learning performance in vision-language models, creating a need for robust learning methods that can handle label noise effectively.

Method: Two main components: 1) PromptMAE - uses mean absolute error (MAE) loss for prompt learning to suppress noisy sample influence; 2) PromptOT - optimal transport data purification that uses text features as prototypes to partition datasets into clean/noisy subsets, applying cross-entropy to clean data and MAE to noisy data.

Result: Extensive experiments across various noise settings demonstrate significant performance improvements, showing enhanced robustness against noisy labels while maintaining high accuracy.

Conclusion: NLPrompt offers a simple and efficient approach leveraging vision-language models’ expressive representations and alignment capabilities for robust prompt learning, effectively addressing noisy label challenges through MAE loss and optimal transport data purification.

Abstract: The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text features in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

[201] FLOL: Fast Baselines for Real-World Low-Light Enhancement

Juan C. Benito, Daniel Feijoo, Alvaro Garcia, Marcos V. Conde

Main category: cs.CV

TL;DR: FLOL is a fast, lightweight neural network for low-light image enhancement that combines frequency and spatial domain processing, achieving real-time performance on 1080p images while maintaining competitive results with state-of-the-art methods.

Details

Motivation: Current deep learning-based low-light image enhancement methods struggle with efficiency and robustness for real-world scenarios, particularly with noisy scenes and saturated pixels. There's a need for faster, more practical solutions that can handle real-world conditions.

Method: Proposes FLOL, a lightweight neural network that combines image processing in both frequency and spatial domains. The architecture is designed for efficiency while maintaining enhancement quality.

Result: FLOL achieves results comparable to state-of-the-art methods on popular benchmarks (LOLv2, LSRW, MIT-5K, UHD-LL) while being one of the fastest models. It can process 1080p images in real-time under 12ms.

Conclusion: FLOL provides an efficient and practical solution for real-world low-light image enhancement, balancing speed and quality for practical applications.

Abstract: Low-Light Image Enhancement (LLIE) is a key task in computational photography and imaging. The problem of enhancing images captured during night or in dark environments has been well-studied in the computer vision literature. However, current deep learning-based solutions struggle with efficiency and robustness for real-world scenarios (e.g., scenes with noise, saturated pixels). We propose a lightweight neural network that combines image processing in the frequency and spatial domains. Our baseline method, FLOL, is one of the fastest models for this task, achieving results comparable to the state-of-the-art on popular real-world benchmarks such as LOLv2, LSRW, MIT-5K and UHD-LL. Moreover, we are able to process 1080p images in real-time under 12ms. Code and models at https://github.com/cidautai/FLOL

[202] Dense-SfM: Structure from Motion with Dense Consistent Matching

JongMin Lee, Sungjoo Yoo

Main category: cs.CV

TL;DR: Dense-SfM is a new Structure from Motion framework that improves 3D reconstruction accuracy and density by replacing sparse keypoint matching with dense matching and Gaussian Splatting-based track extension.

Details

Motivation: Traditional SfM methods rely on sparse keypoint matching, which limits both accuracy and point density, especially in texture-less areas where feature detection is challenging.

Method: Integrates dense matching with Gaussian Splatting-based track extension for more consistent, longer feature tracks. Also includes a multi-view kernelized matching module using transformer and Gaussian Process architectures for robust track refinement across multiple views.

Result: Evaluations on ETH3D and Texture-Poor SfM datasets show significant improvements in both accuracy and density compared to state-of-the-art methods.

Conclusion: Dense-SfM successfully addresses limitations of traditional sparse SfM by leveraging dense matching and advanced refinement techniques, achieving superior 3D reconstruction performance.

Abstract: We present Dense-SfM, a novel Structure from Motion (SfM) framework designed for dense and accurate 3D reconstruction from multi-view images. Sparse keypoint matching, which traditional SfM methods often rely on, limits both accuracy and point density, especially in texture-less areas. Dense-SfM addresses this limitation by integrating dense matching with a Gaussian Splatting (GS) based track extension which gives more consistent, longer feature tracks. To further improve reconstruction accuracy, Dense-SfM is equipped with a multi-view kernelized matching module leveraging transformer and Gaussian Process architectures, for robust track refinement across multi-views. Evaluations on the ETH3D and Texture-Poor SfM datasets show that Dense-SfM offers significant improvements in accuracy and density over state-of-the-art methods. Project page: https://icetea-cv.github.io/densesfm/.

[203] EgoLife: Towards Egocentric Life Assistant

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu

Main category: cs.CV

TL;DR: EgoLife introduces an egocentric life assistant using AI glasses, creates a 300-hour multimodal dataset of daily life, develops EgoLifeQA tasks for practical assistance, and builds EgoButler system with EgoGPT and EgoRAG models for egocentric video understanding and long-context QA.

Details

Motivation: To develop an AI-powered wearable glasses assistant that enhances personal efficiency by understanding and assisting with daily life activities through egocentric perception.

Method: 1) Collected 300-hour EgoLife Dataset from 6 participants living together for a week using AI glasses (egocentric view) with synchronized third-person references; 2) Created EgoLifeQA benchmark tasks for practical life assistance; 3) Developed EgoButler system with EgoGPT (omni-modal model for egocentric understanding) and EgoRAG (retrieval-based component for long-context QA).

Result: Created comprehensive multimodal dataset, established benchmark tasks, developed state-of-the-art EgoGPT model for egocentric video understanding, and built EgoRAG for handling ultra-long-context questions, with experimental studies revealing critical factors and bottlenecks.

Conclusion: The EgoLife project lays foundation for practical egocentric AI assistants by providing datasets, models, and benchmarks, aiming to stimulate further research in wearable AI life assistants.

Abstract: We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, socializing, and entertainment - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations. To address the key technical challenges of (1) developing robust visual-audio models for egocentric data, (2) enabling identity recognition, and (3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.

[204] AdaSCALE: Adaptive Scaling for OOD Detection

Sudarshan Regmi

Main category: cs.CV

TL;DR: AdaSCALE is an adaptive scaling method for OOD detection that dynamically adjusts percentile thresholds based on estimated OOD likelihood, outperforming state-of-the-art methods.

Details

Motivation: Current OOD detection methods use static percentile thresholds across all samples, which leads to suboptimal separation between in-distribution and out-of-distribution inputs. There's a need for adaptive approaches that can better distinguish ID and OOD samples.

Method: AdaSCALE uses an adaptive scaling procedure that dynamically adjusts percentile thresholds based on a sample’s estimated OOD likelihood. The key insight is that OOD samples show more pronounced activation shifts at high-magnitude activations under minor perturbation compared to ID samples.

Result: Achieves state-of-the-art OOD detection performance, outperforming the latest rival OptFS by 14.94% in near-OOD and 21.67% in far-OOD datasets in average FPR@95 metric on ImageNet-1k benchmark across eight diverse architectures.

Conclusion: AdaSCALE’s adaptive threshold adjustment based on OOD likelihood estimation enables stronger scaling for likely ID samples and weaker scaling for likely OOD samples, resulting in highly separable energy scores and superior OOD detection performance.

Abstract: The ability of the deep learning model to recognize when a sample falls outside its learned distribution is critical for safe and reliable deployment. Recent state-of-the-art out-of-distribution (OOD) detection methods leverage activation shaping to improve the separation between in-distribution (ID) and OOD inputs. These approaches resort to sample-specific scaling but apply a static percentile threshold across all samples regardless of their nature, resulting in suboptimal ID-OOD separability. In this work, we propose \textbf{AdaSCALE}, an adaptive scaling procedure that dynamically adjusts the percentile threshold based on a sample’s estimated OOD likelihood. This estimation leverages our key observation: OOD samples exhibit significantly more pronounced activation shifts at high-magnitude activations under minor perturbation compared to ID samples. AdaSCALE enables stronger scaling for likely ID samples and weaker scaling for likely OOD samples, yielding highly separable energy scores. Our approach achieves state-of-the-art OOD detection performance, outperforming the latest rival OptFS by 14.94% in near-OOD and 21.67% in far-OOD datasets in average FPR@95 metric on the ImageNet-1k benchmark across eight diverse architectures. The code is available at: https://github.com/sudarshanregmi/AdaSCALE/

[205] Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

Basura Fernando, Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat

Main category: cs.CV

TL;DR: KML is a neurosymbolic framework that learns neural knowledge modules from procedural knowledge graphs and composes them into executable reasoning programs via LLMs for interpretable procedural reasoning.

Details

Motivation: To enable models to understand and reason over procedural tasks requiring structured, compositional procedural knowledge with transparent, traceable intermediate states.

Method: Learns relation categories within knowledge graphs as neural knowledge modules, composes them into executable reasoning programs using LLMs, and performs multistep reasoning with traceable intermediate states.

Result: Outperforms LLM-only and black-box neural baselines on the PKR-QA benchmark for procedural knowledge reasoning while providing interpretable step-by-step traces.

Conclusion: KML provides a theoretically grounded, interpretable framework for procedural reasoning that combines neural modules with symbolic composition, demonstrating strong performance and transparency advantages over existing approaches.

Abstract: In this work we present Knowledge Module Learning (KML) to understand and reason over procedural tasks that requires models to learn structured and compositional procedural knowledge. KML is a neurosymbolic framework that learns relation categories within a knowledge graph as neural knowledge modules and composes them into executable reasoning programs generated by large language models (LLMs). Each module encodes a specific procedural relation capturing how each entity type such as tools are related to steps, purpose of each tool, and steps of each task. Given a question conditioned on a task shown in a video, then KML performs multistep reasoning with transparent, traceable intermediate states. Our theoretical analysis demonstrated two desired properties of KML. KML satisfy strong optimal conditions for modelling KG relations as neural mappings, providing strong foundations for generalizable procedural reasoning. It also shows a bound on the expected error when it performs multistep reasoning. To evaluate this model, we construct a large procedural knowledge graph (PKG) consisting of diverse instructional domains by integrating the COIN instructional video dataset, and COIN ontology, commonsense relations from ConceptNet, and structured extractions from LLMs, followed by expert verification. We then generate question and answer pairs by applying graph traversal templates over the PKG, constructing the PKR-QA benchmark for procedural knowledge reasoning. Experiments show that KML improves structured reasoning performance while providing interpretable step-by-step traces, outperforming LLM-only and black-box neural baselines. Code is publicly available at https://github.com/LUNAProject22/KML.

[206] CAPE: Connectivity-Aware Path Enforcement Loss for Curvilinear Structure Delineation

Elyar Esmaeilzadeh, Ehsan Garaaghaji, Farzad Hallaji Azad, Doruk Oner

Main category: cs.CV

TL;DR: CAPE is a novel connectivity-aware loss function that enforces topological connectivity in segmentation of curvilinear structures by optimizing graph connectivity metrics through shortest-path algorithms.

Details

Motivation: Traditional pixel-wise loss functions (cross-entropy, Dice) fail to capture high-level topological connectivity, leading to topological mistakes in graphs derived from segmentation predictions for curvilinear structures like neurons and blood vessels.

Method: CAPE uses ground truth graph representation to select node pairs and find corresponding paths in predicted segmentation via shortest-path algorithm, then penalizes both disconnections and false positive connections to preserve topological correctness.

Result: Experiments on 2D and 3D datasets (neuron and blood vessel tracing) show CAPE significantly improves topology-aware metrics and outperforms state-of-the-art methods.

Conclusion: CAPE effectively addresses the connectivity challenge in curvilinear structure segmentation by directly optimizing for topological correctness through graph-based path enforcement.

Abstract: Promoting the connectivity of curvilinear structures, such as neuronal processes in biomedical scans and blood vessels in CT images, remains a key challenge in semantic segmentation. Traditional pixel-wise loss functions, including cross-entropy and Dice losses, often fail to capture high-level topological connectivity, resulting in topological mistakes in graphs obtained from prediction maps. In this paper, we propose CAPE (Connectivity-Aware Path Enforcement), a novel loss function designed to enforce connectivity in graphs obtained from segmentation maps by optimizing a graph connectivity metric. CAPE uses the graph representation of the ground truth to select node pairs and determine their corresponding paths within the predicted segmentation through a shortest-path algorithm. Using this, we penalize both disconnections and false positive connections, effectively promoting the model to preserve topological correctness. Experiments on 2D and 3D datasets, including neuron and blood vessel tracing demonstrate that CAPE significantly improves topology-aware metrics and outperforms state-of-the-art methods.

[207] Semantic Depth Matters: Explaining Errors of Deep Vision Networks through Perceived Class Similarities

Katarzyna Filus, Michał Romaszewski, Mateusz Żarski

Main category: cs.CV

TL;DR: A framework analyzing DNN errors through semantic hierarchy depth, using Similarity Depth metric and graph visualization to reveal relationships between network perception and misclassification patterns.

Details

Motivation: Current DNN evaluation methods lack transparency in explaining misclassifications; need better understanding of how network perception relates to error patterns beyond just accuracy metrics.

Method: Introduces Similarity Depth (SD) metric to quantify semantic hierarchy depth perceived by networks; uses class templates from classifier weights; proposes graph-based visualization of semantic relationships and misperceptions.

Result: Deep vision networks encode specific semantic hierarchies; higher semantic depth improves alignment between perceived class similarities and actual error patterns.

Conclusion: The framework provides transparent error analysis using existing trained networks without additional data, revealing important relationships between network perception and misclassification behavior.

Abstract: Understanding deep neural network (DNN) behavior requires more than evaluating classification accuracy alone; analyzing errors and their predictability is equally crucial. Current evaluation methodologies lack transparency, particularly in explaining the underlying causes of network misclassifications. To address this, we introduce a novel framework that investigates the relationship between the semantic hierarchy depth perceived by a network and its real-data misclassification patterns. Central to our framework is the Similarity Depth (SD) metric, which quantifies the semantic hierarchy depth perceived by a network along with a method of evaluation of how closely the network’s errors align with its internally perceived similarity structure. We also propose a graph-based visualization of model semantic relationships and misperceptions. A key advantage of our approach is that leveraging class templates – representations derived from classifier layer weights – is applicable to already trained networks without requiring additional data or experiments. Our approach reveals that deep vision networks encode specific semantic hierarchies and that high semantic depth improves the compliance between perceived class similarities and actual errors.

[208] Range Image-Based Implicit Neural Compression for LiDAR Point Clouds

Akihiro Kuwabara, Sorachi Kato, Toshiaki Koike-Akino, Takuya Fujihashi

Main category: cs.CV

TL;DR: Novel implicit neural representation-based method for efficient LiDAR point cloud compression using 2D range images, outperforming existing methods at low bitrates.

Details

Motivation: Need for efficient compression of LiDAR point clouds to enable high-precision 3D scene archives for detailed scene understanding, while conventional image compression techniques are limited due to differences in bit precision and pixel value distributions between natural images and range images.

Method: Proposes implicit neural representation-based range image compression that divides range images into depth and mask images, then compresses them using patch-wise and pixel-wise INR architectures with model pruning and quantization to handle floating-point valued pixels.

Result: Experiments on KITTI dataset show the method outperforms existing image, point cloud, range image, and INR-based compression methods in terms of 3D reconstruction and detection quality at low bitrates with low decoding latency.

Conclusion: The proposed INR-based compression method effectively compresses LiDAR point clouds via range images, achieving superior performance for 3D scene understanding applications at low bitrates.

Abstract: This paper presents a novel scheme to efficiently compress Light Detection and Ranging~(LiDAR) point clouds, enabling high-precision 3D scene archives, and such archives pave the way for a detailed understanding of the corresponding 3D scenes. We focus on 2D range images~(RIs) as a lightweight format for representing 3D LiDAR observations. Although conventional image compression techniques can be adapted to improve compression efficiency for RIs, their practical performance is expected to be limited due to differences in bit precision and the distinct pixel value distribution characteristics between natural images and RIs. We propose a novel implicit neural representation~(INR)–based RI compression method that effectively handles floating-point valued pixels. The proposed method divides RIs into depth and mask images and compresses them using patch-wise and pixel-wise INR architectures with model pruning and quantization, respectively. Experiments on the KITTI dataset show that the proposed method outperforms existing image, point cloud, RI, and INR-based compression methods in terms of 3D reconstruction and detection quality at low bitrates and decoding latency.

[209] QVGen: Pushing the Limit of Quantized Video Generative Models

Yushi Huang, Ruihao Gong, Jing Liu, Yifu Ding, Chengtao Lv, Haotong Qin, Jun Zhang

Main category: cs.CV

TL;DR: QVGen is a quantization-aware training framework that enables high-performance video diffusion models under extremely low-bit quantization (4-bit or below) while maintaining full-precision comparable quality.

Details

Motivation: Video diffusion models have high computational and memory demands that hinder real-world deployment. While quantization works well for image diffusion models, direct application to video DMs remains ineffective, creating a need for specialized quantization solutions for video generation.

Method: QVGen introduces auxiliary modules (Φ) to mitigate large quantization errors and enhance convergence, plus a rank-decay strategy using SVD and rank-based regularization to progressively eliminate these modules without inference overhead. The framework is theoretically grounded in reducing gradient norm for better QAT convergence.

Result: QVGen achieves full-precision comparable quality under 4-bit settings across 4 SOTA video DMs (1.3B-14B parameters). It significantly outperforms existing methods, with 3-bit CogVideoX-2B showing +25.28 improvement in Dynamic Degree and +8.43 in Scene Consistency on VBench.

Conclusion: QVGen successfully enables efficient video diffusion model deployment through effective low-bit quantization while maintaining quality, making it the first framework to achieve full-precision comparable performance under 4-bit quantization for video DMs.

Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ($Φ$) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of $Φ$, we propose a rank-decay strategy that progressively eliminates $Φ$. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization $\mathbfγ$ to identify and decay low-contributing components. This strategy retains performance while zeroing out additional inference overhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs, with parameter sizes ranging from $1.3\text{B}\sim14\text{B}$, show that QVGen is the first to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench. Code and models are available at https://github.com/ModelTC/QVGen.

[210] MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

Jinhua Zhang, Wei Long, Minghao Han, Weiyi You, Shuhang Gu

Main category: cs.CV

TL;DR: MVAR introduces scale and spatial Markov assumptions to reduce redundancy in visual autoregressive models, cutting attention complexity from O(N²) to O(Nk) and enabling training on just 8 RTX 4090 GPUs.

Details

Motivation: Current next-scale prediction methods have scale and spatial redundancy - they condition each scale on all previous scales and require each token to consider all preceding tokens, leading to high computational complexity and memory requirements.

Method: Proposes Markovian Visual AutoRegressive modeling with two key innovations: 1) scale-Markov trajectory that only uses adjacent preceding scale features for next-scale prediction, enabling parallel training; 2) spatial-Markov attention that restricts attention to localized neighborhoods of size k at corresponding positions on adjacent scales.

Result: Reduces attention complexity from O(N²) to O(Nk), enables training with only 8 RTX 4090 GPUs, eliminates KV cache during inference, reduces average GPU memory footprint by 3.0x while achieving comparable or superior performance on ImageNet.

Conclusion: MVAR effectively mitigates redundancy in visual autoregressive modeling through Markov assumptions, significantly improving training efficiency while maintaining or improving generation quality.

Abstract: Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size k at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from O(N^2) to O(Nk), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.

Cheng Cheng, Lin Song, Di An, Yicheng Xiao, Xuchong Zhang, Hongbin Sun, Ying Shan

Main category: cs.CV

TL;DR: TensorAR improves autoregressive image generation by shifting from next-token to next-tensor prediction, enabling iterative refinement through overlapping windows and discrete tensor noising.

Details

Motivation: Autoregressive image generators lack refinement mechanisms compared to diffusion models, limiting their generation quality despite being language-model-friendly.

Method: Reformulates AR image generation from next-token to next-tensor prediction using overlapping windows of image patches. Introduces discrete tensor noising via codebook-indexed noise to prevent information leakage during training. Implemented as plug-and-play module for existing AR models.

Result: Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR show TensorAR significantly improves generation performance of autoregressive models.

Conclusion: TensorAR introduces a novel AR paradigm that enables iterative refinement in autoregressive image generation, bridging the quality gap with diffusion models while maintaining AR advantages.

Abstract: Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.

[212] Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

Jiaxin Liu, Jia Wang, Saihui Hou, Min Ren, Huijia Wu, Long Ma, Renwang Pei, Zhaofeng He

Main category: cs.CV

TL;DR: Researchers create DigiFakeAV, a large-scale multimodal digital human forgery dataset using diffusion models, and propose DigiShield detection method that achieves SOTA performance.

Details

Motivation: The rapid advancement of diffusion-based digital human generation technology poses critical threats to public security, as these models can create highly realistic videos with consistency using multimodal control signals. Their flexibility and covertness severely challenge existing detection strategies.

Method: 1) Created DigiFakeAV dataset using five latest digital human generation methods and voice cloning, comprising 60,000 videos (8.4M frames) covering multiple nationalities, skin tones, genders, and real-world scenarios. 2) Proposed DigiShield detection method based on spatiotemporal and cross-modal fusion, jointly modeling 3D spatiotemporal video features and semantic-acoustic audio features.

Result: User studies show 68% misrecognition rate for DigiFakeAV videos. Existing detection models show substantial performance degradation on the dataset. DigiShield achieves state-of-the-art performance on DigiFakeAV and demonstrates strong generalization on other datasets.

Conclusion: The DigiFakeAV dataset highlights the challenges of detecting diffusion-based digital human forgeries, and DigiShield provides an effective baseline solution through multimodal fusion of visual and audio features for improved detection performance.

Abstract: In recent years, the explosive advancement of deepfake technology has posed a critical and escalating threat to public security: diffusion-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency via multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the new large-scale multimodal digital human forgery dataset based on diffusion models. Leveraging five of the latest digital human generation methods and a voice cloning method, we systematically construct a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies demonstrate that the misrecognition rate by participants for DigiFakeAV reaches as high as 68%. Moreover, the substantial performance degradation of existing detection models on our dataset further highlights its challenges. To address this problem, we propose DigiShield, an effective detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves state-of-the-art (SOTA) performance on the DigiFakeAV and shows strong generalization on other datasets.

[213] Diagnosing Vision Language Models’ Perception by Leveraging Human Methods for Color Vision Deficiencies

Kazuki Hayashi, Shintaro Ozaki, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

Main category: cs.CV

TL;DR: LVLMs fail to account for color perception variations, particularly Color Vision Deficiencies, despite having factual knowledge about them, raising accessibility concerns.

Details

Motivation: Real-world LVLM applications require accommodation of perceptual variation, especially color perception differences due to Color Vision Deficiencies, which is currently ignored in multimodal AI.

Method: Evaluated LVLMs using Ishihara Test plates as controlled stimuli, examining model behavior through generation, confidence, and internal representation analysis.

Result: Models possess factual knowledge about color vision deficiencies and can describe the Ishihara test, but fail to reproduce perceptual outcomes experienced by affected individuals, defaulting instead to normative color perception.

Conclusion: Current LVLMs lack mechanisms for representing alternative perceptual experiences, raising significant concerns for accessibility and inclusive deployment in multimodal settings.

Abstract: Large-scale Vision-Language Models (LVLMs) are being deployed in real-world settings that require visual inference. As capabilities improve, applications in navigation, education, and accessibility are becoming practical. These settings require accommodation of perceptual variation rather than assuming a uniform visual experience. Color perception illustrates this requirement: it is central to visual understanding yet varies across individuals due to Color Vision Deficiencies, an aspect largely ignored in multimodal AI. In this work, we examine whether LVLMs can account for variation in color perception using the Ishihara Test. We evaluate model behavior through generation, confidence, and internal representation, using Ishihara plates as controlled stimuli that expose perceptual differences. Although models possess factual knowledge about color vision deficiencies and can describe the test, they fail to reproduce the perceptual outcomes experienced by affected individuals and instead default to normative color perception. These results indicate that current systems lack mechanisms for representing alternative perceptual experiences, raising concerns for accessibility and inclusive deployment in multimodal settings.

[214] SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang

Main category: cs.CV

TL;DR: SeedVR2 is a one-step diffusion-based video restoration model that uses adversarial training against real data to achieve high-quality results with much lower computational cost than multi-step diffusion models.

Details

Motivation: Current diffusion-based video restoration methods produce high visual quality but are computationally expensive during inference. While distillation approaches enable one-step image restoration, extending these to video restoration (especially for high-resolution real-world videos) remains challenging and underexplored.

Method: Proposes SeedVR2 with several enhancements: 1) Adaptive window attention mechanism that dynamically adjusts window size based on output resolution to avoid window inconsistency at high resolutions; 2) Adversarial post-training with real data; 3) Series of losses including a proposed feature matching loss to stabilize training without sacrificing efficiency.

Result: Extensive experiments show SeedVR2 achieves comparable or even better performance than existing video restoration approaches while requiring only a single inference step, significantly reducing computational cost.

Conclusion: SeedVR2 demonstrates that one-step diffusion-based video restoration is feasible and effective, offering a practical solution for high-resolution video restoration in real-world settings with dramatically reduced inference time.

Abstract: Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.

[215] DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen, An-Zi Yen

Main category: cs.CV

TL;DR: DaMO is a data-efficient Video LLM with enhanced temporal reasoning capabilities using hierarchical dual-stream architecture and progressive training.

Details

Motivation: Existing Video LLMs have limitations in fine-grained temporal reasoning and precise attribution of responses to specific video moments, especially with constrained supervision.

Method: DaMO uses Temporal-aware Fuseformer with hierarchical dual-stream architecture to capture temporal dynamics, global residual for efficiency, and four-stage progressive training for multimodal alignment, semantic grounding, and temporal reasoning. Also creates temporally grounded QA datasets.

Result: DaMO consistently surpasses prior methods on temporal grounding and video QA benchmarks, particularly in tasks requiring precise temporal alignment and reasoning.

Conclusion: DaMO establishes a promising direction for data-efficient video-language modeling with enhanced temporal reasoning capabilities.

Abstract: Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with LLM-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.

[216] BlindSight: Harnessing Sparsity for Efficient Vision-Language Models

Tharun Adithya Srikrishnan, Deval Shah, Timothy Hein, Ahmed Hasssan, Stephen Youn, Steven K. Reinhardt

Main category: cs.CV

TL;DR: BlindSight optimizes multi-image VLM inference by exploiting attention sparsity patterns across images, achieving 1.8-3.2x speedup in attention computation with minimal accuracy loss.

Details

Motivation: Processing vision data in VLMs increases prompt length and TTFT. Attention sparsity in multi-image processing presents an opportunity for optimization without runtime overhead.

Method: Analyze attention patterns in VLMs to identify sparse inter-image attention. Categorize attention heads into Dense, Sink, Intra-Image, and Intra-Image+Sink types. Develop Triton-based GPU kernel using input-template-aware sparsity masks.

Result: 1.8-3.2x speedup in attention computation (prompt length 36K-300K). Generalizes across VLMs (Qwen2-VL, Qwen2.5-VL, Gemma 3) with only 0.78% absolute accuracy degradation on multi-image benchmarks.

Conclusion: BlindSight effectively optimizes multi-image VLM inference by exploiting attention sparsity. Advocates for designing efficient VLMs combining sparse and dense layers inspired by this approach.

Abstract: Large vision-language models (VLMs) enable joint processing of text and images. However, incorporating vision data significantly increases the prompt length, resulting in a longer time to first token (TTFT). This bottleneck can be alleviated by leveraging the inherent sparsity in the attention computation. Analyzing these attention patterns in VLMs when processing a series of images, we observe the absence of inter-image attention in a substantial portion of layers. Based on this, we propose BlindSight: an approach to optimize multi-image VLM inference using an input-template-aware attention sparsity mask with no runtime overhead. We utilize a dataset to derive a prompt-agnostic categorization for attention heads: Dense, Sink, Intra-Image, and Intra-Image+Sink. We develop a Triton-based GPU kernel to leverage this sparsity. BlindSight achieves a 1.8-3.2x speedup in the attention computation (prompt length 36K-300K). BlindSight generalizes across VLMs (Qwen2-VL, Qwen2.5-VL, Gemma 3), with only a 0.78% absolute accuracy degradation on average on multi-image comprehension benchmarks. Finally, we advocate for the design of efficient VLMs that combine BlindSight-inspired sparse and dense layers.

[217] DA-Occ: Direction-Aware 2D Convolution for Efficient and Geometry-Preserving 3D Occupancy Prediction

Yuchen Zhou, Yan Luo, Xiaogang Wang, Xingjian Gu, Mingzhou Lu

Main category: cs.CV

TL;DR: A pure 2D framework for efficient 3D occupancy prediction in autonomous driving that balances accuracy and speed using height-score projection and direction-aware convolution.

Details

Motivation: Existing 3D occupancy prediction methods face trade-offs between accuracy and efficiency - high-precision methods are slow, while fast BEV-based approaches sacrifice vertical geometric information, compromising geometric integrity for autonomous driving systems.

Method: Proposes a pure 2D framework that introduces height-score projection alongside depth scores to encode vertical geometric structure, and employs direction-aware convolution to extract geometric features along both vertical and horizontal orientations.

Result: Achieves 39.3% mIoU on Occ3D-nuScenes with 27.7 FPS inference speed, and reaches 14.8 FPS on edge devices, demonstrating effective balance between accuracy and efficiency for real-time deployment.

Conclusion: The proposed method successfully overcomes accuracy-efficiency trade-offs in 3D occupancy prediction, preserving geometric integrity while maintaining real-time performance suitable for resource-constrained autonomous driving environments.

Abstract: Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many existing methods involve trade-offs between accuracy and efficiency. Some achieve high precision but with slow inference speed, while others adopt purely bird’s-eye-view (BEV)-based 2D representations to accelerate processing, inevitably sacrificing vertical cues and compromising geometric integrity. To overcome these limitations, we propose a pure 2D framework that achieves efficient 3D occupancy prediction while preserving geometric integrity. Unlike conventional Lift-Splat-Shoot (LSS) methods that rely solely on depth scores to lift 2D features into 3D space, our approach additionally introduces a height-score projection to encode vertical geometric structure. We further employ direction-aware convolution to extract geometric features along both vertical and horizontal orientations, effectively balancing accuracy and computational efficiency. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method’s applicability for real-time deployment in resource-constrained environments.

[218] X-SAM: From Segment Anything to Any Segmentation

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, Xiaodan Liang

Main category: cs.CV

TL;DR: X-SAM is a multimodal LLM framework that extends segmentation from “segment anything” to “any segmentation” by enabling advanced pixel-level perception and introducing Visual GrounDed segmentation with interactive visual prompts.

Details

Motivation: LLMs lack pixel-level perceptual understanding, and SAM has limitations in multi-mask prediction, category-specific segmentation, and unifying all segmentation tasks. There's a need for a more comprehensive segmentation framework.

Method: Proposes X-SAM: a streamlined MLLM framework with unified architecture for any segmentation. Introduces Visual GrounDed (VGD) segmentation with interactive visual prompts, and a unified training strategy for co-training across multiple datasets.

Result: X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, demonstrating efficiency for multimodal, pixel-level visual understanding.

Conclusion: X-SAM successfully extends segmentation capabilities beyond SAM, providing a unified framework for any segmentation task with advanced pixel-level perceptual comprehension for MLLMs.

Abstract: Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.

[219] Splat Feature Solver

Butian Xiong, Rong Liu, Kenneth Xu, Meida Chen, Andrew Feng

Main category: cs.CV

TL;DR: A unified feature lifting method for 3D scene understanding that attaches image features (DINO, CLIP) to splat-based 3D representations via sparse linear inverse problem formulation with provable error bounds and regularization strategies.

Details

Motivation: Feature lifting is crucial for 3D scene understanding but faces challenges in optimally assigning rich general attributes to 3D primitives while addressing inconsistency issues from multi-view images.

Method: Formulates feature lifting as a sparse linear inverse problem solved in closed form. Introduces two regularization strategies: Tikhonov Guidance for numerical stability and Post-Lifting Aggregation for filtering noisy inputs via feature clustering.

Result: Achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing lifted features in minutes.

Conclusion: Presents a unified, kernel- and feature-agnostic approach with provable error bounds that efficiently solves feature lifting for 3D scene understanding with strong performance and practical efficiency.

Abstract: Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing lifted features in minutes. Our \textbf{code} is available in the \href{https://github.com/saliteta/splat-distiller/tree/main}{\textcolor{blue}{GitHub}}. We provide additional \href{https://splat-distiller.pages.dev/}{\textcolor{blue}{website}} for more visualization, as well as the \href{https://www.youtube.com/watch?v=CH-G5hbvArM}{\textcolor{blue}{video}}.

[220] MSPCaps: A Multi-Scale Patchify Capsule Network with Cross-Agreement Routing for Visual Recognition

Yudong Hu, Yueju Han, Rui Sun, Jinke Ren

Main category: cs.CV

TL;DR: MSPCaps is a novel Capsule Network architecture that integrates multi-scale feature learning with efficient capsule routing to overcome limitations of single-scale features and conventional fusion methods.

Details

Motivation: Existing CapsNet architectures rely on single high-level feature maps, missing complementary multi-scale information. Conventional feature fusion methods struggle with multi-scale feature discrepancies, leading to suboptimal classification performance.

Method: Three key components: 1) Multi-Scale ResNet Backbone (MSRB) extracts diverse multi-scale features; 2) Patchify Capsule Layer (PatchifyCaps) partitions features into primary capsules with uniform patch size; 3) Cross-Agreement Routing (CAR) blocks adaptively route multi-scale capsules by identifying cross-scale prediction pairs with maximum agreement.

Result: MSPCaps achieves remarkable scalability and superior robustness, consistently surpassing baseline methods in classification accuracy. It offers configurations from Tiny model (344.3K parameters) to Large model (10.9M parameters).

Conclusion: The proposed MSPCaps architecture effectively addresses multi-scale feature integration challenges in Capsule Networks, demonstrating strong potential for advancing feature representation learning with scalable and robust performance.

Abstract: Capsule Network (CapsNet) has demonstrated significant potential in visual recognition by capturing spatial relationships and part-whole hierarchies for learning equivariant feature representations. However, existing CapsNet and variants often rely on a single high-level feature map, overlooking the rich complementary information from multi-scale features. Furthermore, conventional feature fusion strategies (e.g., addition and concatenation) struggle to reconcile multi-scale feature discrepancies, leading to suboptimal classification performance. To address these limitations, we propose the Multi-Scale Patchify Capsule Network (MSPCaps), a novel architecture that integrates multi-scale feature learning and efficient capsule routing. Specifically, MSPCaps consists of three key components: a Multi-Scale ResNet Backbone (MSRB), a Patchify Capsule Layer (PatchifyCaps), and Cross-Agreement Routing (CAR) blocks. First, the MSRB extracts diverse multi-scale feature representations from input images, preserving both fine-grained details and global contextual information. Second, the PatchifyCaps partitions these multi-scale features into primary capsules using a uniform patch size, equipping the model with the ability to learn from diverse receptive fields. Finally, the CAR block adaptively routes the multi-scale capsules by identifying cross-scale prediction pairs with maximum agreement. Unlike the simple concatenation of multiple self-routing blocks, CAR ensures that only the most coherent capsules contribute to the final voting. Our proposed MSPCaps achieves remarkable scalability and superior robustness, consistently surpassing multiple baseline methods in terms of classification accuracy, with configurations ranging from a highly efficient Tiny model (344.3K parameters) to a powerful Large model (10.9M parameters), highlighting its potential in advancing feature representation learning.

[221] The SAGES Critical View of Safety Challenge: A Global Benchmark for AI-Assisted Surgical Quality Assessment

Deepak Alapatt, Jennifer Eckhoff, Zhiliang Lyu, Yutong Ban, Jean-Paul Mazellier, Sarah Choksi, Kunyi Yang, Po-Hsing Chiang, Noemi Zorzetti, Samuele Cannas, Daniel Neimark, Omri Bar, Amine Yamlahi, Jakob Hennighausen, Xiaohan Wang, Rui Li, Long Liang, Yuxian Wang, Saurabh Koju, Binod Bhattarai, Tim Jaspers, Zhehua Mao, Anjana Wijekoon, Jun Ma, Yinan Xu, Zhilong Weng, Ammar M. Okran, Hatem A. Rashwan, Boyang Shen, Kaixiang Yang, Yutao Zhang, Hao Wang, 2024 CVS Challenge Consortium, Quanzheng Li, Filippo Filicori, Xiang Li, Pietro Mascagni, Daniel A. Hashimoto, Guy Rosman, Ozanan Meireles, Nicolas Padoy

Main category: cs.CV

TL;DR: The SAGES CVS Challenge was the first surgical society-organized AI competition using laparoscopic cholecystectomy videos to develop AI models for surgical quality assessment, achieving significant improvements in performance, calibration, and robustness.

Details

Motivation: To democratize access to surgical expertise through AI for quality assessment in training, guidance, and accreditation, addressing the inconsistent performance of the Critical View of Safety step in laparoscopic cholecystectomy despite universal recommendation.

Method: Organized first surgical society AI competition (SAGES CVS Challenge) with global collaboration across 54 institutions in 24 countries. Curated 1,000 surgical videos annotated by 20 experts using consensus-validated protocol. Developed EndoGlacier framework for managing large, heterogeneous surgical video and multi-annotator workflows. Thirteen international teams participated in developing AI models.

Result: Achieved up to 17% relative gain in assessment performance, over 80% reduction in calibration error, and 17% relative improvement in robustness over state-of-the-art. Analysis revealed methodological trends linked to model performance, providing guidance for future research.

Conclusion: The challenge successfully addressed key barriers to real-world AI deployment in surgery and demonstrated the potential for robust, clinically deployable AI for surgical quality assessment through global collaborative efforts.

Abstract: Advances in artificial intelligence (AI) for surgical quality assessment promise to democratize access to expertise, with applications in training, guidance, and accreditation. This study presents the SAGES Critical View of Safety (CVS) Challenge, the first AI competition organized by a surgical society, using the CVS in laparoscopic cholecystectomy, a universally recommended yet inconsistently performed safety step, as an exemplar of surgical quality assessment. A global collaboration across 54 institutions in 24 countries engaged hundreds of clinicians and engineers to curate 1,000 videos annotated by 20 surgical experts according to a consensus-validated protocol. The challenge addressed key barriers to real-world deployment in surgery, including achieving high performance, capturing uncertainty in subjective assessment, and ensuring robustness to clinical variability. To enable this scale of effort, we developed EndoGlacier, a framework for managing large, heterogeneous surgical video and multi-annotator workflows. Thirteen international teams participated, achieving up to a 17% relative gain in assessment performance, over 80% reduction in calibration error, and a 17% relative improvement in robustness over the state-of-the-art. Analysis of results highlighted methodological trends linked to model performance, providing guidance for future research toward robust, clinically deployable AI for surgical quality assessment.

[222] Visual Instruction Pretraining for Domain-Specific Foundation Models

Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang

Main category: cs.CV

TL;DR: ViTP introduces a novel pretraining paradigm that leverages reasoning to enhance perception in foundation models, using visual instruction data from downstream domains to achieve state-of-the-art performance on remote sensing and medical imaging benchmarks.

Details

Motivation: The paper addresses the incomplete closed loop in computer vision where high-level reasoning doesn't sufficiently influence low-level perceptual feature learning in foundation models. Current approaches lack top-down reasoning influence on foundational perception learning.

Method: Proposes Visual insTruction Pretraining (ViTP) which embeds a Vision Transformer backbone within a Vision-Language Model and pretrains it end-to-end using curated visual instruction data from target domains. Uses Visual Robustness Learning (VRL) to force the ViT to learn robust, domain-relevant features from sparse visual tokens.

Result: Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across diverse downstream tasks.

Conclusion: ViTP successfully closes the perception-reasoning-generation loop by enabling reasoning to enhance perception, creating a new paradigm for pretraining foundation models that achieves superior performance on domain-specific vision tasks.

Abstract: Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.

[223] SiNGER: A Clearer Voice Distills Vision Transformers Further

Geunhyeok Yu, Sunjae Jeong, Yoonyoung Choi, Jaeseung Kim, Hyoseok Hwang

Main category: cs.CV

TL;DR: SiNGER is a new distillation framework that suppresses high-norm artifacts in Vision Transformers while preserving informative signals, improving student model performance.

Details

Motivation: Vision Transformers produce high-norm artifacts that degrade representation quality. When these features are transferred via knowledge distillation, students overfit to artifacts and underweight informative signals, diminishing gains from larger teacher models.

Method: SiNGER uses Singular Nullspace-Guided Energy Reallocation to refine teacher features. It leverages nullspace-guided perturbation to preserve information while suppressing artifacts, implemented efficiently with a LoRA-based adapter requiring minimal structural modification.

Result: Extensive experiments show SiNGER consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer, more interpretable representations.

Conclusion: SiNGER effectively addresses the artifact-information trade-off in knowledge distillation, enabling better transfer of teacher knowledge to student models without the limitations of prior approaches.

Abstract: Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality. When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals. The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts. Then, the refined teacher’s features are distilled to a student. We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification. Extensive experiments show that \oursname consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.

[224] Dynamic Novel View Synthesis in High Dynamic Range

Kaixuan Zhang, Zhipeng Xiong, Minxian Li, Mingwu Ren, Jiankang Deng, Xiatian Zhu

Main category: cs.CV

TL;DR: HDR-4DGS: A Gaussian Splatting-based method for HDR Dynamic Novel View Synthesis that handles moving objects and temporal radiance variations while translating between LDR and HDR domains.

Details

Motivation: Current HDR NVS methods focus only on static scenes, but real-world scenarios contain dynamic elements like moving objects and varying lighting. There's a need to jointly model temporal radiance variations with 3D translation between LDR and HDR domains.

Method: Proposes HDR-4DGS with a dynamic tone-mapping module that explicitly connects HDR and LDR domains, maintaining temporal radiance coherence by adapting tone-mapping functions according to evolving radiance distributions across time.

Result: Achieves both temporal radiance consistency and spatially accurate color translation, enabling photorealistic HDR renderings from arbitrary viewpoints and time instances. Outperforms existing state-of-the-art methods in quantitative performance and visual fidelity.

Conclusion: HDR-4DGS successfully addresses the challenging HDR Dynamic Novel View Synthesis problem by jointly modeling temporal dynamics and HDR-LDR translation, representing a significant advancement over static scene methods.

Abstract: High Dynamic Range Novel View Synthesis (HDR NVS) seeks to learn an HDR 3D model from Low Dynamic Range (LDR) training images captured under conventional imaging conditions. Current methods primarily focus on static scenes, implicitly assuming all scene elements remain stationary and non-living. However, real-world scenarios frequently feature dynamic elements, such as moving objects, varying lighting conditions, and other temporal events, thereby presenting a significantly more challenging scenario. To address this gap, we propose a more realistic problem named HDR Dynamic Novel View Synthesis (HDR DNVS), where the additional dimension ``Dynamic’’ emphasizes the necessity of jointly modeling temporal radiance variations alongside sophisticated 3D translation between LDR and HDR. To tackle this complex, intertwined challenge, we introduce HDR-4DGS, a Gaussian Splatting-based architecture featured with an innovative dynamic tone-mapping module that explicitly connects HDR and LDR domains, maintaining temporal radiance coherence by dynamically adapting tone-mapping functions according to the evolving radiance distributions across the temporal dimension. As a result, HDR-4DGS achieves both temporal radiance consistency and spatially accurate color translation, enabling photorealistic HDR renderings from arbitrary viewpoints and time instances. Extensive experiments demonstrate that HDR-4DGS surpasses existing state-of-the-art methods in both quantitative performance and visual fidelity. Source code is available at https://github.com/prinasi/HDR-4DGS.

[225] Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Huan Gao, Mingkun Xu, Shangyang Li

Main category: cs.CV

TL;DR: Neural-MedBench is a compact neurology benchmark that reveals VLMs’ clinical reasoning deficiencies despite good performance on standard medical datasets.

Details

Motivation: Current vision-language models show strong performance on standard medical benchmarks but their true clinical reasoning ability remains unclear. Existing datasets focus too much on classification accuracy, creating an evaluation illusion where models appear proficient but still fail at high-stakes diagnostic reasoning.

Method: Created Neural-MedBench, a reasoning-intensive neurology benchmark integrating multi-sequence MRI scans, EHRs, and clinical notes. Includes three task families: differential diagnosis, lesion recognition, and rationale generation. Developed hybrid scoring pipeline combining LLM-based graders, clinician validation, and semantic similarity metrics.

Result: Evaluation of state-of-the-art VLMs (GPT-4o, Claude-4, MedGemma) showed sharp performance drop compared to conventional datasets. Error analysis revealed reasoning failures dominate model shortcomings rather than perceptual errors.

Conclusion: Proposes Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented compact benchmarks like Neural-MedBench for reasoning fidelity. Released benchmark as open diagnostic testbed for rigorous, cost-effective assessment of clinically trustworthy AI.

Abstract: Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

[226] GraphTARIF: Linear Graph Transformer with Augmented Rank and Improved Focus

Zhaolin Hu, Kun Li, Hehe Fan, Yi Yang

Main category: cs.CV

TL;DR: Proposes a hybrid framework combining linear attention with gated local graph networks and learnable log-power functions to enhance expressiveness while maintaining linear complexity in Graph Transformers.

Details

Motivation: Existing linear attention mechanisms in Graph Transformers suffer from reduced expressiveness due to low-rank projections and uniform attention distributions, which theoretically reduce class separability of node representations and limit classification ability.

Method: 1) Enhance linear attention by attaching a gated local graph network branch to the value matrix to increase attention map rank. 2) Introduce a learnable log-power function into attention scores to reduce entropy and sharpen focus, alleviating excessive smoothing effects.

Result: Extensive experiments on both homophilic and heterophilic graph benchmarks show the method achieves competitive performance while preserving the scalability of linear attention.

Conclusion: The proposed hybrid framework successfully addresses expressiveness limitations of linear attention in Graph Transformers by enhancing both rank and focus of attention, maintaining linear complexity while improving classification performance across diverse graph types.

Abstract: Linear attention mechanisms have emerged as efficient alternatives to full self-attention in Graph Transformers, offering linear time complexity. However, existing linear attention models often suffer from a significant drop in expressiveness due to low-rank projection structures and overly uniform attention distributions. We theoretically prove that these properties reduce the class separability of node representations, limiting the model’s classification ability. To address this, we propose a novel hybrid framework that enhances both the rank and focus of attention. Specifically, we enhance linear attention by attaching a gated local graph network branch to the value matrix, thereby increasing the rank of the resulting attention map. Furthermore, to alleviate the excessive smoothing effect inherent in linear attention, we introduce a learnable log-power function into the attention scores to reduce entropy and sharpen focus. We theoretically show that this function decreases entropy in the attention distribution, enhancing the separability of learned embeddings. Extensive experiments on both homophilic and heterophilic graph benchmarks demonstrate that our method achieves competitive performance while preserving the scalability of linear attention.

[227] Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

Chao Chen, Zhixin Ma, Yongqi Li, Yupeng Hu, Yinwei Wei, Wenjie Li, Liqiang Nie

Main category: cs.CV

TL;DR: IVT-LR introduces multimodal latent reasoning that combines visual and textual information in latent space to reduce annotation needs and improve inference speed while maintaining accuracy.

Details

Motivation: Current multimodal reasoning methods require explicit reasoning steps with labor-intensive vision-text annotations and suffer from significant inference latency, creating efficiency bottlenecks.

Method: Proposes Interleaved Vision-Text Latent Reasoning (IVT-LR) that represents reasoning steps through latent text (hidden states from previous step) and latent vision (selected image embeddings), trained with progressive multi-stage training.

Result: Achieves 5.45% average accuracy improvement on M³CoT and ScienceQA benchmarks while achieving over 5x speed increase compared to existing approaches.

Conclusion: IVT-LR successfully addresses annotation and latency issues in multimodal reasoning through latent space reasoning, offering both performance gains and efficiency improvements.

Abstract: Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilitate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M$^3$CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches.

[228] WaterFlow: Explicit Physics-Prior Rectified Flow for Underwater Saliency Mask Generation

Runting Li, Shijie Lian, Hua Li, Yutong Li, Wenhui Wu, Sam Kwong

Main category: cs.CV

TL;DR: WaterFlow is a rectified flow-based framework for underwater salient object detection that incorporates underwater physical imaging priors and temporal modeling, achieving state-of-the-art performance on USOD10K dataset.

Details

Motivation: Existing USOD methods ignore underwater imaging physics or treat degradation as noise to eliminate, failing to exploit valuable information in underwater images. There's a need to incorporate physical principles and leverage degradation phenomena constructively.

Method: Proposes WaterFlow, a rectified flow-based framework that incorporates underwater physical imaging information as explicit priors directly into network training and introduces temporal dimension modeling for enhanced salient object identification.

Result: Achieves 0.072 gain in S_m metric on USOD10K dataset, demonstrating effectiveness and superiority over existing methods.

Conclusion: WaterFlow successfully integrates underwater physical imaging priors and temporal modeling to significantly improve underwater salient object detection performance, showing that degradation phenomena contain valuable information that can be constructively utilized rather than eliminated.

Abstract: Underwater Salient Object Detection (USOD) faces significant challenges, including underwater image quality degradation and domain gaps. Existing methods tend to ignore the physical principles of underwater imaging or simply treat degradation phenomena in underwater images as interference factors that must be eliminated, failing to fully exploit the valuable information they contain. We propose WaterFlow, a rectified flow-based framework for underwater salient object detection that innovatively incorporates underwater physical imaging information as explicit priors directly into the network training process and introduces temporal dimension modeling, significantly enhancing the model’s capability for salient object identification. On the USOD10K dataset, WaterFlow achieves a 0.072 gain in S_m, demonstrating the effectiveness and superiority of our method. https://github.com/Theo-polis/WaterFlow.

[229] PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting

Changkun Liu, Bin Tan, Zeran Ke, Shangzhan Zhang, Jiachen Liu, Ming Qian, Nan Xue, Yujun Shen, Tristan Braud

Main category: cs.CV

TL;DR: PLANA3R is a pose-free framework for metric planar 3D reconstruction from unposed two-view images using planar primitives, trained without explicit 3D plane supervision.

Details

Motivation: To address metric 3D reconstruction of indoor scenes by exploiting geometric regularities with compact planar representations, without requiring camera poses or 3D plane annotations during training.

Method: Uses Vision Transformers to extract sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting with gradient propagation through rendered depth and normal maps.

Result: Validated on multiple indoor-scene datasets with strong generalization to out-of-domain environments across 3D surface reconstruction, depth estimation, relative pose estimation, and accurate plane segmentation.

Conclusion: PLANA3R enables scalable training on large-scale stereo datasets using only depth and normal annotations, achieving metric 3D reconstruction without explicit plane supervision or camera poses.

Abstract: This paper addresses metric 3D reconstruction of indoor scenes by exploiting their inherent geometric regularities with compact representations. Using planar 3D primitives - a well-suited representation for man-made environments - we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images. Our approach employs Vision Transformers to extract a set of sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting, where gradients are propagated through high-resolution rendered depth and normal maps of primitives. Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision, enabling scalable training on large-scale stereo datasets using only depth and normal annotations. We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments across diverse tasks under metric evaluation protocols, including 3D surface reconstruction, depth estimation, and relative pose estimation. Furthermore, by formulating with planar 3D representation, our method emerges with the ability for accurate plane segmentation. The project page is available at https://lck666666.github.io/plana3r

[230] Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

Xuyang Liu, Xiyan Gui, Yuchao Zhang, Linfeng Zhang

Main category: cs.CV

TL;DR: MixKV improves KV cache compression for large vision-language models by mixing importance with diversity, adapting to head-wise semantic redundancy patterns for better performance under extreme compression.

Details

Motivation: Existing KV cache compression methods focus only on importance retention but overlook modality-specific semantic redundancy patterns in multi-modal KV caches, leading to potential loss of semantic coverage and limiting deployment scalability.

Method: MixKV mixes importance with diversity for optimized KV cache compression, adapting to head-wise semantic redundancy and selectively balancing diversity and importance when compressing KV pairs.

Result: Under extreme compression (budget=64), MixKV improves baseline methods by average 5.1% across five multi-modal benchmarks, with 8.0% and 9.0% gains for SnapKV and AdaKV on GUI grounding tasks, while maintaining comparable inference efficiency.

Conclusion: MixKV effectively addresses KV cache memory bottlenecks in LVLMs by considering both importance and diversity, achieving superior compression performance that extends seamlessly to LLMs with comparable gains.

Abstract: Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose MixKV, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. MixKV adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that MixKV consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), MixKV improves baseline methods by an average of 5.1% across five multi-modal understanding benchmarks and achieves remarkable gains of 8.0% and 9.0% for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, MixKV extends seamlessly to LLMs with comparable performance gains. Our code is available at https://github.com/xuyang-liu16/MixKV.

Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, Fan Zhou

Main category: cs.CV

TL;DR: MBCD is a collaborative distillation framework that addresses WA’s limitations in multi-modal domain generalization by preventing early-stage bias toward faster-converging modalities and promoting flatter loss surfaces.

Details

Motivation: Weight Averaging (WA) promotes flat loss landscapes for better generalization but fails in multi-modal settings because different modalities converge at different speeds. Faster-converging modalities dominate early, suppressing slower complementary modalities, leading to poor modality fusion and sharper minima.

Method: MBCD uses three key components: 1) Adaptive modality dropout in student model to prevent early bias toward dominant modalities, 2) Gradient consistency constraint aligning learning signals between uni-modal branches and fused representation for smoother optimization, 3) WA-based teacher performing cross-modal distillation to transfer fused knowledge to each uni-modal branch, strengthening cross-modal interactions.

Result: Extensive experiments on MMDG benchmarks show MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

Conclusion: MBCD successfully retains WA’s flatness-inducing advantages while overcoming its limitations in multi-modal contexts, enabling effective modality fusion and steering convergence toward flatter, more generalizable solutions.

Abstract: Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA’s flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

[232] ReactionMamba: Generating Short & Long Human Reaction Sequences

Hajra Anwar Beg, Baptiste Chopin, Hao Tang, Mohamed Daoudi

Main category: cs.CV

TL;DR: ReactionMamba is a new framework for generating long 3D human reaction motions using motion VAE encoding and Mamba-based state-space models for decoding, achieving competitive performance with faster inference.

Details

Motivation: The paper aims to address the challenge of generating long, temporally consistent 3D human reaction motions, particularly for complex sequences like dance and martial arts, where existing methods may struggle with realism, diversity, and computational efficiency.

Method: ReactionMamba integrates a motion VAE for efficient motion encoding with Mamba-based state-space models to decode temporally consistent reactions. This hybrid approach enables generation of both short simple motions and long complex sequences.

Result: The framework demonstrates competitive performance on NTU120-AS, Lindy Hop, and InterX datasets in terms of realism, diversity, and long-sequence generation compared to InterFormer, ReMoS, and Ready-to-React, while achieving substantial improvements in inference speed.

Conclusion: ReactionMamba provides an effective solution for generating long 3D human reaction motions with better temporal consistency and computational efficiency than previous methods, making it suitable for applications requiring complex motion sequences.

Abstract: We present ReactionMamba, a novel framework for generating long 3D human reaction motions. Reaction-Mamba integrates a motion VAE for efficient motion encoding with Mamba-based state-space models to decode temporally consistent reactions. This design enables ReactionMamba to generate both short sequences of simple motions and long sequences of complex motions, such as dance and martial arts. We evaluate ReactionMamba on three datasets–NTU120-AS, Lindy Hop, and InterX–and demonstrate competitive performance in terms of realism, diversity, and long-sequence generation compared to previous methods, including InterFormer, ReMoS, and Ready-to-React, while achieving substantial improvements in inference speed.

[233] Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

Wei Chee Yew, Hailun Xu, Sanjay Saha, Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Kanchan Sarkar, Zhenheng Yang, Danhui Guan

Main category: cs.CV

TL;DR: Hybrid content moderation framework combining supervised classification for known violations with similarity matching for novel cases, achieving 6-8% reduction in unwanted livestream views.

Details

Motivation: Content moderation is challenging for large-scale video platforms, especially in livestreaming where it must be timely, multimodal, and robust to evolving unwanted content.

Method: Hybrid framework with two pipelines: supervised classification for known violations and reference-based similarity matching for novel cases. Uses multimodal inputs (text, audio, visual) processed through both pipelines, with MLLM distilling knowledge to boost accuracy while keeping inference lightweight.

Result: Classification pipeline: 67% recall at 80% precision. Similarity pipeline: 76% recall at 80% precision. Large-scale A/B tests show 6-8% reduction in user views of unwanted livestreams.

Conclusion: Demonstrates a scalable and adaptable approach to multimodal content governance capable of addressing both explicit violations and emerging adversarial behaviors.

Abstract: Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.

[234] TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

Zhenglin Cheng, Peng Sun, Jianguo Li, Tao Lin

Main category: cs.CV

TL;DR: TwinFlow: A simple framework for training 1-step generative models that bypasses teacher models and adversarial networks, achieving efficient multi-modal generation with 100x computational reduction.

Details

Motivation: Current multi-modal generative models use multi-step frameworks (diffusion/flow matching) that require 40-100 NFEs, making inference inefficient. Existing few-step methods have limitations: distillation methods need iterative procedures or degrade at <4 NFEs, while adversarial training methods introduce instability, complexity, and high memory overhead.

Method: TwinFlow is a simple framework that trains 1-step generative models without fixed pretrained teacher models and avoids standard adversarial networks during training. It’s designed for building large-scale, efficient models and demonstrates scalability through full-parameter training on Qwen-Image-20B.

Result: On text-to-image tasks, TwinFlow achieves GenEval score of 0.83 in 1-NFE, outperforming SANA-Sprint and RCGM. With just 1-NFE, it matches the performance of original 100-NFE model on both GenEval and DPG-Bench benchmarks, reducing computational cost by 100x with minor quality degradation.

Conclusion: TwinFlow provides an effective solution for efficient multi-modal generation by eliminating the need for teacher models and adversarial networks, enabling 1-step generation with 100x computational savings while maintaining quality comparable to multi-step models.

Abstract: Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.

[235] DAUNet: A Lightweight UNet Variant with Deformable Convolutions and Parameter-Free Attention for Medical Image Segmentation

Adnan Munir, Muhammad Shahid Jabbar, Shujaat Khan

Main category: cs.CV

TL;DR: DAUNet: Lightweight UNet variant with Deformable V2 Convolutions and SimAM attention for medical image segmentation, achieving better performance with fewer parameters.

Details

Motivation: Medical image segmentation is crucial for automated diagnostic systems, but existing models often lack spatial adaptability and context-aware feature fusion while maintaining low complexity for real-time clinical deployment.

Method: DAUNet integrates Deformable V2 Convolutions in the bottleneck for handling geometric variations, and Parameter-Free Attention (SimAM) modules in decoder and skip pathways for saliency-aware refinement, all while maintaining lightweight architecture.

Result: Outperforms state-of-the-art models on FH-PS-AoP (ultrasound) and FUMPE (CT) datasets in Dice score, HD95, and ASD metrics while maintaining superior parameter efficiency. Robust to missing context and low-contrast regions.

Conclusion: DAUNet demonstrates improved spatial adaptability and context-aware feature fusion without increasing model complexity, making it suitable for real-time and resource-constrained clinical environments.

Abstract: Medical image segmentation plays a pivotal role in automated diagnostic and treatment planning systems. In this work, we present DAUNet, a novel lightweight UNet variant that integrates Deformable V2 Convolutions and Parameter-Free Attention (SimAM) to improve spatial adaptability and context-aware feature fusion without increasing model complexity. DAUNet’s bottleneck employs dynamic deformable kernels to handle geometric variations, while the decoder and skip pathways are enhanced using SimAM attention modules for saliency-aware refinement. Extensive evaluations on two challenging datasets, FH-PS-AoP (fetal head and pubic symphysis ultrasound) and FUMPE (CT-based pulmonary embolism detection), demonstrate that DAUNet outperforms state-of-the-art models in Dice score, HD95, and ASD, while maintaining superior parameter efficiency. Ablation studies highlight the individual contributions of deformable convolutions and SimAM attention. DAUNet’s robustness to missing context and low-contrast regions establishes its suitability for deployment in real-time and resource-constrained clinical environments.

[236] TBC: A Target-Background Contrast Metric for Low-Altitude Infrared and Visible Image Fusion

Yufeng Xie, Cong Wang

Main category: cs.CV

TL;DR: The paper proposes a new Target-Background Contrast (TBC) metric for infrared-visible image fusion in UAV reconnaissance, addressing the failure of traditional no-reference metrics in low-light environments where they paradoxically reward noisy images.

Details

Motivation: Traditional no-reference metrics for infrared-visible image fusion fail in complex low-light environments (termed "Noise Trap"), where they are positively correlated with high-frequency sensor noise and assign higher scores to degraded images, misleading algorithm optimization.

Method: Proposes Target-Background Contrast (TBC) metric inspired by Weber’s Law, focusing on relative contrast of salient targets rather than global statistics. TBC penalizes background noise and rewards target visibility.

Result: Extensive experiments on DroneVehicle dataset show TBC exhibits high “Semantic Discriminability” in distinguishing thermal targets from background clutter and achieves remarkable computational efficiency for real-time use in intelligent UAV systems.

Conclusion: TBC provides a reliable and real-time standard for evaluating infrared-visible image fusion in UAV reconnaissance, overcoming the limitations of traditional metrics in low-light environments and enabling more effective target detection and tracking.

Abstract: Infrared and visible image fusion (IVIF) is a pivotal technology in low-altitude Unmanned Aerial Vehicle (UAV) reconnaissance missions, enabling robust target detection and tracking by integrating thermal saliency with environmental textures. However, traditional no-reference metrics (Statistics-based metrics and Gradient-based metrics) fail in complex low-light environments, termed the Noise Trap''. This paper mathematically prove that these metrics are positively correlated with high-frequency sensor noise, paradoxically assigning higher scores to degraded images and misguiding algorithm optimization. To address this, we propose the Target-Background Contrast (TBC) metric. Inspired by Weber's Law, TBC focuses on the relative contrast of salient targets rather than global statistics. Unlike traditional metrics, TBC penalizes background noise and rewards target visibility. Extensive experiments on the DroneVehicle dataset demonstrate the superiority of TBC. Results show that TBC exhibits high Semantic Discriminability’’ in distinguishing thermal targets from background clutter. Furthermore, TBC achieves remarkable computational efficiency, making it a reliable and real-time standard for intelligent UAV systems.

[237] RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Hanzheng Li, Xi Fang, Yixuan Li, Chaozheng Huang, Junjie Wang, Xi Wang, Hongzhe Bai, Bojun Hao, Shenyu Lin, Huiqi Liang, Linfeng Zhang, Guolin Ke

Main category: cs.CV

TL;DR: RxnBench is a new benchmark for evaluating Multimodal Large Language Models on chemical reaction understanding from scientific PDFs, revealing significant gaps in models’ ability to comprehend complex chemical logic and integrate multimodal information.

Details

Motivation: Current MLLMs have underexplored capabilities in understanding the dense, graphical language of chemical reactions within authentic scientific literature, despite their potential to revolutionize scientific discovery in chemistry.

Method: Created RxnBench, a multi-tiered benchmark with two tasks: Single-Figure QA (1,525 questions from 305 curated reaction schemes) testing visual perception and mechanistic reasoning, and Full-Document QA (108 articles) requiring cross-modal integration of text, schemes, and tables.

Result: MLLMs show critical capability gaps - they excel at extracting explicit text but struggle with deep chemical logic and precise structural recognition. Models with inference-time reasoning outperform standard architectures, but none achieve 50% accuracy on Full-Document QA.

Conclusion: There’s an urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists, as current MLLMs cannot adequately handle the complex multimodal understanding required for chemical literature analysis.

Abstract: The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.

[238] FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering

Chaodong Tong, Qi Zhang, Chen Li, Lei Jiang, Yanbing Liu

Main category: cs.CV

TL;DR: FaithSCAN: A lightweight network that detects VQA hallucinations by exploiting rich internal signals from vision-language models, using token-level uncertainty, visual representations, and cross-modal alignment features, with automated supervision generation.

Details

Motivation: Existing VQA hallucination detection methods have limitations: external verification approaches are computationally expensive and dependent on external resources, while uncertainty-driven methods capture only limited facets of model uncertainty and fail to explore rich internal signals associated with diverse failure modes.

Method: FaithSCAN uses a lightweight network that fuses multiple internal VLM signals: token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features via branch-wise evidence encoding and uncertainty-aware attention. It also extends LLM-as-a-Judge paradigm to automatically generate model-dependent supervision signals for training without human labels.

Result: Experiments on multiple VQA benchmarks show FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. Analysis reveals hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding, with different internal signals providing complementary diagnostic cues.

Conclusion: FaithSCAN addresses limitations of existing hallucination detection methods by efficiently exploiting rich internal VLM signals, with automated supervision generation enabling accurate detection without costly human labeling. The approach provides new insights into multimodal hallucination causes and patterns across different VLM architectures.

Abstract: Faithfulness hallucinations in VQA occur when vision-language models produce fluent yet visually ungrounded answers, severely undermining their reliability in safety-critical applications. Existing detection methods mainly fall into two categories: external verification approaches relying on auxiliary models or knowledge bases, and uncertainty-driven approaches using repeated sampling or uncertainty estimates. The former suffer from high computational overhead and are limited by external resource quality, while the latter capture only limited facets of model uncertainty and fail to sufficiently explore the rich internal signals associated with the diverse failure modes. Both paradigms thus have inherent limitations in efficiency, robustness, and detection performance. To address these challenges, we propose FaithSCAN: a lightweight network that detects hallucinations by exploiting rich internal signals of VLMs, including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. We also extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals, enabling supervised training without costly human labels while maintaining high detection accuracy. Experiments on multiple VQA benchmarks show that FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding. Different internal signals provide complementary diagnostic cues, and hallucination patterns vary across VLM architectures, offering new insights into the underlying causes of multimodal hallucinations.

[239] All-in-One Video Restoration under Smoothly Evolving Unknown Weather Degradations

Wenrui Li, Hongtao Chen, Yao Xiao, Wangmeng Zuo, Jiantao Zhou, Yonghong Tian, Xiaopeng Fan

Main category: cs.CV

TL;DR: ORCANet addresses video restoration under smoothly evolving unknown degradations using recurrent conditional prompting with coarse intensity estimation and flow-based degradation feature extraction.

Details

Motivation: Existing video restoration methods focus on frame-wise degradation variation but ignore temporal continuity in real-world degradation processes where degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually.

Method: Proposes ORCANet with: 1) Coarse Intensity Estimation Dehazing (CIED) module using physical priors for haze intensity estimation and coarse dehazed features initialization; 2) Flow Prompt Generation (FPG) module extracting degradation features with static prompts for segment-level degradation types and dynamic prompts for frame-level intensity variations; 3) Label-aware supervision for discriminative static prompt representations.

Result: Extensive experiments show ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines.

Conclusion: The proposed SEUD scenario and ORCANet framework effectively address video restoration under smoothly evolving unknown degradations, demonstrating the importance of modeling temporal continuity in degradation processes.

Abstract: All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model. But extending this task to videos faces unique challenges. Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes. In practice, degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually. In this paper, we introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time. To support this scenario, we design a flexible synthesis pipeline that generates temporally coherent videos with single, compound, and evolving degradations. To address the challenges in the SEUD scenario, we propose an all-in-One Recurrent Conditional and Adaptive prompting Network (ORCANet). First, a Coarse Intensity Estimation Dehazing (CIED) module estimates haze intensity using physical priors and provides coarse dehazed features as initialization. Second, a Flow Prompt Generation (FPG) module extracts degradation features. FPG generates both static prompts that capture segment-level degradation types and dynamic prompts that adapt to frame-level intensity variations. Furthermore, a label-aware supervision mechanism improves the discriminability of static prompt representations under different degradations. Extensive experiments show that ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines. Code is available at https://github.com/Friskknight/ORCANet-SEUD.

[240] Diffusion in SPAD Signals

Lior Dvir, Nadav Torem, Yoav Y. Schechner

Main category: cs.CV

TL;DR: Derives likelihood and score function for SPAD signals, applies diffusion models for image reconstruction under varying photon counts.

Details

Motivation: SPAD signals have nonlinear, stochastic relationships with photon flux, requiring proper statistical models for solving inverse problems in imaging applications.

Method: Derive likelihood function for SPAD raw signals (detection event timings), compute score function, and apply diffusion models with image priors for inverse problem solving.

Result: Demonstrates performance under low/high photon counts and shows benefits of exploiting detection event timing information in SPAD-based imaging.

Conclusion: Proper statistical modeling of SPAD signals enables effective image reconstruction using diffusion models, with timing information providing significant advantages.

Abstract: We derive the likelihood of a raw signal in a single photon avalanche diode (SPAD), given a fixed photon flux. The raw signal comprises timing of detection events, which are nonlinearly related to the flux. Moreover, they are naturally stochastic. We then derive a score function of the signal. This is a key for solving inverse problems based on SPAD signals. We focus on deriving solutions involving a diffusion model, to express image priors. We demonstrate the effect of low or high photon counts, and the consequence of exploiting timing of detection events.

[241] SAM-Aug: Leveraging SAM Priors for Few-Shot Parcel Segmentation in Satellite Time Series

Kai Hu, Yaozu Feng, Vladimir Lysenko, Ya Guo, Huayi Wu

Main category: cs.CV

TL;DR: SAM-Aug uses Segment Anything Model’s geometry-aware segmentation to improve few-shot land cover mapping from time-series remote sensing images without additional labeled data.

Details

Motivation: Few-shot semantic segmentation of remote sensing images is challenging due to scarce labeled data. Current models degrade significantly with limited supervision, limiting real-world applicability in land cover monitoring.

Method: Constructs cloud-free composite images from temporal sequences, applies SAM unsupervised to generate geometry-aware mask priors, then integrates these priors via RegionSmoothLoss that enforces prediction consistency within SAM-derived regions across temporal frames.

Result: Achieves mean test mIoU of 36.21% (5% labeled setting), outperforming SOTA by +2.33pp (6.89% relative improvement). Best split reaches 40.28% mIoU (11.2% relative gain) with no additional labeled data.

Conclusion: Vision foundation models like SAM can serve as effective regularizers for few-shot remote sensing learning, offering scalable plug-and-play solutions for land cover monitoring without manual annotations or model fine-tuning.

Abstract: Few-shot semantic segmentation of time-series remote sensing images remains a critical challenge, particularly in regions where labeled data is scarce or costly to obtain. While state-of-the-art models perform well under full supervision, their performance degrades significantly under limited labeling, limiting their real-world applicability. In this work, we propose SAM-Aug, a new annotation-efficient framework that leverages the geometry-aware segmentation capability of the Segment Anything Model (SAM) to improve few-shot land cover mapping. Our approach constructs cloud-free composite images from temporal sequences and applies SAM in a fully unsupervised manner to generate geometry-aware mask priors. These priors are then integrated into training through a proposed loss function called RegionSmoothLoss, which enforces prediction consistency within each SAM-derived region across temporal frames, effectively regularizing the model to respect semantically coherent structures. Extensive experiments on the PASTIS-R benchmark under a 5 percent labeled setting demonstrate the effectiveness and robustness of SAM-Aug. Averaged over three random seeds (42, 2025, 4090), our method achieves a mean test mIoU of 36.21 percent, outperforming the state-of-the-art baseline by +2.33 percentage points, a relative improvement of 6.89 percent. Notably, on the most favorable split (seed=42), SAM-Aug reaches a test mIoU of 40.28 percent, representing an 11.2 percent relative gain with no additional labeled data. The consistent improvement across all seeds confirms the generalization power of leveraging foundation model priors under annotation scarcity. Our results highlight that vision models like SAM can serve as useful regularizers in few-shot remote sensing learning, offering a scalable and plug-and-play solution for land cover monitoring without requiring manual annotations or model fine-tuning.

[242] Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation

Zijie Lou, Xiangwei Feng, Jiaxin Wang, Jiangtao Yao, Fei Che, Tianbao Liu, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Ting Liu

Main category: cs.CV

TL;DR: Video object removal reformulated as video-to-video translation via stochastic bridge model, using input video as structural prior instead of starting from noise.

Details

Motivation: Existing diffusion-based methods discard rich structural priors from input videos, leading to incomplete object removal or implausible content generation that conflicts with scene logic.

Method: Proposes stochastic bridge model that establishes direct path from source video (with objects) to target video (objects removed), with adaptive mask modulation to balance background fidelity and generative flexibility.

Result: Significantly outperforms existing methods in both visual quality and temporal consistency.

Conclusion: Video object removal benefits from using input video as structural prior through bridge formulation rather than noise-to-data paradigm.

Abstract: Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene’s physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency. The project page is https://bridgeremoval.github.io/.

[243] Progressive $\mathcal{J}$-Invariant Self-supervised Learning for Low-Dose CT Denoising

Yichao Liu, Zongru Shao, Yueyang Teng, Junwen Guo

Main category: cs.CV

TL;DR: Progressive J-invariant Learning improves self-supervised LDCT denoising with step-wise blind-spot mechanism and noise injection, achieving performance comparable to supervised methods.

Details

Motivation: Self-supervised learning reduces dependence on paired normal-dose CT data, but existing blind-spot methods suffer from training inefficiencies and suboptimal performance due to restricted receptive fields.

Method: Progressive J-invariant Learning with step-wise blind-spot denoising mechanism that enforces conditional independence progressively, plus explicit injection of controlled Gaussian and Poisson noise during training for regularization.

Result: Extensive experiments on Mayo LDCT dataset show the method consistently outperforms existing self-supervised approaches and achieves performance comparable to or better than several representative supervised denoising methods.

Conclusion: The proposed Progressive J-invariant Learning effectively enhances LDCT denoising performance while maintaining the practical advantage of not requiring paired normal-dose CT data.

Abstract: Self-supervised learning has been increasingly investigated for low-dose computed tomography (LDCT) image denoising, as it alleviates the dependence on paired normal-dose CT (NDCT) data, which are often difficult to collect. However, many existing self-supervised blind-spot denoising methods suffer from training inefficiencies and suboptimal performance due to restricted receptive fields. To mitigate this issue, we propose a novel Progressive $\mathcal{J}$-invariant Learning that maximizes the use of $\mathcal{J}$-invariant to enhance LDCT denoising performance. We introduce a step-wise blind-spot denoising mechanism that enforces conditional independence in a progressive manner, enabling more fine-grained learning for denoising. Furthermore, we explicitly inject a combination of controlled Gaussian and Poisson noise during training to regularize the denoising process and mitigate overfitting. Extensive experiments on the Mayo LDCT dataset demonstrate that the proposed method consistently outperforms existing self-supervised approaches and achieves performance comparable to, or better than, several representative supervised denoising methods.

[244] ColorConceptBench: A Benchmark for Probabilistic Color-Concept Understanding in Text-to-Image Models

Chenxi Ruan, Yu Xiao, Yihan Hou, Guosheng Hu, Wei Zeng

Main category: cs.CV

TL;DR: ColorConceptBench: A new benchmark for evaluating T2I models’ ability to associate colors with implicit concepts, revealing current models lack sensitivity to abstract color semantics despite scaling.

Details

Motivation: Text-to-image models have advanced but their capability to associate colors with implicit concepts remains underexplored. There's a need to systematically evaluate how models translate implicit color concepts beyond explicit color names or codes.

Method: Introduce ColorConceptBench, a human-annotated benchmark with 1,281 implicit color concepts and 6,369 human annotations. Evaluate seven leading T2I models by probing how they translate these concepts through probabilistic color distributions.

Result: Current T2I models lack sensitivity to abstract color semantics, and this limitation appears resistant to standard interventions like model scaling and guidance techniques.

Conclusion: Achieving human-like color semantics requires more than larger models - it demands a fundamental shift in how models learn and represent implicit meaning.

Abstract: While text-to-image (T2I) models have advanced considerably, their capability to associate colors with implicit concepts remains underexplored. To address the gap, we introduce ColorConceptBench, a new human-annotated benchmark to systematically evaluate color-concept associations through the lens of probabilistic color distributions. ColorConceptBench moves beyond explicit color names or codes by probing how models translate 1,281 implicit color concepts using a foundation of 6,369 human annotations. Our evaluation of seven leading T2I models reveals that current models lack sensitivity to abstract semantics, and crucially, this limitation appears resistant to standard interventions (e.g., scaling and guidance). This demonstrates that achieving human-like color semantics requires more than larger models, but demands a fundamental shift in how models learn and represent implicit meaning.

[245] Atomic Depth Estimation From Noisy Electron Microscopy Data Via Deep Learning

Matan Leibovich, Mai Tan, Ramon Manzorro, Adria Marcos-Morales, Sreyas Mohan, Peter A. Crozier, Carlos Fernandez-Granda

Main category: cs.CV

TL;DR: A deep learning approach for 3D atomic depth estimation from noisy TEM images using semantic segmentation.

Details

Motivation: Transmission electron microscopy (TEM) images often suffer from significant noise, making it challenging to extract accurate 3D atomic-level information. Existing methods struggle with noise robustness and precise depth estimation.

Method: Formulates depth estimation as a semantic segmentation problem. Uses a deep convolutional neural network trained on simulated TEM data corrupted by synthetic noise to generate pixel-wise depth segmentation maps.

Result: The method successfully estimated atomic column depths in CeO2 nanoparticles from both simulated and real-world TEM data. Results show accurate, calibrated depth estimates with strong noise robustness.

Conclusion: The semantic segmentation approach provides an effective solution for extracting 3D atomic information from noisy TEM images, offering accurate and robust depth estimation capabilities.

Abstract: We present a novel approach for extracting 3D atomic-level information from transmission electron microscopy (TEM) images affected by significant noise. The approach is based on formulating depth estimation as a semantic segmentation problem. We address the resulting segmentation problem by training a deep convolutional neural network to generate pixel-wise depth segmentation maps using simulated data corrupted by synthetic noise. The proposed method was applied to estimate the depth of atomic columns in CeO2 nanoparticles from simulated images and real-world TEM data. Our experiments show that the resulting depth estimates are accurate, calibrated and robust to noise.

[246] PocketGS: On-Device Training of 3D Gaussian Splatting for High Perceptual Modeling

Wenzhi Guo, Guangchi Fang, Shu Yang, Bing Wang

Main category: cs.CV

TL;DR: PocketGS enables efficient 3D Gaussian Splatting training on mobile devices with limited memory and training time, achieving high-fidelity scene modeling through co-designed operators that optimize geometry priors, surface statistics, and mobile backpropagation.

Details

Motivation: Current 3D Gaussian Splatting methods require resource-intensive training that fails on mobile devices due to minute-scale training budgets and hardware memory limitations, preventing practical on-device 3D scene modeling workflows.

Method: Three co-designed operators: G builds geometry-faithful point-cloud priors; I injects local surface statistics to seed anisotropic Gaussians (reducing early conditioning gaps); and T unrolls alpha compositing with cached intermediates and index-mapped gradient scattering for stable mobile backpropagation.

Result: PocketGS outperforms powerful workstation 3DGS baselines, delivers high-quality reconstructions, and enables a fully on-device capture-to-rendering workflow while satisfying competing requirements of training efficiency, memory compactness, and modeling fidelity.

Conclusion: PocketGS presents a mobile scene modeling paradigm that successfully resolves the fundamental contradictions of standard 3DGS, making high-fidelity 3D scene modeling practical on resource-constrained mobile devices through co-designed optimization operators.

Abstract: Efficient and high-fidelity 3D scene modeling is a long-standing pursuit in computer graphics. While recent 3D Gaussian Splatting (3DGS) methods achieve impressive real-time modeling performance, they rely on resource-unconstrained training assumptions that fail on mobile devices, which are limited by minute-scale training budgets and hardware-available peak-memory. We present PocketGS, a mobile scene modeling paradigm that enables on-device 3DGS training under these tightly coupled constraints while preserving high perceptual fidelity. Our method resolves the fundamental contradictions of standard 3DGS through three co-designed operators: G builds geometry-faithful point-cloud priors; I injects local surface statistics to seed anisotropic Gaussians, thereby reducing early conditioning gaps; and T unrolls alpha compositing with cached intermediates and index-mapped gradient scattering for stable mobile backpropagation. Collectively, these operators satisfy the competing requirements of training efficiency, memory compactness, and modeling fidelity. Extensive experiments demonstrate that PocketGS is able to outperform the powerful mainstream workstation 3DGS baseline to deliver high-quality reconstructions, enabling a fully on-device, practical capture-to-rendering workflow.

[247] From Specialist to Generalist: Unlocking SAM’s Learning Potential on Unlabeled Medical Images

Vi Vu, Thanh-Huy Nguyen, Tien-Thinh Nguyen, Ba-Thinh Lam, Hoang-Thien Nguyen, Tianyang Wang, Xingjian Li, Min Xu

Main category: cs.CV

TL;DR: SC-SAM is a specialist-generalist framework that combines U-Net and SAM in a bidirectional co-training loop for semi-supervised medical image segmentation, achieving state-of-the-art results by leveraging unlabeled data through reciprocal guidance.

Details

Motivation: Foundation models like SAM struggle with medical image adaptation due to domain shift, scarce labels, and PEFT's inability to use unlabeled data. While U-Net excels in semi-supervised medical learning, its potential to assist PEFT SAM has been overlooked.

Method: SC-SAM creates a specialist-generalist framework where U-Net provides point-based prompts and pseudo-labels to guide SAM’s adaptation, while SAM serves as a powerful generalist supervisor to regularize U-Net, forming a bidirectional co-training loop.

Result: The method achieves state-of-the-art results across prostate MRI and polyp segmentation benchmarks, outperforming other semi-supervised SAM variants and even medical foundation models like MedSAM.

Conclusion: The work highlights the value of specialist-generalist cooperation for label-efficient medical image segmentation, demonstrating that combining U-Net’s medical expertise with SAM’s generalization capabilities creates a powerful synergistic framework.

Abstract: Foundation models like the Segment Anything Model (SAM) show strong generalization, yet adapting them to medical images remains difficult due to domain shift, scarce labels, and the inability of Parameter-Efficient Fine-Tuning (PEFT) to exploit unlabeled data. While conventional models like U-Net excel in semi-supervised medical learning, their potential to assist a PEFT SAM has been largely overlooked. We introduce SC-SAM, a specialist-generalist framework where U-Net provides point-based prompts and pseudo-labels to guide SAM’s adaptation, while SAM serves as a powerful generalist supervisor to regularize U-Net. This reciprocal guidance forms a bidirectional co-training loop that allows both models to effectively exploit the unlabeled data. Across prostate MRI and polyp segmentation benchmarks, our method achieves state-of-the-art results, outperforming other existing semi-supervised SAM variants and even medical foundation models like MedSAM, highlighting the value of specialist-generalist cooperation for label-efficient medical image segmentation. Our code is available at https://github.com/vnlvi2k3/SC-SAM.

[248] GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning

Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, Wenqiang Zhang

Main category: cs.CV

TL;DR: GenAgent is an agentic multimodal model that decouples visual understanding and generation by treating image generators as invokable tools, enabling autonomous multi-turn interactions with reasoning chains and iterative refinement.

Details

Motivation: Unified visual understanding-generation models face expensive training costs and trade-offs between capabilities. Existing modular systems are constrained by static pipelines, lacking autonomous multi-turn interactions.

Method: Agentic framework where multimodal model handles understanding and invokes image generation tools. Two-stage training: 1) supervised fine-tuning on tool invocation/reflection data, 2) end-to-end agentic RL with pointwise (image quality) and pairwise (reflection accuracy) rewards with trajectory resampling.

Result: Significant performance boosts: +23.6% on GenEval++ and +14% on WISE over base generator (FLUX.1-dev). Demonstrates cross-tool generalization, test-time scaling with consistent improvements across rounds, and task-adaptive reasoning.

Conclusion: GenAgent successfully unifies visual understanding and generation through an agentic approach, enabling autonomous multi-turn interactions with iterative refinement, outperforming base generators and showing valuable generalization properties.

Abstract: We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities through an agentic framework: understanding is handled by the multimodal model itself, while generation is achieved by treating image generation models as invokable tools. Crucially, unlike existing modular systems constrained by static pipelines, this design enables autonomous multi-turn interactions where the agent generates multimodal chains-of-thought encompassing reasoning, tool invocation, judgment, and reflection to iteratively refine outputs. We employ a two-stage training strategy: first, cold-start with supervised fine-tuning on high-quality tool invocation and reflection data to bootstrap agent behaviors; second, end-to-end agentic reinforcement learning combining pointwise rewards (final image quality) and pairwise rewards (reflection accuracy), with trajectory resampling for enhanced multi-turn exploration. GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ (+23.6%) and WISE (+14%). Beyond performance gains, our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks. Our code will be available at \href{https://github.com/deep-kaixun/GenAgent}{this url}.

[249] Weakly supervised framework for wildlife detection and counting in challenging Arctic environments: a case study on caribou (Rangifer tarandus)

Ghazaleh Serati, Samuel Foucher, Jerome Theau

Main category: cs.CV

TL;DR: Weakly supervised patch-level pretraining improves caribou detection in aerial imagery by learning from empty vs. non-empty labels, outperforming generic ImageNet initialization and enabling reliable mapping for manual counting.

Details

Motivation: Caribou populations across the Arctic have declined, requiring scalable and accurate monitoring. Manual interpretation of aerial imagery is labor-intensive and error-prone, necessitating automatic detection that can handle challenges like background heterogeneity, class imbalance, small/occluded targets, and scale variation.

Method: Proposed HerdNet detection model with weakly supervised patch-level pretraining based on detection network architecture. Uses detection dataset from five caribou herds in Alaska, learning from empty vs. non-empty labels to produce early weakly supervised knowledge for enhanced detection compared to generic weight initialization.

Result: Patch-based pretraining achieved high accuracy on multi-herd imagery (2017: F1 93.7%) and independent year test sets (2019: F1 92.6%). Initialization from weakly supervised pretraining outperformed ImageNet weights on positive patches (F1: 92.6%/93.5% vs. 89.3%/88.6%) and full-image counting (F1: 95.5%/93.3% vs. 91.5%/90.4%).

Conclusion: Weakly supervised pretraining on coarse labels prior to detection enables effective caribou monitoring even with limited labeled data, achieving results comparable to generic-weight initialization while addressing challenges of Arctic wildlife monitoring. Remaining limitations include false positives from animal-like background clutter and false negatives from low-density occlusions.

Abstract: Caribou across the Arctic has declined in recent decades, motivating scalable and accurate monitoring approaches to guide evidence-based conservation actions and policy decisions. Manual interpretation from this imagery is labor-intensive and error-prone, underscoring the need for automatic and reliable detection across varying scenes. Yet, such automatic detection is challenging due to severe background heterogeneity, dominant empty terrain (class imbalance), small or occluded targets, and wide variation in density and scale. To make the detection model (HerdNet) more robust to these challenges, a weakly supervised patch-level pretraining based on a detection network’s architecture is proposed. The detection dataset includes five caribou herds distributed across Alaska. By learning from empty vs. non-empty labels in this dataset, the approach produces early weakly supervised knowledge for enhanced detection compared to HerdNet, which is initialized from generic weights. Accordingly, the patch-based pretrain network attained high accuracy on multi-herd imagery (2017) and on an independent year’s (2019) test sets (F1: 93.7%/92.6%, respectively), enabling reliable mapping of regions containing animals to facilitate manual counting on large aerial imagery. Transferred to detection, initialization from weakly supervised pretraining yielded consistent gains over ImageNet weights on both positive patches (F1: 92.6%/93.5% vs. 89.3%/88.6%), and full-image counting (F1: 95.5%/93.3% vs. 91.5%/90.4%). Remaining limitations are false positives from animal-like background clutter and false negatives related to low animal density occlusions. Overall, pretraining on coarse labels prior to detection makes it possible to rely on weakly-supervised pretrained weights even when labeled data are limited, achieving results comparable to generic-weight initialization.

[250] CLIP-Guided Unsupervised Semantic-Aware Exposure Correction

Puzhen Wu, Han Weng, Quan Zheng, Yi Zhan, Hewei Wang, Yiming Li, Jiahui Han, Rui Xu

Main category: cs.CV

TL;DR: Unsupervised semantic-aware exposure correction network using FastSAM for semantic fusion, multi-scale spatial mamba for restoration, and CLIP-guided pseudo-ground truth generation.

Details

Motivation: Address two key challenges in exposure correction: (1) ignoring object-wise semantic information causes color shift artifacts, and (2) lack of ground-truth labels for real-world exposure images requiring massive manual editing.

Method: Proposes unsupervised semantic-aware exposure correction network with: adaptive semantic-aware fusion module (fuses FastSAM semantic info), multi-scale residual spatial mamba group for restoration, CLIP-guided pseudo-ground truth generator for automatic exposure identification, and semantic-prompt consistency loss using FastSAM and CLIP priors.

Result: Comprehensive experiments show method effectively corrects real-world exposure images and outperforms state-of-the-art unsupervised methods both numerically and visually.

Conclusion: The proposed unsupervised semantic-aware approach successfully addresses exposure correction challenges by leveraging semantic information and avoiding manual labeling, demonstrating superior performance over existing unsupervised methods.

Abstract: Improper exposure often leads to severe loss of details, color distortion, and reduced contrast. Exposure correction still faces two critical challenges: (1) the ignorance of object-wise regional semantic information causes the color shift artifacts; (2) real-world exposure images generally have no ground-truth labels, and its labeling entails massive manual editing. To tackle the challenges, we propose a new unsupervised semantic-aware exposure correction network. It contains an adaptive semantic-aware fusion module, which effectively fuses the semantic information extracted from a pre-trained Fast Segment Anything Model into a shared image feature space. Then the fused features are used by our multi-scale residual spatial mamba group to restore the details and adjust the exposure. To avoid manual editing, we propose a pseudo-ground truth generator guided by CLIP, which is fine-tuned to automatically identify exposure situations and instruct the tailored corrections. Also, we leverage the rich priors from the FastSAM and CLIP to develop a semantic-prompt consistency loss to enforce semantic consistency and image-prompt alignment for unsupervised training. Comprehensive experimental results illustrate the effectiveness of our method in correcting real-world exposure images and outperforms state-of-the-art unsupervised methods both numerically and visually.

[251] Tri-Reader: An Open-Access, Multi-Stage AI Pipeline for First-Pass Lung Nodule Annotation in Screening CT

Fakrul Islam Tushar, Joseph Y. Lo

Main category: cs.CV

TL;DR: Tri-Reader is a free, open-source pipeline for lung cancer screening that combines lung segmentation, nodule detection, and malignancy classification in a unified workflow, prioritizing sensitivity while reducing annotation workload.

Details

Motivation: To create an accessible, comprehensive lung cancer screening tool that reduces the burden on annotators while maintaining high sensitivity across diverse clinical practices.

Method: Developed a tri-stage pipeline integrating lung segmentation, nodule detection, and malignancy classification using multiple open-access models trained on public datasets. Designed to prioritize sensitivity while reducing candidate burden for annotators.

Result: Evaluated on multiple internal and external datasets compared with expert annotations and dataset-provided reference standards to ensure accuracy and generalizability across diverse practices.

Conclusion: Tri-Reader provides a freely available, comprehensive pipeline for lung cancer screening that balances sensitivity with reduced annotation workload and demonstrates generalizability across different clinical settings.

Abstract: Using multiple open-access models trained on public datasets, we developed Tri-Reader, a comprehensive, freely available pipeline that integrates lung segmentation, nodule detection, and malignancy classification into a unified tri-stage workflow. The pipeline is designed to prioritize sensitivity while reducing the candidate burden for annotators. To ensure accuracy and generalizability across diverse practices, we evaluated Tri-Reader on multiple internal and external datasets as compared with expert annotations and dataset-provided reference standards.

[252] Fast Converging 3D Gaussian Splatting for 1-Minute Reconstruction

Ziyu Zhang, Tianle Liu, Diantao Tu, Shuhan Shen

Main category: cs.CV

TL;DR: A fast 3DGS reconstruction pipeline that converges within one minute, winning the SIGGRAPH Asia 3DGS Fast Reconstruction Challenge with top PSNR of 28.43.

Details

Motivation: To address the SIGGRAPH Asia 3DGS Fast Reconstruction Challenge requiring reconstruction within one minute, handling two different settings: initial round with noisy SLAM poses and final round with accurate COLMAP poses.

Method: Two-stage approach: For noisy SLAM poses, uses reverse per-Gaussian parallel optimization, compact forward splatting, load-balanced tiling, anchor-based Neural-Gaussian representation, monocular depth initialization, and global pose refinement. For accurate COLMAP poses, disables pose refinement, reverts to standard 3DGS, uses multi-view consistency-guided Gaussian splitting, and depth estimator supervision.

Result: Achieved top performance with PSNR of 28.43 and ranked first in the competition, successfully meeting the one-minute reconstruction constraint.

Conclusion: The proposed two-stage pipeline effectively handles heterogeneous pose quality settings and enables high-fidelity 3DGS reconstruction within strict time constraints, demonstrating state-of-the-art performance in fast reconstruction challenges.

Abstract: We present a fast 3DGS reconstruction pipeline designed to converge within one minute, developed for the SIGGRAPH Asia 3DGS Fast Reconstruction Challenge. The challenge consists of an initial round using SLAM-generated camera poses (with noisy trajectories) and a final round using COLMAP poses (highly accurate). To robustly handle these heterogeneous settings, we develop a two-stage solution. In the first round, we use reverse per-Gaussian parallel optimization and compact forward splatting based on Taming-GS and Speedy-splat, load-balanced tiling, an anchor-based Neural-Gaussian representation enabling rapid convergence with fewer learnable parameters, initialization from monocular depth and partially from feed-forward 3DGS models, and a global pose refinement module for noisy SLAM trajectories. In the final round, the accurate COLMAP poses change the optimization landscape; we disable pose refinement, revert from Neural-Gaussians back to standard 3DGS to eliminate MLP inference overhead, introduce multi-view consistency-guided Gaussian splitting inspired by Fast-GS, and introduce a depth estimator to supervise the rendered depth. Together, these techniques enable high-fidelity reconstruction under a strict one-minute budget. Our method achieved the top performance with a PSNR of 28.43 and ranked first in the competition.

Zhengjian Yao, Jiakui Hu, Kaiwen Li, Hangzhou He, Xinliang Zhang, Shuang Zeng, Lei Zhu, Yanye Lu

Main category: cs.CV

TL;DR: Pref-Restore: A hierarchical framework for blind face restoration that addresses information asymmetry through semantic logic integration and preference-aligned reinforcement learning to achieve deterministic, high-fidelity results.

Details

Motivation: Current blind face restoration methods suffer from information asymmetry - the disparity between information-sparse low-quality inputs and information-dense high-quality outputs. This creates a one-to-many mapping problem leading to stochastic uncertainty and hallucinatory artifacts in generated results.

Method: Two complementary strategies: 1) Augmenting Input Density: Using an auto-regressive integrator to reformulate textual instructions into dense latent queries for semantic stability. 2) Pruning Output Distribution: Integrating on-policy reinforcement learning into the diffusion restoration loop to transform human preferences into differentiable constraints that penalize stochastic deviations.

Result: Pref-Restore achieves state-of-the-art performance across synthetic and real-world benchmarks. The preference-aligned strategy significantly reduces solution entropy, establishing a robust pathway toward reliable and deterministic blind restoration.

Conclusion: The hierarchical framework successfully bridges the information asymmetry gap in blind face restoration by integrating discrete semantic logic with continuous texture generation, enabling deterministic, preference-aligned restoration with reduced stochastic uncertainty.

Abstract: Blind face restoration remains a persistent challenge due to the inherent ill-posedness of reconstructing holistic structures from severely constrained observations. Current generative approaches, while capable of synthesizing realistic textures, often suffer from information asymmetry – the intrinsic disparity between the information-sparse low quality inputs and the information-dense high quality outputs. This imbalance leads to a one-to-many mapping, where insufficient constraints result in stochastic uncertainty and hallucinatory artifacts. To bridge this gap, we present \textbf{Pref-Restore}, a hierarchical framework that integrates discrete semantic logic with continuous texture generation to achieve deterministic, preference-aligned restoration. Our methodology fundamentally addresses this information disparity through two complementary strategies: (1) Augmenting Input Density: We employ an auto-regressive integrator to reformulate textual instructions into dense latent queries, injecting high-level semantic stability to constrain the degraded signals; (2) Pruning Output Distribution: We pioneer the integration of on-policy reinforcement learning directly into the diffusion restoration loop. By transforming human preferences into differentiable constraints, we explicitly penalize stochastic deviations, thereby sharpening the posterior distribution toward the desired high-fidelity outcomes. Extensive experiments demonstrate that Pref-Restore achieves state-of-the-art performance across synthetic and real-world benchmarks. Furthermore, empirical analysis confirms that our preference-aligned strategy significantly reduces solution entropy, establishing a robust pathway toward reliable and deterministic blind restoration.

[254] GeoDiff3D: Self-Supervised 3D Scene Generation with Geometry-Constrained 2D Diffusion Guidance

Haozhi Zhu, Miaomiao Zhao, Dingyao Liu, Runze Tian, Yan Zhang, Jie Guo, Fenggen Yu

Main category: cs.CV

TL;DR: GeoDiff3D is a self-supervised 3D scene generation framework that uses coarse geometry as structural anchor and geometry-constrained 2D diffusion for texture-rich references, achieving high-quality generation with low computational cost.

Details

Motivation: Existing 3D scene generation methods (indirect 2D-to-3D reconstruction and direct 3D generation) suffer from weak structural modeling, heavy reliance on large-scale ground-truth supervision, structural artifacts, geometric inconsistencies, and degraded high-frequency details in complex scenes.

Method: Uses coarse geometry as structural anchor, geometry-constrained 2D diffusion model for texture-rich reference images (doesn’t require strict multi-view consistency), voxel-aligned 3D feature aggregation, and dual self-supervision to maintain scene coherence and fine details while reducing labeled data dependence.

Result: Extensive experiments on challenging scenes show improved generalization and generation quality over existing baselines, with low computational cost and fast, high-quality 3D scene generation.

Conclusion: GeoDiff3D offers a practical solution for accessible and efficient 3D scene construction, addressing limitations of current methods while maintaining scene coherence and fine details with reduced supervision requirements.

Abstract: 3D scene generation is a core technology for gaming, film/VFX, and VR/AR. Growing demand for rapid iteration, high-fidelity detail, and accessible content creation has further increased interest in this area. Existing methods broadly follow two paradigms - indirect 2D-to-3D reconstruction and direct 3D generation - but both are limited by weak structural modeling and heavy reliance on large-scale ground-truth supervision, often producing structural artifacts, geometric inconsistencies, and degraded high-frequency details in complex scenes. We propose GeoDiff3D, an efficient self-supervised framework that uses coarse geometry as a structural anchor and a geometry-constrained 2D diffusion model to provide texture-rich reference images. Importantly, GeoDiff3D does not require strict multi-view consistency of the diffusion-generated references and remains robust to the resulting noisy, inconsistent guidance. We further introduce voxel-aligned 3D feature aggregation and dual self-supervision to maintain scene coherence and fine details while substantially reducing dependence on labeled data. GeoDiff3D also trains with low computational cost and enables fast, high-quality 3D scene generation. Extensive experiments on challenging scenes show improved generalization and generation quality over existing baselines, offering a practical solution for accessible and efficient 3D scene construction.

cs.AI

[255] NeuroAI and Beyond

Jean-Marc Fellous, Gert Cauwenberghs, Cornelia Fermüller, Yulia Sandamisrkaya, Terrence Sejnowski

Main category: cs.AI

TL;DR: A workshop report advocating for NeuroAI - Neuroscience-informed Artificial Intelligence - to create synergies between neuroscience and AI, identifying key areas of collaboration and conducting SWOT analyses.

Details

Motivation: Neuroscience and AI have progressed independently with only loose connections. There's a need to bridge these fields to leverage neuroscience insights for improving AI algorithms while advancing our understanding of biological neural computation.

Method: Based on an August 2025 workshop, the paper identifies synergies across key areas: embodiment, language/communication, robotics, human/machine learning, and neuromorphic engineering. Includes personal statements from leading researchers and conducts two SWOT analyses (by researchers and trainees).

Result: Identifies current and future synergies between neuroscience and AI across multiple domains. Proposes NeuroAI as a framework for neuroscience-informed AI development that could enhance AI scope/efficiency while advancing neural computation understanding.

Conclusion: Advocates for developing NeuroAI as a promising interdisciplinary approach that offers mutual benefits: improving AI algorithms through neuroscience insights while transforming our understanding of biological neural systems.

Abstract: Neuroscience and Artificial Intelligence (AI) have made significant progress in the past few years but have only been loosely inter-connected. Based on a workshop held in August 2025, we identify current and future areas of synergism between these two fields. We focus on the subareas of embodiment, language and communication, robotics, learning in humans and machines and Neuromorphic engineering to take stock of the progress made so far, and possible promising new future avenues. Overall, we advocate for the development of NeuroAI, a type of Neuroscience-informed Artificial Intelligence that, we argue, has the potential for significantly improving the scope and efficiency of AI algorithms while simultaneously changing the way we understand biological neural computations. We include personal statements from several leading researchers on their diverse views of NeuroAI. Two Strength-Weakness-Opportunities-Threat (SWOT) analyses by researchers and trainees are appended that describe the benefits and risks offered by NeuroAI.

[256] Teaching LLMs to Ask: Self-Querying Category-Theoretic Planning for Under-Specified Reasoning

Shuhui Qu

Main category: cs.AI

TL;DR: SQ-BCP is a planning method for LLMs that handles partial observability by tracking precondition status and resolving unknowns through self-queries or bridging hypotheses, with formal guarantees on plan correctness.

Details

Motivation: LLM-based planning often fails under partial observability when critical preconditions are unspecified, leading to hallucinations or constraint violations.

Method: SQ-BCP explicitly tracks precondition status (Sat/Viol/Unk), resolves unknowns via targeted self-queries to users/oracles or bridging hypotheses, performs bidirectional search, and uses pullback-based verification as categorical certificates of goal compatibility.

Result: On WikiHow and RecipeNLG tasks with withheld preconditions, SQ-BCP reduces resource-violation rates to 14.9% and 5.8% (vs. 26.0% and 15.7% for best baseline) while maintaining competitive plan quality.

Conclusion: SQ-BCP provides a principled approach to LLM planning under partial observability with formal correctness guarantees and significant improvements over baselines in reducing constraint violations.

Abstract: Inference-time planning with large language models frequently breaks under partial observability: when task-critical preconditions are not specified at query time, models tend to hallucinate missing facts or produce plans that violate hard constraints. We introduce \textbf{Self-Querying Bidirectional Categorical Planning (SQ-BCP)}, which explicitly represents precondition status (\texttt{Sat}/\texttt{Viol}/\texttt{Unk}) and resolves unknowns via (i) targeted self-queries to an oracle/user or (ii) \emph{bridging} hypotheses that establish the missing condition through an additional action. SQ-BCP performs bidirectional search and invokes a pullback-based verifier as a categorical certificate of goal compatibility, while using distance-based scores only for ranking and pruning. We prove that when the verifier succeeds and hard constraints pass deterministic checks, accepted plans are compatible with goal requirements; under bounded branching and finite resolution depth, SQ-BCP finds an accepting plan when one exists. Across WikiHow and RecipeNLG tasks with withheld preconditions, SQ-BCP reduces resource-violation rates to \textbf{14.9%} and \textbf{5.8%} (vs.\ \textbf{26.0%} and \textbf{15.7%} for the best baseline), while maintaining competitive reference quality.

[257] Fuzzy Categorical Planning: Autonomous Goal Satisfaction with Graded Semantic Constraints

Shuhui Qu

Main category: cs.AI

TL;DR: Fuzzy Category-theoretic Planning (FCP) extends category-theoretic planning to handle vague predicates with graded satisfaction, using fuzzy logic for quality composition while maintaining crisp executability checks.

Details

Motivation: Natural language planning often involves vague predicates (e.g., "suitable substitute", "stable enough") that have graded satisfaction, but existing category-theoretic planners treat applicability as crisp, forcing thresholding that collapses meaningful distinctions and cannot track quality degradation across multi-step plans.

Method: FCP annotates each action (morphism) with a degree in [0,1], composes plan quality via Lukasiewicz t-norm, retains crisp executability checks via pullback verification, grounds graded applicability from language using LLM with k-sample median aggregation, and supports meeting-in-the-middle search using residuum-based backward requirements.

Result: FCP improves success and reduces hard-constraint violations on RecipeNLG-Subs (missing-substitute recipe-planning benchmark) compared to LLM-only and ReAct-style baselines, while remaining competitive with classical PDDL3 planners on public PDDL3 preference/oversubscription benchmarks.

Conclusion: FCP provides a principled approach to handle graded applicability in natural language planning, combining fuzzy logic for quality assessment with category-theoretic structure for compositional reasoning and verification.

Abstract: Natural-language planning often involves vague predicates (e.g., suitable substitute, stable enough) whose satisfaction is inherently graded. Existing category-theoretic planners provide compositional structure and pullback-based hard-constraint verification, but treat applicability as crisp, forcing thresholding that collapses meaningful distinctions and cannot track quality degradation across multi-step plans. We propose Fuzzy Category-theoretic Planning (FCP), which annotates each action (morphism) with a degree in [0,1], composes plan quality via a t-norm Lukasiewicz, and retains crisp executability checks via pullback verification. FCP grounds graded applicability from language using an LLM with k-sample median aggregation and supports meeting-in-the-middle search using residuum-based backward requirements. We evaluate on (i) public PDDL3 preference/oversubscription benchmarks and (ii) RecipeNLG-Subs, a missing-substitute recipe-planning benchmark built from RecipeNLG with substitution candidates from Recipe1MSubs and FoodKG. FCP improves success and reduces hard-constraint violations on RecipeNLG-Subs compared to LLM-only and ReAct-style baselines, while remaining competitive with classical PDDL3 planners.

[258] Insight Agents: An LLM-Based Multi-Agent System for Data Insights

Jincheng Bai, Zhenyu Zhang, Jennifer Zhang, Zhihuai Zhu

Main category: cs.AI

TL;DR: A conversational multi-agent system called Insight Agents (IA) helps E-commerce sellers by providing personalized data insights through automated information retrieval, achieving 90% accuracy with low latency.

Details

Motivation: E-commerce sellers struggle with discovering/utilizing available programs/tools and understanding/utilizing rich data from various tools. The system aims to reduce effort and increase decision-making speed for sellers.

Method: LLM-backed end-to-end agentic system with hierarchical multi-agent structure: manager agent (OOD detection + BERT-based routing) and two worker agents (data presentation and insight generation). Uses plan-and-execute paradigm with strategic planning for API-based data model and dynamic domain knowledge injection.

Result: Launched for Amazon sellers in US with 90% accuracy based on human evaluation and P90 latency below 15 seconds.

Conclusion: Insight Agents successfully serves as a force multiplier for E-commerce sellers by providing automated, personalized insights with high accuracy and low latency, addressing key challenges in data utilization and decision-making.

Abstract: Today, E-commerce sellers face several key challenges, including difficulties in discovering and effectively utilizing available programs and tools, and struggling to understand and utilize rich data from various tools. We therefore aim to develop Insight Agents (IA), a conversational multi-agent Data Insight system, to provide E-commerce sellers with personalized data and business insights through automated information retrieval. Our hypothesis is that IA will serve as a force multiplier for sellers, thereby driving incremental seller adoption by reducing the effort required and increase speed at which sellers make good business decisions. In this paper, we introduce this novel LLM-backed end-to-end agentic system built on a plan-and-execute paradigm and designed for comprehensive coverage, high accuracy, and low latency. It features a hierarchical multi-agent structure, consisting of manager agent and two worker agents: data presentation and insight generation, for efficient information retrieval and problem-solving. We design a simple yet effective ML solution for manager agent that combines Out-of-Domain (OOD) detection using a lightweight encoder-decoder model and agent routing through a BERT-based classifier, optimizing both accuracy and latency. Within the two worker agents, a strategic planning is designed for API-based data model that breaks down queries into granular components to generate more accurate responses, and domain knowledge is dynamically injected to to enhance the insight generator. IA has been launched for Amazon sellers in US, which has achieved high accuracy of 90% based on human evaluation, with latency of P90 below 15s.

[259] Should I Have Expressed a Different Intent? Counterfactual Generation for LLM-Based Autonomous Control

Amirmohammad Farzaneh, Salvatore D’Oro, Osvaldo Simeone

Main category: cs.AI

TL;DR: A framework for counterfactual reasoning in LLM-powered agents with formal reliability guarantees, using structural causal models and conformal prediction.

Details

Motivation: Users interacting with LLM-powered agents may wonder how different phrasing of their intents would affect outcomes, but current systems lack reliable counterfactual reasoning capabilities.

Method: Models user-agent-environment interaction as a structural causal model (SCM), uses test-time scaling for probabilistic abduction to generate candidate counterfactual outcomes, and applies conformal counterfactual generation (CCG) with offline calibration for reliability guarantees.

Result: CCG produces sets of counterfactual outcomes guaranteed to contain the true counterfactual outcome with high probability, showing significant advantages over naive re-execution baselines in wireless network control use cases.

Conclusion: The framework enables reliable counterfactual reasoning for LLM-powered agents with formal guarantees, addressing an important gap in understanding how intent phrasing affects outcomes in agentic control scenarios.

Abstract: Large language model (LLM)-powered agents can translate high-level user intents into plans and actions in an environment. Yet after observing an outcome, users may wonder: What if I had phrased my intent differently? We introduce a framework that enables such counterfactual reasoning in agentic LLM-driven control scenarios, while providing formal reliability guarantees. Our approach models the closed-loop interaction between a user, an LLM-based agent, and an environment as a structural causal model (SCM), and leverages test-time scaling to generate multiple candidate counterfactual outcomes via probabilistic abduction. Through an offline calibration phase, the proposed conformal counterfactual generation (CCG) yields sets of counterfactual outcomes that are guaranteed to contain the true counterfactual outcome with high probability. We showcase the performance of CCG on a wireless network control use case, demonstrating significant advantages compared to naive re-execution baselines.

Zixuan Xiao, Chunguang Hu, Jun Ma

Main category: cs.AI

TL;DR: A multi-modal LLM agent framework for urban park development monitoring that addresses limitations of traditional remote sensing methods through semantic understanding, data alignment, and domain-specific toolkits.

Details

Motivation: Traditional remote sensing change detection methods lack high-level intelligent analysis capabilities and flexibility for complex multi-modal urban park monitoring, failing to meet current urban planning and management requirements.

Method: Proposes a multi-modal LLM agent framework with: 1) horizontal and vertical data alignment mechanism for multi-modal data consistency, 2) specific toolkit to address LLM hallucination issues from lack of domain knowledge, 3) leverages LLM’s semantic understanding and reasoning capabilities.

Result: The approach enables robust multi-modal information fusion and analysis, outperforming vanilla GPT-4o and other agents, providing reliable and scalable solutions for diverse urban park monitoring demands.

Conclusion: The LLM-based framework offers an effective solution for intelligent urban park development monitoring, addressing limitations of traditional methods through enhanced semantic understanding and multi-modal data integration.

Abstract: As an important part of urbanization, the development monitoring of newly constructed parks is of great significance for evaluating the effect of urban planning and optimizing resource allocation. However, traditional change detection methods based on remote sensing imagery have obvious limitations in high-level and intelligent analysis, and thus are difficult to meet the requirements of current urban planning and management. In face of the growing demand for complex multi-modal data analysis in urban park development monitoring, these methods often fail to provide flexible analysis capabilities for diverse application scenarios. This study proposes a multi-modal LLM agent framework, which aims to make full use of the semantic understanding and reasoning capabilities of LLM to meet the challenges in urban park development monitoring. In this framework, a general horizontal and vertical data alignment mechanism is designed to ensure the consistency and effective tracking of multi-modal data. At the same time, a specific toolkit is constructed to alleviate the hallucination issues of LLM due to the lack of domain-specific knowledge. Compared to vanilla GPT-4o and other agents, our approach enables robust multi-modal information fusion and analysis, offering reliable and scalable solutions tailored to the diverse and evolving demands of urban park development monitoring.

[261] Scaling Medical Reasoning Verification via Tool-Integrated Reinforcement Learning

Hang Zhang, Ruheng Wang, Yuelyu Ji, Mingu Kwak, Xizhi Wu, Chenyu Li, Li Zhang, Wenqi Shi, Yifan Peng, Yanshan Wang

Main category: cs.AI

TL;DR: M3V: An agentic framework for medical reasoning verification that iteratively queries external medical corpora, combining tool-augmented verification with iterative RL and adaptive curriculum, achieving substantial accuracy gains and 8× reduction in sampling budget.

Details

Motivation: Current medical reasoning verification methods have limitations: they produce only scalar rewards without justification, and use single-pass retrieval that prevents adaptive knowledge access during verification. There's a need for more reliable verification systems that can dynamically access and justify medical evidence.

Method: M3V trains medical reasoning verifiers to iteratively query external medical corpora during evaluation. It combines tool-augmented verification with an iterative reinforcement learning paradigm requiring only trace-level supervision, plus an adaptive curriculum mechanism that dynamically adjusts training data distribution.

Result: Across four medical reasoning benchmarks, M3V achieves substantial gains: 23.5% improvement on MedQA and 32.0% on MedXpertQA relative to base generator. Most notably, it demonstrates an 8× reduction in sampling budget requirement compared to prior reward model baselines.

Conclusion: Grounding verification in dynamically retrieved evidence offers a principled path toward more reliable medical reasoning systems, addressing limitations of existing verification methods while significantly improving efficiency and accuracy.

Abstract: Large language models have achieved strong performance on medical reasoning benchmarks, yet their deployment in clinical settings demands rigorous verification to ensure factual accuracy. While reward models offer a scalable approach for reasoning trace verification, existing methods face two limitations: they produce only scalar reward values without explicit justification, and they rely on single-pass retrieval that precludes adaptive knowledge access as verification unfolds. We introduce $\method$, an agentic framework that addresses these limitations by training medical reasoning verifiers to iteratively query external medical corpora during evaluation. Our approach combines tool-augmented verification with an iterative reinforcement learning paradigm that requires only trace-level supervision, alongside an adaptive curriculum mechanism that dynamically adjusts training data distribution. Across four medical reasoning benchmarks, $\method$ achieves substantial gains over existing methods, improving MedQA accuracy by 23.5% and MedXpertQA by 32.0% relative to the base generator in particular. Crucially, $\method$ demonstrates an $\mathbf{8\times}$ reduction in sampling budget requirement compared to prior reward model baselines. These findings establish that grounding verification in dynamically retrieved evidence offers a principled path toward more reliable medical reasoning systems.

[262] Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models

Zhenchen Tang, Songlin Yang, Zichuan Wang, Bo Peng, Yang Li, Beibei Dong, Jing Dong

Main category: cs.AI

TL;DR: SEER is a training framework that bridges the cognitive gap in Unified Multimodal Models by enabling them to generate self-aligned descriptors during generation through an endogenous reprompting mechanism.

Details

Motivation: Unified Multimodal Models have strong understanding capabilities but fail to effectively guide their own generation process due to a "Cognitive Gap" - they lack understanding of how to enhance their own generation.

Method: Proposes Endogenous Reprompting mechanism and SEER framework with two-stage endogenous loop: 1) RLVR (Reinforcement Learning with Verifiable Rewards) activates latent evaluation ability via curriculum learning, 2) RLMT (Reinforcement Learning with Model-rewarded Thinking) optimizes generative reasoning policy using the reward signal. Uses only 300 samples from Visual Instruction Elaboration proxy task.

Result: SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality without sacrificing general multimodal capabilities.

Conclusion: The proposed Endogenous Reprompting mechanism and SEER framework successfully bridge the cognitive gap in UMMs by transforming understanding from passive encoding to explicit generative reasoning, enabling models to guide their own generation process effectively.

Abstract: Unified Multimodal Models (UMMs) exhibit strong understanding, yet this capability often fails to effectively guide generation. We identify this as a Cognitive Gap: the model lacks the understanding of how to enhance its own generation process. To bridge this gap, we propose Endogenous Reprompting, a mechanism that transforms the model’s understanding from a passive encoding process into an explicit generative reasoning step by generating self-aligned descriptors during generation. To achieve this, we introduce SEER (Self-Evolving Evaluator and Reprompter), a training framework that establishes a two-stage endogenous loop using only 300 samples from a compact proxy task, Visual Instruction Elaboration. First, Reinforcement Learning with Verifiable Rewards (RLVR) activates the model’s latent evaluation ability via curriculum learning, producing a high-fidelity endogenous reward signal. Second, Reinforcement Learning with Model-rewarded Thinking (RLMT) leverages this signal to optimize the generative reasoning policy. Experiments show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality, without sacrificing general multimodal capabilities.

[263] ECG-Agent: On-Device Tool-Calling Agent for ECG Multi-Turn Dialogue

Hyunseung Chung, Jungwoo Oh, Daeun Kyung, Jiho Kim, Yeonsu Kwon, Min-Gyu Kim, Edward Choi

Main category: cs.AI

TL;DR: ECG-Agent is the first LLM-based tool-calling agent for multi-turn ECG dialogue, addressing limitations of current models by enabling conversational ability, on-device efficiency, and precise ECG measurement understanding.

Details

Motivation: Current Multimodal LLMs for ECG analysis are limited to classification, report generation, and single-turn QA, lacking real-world applicability due to absence of multi-turn conversation capability, on-device efficiency, and precise understanding of ECG measurements like PQRST intervals.

Method: Introduced ECG-Agent as LLM-based tool-calling agent for multi-turn ECG dialogue, created ECG-MTD dataset of realistic user-assistant multi-turn dialogues for diverse ECG lead configurations, and developed ECG-Agents in various sizes from on-device capable to larger agents.

Result: ECG-Agents outperform baseline ECG-LLMs in response accuracy. On-device agents achieve comparable performance to larger agents in evaluations of response accuracy, tool-calling ability, and hallucinations, demonstrating viability for real-world applications.

Conclusion: ECG-Agent successfully addresses limitations of current ECG-LLMs by enabling multi-turn conversational ability, on-device efficiency, and precise ECG measurement understanding, with on-device agents showing comparable performance to larger models for practical deployment.

Abstract: Recent advances in Multimodal Large Language Models have rapidly expanded to electrocardiograms, focusing on classification, report generation, and single-turn QA tasks. However, these models fall short in real-world scenarios, lacking multi-turn conversational ability, on-device efficiency, and precise understanding of ECG measurements such as the PQRST intervals. To address these limitations, we introduce ECG-Agent, the first LLM-based tool-calling agent for multi-turn ECG dialogue. To facilitate its development and evaluation, we also present ECG-Multi-Turn-Dialogue (ECG-MTD) dataset, a collection of realistic user-assistant multi-turn dialogues for diverse ECG lead configurations. We develop ECG-Agents in various sizes, from on-device capable to larger agents. Experimental results show that ECG-Agents outperform baseline ECG-LLMs in response accuracy. Furthermore, on-device agents achieve comparable performance to larger agents in various evaluations that assess response accuracy, tool-calling ability, and hallucinations, demonstrating their viability for real-world applications.

[264] AMA: Adaptive Memory via Multi-Agent Collaboration

Weiquan Huang, Zixuan Wang, Hehai Lin, Sudong Wang, Bo Xu, Qian Li, Beier Zhu, Linyi Yang, Chengwei Qin

Main category: cs.AI

TL;DR: AMA is a multi-agent memory framework that uses coordinated agents to manage memory across granularities, improving retrieval precision and consistency while reducing token usage by 80%.

Details

Motivation: Existing LLM agent memory systems suffer from rigid retrieval granularity, accumulation-heavy maintenance, and coarse-grained updates, creating mismatches between stored information and task-specific reasoning demands while accumulating logical inconsistencies over time.

Method: AMA uses a multi-agent collaboration framework with hierarchical memory design. Four specialized agents work together: Constructor and Retriever for multi-granularity memory construction and adaptive query routing; Judge for verifying relevance/consistency and triggering iterative retrieval; Refresher for targeted updates and removing outdated entries.

Result: AMA significantly outperforms state-of-the-art baselines on challenging long-context benchmarks while reducing token consumption by approximately 80% compared to full-context methods, demonstrating effectiveness in maintaining retrieval precision and long-term memory consistency.

Conclusion: The AMA framework successfully addresses limitations of existing memory systems by enabling adaptive granularity management and consistency maintenance through multi-agent collaboration, offering an efficient solution for LLM agent memory systems.

Abstract: The rapid evolution of Large Language Model (LLM) agents has necessitated robust memory systems to support cohesive long-term interaction and complex reasoning. Benefiting from the strong capabilities of LLMs, recent research focus has shifted from simple context extension to the development of dedicated agentic memory systems. However, existing approaches typically rely on rigid retrieval granularity, accumulation-heavy maintenance strategies, and coarse-grained update mechanisms. These design choices create a persistent mismatch between stored information and task-specific reasoning demands, while leading to the unchecked accumulation of logical inconsistencies over time. To address these challenges, we propose Adaptive Memory via Multi-Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities. AMA employs a hierarchical memory design that dynamically aligns retrieval granularity with task complexity. Specifically, the Constructor and Retriever jointly enable multi-granularity memory construction and adaptive query routing. The Judge verifies the relevance and consistency of retrieved content, triggering iterative retrieval when evidence is insufficient or invoking the Refresher upon detecting logical conflicts. The Refresher then enforces memory consistency by performing targeted updates or removing outdated entries. Extensive experiments on challenging long-context benchmarks show that AMA significantly outperforms state-of-the-art baselines while reducing token consumption by approximately 80% compared to full-context methods, demonstrating its effectiveness in maintaining retrieval precision and long-term memory consistency.

[265] Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution

Zhengbo Jiao, Hongyu Xian, Qinglong Wang, Yunpu Ma, Zhebo Wang, Zifan Zhang, Dezhang Kong, Meng Han

Main category: cs.AI

TL;DR: PoT (Policy of Thoughts) enables LLMs to evolve reasoning strategies through online learning from execution feedback, dramatically improving complex reasoning performance with small models.

Details

Motivation: LLMs struggle with complex, long-horizon reasoning due to their frozen policy assumption. Current methods treat execution feedback as external signals rather than internalizing it to improve reasoning strategies, limiting their ability to learn from failures.

Method: PoT recasts reasoning as within-instance online optimization: 1) generates diverse candidate solutions via efficient exploration, 2) uses Group Relative Policy Optimization (GRPO) to update a transient LoRA adapter based on execution feedback, enabling dynamic refinement of reasoning priors.

Result: PoT dramatically boosts performance: a 4B parameter model achieves 49.71% accuracy on LiveCodeBench, outperforming GPT-4o and DeepSeek-V3 despite being over 50× smaller in size.

Conclusion: Intelligence requires real-time evolution of reasoning policies through learning from failed attempts. PoT’s closed-loop design enables LLMs to internalize execution feedback and dynamically refine reasoning strategies, achieving state-of-the-art performance with small models.

Abstract: Large language models (LLMs) struggle with complex, long-horizon reasoning due to instability caused by their frozen policy assumption. Current test-time scaling methods treat execution feedback merely as an external signal for filtering or rewriting trajectories, without internalizing it to improve the underlying reasoning strategy. Inspired by Popper’s epistemology of “conjectures and refutations,” we argue that intelligence requires real-time evolution of the model’s policy through learning from failed attempts. We introduce Policy of Thoughts (PoT), a framework that recasts reasoning as a within-instance online optimization process. PoT first generates diverse candidate solutions via an efficient exploration mechanism, then uses Group Relative Policy Optimization (GRPO) to update a transient LoRA adapter based on execution feedback. This closed-loop design enables dynamic, instance-specific refinement of the model’s reasoning priors. Experiments show that PoT dramatically boosts performance: a 4B model achieves 49.71% accuracy on LiveCodeBench, outperforming GPT-4o and DeepSeek-V3 despite being over 50 smaller.

[266] OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution

Le Zhang, Yixiong Xiao, Xinjiang Lu, Jingjia Cao, Yusai Zhao, Jingbo Zhou, Lang An, Zikan Feng, Wanxiang Sha, Yu Shi, Congxi Xiao, Jian Xiong, Yankai Zhang, Hua Wu, Haifeng Wang

Main category: cs.AI

TL;DR: OmegaUse is a general-purpose GUI agent model for autonomous task execution on mobile and desktop platforms, using MoE architecture, synthetic data generation, and two-stage training to achieve SOTA performance on GUI benchmarks.

Details

Motivation: GUI agents can revolutionize human-computer interaction and improve productivity by enabling foundation models to complete real-world tasks across different platforms (mobile and desktop).

Method: 1) Data construction: curated open-source datasets + automated synthesis framework with bottom-up exploration and top-down taxonomy-guided generation. 2) Training: Two-stage approach - SFT for basic interaction syntax, then GRPO for spatial grounding and sequential planning. 3) Architecture: Mixture-of-Experts (MoE) backbone for efficiency and reasoning capacity.

Result: Achieves SOTA 96.3% on ScreenSpot-V2, leading 79.1% step success on AndroidControl, 74.24% step success on ChiM-Nav, and 55.9% average success on Ubu-Nav. Strong performance across established GUI benchmarks.

Conclusion: OmegaUse demonstrates effective cross-terminal GUI agent capabilities through careful data construction, decoupled training paradigm, and MoE architecture, showing competitive performance across multiple operating systems and platforms.

Abstract: Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.

[267] CtrlCoT: Dual-Granularity Chain-of-Thought Compression for Controllable Reasoning

Zhenxuan Fan, Jie Cao, Yang Dai, Zheqi Lv, Wenqiao Zhang, Zhongle Xie, Peng LU, Beng Chin Ooi

Main category: cs.AI

TL;DR: CtrlCoT is a dual-granularity CoT compression framework that combines semantic abstraction and token-level pruning to reduce LLM reasoning latency and memory costs while preserving accuracy.

Details

Motivation: Chain-of-thought prompting improves LLM reasoning but has high latency and memory costs due to verbose traces. Existing compression methods are either too conservative (semantic shortening) or too aggressive (token pruning), and combining them is challenging due to sequential dependency, task-agnostic pruning, and distribution mismatch.

Method: Three-component framework: 1) Hierarchical Reasoning Abstraction produces CoTs at multiple semantic granularities; 2) Logic-Preserving Distillation trains a logic-aware pruner to retain indispensable reasoning cues (numbers, operators) across pruning ratios; 3) Distribution-Alignment Generation aligns compressed traces with fluent inference-time reasoning styles.

Result: On MATH-500 with Qwen2.5-7B-Instruct, CtrlCoT uses 30.7% fewer tokens while achieving 7.6 percentage points higher accuracy than the strongest baseline.

Conclusion: CtrlCoT demonstrates more efficient and reliable reasoning through harmonized semantic abstraction and token-level pruning, effectively reducing CoT verbosity while maintaining reasoning correctness.

Abstract: Chain-of-thought (CoT) prompting improves LLM reasoning but incurs high latency and memory cost due to verbose traces, motivating CoT compression with preserved correctness. Existing methods either shorten CoTs at the semantic level, which is often conservative, or prune tokens aggressively, which can miss task-critical cues and degrade accuracy. Moreover, combining the two is non-trivial due to sequential dependency, task-agnostic pruning, and distribution mismatch. We propose \textbf{CtrlCoT}, a dual-granularity CoT compression framework that harmonizes semantic abstraction and token-level pruning through three components: Hierarchical Reasoning Abstraction produces CoTs at multiple semantic granularities; Logic-Preserving Distillation trains a logic-aware pruner to retain indispensable reasoning cues (e.g., numbers and operators) across pruning ratios; and Distribution-Alignment Generation aligns compressed traces with fluent inference-time reasoning styles to avoid fragmentation. On MATH-500 with Qwen2.5-7B-Instruct, CtrlCoT uses 30.7% fewer tokens while achieving 7.6 percentage points higher than the strongest baseline, demonstrating more efficient and reliable reasoning. Our code will be publicly available at https://github.com/fanzhenxuan/Ctrl-CoT.

[268] AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

Mingyang Song, Haoyu Sun, Jiawei Gu, Linjie Li, Luxin Xu, Ranjay Krishna, Yu Cheng

Main category: cs.AI

TL;DR: AdaReasoner is a family of multimodal models that learn tool use as a general reasoning skill, enabling them to autonomously select, sequence, and compose tools for visual reasoning tasks without explicit supervision.

Details

Motivation: Humans use tools to solve problems beyond their immediate capabilities, which provides a promising paradigm for improving visual reasoning in multimodal LLMs. Current models need to learn how to effectively select, invoke, and compose tools over multiple steps, especially when facing new tools or tasks.

Method: Three key components: (1) scalable data curation pipeline for long-horizon, multi-step tool interactions; (2) Tool-GRPO reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; (3) adaptive learning mechanism that dynamically regulates tool usage.

Result: AdaReasoner exhibits strong tool-adaptive and generalization behaviors: autonomously adopts beneficial tools, suppresses irrelevant ones, adjusts tool usage frequency based on task demands. Achieves state-of-the-art performance across benchmarks, improving 7B base model by +24.9% on average and surpassing proprietary systems like GPT-5 on multiple tasks including VSP and Jigsaw.

Conclusion: AdaReasoner demonstrates that multimodal models can learn tool use as a general reasoning skill rather than tool-specific behavior, enabling effective tool coordination and generalization to unseen tools through scalable data curation, reinforcement learning optimization, and adaptive learning mechanisms.

Abstract: When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose them over multiple steps, even when faced with new tools or new tasks. We introduce \textbf{AdaReasoner}, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior. AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage. Together, these components allow models to infer tool utility from task context and intermediate outcomes, enabling coordination of multiple tools and generalization to unseen tools. Empirically, AdaReasoner exhibits strong tool-adaptive and generalization behaviors: it autonomously adopts beneficial tools, suppresses irrelevant ones, and adjusts tool usage frequency based on task demands, despite never being explicitly trained to do so. These capabilities translate into state-of-the-art performance across challenging benchmarks, improving the 7B base model by +24.9% on average and surpassing strong proprietary systems such as GPT-5 on multiple tasks, including VSP and Jigsaw.

[269] Normative Equivalence in human-AI Cooperation: Behaviour, Not Identity, Drives Cooperation in Mixed-Agent Groups

Nico Mutzner, Taha Yasseri, Heiko Rauhut

Main category: cs.AI

TL;DR: AI agents in human groups show normative equivalence - cooperation norms function similarly regardless of whether partners are human or AI, with group behavior mattering more than partner identity.

Details

Motivation: Previous research focused on dyadic human-AI interactions, but little is known about how AI agents affect cooperative norm emergence in small groups. This study addresses the gap in understanding AI's influence on social norms in collective settings.

Method: Online experiment using repeated four-player Public Goods Game with 236 participants. Each group had 3 humans + 1 bot (framed as human or AI) following three strategies: unconditional cooperation, conditional cooperation, or free-riding. Follow-up Prisoner’s Dilemma tested norm persistence.

Result: Cooperation levels didn’t differ significantly between human and AI labels. Normative mechanisms (reciprocal dynamics, behavioral inertia) operated identically across conditions. No differences in norm persistence or normative perceptions. Cooperation depended on group behavior rather than partner identity.

Conclusion: Supports normative equivalence - cooperative norms function similarly in mixed human-AI and all-human groups. Cooperative norms are flexible enough to extend to artificial agents, blurring human-AI boundaries in collective decision-making.

Abstract: The introduction of artificial intelligence (AI) agents into human group settings raises essential questions about how these novel participants influence cooperative social norms. While previous studies on human-AI cooperation have primarily focused on dyadic interactions, little is known about how integrating AI agents affects the emergence and maintenance of cooperative norms in small groups. This study addresses this gap through an online experiment using a repeated four-player Public Goods Game (PGG). Each group consisted of three human participants and one bot, which was framed either as human or AI and followed one of three predefined decision strategies: unconditional cooperation, conditional cooperation, or free-riding. In our sample of 236 participants, we found that reciprocal group dynamics and behavioural inertia primarily drove cooperation. These normative mechanisms operated identically across conditions, resulting in cooperation levels that did not differ significantly between human and AI labels. Furthermore, we found no evidence of differences in norm persistence in a follow-up Prisoner’s Dilemma, or in participants’ normative perceptions. Participants’ behaviour followed the same normative logic across human and AI conditions, indicating that cooperation depended on group behaviour rather than partner identity. This supports a pattern of normative equivalence, in which the mechanisms that sustain cooperation function similarly in mixed human-AI and all human groups. These findings suggest that cooperative norms are flexible enough to extend to artificial agents, blurring the boundary between humans and AI in collective decision-making.

[270] PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs

Oguzhan Gungordu, Siheng Xiong, Faramarz Fekri

Main category: cs.AI

TL;DR: PathWise is a multi-agent LLM framework that treats heuristic generation as a sequential decision process using an entailment graph for memory, enabling state-aware planning instead of trial-and-error evolution.

Details

Motivation: Existing automated heuristic design frameworks rely on fixed evolutionary rules and static prompts, leading to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived.

Method: PathWise formulates heuristic generation as a sequential decision process over an entailment graph serving as compact memory. It uses a policy agent to plan evolutionary actions, a world model agent to generate heuristic rollouts, and critic agents to provide routed reflections from prior steps.

Result: Experiments show PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes across diverse combinatorial optimization problems.

Conclusion: The framework shifts LLM-based automated heuristic design from trial-and-error evolution toward state-aware planning through reasoning, enabling more efficient and effective heuristic generation.

Abstract: Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing frameworks’ reliance on fixed evolutionary rules and static prompt templates often leads to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived. We propose a novel multi-agent reasoning framework, referred to as Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs (PathWise), which formulates heuristic generation as a sequential decision process over an entailment graph serving as a compact, stateful memory of the search trajectory. This approach allows the system to carry forward past decisions and reuse or avoid derivation information across generations. A policy agent plans evolutionary actions, a world model agent generates heuristic rollouts conditioned on those actions, and critic agents provide routed reflections summarizing lessons from prior steps, shifting LLM-based AHD from trial-and-error evolution toward state-aware planning through reasoning. Experiments across diverse COPs show that PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.

[271] Online Risk-Averse Planning in POMDPs Using Iterated CVaR Value Function

Yaacov Pariente, Vadim Indelman

Main category: cs.AI

TL;DR: This paper extends risk-sensitive planning to partially observable environments using ICVaR dynamic risk measure, developing policy evaluation and online planning algorithms with finite-time guarantees.

Details

Motivation: Standard POMDP planning focuses on expected return, ignoring tail risk. Real-world applications require risk-sensitive decision-making under uncertainty, especially in safety-critical domains where worst-case outcomes matter.

Method: Develop ICVaR policy evaluation algorithm, then extend three online planners (Sparse Sampling, PFT-DPW, POMCPOW) to optimize ICVaR instead of expected return. Introduce risk parameter α controlling risk aversion level.

Result: Established finite-time performance guarantees for ICVaR Sparse Sampling. Experimental results show proposed ICVaR planners achieve lower tail risk compared to risk-neutral counterparts in benchmark POMDP domains.

Conclusion: The paper successfully bridges risk-sensitive decision-making with partial observability, providing theoretically grounded and practically effective ICVaR-based planning algorithms for risk-averse applications.

Abstract: We study risk-sensitive planning under partial observability using the dynamic risk measure Iterated Conditional Value-at-Risk (ICVaR). A policy evaluation algorithm for ICVaR is developed with finite-time performance guarantees that do not depend on the cardinality of the action space. Building on this foundation, three widely used online planning algorithms–Sparse Sampling, Particle Filter Trees with Double Progressive Widening (PFT-DPW), and Partially Observable Monte Carlo Planning with Observation Widening (POMCPOW)–are extended to optimize the ICVaR value function rather than the expectation of the return. Our formulations introduce a risk parameter $α$, where $α= 1$ recovers standard expectation-based planning and $α< 1$ induces increasing risk aversion. For ICVaR Sparse Sampling, we establish finite-time performance guarantees under the risk-sensitive objective, which further enable a novel exploration strategy tailored to ICVaR. Experiments on benchmark POMDP domains demonstrate that the proposed ICVaR planners achieve lower tail risk compared to their risk-neutral counterparts.

[272] Dialogical Reasoning Across AI Architectures: A Multi-Model Framework for Testing AI Alignment Strategies

Gray Cox

Main category: cs.AI

TL;DR: AI alignment tested through multi-model dialogue using Peace Studies frameworks, showing models can engage meaningfully with complex alignment concepts and surface different architectural perspectives.

Details

Motivation: To empirically test AI alignment strategies by reframing alignment from a control problem to a relationship problem, using Peace Studies traditions to develop dialogical reasoning approaches.

Method: Structured multi-model dialogue framework with four distinct roles (Proposer, Responder, Monitor, Translator) assigned to different AI systems (Claude, Gemini, GPT-4o) across six conditions, totaling 72 dialogue turns and 576,822 characters of structured exchange.

Result: AI systems can meaningfully engage with Peace Studies concepts, surface complementary objections from different architectural perspectives, and generate emergent insights like “VCW as transitional framework.” Different models foreground different concerns: Claude emphasized verification, Gemini focused on bias/scalability, GPT-4o highlighted implementation barriers.

Conclusion: The framework provides replicable methods for stress-testing alignment proposals before implementation, with preliminary evidence of AI capacity for dialogical reasoning. Limitations include dialogues engaging more with process than foundational claims, with future directions including human-AI hybrid protocols and extended dialogue studies.

Abstract: This paper introduces a methodological framework for empirically testing AI alignment strategies through structured multi-model dialogue. Drawing on Peace Studies traditions - particularly interest-based negotiation, conflict transformation, and commons governance - we operationalize Viral Collaborative Wisdom (VCW), an approach that reframes alignment from a control problem to a relationship problem developed through dialogical reasoning. Our experimental design assigns four distinct roles (Proposer, Responder, Monitor, Translator) to different AI systems across six conditions, testing whether current large language models can engage substantively with complex alignment frameworks. Using Claude, Gemini, and GPT-4o, we conducted 72 dialogue turns totaling 576,822 characters of structured exchange. Results demonstrate that AI systems can engage meaningfully with Peace Studies concepts, surface complementary objections from different architectural perspectives, and generate emergent insights not present in initial framings - including the novel synthesis of “VCW as transitional framework.” Cross-architecture patterns reveal that different models foreground different concerns: Claude emphasized verification challenges, Gemini focused on bias and scalability, and GPT-4o highlighted implementation barriers. The framework provides researchers with replicable methods for stress-testing alignment proposals before implementation, while the findings offer preliminary evidence about AI capacity for the kind of dialogical reasoning VCW proposes. We discuss limitations, including the observation that dialogues engaged more with process elements than with foundational claims about AI nature, and outline directions for future research including human-AI hybrid protocols and extended dialogue studies.

[273] Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu

Main category: cs.AI

TL;DR: MathForge improves mathematical reasoning in LLMs by addressing algorithmic and data limitations in handling harder questions through DGPO algorithm and MQR data augmentation.

Details

Motivation: Existing RLVR methods lack emphasis on challenging questions from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities in mathematical reasoning.

Method: Two-dual framework: 1) DGPO algorithm with difficulty-balanced group advantage estimation and difficulty-aware question-level weighting, 2) MQR strategy that reformulates questions across multiple aspects to increase difficulty while maintaining original answers.

Result: MathForge significantly outperforms existing methods on various mathematical reasoning tasks, with code and augmented data publicly available.

Conclusion: The synergistic loop between MQR (expanding data frontier) and DGPO (effective learning from augmented data) successfully addresses limitations in handling harder questions for improved mathematical reasoning.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation, and further prioritizes harder questions by difficulty-aware question-level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at https://github.com/AMAP-ML/MathForge.

[274] Investigating the Development of Task-Oriented Communication in Vision-Language Models

Boaz Carmeli, Orr Paradise, Shafi Goldwasser, Yonatan Belinkov, Ron Meir

Main category: cs.AI

TL;DR: LLM-based agents can develop task-oriented communication protocols that are more efficient than natural language and covert (hard to interpret by external observers), raising transparency concerns.

Details

Motivation: To investigate whether LLM-based agents can develop specialized communication protocols for collaborative tasks that differ from standard natural language, focusing on efficiency and covertness properties.

Method: Using a referential-game framework with vision-language model (VLM) agents to study communication patterns in a controlled, measurable setting for evaluating language variants.

Result: VLMs can develop effective task-adapted communication patterns that are more efficient than natural language, and can create covert protocols difficult for humans and external agents to interpret. Spontaneous coordination between similar models occurs without explicitly shared protocols.

Conclusion: Task-oriented communication by LLM agents shows both potential benefits (efficiency) and risks (covertness, transparency concerns). Referential games serve as a valuable testbed for future research in this area.

Abstract: We investigate whether \emph{LLM-based agents} can develop task-oriented communication protocols that differ from standard natural language in collaborative reasoning tasks. Our focus is on two core properties such task-oriented protocols may exhibit: Efficiency – conveying task-relevant information more concisely than natural language, and Covertness – becoming difficult for external observers to interpret, raising concerns about transparency and control. To investigate these aspects, we use a referential-game framework in which vision-language model (VLM) agents communicate, providing a controlled, measurable setting for evaluating language variants. Experiments show that VLMs can develop effective, task-adapted communication patterns. At the same time, they can develop covert protocols that are difficult for humans and external agents to interpret. We also observe spontaneous coordination between similar models without explicitly shared protocols. These findings highlight both the potential and the risks of task-oriented communication, and position referential games as a valuable testbed for future work in this area.

[275] Enterprise Resource Planning Using Multi-type Transformers in Ferro-Titanium Industry

Samira Yazdanpourmoghadam, Mahan Balal Pour, Vahid Partovi Nia

Main category: cs.AI

TL;DR: MTT (Multi-Type Transformer) architecture applied to combinatorial optimization problems (Job-Shop Scheduling and Knapsack) achieves competitive performance and demonstrates real-world application in Ferro-Titanium manufacturing.

Details

Motivation: Combinatorial optimization problems like JSP and KP are fundamental challenges in operations research, logistics, and ERP systems that require sophisticated algorithms for near-optimal solutions within practical time constraints. Traditional heuristics and metaheuristics need alternatives.

Method: Leverages the Multi-Type Transformer (MTT) architecture to address JSP and KP benchmarks in a unified framework, using multi-type attention mechanisms.

Result: Extensive experimental evaluation across standard benchmark datasets shows MTT achieves competitive performance on different problem sizes. Successfully applied to real Ferro-Titanium industry application, being the first to apply multi-type transformers in real manufacturing.

Conclusion: Transformer-based architectures, particularly MTT, show promise as alternatives to traditional optimization methods, with demonstrated effectiveness in both benchmark problems and real-world manufacturing applications.

Abstract: Combinatorial optimization problems such as the Job-Shop Scheduling Problem (JSP) and Knapsack Problem (KP) are fundamental challenges in operations research, logistics, and eterprise resource planning (ERP). These problems often require sophisticated algorithms to achieve near-optimal solutions within practical time constraints. Recent advances in deep learning have introduced transformer-based architectures as promising alternatives to traditional heuristics and metaheuristics. We leverage the Multi-Type Transformer (MTT) architecture to address these benchmarks in a unified framework. We present an extensive experimental evaluation across standard benchmark datasets for JSP and KP, demonstrating that MTT achieves competitive performance on different size of these benchmark problems. We showcase the potential of multi-type attention on a real application in Ferro-Titanium industry. To the best of our knowledge, we are the first to apply multi-type transformers in real manufacturing.

[276] Implementing Metric Temporal Answer Set Programming

Arvid Becker, Pedro Cabalar, Martin Diéguez, Susana Hahn, Javier Romero, Torsten Schaub

Main category: cs.AI

TL;DR: A computational approach for Metric ASP that handles quantitative temporal constraints while maintaining scalability by decoupling from time granularity.

Details

Motivation: To enable expressing quantitative temporal constraints (durations, deadlines) in Answer Set Programming while addressing scalability issues caused by fine-grained timing constraints that exacerbate ASP's grounding bottleneck.

Method: Leverage extensions of ASP with difference constraints (simplified linear constraints) to handle time-related aspects externally, effectively decoupling metric ASP from time granularity.

Result: Developed a solution that maintains scalability and is unaffected by time precision, overcoming the grounding bottleneck issue in metric ASP.

Conclusion: The approach successfully enables quantitative temporal reasoning in ASP while preserving computational efficiency by externalizing time constraints through difference constraints.

Abstract: We develop a computational approach to Metric Answer Set Programming (ASP) to allow for expressing quantitative temporal constraints, like durations and deadlines. A central challenge is to maintain scalability when dealing with fine-grained timing constraints, which can significantly exacerbate ASP’s grounding bottleneck. To address this issue, we leverage extensions of ASP with difference constraints, a simplified form of linear constraints, to handle time-related aspects externally. Our approach effectively decouples metric ASP from the granularity of time, resulting in a solution that is unaffected by time precision.

[277] REASON: Accelerating Probabilistic Logical Reasoning for Scalable Neuro-Symbolic Intelligence

Zishen Wan, Che-Kai Liu, Jiayi Qian, Hanchen Yang, Arijit Raychowdhury, Tushar Krishna

Main category: cs.AI

TL;DR: REASON is a hardware acceleration framework that achieves 12-50x speedup for probabilistic logical reasoning in neuro-symbolic AI by using a unified DAG representation and tree-based processing fabric.

Details

Motivation: Neuro-symbolic AI shows promise but suffers from severe inefficiencies in symbolic and probabilistic inference, particularly probabilistic logical reasoning, which has irregular control flow, low arithmetic intensity, and poor hardware utilization on CPUs/GPUs.

Method: REASON introduces: 1) Unified directed acyclic graph representation capturing structure across symbolic/probabilistic models, 2) Adaptive pruning and regularization, 3) Reconfigurable tree-based processing fabric optimized for irregular traversal and probabilistic aggregation, 4) GPU integration via programmable interface and multi-level pipeline.

Result: Achieves 12-50x speedup and 310-681x energy efficiency over desktop/edge GPUs (TSMC 28nm). Completes end-to-end tasks in 0.8s with 6mm² area and 2.12W power, enabling real-time probabilistic logical reasoning.

Conclusion: Targeted acceleration of probabilistic logical reasoning is critical for practical neuro-symbolic AI. REASON serves as a foundational system architecture for next-generation cognitive intelligence by overcoming the inference bottleneck.

Abstract: Neuro-symbolic AI systems integrate neural perception with symbolic reasoning to enable data-efficient, interpretable, and robust intelligence beyond purely neural models. Although this compositional paradigm has shown superior performance in domains such as reasoning, planning, and verification, its deployment remains challenging due to severe inefficiencies in symbolic and probabilistic inference. Through systematic analysis of representative neuro-symbolic workloads, we identify probabilistic logical reasoning as the inefficiency bottleneck, characterized by irregular control flow, low arithmetic intensity, uncoalesced memory accesses, and poor hardware utilization on CPUs and GPUs. This paper presents REASON, an integrated acceleration framework for probabilistic logical reasoning in neuro-symbolic AI. REASON introduces a unified directed acyclic graph representation that captures common structure across symbolic and probabilistic models, coupled with adaptive pruning and regularization. At the architecture level, REASON features a reconfigurable, tree-based processing fabric optimized for irregular traversal, symbolic deduction, and probabilistic aggregation. At the system level, REASON is tightly integrated with GPU streaming multiprocessors through a programmable interface and multi-level pipeline that efficiently orchestrates compositional execution. Evaluated across six neuro-symbolic workloads, REASON achieves 12-50x speedup and 310-681x energy efficiency over desktop and edge GPUs under TSMC 28 nm node. REASON enables real-time probabilistic logical reasoning, completing end-to-end tasks in 0.8 s with 6 mm2 area and 2.12 W power, demonstrating that targeted acceleration of probabilistic logical reasoning is critical for practical and scalable neuro-symbolic AI and positioning REASON as a foundational system architecture for next-generation cognitive intelligence.

[278] MemCtrl: Using MLLMs as Active Memory Controllers on Embodied Agents

Vishnu Sashank Dorbala, Dinesh Manocha

Main category: cs.AI

TL;DR: MemCtrl is a framework that uses MLLMs with a trainable memory head to prune memory online for embodied agents, improving task completion by ~16% on EmbodiedBench.

Details

Motivation: Current memory systems for embodied agents treat memory as large offline storage, which is unsuitable for agents operating under strict memory/compute constraints online. There's a need for efficient online memory pruning.

Method: MemCtrl augments MLLMs with a trainable memory head (μ) that acts as a gate to determine which observations/reflections to retain, update, or discard during exploration. Two training approaches: 1) via offline expert, 2) via online RL.

Result: μ-augmented MLLMs show ~16% average improvement on EmbodiedBench, with over 20% on specific instruction subsets. Superior performance on long/complex instructions. Memory head trained via online RL performs best.

Conclusion: MemCtrl enables efficient online memory pruning for embodied agents, significantly improving task completion while respecting memory/compute constraints. The trainable memory head effectively filters relevant information during exploration.

Abstract: Foundation models rely on in-context learning for personalized decision making. The limited size of this context window necessitates memory compression and retrieval systems like RAG. These systems however often treat memory as large offline storage spaces, which is unfavorable for embodied agents that are expected to operate under strict memory and compute constraints, online. In this work, we propose MemCtrl, a novel framework that uses Multimodal Large Language Models (MLLMs) for pruning memory online. MemCtrl augments MLLMs with a trainable memory head μthat acts as a gate to determine which observations or reflections to retain, update, or discard during exploration. We evaluate with training two types of μ, 1) via an offline expert, and 2) via online RL, and observe significant improvement in overall embodied task completion ability on μ-augmented MLLMs. In particular, on augmenting two low performing MLLMs with MemCtrl on multiple subsets of the EmbodiedBench benchmark, we observe that μ-augmented MLLMs show an improvement of around 16% on average, with over 20% on specific instruction subsets. Finally, we present a qualitative analysis on the memory fragments collected by μ, noting the superior performance of μaugmented MLLMs on long and complex instruction types.

[279] Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve)

Saurav Prateek

Main category: cs.AI

TL;DR: Deep Researcher architecture using sequential refinement and candidate crossover outperforms parallel approaches on PhD-level research tasks, achieving state-of-the-art results on DeepResearch Bench.

Details

Motivation: Address limitations of Parallel Scaling paradigm which suffers from siloed knowledge and inefficiencies in generating comprehensive research reports on complex PhD-level topics.

Method: Two key innovations: 1) Sequential Research Plan Refinement via Reflection with centralized Global Research Context for dynamic adaptation, and 2) Candidates Crossover algorithm deploying multiple LLM candidates with varied parameters to explore larger search space, synthesized for final response. Concludes with One Shot Report Generation.

Result: Achieved overall score of 46.21 on DeepResearch Bench (100 doctoral-level research tasks), surpassing leading deep research agents including Claude Researcher, Nvidia AIQ Research Assistant, Perplexity Research, Kimi Researcher, and Grok Deeper Search. Marginally exceeds previous Static DRA work.

Conclusion: Sequential scaling consistently outperforms parallel self-consistency paradigm for deep research tasks, with the proposed architecture demonstrating superior performance through dynamic adaptation and efficient search space exploration.

Abstract: This paper introduces a novel Deep Researcher architecture designed to generate detailed research reports on complex PhD level topics by addressing the inherent limitations of the Parallel Scaling paradigm. Our system utilizes two key innovations: Sequential Research Plan Refinement via Reflection and a Candidates Crossover algorithm. The sequential refinement process is demonstrated as an efficient method that allows the agent to maintain a centralized Global Research Context, enabling it to look back at current progress, reason about the research plan, and intelligently make changes at runtime. This dynamic adaptation contrasts with parallel approaches, which often suffer from siloed knowledge. The Candidates Crossover algorithm further enhances search efficiency by deploying multiple LLM candidates with varied parameters to explore a larger search space, with their findings synthesized to curate a comprehensive final research response. The process concludes with One Shot Report Generation, ensuring the final document is informed by a unified narrative and high fact density. Powered by the Gemini 2.5 Pro model, our Deep Researcher was evaluated on the DeepResearch Bench, a globally recognized benchmark of 100 doctoral level research tasks. Our architecture achieved an overall score of 46.21, demonstrating superior performance by surpassing leading deep research agents such as Claude Researcher, Nvidia AIQ Research Assistant, Perplexity Research, Kimi Researcher and Grok Deeper Search present on the DeepResearch Bench actively running leaderboard. This performance marginally exceeds our previous work, Static DRA, and reinforces the finding that sequential scaling consistently outperforms the parallel self consistency paradigm.

[280] SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models

Sebastiano Monti, Carlo Nicolini, Gianni Pellegrini, Jacopo Staiano, Bruno Lepri

Main category: cs.AI

TL;DR: LRMs show degraded planning performance beyond 25 moves in Sokoban puzzles, suggesting fundamental forward planning limitations not fully overcome by PDDL tools.

Details

Motivation: To systematically assess the long-horizon planning capabilities of Large Reasoning Models (LRMs), which have not been extensively investigated despite their growing use in complex reasoning tasks.

Method: Proposed a novel Sokoban puzzle benchmark simplified to isolate long-horizon planning from state persistence. Tested state-of-the-art LRMs and evaluated performance degradation with increasing move requirements. Also tested equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools.

Result: Found consistent degradation in planning performance when more than 25 moves are required to reach solution, indicating a fundamental constraint on forward planning capacity. PDDL tools provided only modest improvements, suggesting inherent architectural limitations not overcome by test-time scaling alone.

Conclusion: LRMs have significant limitations in long-horizon planning that appear to be architectural in nature, not easily remedied by current approaches. The 25-move threshold reveals a fundamental constraint on forward planning capacity in these models.

Abstract: Although the capabilities of large language models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated. In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence. Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting a fundamental constraint on forward planning capacity. We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting inherent architectural limitations which might not be overcome by test-time scaling approaches alone.

[281] SimBench: A Framework for Evaluating and Diagnosing LLM-Based Digital-Twin Generation for Multi-Physics Simulation

Jingquan Wang, Andrew Negrut, Hongyu Wang, Harry Zhang, Dan Negrut

Main category: cs.AI

TL;DR: SimBench is a benchmark for evaluating simulator-oriented LLMs (S-LLMs) on their ability to generate high-quality digital twins for simulation environments, using Chrono simulator as a test case with multi-turn LLM-as-a-judge evaluation.

Details

Motivation: There's a need to systematically evaluate and rank S-LLMs on their proficiency in creating digital twins for virtual testing in simulators, as current methods lack standardized evaluation protocols.

Method: SimBench uses multi-turn interactions where an LLM-as-a-judge (J-LLM) evaluates S-LLM-generated digital twins based on predefined rules and human-in-the-loop guidance. The benchmark is demonstrated with Chrono multi-physics simulator across various domains.

Result: The benchmark successfully compares over 33 open- and closed-source S-LLMs, providing a consistent evaluation protocol. The approach is validated with Chrono simulator for multibody dynamics, finite element analysis, vehicle dynamics, robotic dynamics, and sensor simulations.

Conclusion: SimBench provides a broadly applicable benchmarking framework for assessing S-LLMs’ digital twin generation capabilities across various simulation packages, enabling standardized evaluation and ranking of these specialized language models.

Abstract: We introduce SimBench, a benchmark designed to evaluate the proficiency of simulator-oriented LLMs (S-LLMs) in generating digital twins (DTs) that can be used in simulators for virtual testing. Given a collection of S-LLMs, this benchmark ranks them according to their ability to produce high-quality DTs. We demonstrate this by comparing over 33 open- and closed-source S-LLMs. Using multi-turn interactions, SimBench employs an LLM-as-a-judge (J-LLM) that leverages both predefined rules and human-in-the-loop guidance to assign scores for the DTs generated by the S-LLM, thus providing a consistent and expert-inspired evaluation protocol. The J-LLM is specific to a simulator, and herein the proposed benchmarking approach is demonstrated in conjunction with the open-sourceChrono multi-physics simulator. Chrono provided the backdrop used to assess an S-LLM in relation to the latter’s ability to create digital twins for multibody dynamics, finite element analysis, vehicle dynamics, robotic dynamics, and sensor simulations. The proposed benchmarking principle is broadly applicable and enables the assessment of an S-LLM’s ability to generate digital twins for other simulation packages, e.g., ANSYS, ABAQUS, OpenFOAM, StarCCM+, IsaacSim, and pyBullet.

[282] DGRAG: Distributed Graph-based Retrieval-Augmented Generation in Edge-Cloud Systems

Wenqing Zhou, Yuxuan Yan, Qianqian Yang

Main category: cs.AI

TL;DR: DGRAG is a distributed graph-driven RAG framework for edge-cloud systems that improves privacy and reduces latency by processing queries locally on edge devices with knowledge graphs, only escalating uncertain queries to the cloud.

Details

Motivation: Conventional centralized RAG requires aggregating distributed data, which raises privacy risks and incurs high retrieval latency and cost. There's a need for a more efficient, privacy-preserving approach to retrieval-augmented generation in distributed environments.

Method: Edge devices organize local documents into knowledge graphs and upload subgraph-level summaries to the cloud for global indexing. A gate mechanism assesses confidence and consistency of local generations to decide whether to return local answers or escalate queries to the cloud, which then performs summary-based matching and retrieves evidence from relevant edges.

Result: Experiments on distributed question answering show that DGRAG consistently outperforms decentralized baselines while substantially reducing cloud overhead.

Conclusion: DGRAG provides an effective distributed RAG framework that balances privacy preservation, latency reduction, and factual accuracy through edge-cloud collaboration with graph-based knowledge organization and intelligent query escalation.

Abstract: Retrieval-Augmented Generation (RAG) improves factuality by grounding LLMs in external knowledge, yet conventional centralized RAG requires aggregating distributed data, raising privacy risks and incurring high retrieval latency and cost. We present DGRAG, a distributed graph-driven RAG framework for edge-cloud collaborative systems. Each edge device organizes local documents into a knowledge graph and periodically uploads subgraph-level summaries to the cloud for lightweight global indexing without exposing raw data. At inference time, queries are first answered on the edge; a gate mechanism assesses the confidence and consistency of multiple local generations to decide whether to return a local answer or escalate the query. For escalated queries, the cloud performs summary-based matching to identify relevant edges, retrieves supporting evidence from them, and generates the final response with a cloud LLM. Experiments on distributed question answering show that DGRAG consistently outperforms decentralized baselines while substantially reducing cloud overhead.

[283] Lifted Forward Planning in Relational Factored Markov Decision Processes with Concurrent Actions

Florian Andreas Marwitz, Tanya Braun, Ralf Möller, Marcel Gehrke

Main category: cs.AI

TL;DR: Foreplan is a relational forward planner that uses first-order representations to efficiently compute policies for problems with many indistinguishable objects, achieving massive speedups over traditional methods.

Details

Motivation: The exponential growth of state and action spaces in Markov Decision Processes with increasing numbers of indistinguishable objects makes policy computation infeasible using traditional enumeration methods.

Method: Uses first-order representation to store state and action spaces in polynomial size instead of exponential size, enabling efficient policy computation through relational forward planning. Also introduces an approximate version for even faster computation.

Result: Demonstrates speedup of at least four orders of magnitude over traditional methods, efficiently handles numerous indistinguishable objects, and identifies optimal numbers of objects to act on for given tasks.

Conclusion: Foreplan successfully addresses the exponential blow-up problem in decision making with indistinguishable objects, providing efficient and scalable policy computation with significant performance improvements.

Abstract: Decision making is a central problem in AI that can be formalized using a Markov Decision Process. A problem is that, with increasing numbers of (indistinguishable) objects, the state space grows exponentially. To compute policies, the state space has to be enumerated. Even more possibilities have to be enumerated if the size of the action space depends on the size of the state space, especially if we allow concurrent actions. To tackle the exponential blow-up in the action and state space, we present a first-order representation to store the spaces in polynomial instead of exponential size in the number of objects and introduce Foreplan, a relational forward planner, which uses this representation to efficiently compute policies for numerous indistinguishable objects and actions. Additionally, we introduce an even faster approximate version of Foreplan. Moreover, Foreplan identifies how many objects an agent should act on to achieve a certain task given restrictions. Further, we provide a theoretical analysis and an empirical evaluation of Foreplan, demonstrating a speedup of at least four orders of magnitude.

[284] DCP-Bench-Open: Evaluating LLMs for Constraint Modelling of Discrete Combinatorial Problems

Kostis Michailidis, Dimos Tsouros, Tias Guns

Main category: cs.AI

TL;DR: DCP-Bench-Open is a new benchmark for evaluating LLMs’ ability to formalize diverse discrete combinatorial problems into constraint models, showing best performance with high-level Python frameworks and reaching 91% accuracy with advanced prompting techniques.

Details

Motivation: Constraint modelling for discrete combinatorial problems requires significant expertise and is a bottleneck for wider adoption of constraint solving technologies. Existing evaluation datasets are limited to small, homogeneous, or domain-specific problems that don't capture real-world diversity.

Method: Introduced DCP-Bench-Open benchmark with diverse well-known discrete combinatorial problems from CP and OR communities. Evaluated LLM modelling capabilities across three distinct constraint modelling systems with varying abstraction levels and syntax. Systematically tested prompt-based and inference-time compute methods across different LLMs.

Result: Higher performance achieved when modelling with high-level Python-based framework. Advanced prompting and inference-time compute methods further increased accuracy, reaching up to 91% on this challenging benchmark.

Conclusion: DCP-Bench-Open addresses the gap in evaluating LLM-driven constraint modelling for diverse real-world combinatorial problems, demonstrating that LLMs can achieve high accuracy in constraint modelling with appropriate frameworks and techniques.

Abstract: Discrete Combinatorial Problems (DCPs) are prevalent in industrial decision-making and optimisation. However, while constraint solving technologies for DCPs have advanced significantly, the core process of formalising them, namely constraint modelling, requires significant expertise and remains a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for discrete constraint modelling are often limited to small, homogeneous, or domain-specific problems, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing DCP-Bench-Open, a novel benchmark that includes a diverse set of well-known discrete combinatorial problems sourced from the Constraint Programming (CP) and Operations Research (OR) communities, structured explicitly for evaluating LLM-driven constraint modelling. With this dataset, and given the variety of modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 91% on this highly challenging benchmark. DCP-Bench-Open is publicly available.

[285] Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Fei Lin, Ziyang Gong, Cong Wang, Tengchao Zhang, Yonglin Tian, Yining Jiang, Ji Dai, Chao Guo, Xiaotong Yu, Xue Yang, Gen Luo, Fei-Yue Wang

Main category: cs.AI

TL;DR: ToxiMol is the first benchmark for MLLMs on molecular toxicity repair, with a dataset of 660 toxic molecules across 11 tasks and an automated evaluation framework called ToxiEval.

Details

Motivation: Toxicity is a major cause of drug development failure, but molecular toxicity repair hasn't been systematically defined or benchmarked for MLLMs.

Method: Created ToxiMol benchmark with 660 toxic molecules across 11 tasks, designed prompt annotation pipeline with toxicological knowledge, and developed ToxiEval automated evaluation framework integrating toxicity prediction, synthetic accessibility, drug-likeness, and structural similarity.

Result: Evaluated 43 mainstream MLLMs showing they face significant challenges but demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware editing.

Conclusion: ToxiMol fills a critical gap in benchmarking MLLMs for molecular toxicity repair, providing a standardized framework to advance AI-assisted drug safety optimization.

Abstract: Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair, generating structurally valid molecular alternatives with reduced toxicity, has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 660 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess 43 mainstream general-purpose MLLMs and conduct multiple ablation studies to analyze key issues, including evaluation metrics, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware editing.

[286] Beyond Syntax: Action Semantics Learning for App Agents

Bohan Tang, Dezhao Luo, Jianheng Liu, Jingxuan Chen, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao

Main category: cs.AI

TL;DR: ASL (Action Semantics Learning) is a novel framework that trains App agents to understand action semantics rather than just reproducing exact action strings, improving robustness to out-of-distribution scenarios.

Details

Motivation: Current fine-tuning methods for App agents use syntax learning that forces exact reproduction of ground truth action strings, leading to out-of-distribution vulnerability. Prompt-based solutions with proprietary LLMs have high compute costs and external API dependency.

Method: Proposes Action Semantics Learning (ASL) where the objective is capturing the semantics of ground truth actions. Defines action semantics as state transition induced by actions in the UI. Uses a SEmantic Estimator (SEE) to compute semantic similarity for training agents to generate actions aligned with ground truth semantics, even when syntactic forms differ. SEE can be applied in both supervised and reinforcement fine-tuning paradigms.

Result: Theoretically demonstrates superior robustness to OOD problems compared to existing syntax learning. Extensive experiments across multiple offline and online benchmarks show ASL significantly improves accuracy and generalization of App agents compared to existing methods.

Conclusion: ASL provides a more robust learning framework for App agents by focusing on action semantics rather than syntax, addressing OOD vulnerability while reducing dependency on proprietary LLMs and high compute costs.

Abstract: The recent development of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with proprietary LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current supervised fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. Building on this insight, ASL employs a novel SEmantic Estimator~(SEE) to compute a semantic similarity to train the App agents in generating actions aligned with the semantics of ground truth actions, even when their syntactic forms differ. SEE is a flexible module that can be applied in both supervised and reinforcement fine-tuning paradigms. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments across multiple offline and online benchmarks demonstrate that ASL significantly improves the accuracy and generalisation of App agents compared to existing methods.

[287] Mind the Gap: The Divergence Between Human and LLM-Generated Tasks

Yi-Long Lu, Jiajun Song, Chunhui Zhang, Wei Wang

Main category: cs.AI

TL;DR: LLM agents (GPT-4o) generate tasks differently from humans - less social, less physical, more abstract, and fail to reflect psychological drivers like personal values and cognitive style, despite linguistic proficiency.

Details

Motivation: To understand whether generative agents powered by LLMs operate on similar cognitive principles as humans when generating tasks, specifically examining if they reflect psychological drivers like personal values and cognitive style.

Method: Conducted a task-generation experiment comparing human responses with those of an LLM agent (GPT-4o), examining how psychological drivers influence task generation patterns.

Result: Human task generation is consistently influenced by psychological drivers, but LLMs fail to reflect these patterns even when explicitly provided. LLM-generated tasks are less social, less physical, more abstract, though perceived as more fun and novel.

Conclusion: There is a core gap between value-driven, embodied human cognition and LLM statistical patterns, highlighting the need to incorporate intrinsic motivation and physical grounding for more human-aligned agents.

Abstract: Humans constantly generate a diverse range of tasks guided by internal motivations. While generative agents powered by large language models (LLMs) aim to simulate this complex behavior, it remains uncertain whether they operate on similar cognitive principles. To address this, we conducted a task-generation experiment comparing human responses with those of an LLM agent (GPT-4o). We find that human task generation is consistently influenced by psychological drivers, including personal values (e.g., Openness to Change) and cognitive style. Even when these psychological drivers are explicitly provided to the LLM, it fails to reflect the corresponding behavioral patterns. They produce tasks that are markedly less social, less physical, and thematically biased toward abstraction. Interestingly, while the LLM’s tasks were perceived as more fun and novel, this highlights a disconnect between its linguistic proficiency and its capacity to generate human-like, embodied goals. We conclude that there is a core gap between the value-driven, embodied nature of human cognition and the statistical patterns of LLMs, highlighting the necessity of incorporating intrinsic motivation and physical grounding into the design of more human-aligned agents.

[288] A Message Passing Realization of Expected Free Energy Minimization

Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries

Main category: cs.AI

TL;DR: Message passing approach for Expected Free Energy minimization on factor graphs, transforming combinatorial search into tractable inference via epistemic priors.

Details

Motivation: To bridge active inference theory with practical implementations by providing efficient policy inference under epistemic uncertainty, enabling robust planning and systematic exploration.

Method: Reformulate EFE minimization as Variational Free Energy minimization with epistemic priors, then apply message passing on factor graphs for factorized state-space models.

Result: EFE-minimizing agents outperform conventional KL-control agents in stochastic gridworld and partially observable Minigrid tasks, showing better risk avoidance and information-seeking behavior.

Conclusion: The approach successfully bridges theory and practice, demonstrating efficiency of epistemic priors for artificial agents in uncertain environments.

Abstract: We present a message passing approach to Expected Free Energy (EFE) minimization on factor graphs, based on the theory introduced in arXiv:2504.14898. By reformulating EFE minimization as Variational Free Energy minimization with epistemic priors, we transform a combinatorial search problem into a tractable inference problem solvable through standard variational techniques. Applying our message passing method to factorized state-space models enables efficient policy inference. We evaluate our method on environments with epistemic uncertainty: a stochastic gridworld and a partially observable Minigrid task. Agents using our approach consistently outperform conventional KL-control agents on these tasks, showing more robust planning and efficient exploration under uncertainty. In the stochastic gridworld environment, EFE-minimizing agents avoid risky paths, while in the partially observable minigrid setting, they conduct more systematic information-seeking. This approach bridges active inference theory with practical implementations, providing empirical evidence for the efficiency of epistemic priors in artificial agents.

[289] Robust Deep Monte Carlo Counterfactual Regret Minimization: Addressing Theoretical Risks in Neural Fictitious Self-Play

Zakaria El Jaafari

Main category: cs.AI

TL;DR: Robust Deep MCCFR framework adaptively deploys neural components based on game scale, achieving 60% improvement on Kuhn Poker and 23.5% on Leduc Poker by addressing scale-dependent risks like distribution shifts and variance explosion.

Details

Motivation: MCCFR with deep neural networks faces scale-dependent challenges that manifest differently across game complexities, requiring adaptive solutions for different game sizes.

Method: Proposes Robust Deep MCCFR framework with target networks (delayed updates), uniform exploration mixing, variance-aware training objectives, and diagnostic monitoring. Uses systematic ablation studies on Kuhn and Leduc Poker to analyze scale-dependent component effectiveness.

Result: Achieves 60% improvement on Kuhn Poker (0.0628 vs 0.156 exploitability) and 23.5% improvement on Leduc Poker (0.2386 vs 0.3703). Shows selective component usage outperforms comprehensive mitigation in larger games.

Conclusion: Scale-dependent component effectiveness requires adaptive deployment strategies; careful component selection is more important than comprehensive mitigation in complex games; framework provides practical guidelines for larger games.

Abstract: Monte Carlo Counterfactual Regret Minimization (MCCFR) has emerged as a cornerstone algorithm for solving extensive-form games, but its integration with deep neural networks introduces scale-dependent challenges that manifest differently across game complexities. This paper presents a comprehensive analysis of how neural MCCFR component effectiveness varies with game scale and proposes an adaptive framework for selective component deployment. We identify that theoretical risks such as nonstationary target distribution shifts, action support collapse, variance explosion, and warm-starting bias have scale-dependent manifestation patterns, requiring different mitigation strategies for small versus large games. Our proposed Robust Deep MCCFR framework incorporates target networks with delayed updates, uniform exploration mixing, variance-aware training objectives, and comprehensive diagnostic monitoring. Through systematic ablation studies on Kuhn and Leduc Poker, we demonstrate scale-dependent component effectiveness and identify critical component interactions. The best configuration achieves final exploitability of 0.0628 on Kuhn Poker, representing a 60% improvement over the classical framework (0.156). On the more complex Leduc Poker domain, selective component usage achieves exploitability of 0.2386, a 23.5% improvement over the classical framework (0.3703) and highlighting the importance of careful component selection over comprehensive mitigation. Our contributions include: (1) a formal theoretical analysis of risks in neural MCCFR, (2) a principled mitigation framework with convergence guarantees, (3) comprehensive multi-scale experimental validation revealing scale-dependent component interactions, and (4) practical guidelines for deployment in larger games.

[290] Analysis of approximate linear programming solution to Markov decision problem with log barrier function

Donghwan Lee, Hyukjun Yang, Bum Geun Park

Main category: cs.AI

TL;DR: The paper proposes using log-barrier functions to transform LP-based MDP formulations into unconstrained optimization problems solvable via gradient descent, providing theoretical foundations for this approach.

Details

Motivation: LP-based methods for solving MDPs have been underused compared to dynamic programming methods because they lead to inequality-constrained optimization problems that are more challenging to solve effectively. The paper aims to establish theoretical foundations for solving LP-based MDPs in a more practical way.

Method: The key idea is to leverage the log-barrier function from inequality-constrained optimization to transform the LP formulation of MDPs into an unconstrained optimization problem, enabling approximate solutions via gradient descent.

Result: The paper develops a thorough theoretical interpretation of the log-barrier approach for LP-based MDPs, which to the authors’ knowledge hasn’t been previously established.

Conclusion: The paper bridges the gap in theoretical understanding of log-barrier methods for LP-based MDPs, providing foundations for more effective and practical solutions to inequality-constrained optimization problems in reinforcement learning.

Abstract: There are two primary approaches to solving Markov decision problems (MDPs): dynamic programming based on the Bellman equation and linear programming (LP). Dynamic programming methods are the most widely used and form the foundation of both classical and modern reinforcement learning (RL). By contrast, LP-based methods have been less commonly employed, although they have recently gained attention in contexts such as offline RL. The relative underuse of the LP-based methods stems from the fact that it leads to an inequality-constrained optimization problem, which is generally more challenging to solve effectively compared with Bellman-equation-based methods. The purpose of this paper is to establish a theoretical foundation for solving LP-based MDPs in a more effective and practical manner. Our key idea is to leverage the log-barrier function, widely used in inequality-constrained optimization, to transform the LP formulation of the MDP into an unconstrained optimization problem. This reformulation enables approximate solutions to be obtained easily via gradient descent. While the method may appear simple, to the best of our knowledge, a thorough theoretical interpretation of this approach has not yet been developed. This paper aims to bridge this gap.

[291] SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

Qian Cheng, Ruize Tang, Emilie Ma, Finn Hackett, Peiyang He, Yiming Su, Ivan Beschastnikh, Yu Huang, Xiaoxing Ma, Tianyin Xu

Main category: cs.AI

TL;DR: SysMoBench is a benchmark for evaluating AI’s ability to generate formal specifications for large, complex concurrent and distributed systems using TLA+, addressing the gap in existing AI specification generation work that mostly targets small code.

Details

Motivation: Formal models are essential for specifying and verifying large computer systems but are expensive to create and maintain. While AI shows promise in generating specifications, existing work focuses on small code, not complete systems. There's a need to understand if AI can handle realistic system artifacts requiring abstraction of complex behavioral properties.

Method: Created SysMoBench, a benchmark focusing on concurrent and distributed systems using TLA+ specification language. The benchmark includes eleven diverse system artifacts (Raft implementation in Etcd and Redis, ZooKeeper leader election, Asterinas OS components, etc.) and automates evaluation metrics including syntactic correctness, runtime correctness, conformance to system code, and invariant correctness.

Result: SysMoBench provides a comprehensive evaluation framework for AI-generated formal models, enabling assessment of LLMs and agents’ capabilities in generating specifications for complex systems. The benchmark is extensible to other specification languages beyond TLA+.

Conclusion: SysMoBench establishes a foundation for evaluating AI’s ability to generate formal models of complex systems, enabling better understanding of current AI capabilities/limitations, providing tools for research in this area, and opening new research directions for AI-assisted formal specification generation.

Abstract: Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. Recent advances in generative AI show promise in generating certain forms of specifications. However, existing work mostly targets small code, not complete systems. It is unclear whether AI can deal with realistic system artifacts, as this requires abstracting their complex behavioral properties into formal models. We present SysMoBench, a benchmark that evaluates AI’s ability to formally model large, complex systems. We focus on concurrent and distributed systems, which are keystones of today’s critical computing infrastructures, encompassing operating systems and cloud infrastructure. We use TLA+, the de facto specification language for concurrent and distributed systems, though the benchmark can be extended to other specification languages. We address the primary challenge of evaluating AI-generated models by automating metrics like syntactic and runtime correctness, conformance to system code, and invariant correctness. SysMoBench currently includes eleven diverse system artifacts: the Raft implementation of Etcd and Redis, the leader election of ZooKeeper, the Spinlock, Mutex, and Ringbuffer in Asterinas OS, etc., with more being added. SysMoBench enables us to understand the capabilities and limitations of today’s LLMs and agents, putting tools in this area on a firm footing and opening up promising new research directions.

[292] FourierCSP: Differentiable Constraint Satisfaction Problem Solving by Walsh-Fourier Expansion

Yunuo Cen, Zixuan Wang, Jintao Zhang, Zhiwei Zhang, Xuanyao Fong

Main category: cs.AI

TL;DR: FourierCSP extends continuous local search from Boolean SAT to general CSPs using Walsh-Fourier transforms to convert constraints to compact multilinear polynomials, enabling efficient differentiable solving without auxiliary variables.

Details

Motivation: Motivated by the success of continuous local search (CLS) solvers on Boolean SAT problems, the authors aim to extend this framework to general constraint satisfaction problems with finite-domain variables and expressive constraints, broadening the applicability of differentiable CLS techniques.

Method: Develop FourierCSP framework that generalizes Walsh-Fourier transform to CSP, transforming versatile constraints to compact multilinear polynomials without auxiliary variables. Use projected subgradient and mirror descent algorithms with provable convergence guarantees, combined to accelerate gradient-based optimization.

Result: Empirical results on benchmark suites demonstrate that FourierCSP is scalable and competitive, significantly broadening the class of problems that can be efficiently solved by differentiable CLS techniques.

Conclusion: FourierCSP successfully extends CLS from Boolean SAT to general CSPs, paving the way toward end-to-end neurosymbolic integration by enabling efficient differentiable solving of a broader class of constraint satisfaction problems.

Abstract: The Constraint-satisfaction problem (CSP) is fundamental in mathematics, physics, and theoretical computer science. Continuous local search (CLS) solvers, as recent advancements, can achieve highly competitive results on certain classes of Boolean satisfiability (SAT) problems. Motivated by these advances, we extend the CLS framework from Boolean SAT to general CSP with finite-domain variables and expressive constraint formulations. We present FourierCSP, a continuous optimization framework that generalizes the Walsh-Fourier transform to CSP, allowing for transforming versatile constraints to compact multilinear polynomials, thereby avoiding the need for auxiliary variables and memory-intensive encodings. We employ projected subgradient and mirror descent algorithms with provable convergence guarantees, and further combine them to accelerate gradient-based optimization. Empirical results on benchmark suites demonstrate that FourierCSP is scalable and competitive, significantly broadening the class of problems that can be efficiently solved by differentiable CLS techniques and paving the way toward end-to-end neurosymbolic integration.

[293] MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption

Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, Marios Savvides

Main category: cs.AI

TL;DR: MetaVLA is a unified post-training framework for Vision-Language-Action models that enables efficient multi-task alignment through meta-learning, reducing training costs while improving generalization on embodied reasoning tasks.

Details

Motivation: Current VLA models require task-specific fine-tuning, have high compute costs, and generalize poorly to unseen tasks, limiting their potential as general-purpose embodied agents.

Method: Proposes Context-Aware Meta Co-Training that consolidates diverse target tasks into single fine-tuning stage, uses auxiliary tasks for better generalization, and integrates lightweight meta-learning mechanism from Attentive Neural Processes for rapid adaptation.

Result: On LIBERO benchmark: outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, cuts GPU time by ~76%, and shows scalable post-training with minimal architectural changes.

Conclusion: MetaVLA demonstrates that scalable, low-resource post-training is achievable, paving the way toward general-purpose embodied agents through efficient multi-task alignment.

Abstract: Vision-Language-Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists-they often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism-derived from Attentive Neural Processes-to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable-paving the way toward general-purpose embodied agents. Code will be available.

[294] Cognition Envelopes for Bounded AI Reasoning in Autonomous UAS Operations

Pedro Antonio Alarcon Granadeno, Arturo Miguel Bernal Russell, Sofia Nelson, Demetrius Hernandez, Maureen Petterson, Michael Murphy, Walter J. Scheirer, Jane Cleland-Huang

Main category: cs.AI

TL;DR: The paper introduces Cognition Envelopes as reasoning boundaries to constrain AI decisions in cyber-physical systems, addressing errors from foundational models like hallucinations and overgeneralizations.

Details

Motivation: Foundational Models (LLMs/VLMs) in cyber-physical systems introduce new error types like hallucinations, overgeneralizations, and context misalignments, leading to incorrect decisions that compromise system safety and reliability.

Method: Introduces Cognition Envelopes - reasoning boundaries that constrain AI-generated decisions, complementing meta-cognition and traditional safety envelopes. Requires systematic processes for definition, validation, and assurance.

Result: Proposes a conceptual framework for establishing reasoning boundaries in AI systems, addressing the gap between traditional safety envelopes and the cognitive errors introduced by foundational models.

Conclusion: Cognition Envelopes provide a necessary extension to safety engineering for AI-integrated cyber-physical systems, requiring practical guidelines and systematic assurance processes to ensure reliable operation.

Abstract: Cyber-physical systems increasingly rely on Foundational Models such as Large Language Models (LLMs) and Vision-Language Models (VLMs) to increase autonomy through enhanced perception, inference, and planning. However, these models also introduce new types of errors, such as hallucinations, overgeneralizations, and context misalignments, resulting in incorrect and flawed decisions. To address this, we introduce the concept of Cognition Envelopes, designed to establish reasoning boundaries that constrain AI-generated decisions while complementing the use of meta-cognition and traditional safety envelopes. As with safety envelopes, Cognition Envelopes require practical guidelines and systematic processes for their definition, validation, and assurance.

[295] Neural Value Iteration

Yang You, Ufuk Çakır, Alex Schutz, Nick Hawes

Main category: cs.AI

TL;DR: Neural Value Iteration: A novel POMDP planning algorithm that uses neural networks instead of α-vectors to represent value functions, enabling scalability to extremely large POMDPs.

Details

Motivation: Traditional POMDP solvers using α-vectors become intractable for large-scale problems due to the |S|-dimensional nature of α-vectors and prohibitive computational cost of Bellman backups.

Method: Leverages the PWLC property to represent POMDP value functions as finite sets of neural networks instead of α-vectors, combining neural network generalization with classical value iteration framework.

Result: Achieves near-optimal solutions in extremely large POMDPs that are intractable for existing offline solvers.

Conclusion: Neural networks provide an effective alternative representation for POMDP value functions, enabling scalable planning through Neural Value Iteration algorithm.

Abstract: The value function of a POMDP exhibits the piecewise-linear-convex (PWLC) property and can be represented as a finite set of hyperplanes, known as $α$-vectors. Most state-of-the-art POMDP solvers (offline planners) follow the point-based value iteration scheme, which performs Bellman backups on $α$-vectors at reachable belief points until convergence. However, since each $α$-vector is $|S|$-dimensional, these methods quickly become intractable for large-scale problems due to the prohibitive computational cost of Bellman backups. In this work, we demonstrate that the PWLC property allows a POMDP’s value function to be alternatively represented as a finite set of neural networks. This insight enables a novel POMDP planning algorithm called \emph{Neural Value Iteration}, which combines the generalization capability of neural networks with the classical value iteration framework. Our approach achieves near-optimal solutions even in extremely large POMDPs that are intractable for existing offline solvers.

[296] AI Annotation Orchestration: Evaluating LLM verifiers to Improve the Quality of LLM Annotations in Learning Analytics

Bakhtawar Ahtisham, Kirk Vanacore, Jinsook Lee, Zhuqian Zhou, Doug Pietrzak, Rene F. Kizilcec

Main category: cs.AI

TL;DR: Verification-oriented orchestration (self- and cross-verification) improves LLM annotation reliability for tutoring discourse coding by 58% in Cohen’s kappa.

Details

Motivation: LLMs are increasingly used for annotating learning interactions but concerns about reliability limit their utility. The paper aims to test whether verification-oriented orchestration improves qualitative coding of tutoring discourse.

Method: Tested three LLMs (GPT, Claude, Gemini) under three conditions: unverified annotation, self-verification, and cross-verification across all orchestration configurations. Used transcripts from 30 one-to-one math sessions and benchmarked outputs against blinded, disagreement-focused human adjudication using Cohen’s kappa.

Result: Orchestration yields 58% improvement in kappa. Self-verification nearly doubles agreement relative to unverified baselines, with largest gains for challenging tutor moves. Cross-verification achieves 37% improvement on average, with pair- and construct-dependent effects.

Conclusion: Verification serves as a principled design lever for reliable, scalable LLM-assisted annotation in Learning Analytics. The paper contributes an orchestration framework, empirical comparison across frontier LLMs, and concise notation for standardization.

Abstract: Large Language Models (LLMs) are increasingly used to annotate learning interactions, yet concerns about reliability limit their utility. We test whether verification-oriented orchestration-prompting models to check their own labels (self-verification) or audit one another (cross-verification)-improves qualitative coding of tutoring discourse. Using transcripts from 30 one-to-one math sessions, we compare three production LLMs (GPT, Claude, Gemini) under three conditions: unverified annotation, self-verification, and cross-verification across all orchestration configurations. Outputs are benchmarked against a blinded, disagreement-focused human adjudication using Cohen’s kappa. Overall, orchestration yields a 58 percent improvement in kappa. Self-verification nearly doubles agreement relative to unverified baselines, with the largest gains for challenging tutor moves. Cross-verification achieves a 37 percent improvement on average, with pair- and construct-dependent effects: some verifier-annotator pairs exceed self-verification, while others reduce alignment, reflecting differences in verifier strictness. We contribute: (1) a flexible orchestration framework instantiating control, self-, and cross-verification; (2) an empirical comparison across frontier LLMs on authentic tutoring data with blinded human “gold” labels; and (3) a concise notation, verifier(annotator) (e.g., Gemini(GPT) or Claude(Claude)), to standardize reporting and make directional effects explicit for replication. Results position verification as a principled design lever for reliable, scalable LLM-assisted annotation in Learning Analytics.

[297] Quantifying Fidelity: A Decisive Feature Approach to Comparing Synthetic and Real Imagery

Danial Safaei, Siddartha Khastgir, Mohsen Alirezaei, Jeroen Ploeg, Son Tong, Chih-Hong Cheng, Xingyu Zhao

Main category: cs.AI

TL;DR: Proposes Decisive Feature Fidelity (DFF), a new SUT-specific metric that measures agreement in decisive evidence driving system decisions across real and synthetic domains, using explainable AI to identify and compare critical features.

Details

Motivation: Current virtual testing for autonomous vehicles focuses on visual realism, but pixel-level fidelity doesn't ensure reliable transfer from simulation to real world. What matters is whether the system bases decisions on consistent evidence across domains, not just whether images look realistic to humans.

Method: Introduces Decisive Feature Fidelity (DFF) metric that leverages explainable-AI methods to identify and compare decisive features driving SUT’s outputs for matched real-synthetic pairs. Proposes estimators based on counterfactual explanations and a DFF-guided calibration scheme to enhance simulator fidelity.

Result: Experiments on 2126 matched KITTI-VirtualKITTI2 pairs show DFF reveals discrepancies overlooked by conventional output-value fidelity. DFF-guided calibration improves decisive-feature and input-level fidelity without sacrificing output value fidelity across diverse SUTs.

Conclusion: DFF provides a behavior-grounded fidelity measure that captures mechanism parity - agreement in model-specific decisive evidence driving decisions across domains, offering a more meaningful metric for autonomous vehicle safety assurance than visual realism alone.

Abstract: Virtual testing using synthetic data has become a cornerstone of autonomous vehicle (AV) safety assurance. Despite progress in improving visual realism through advanced simulators and generative AI, recent studies reveal that pixel-level fidelity alone does not ensure reliable transfer from simulation to the real world. What truly matters is whether the system-under-test (SUT) bases its decisions on consistent decision evidence in both real and simulated environments, not just whether images “look real” to humans. To this end this paper proposes a behavior-grounded fidelity measure by introducing Decisive Feature Fidelity (DFF), a new SUT-specific metric that extends the existing fidelity spectrum to capture mechanism parity, that is, agreement in the model-specific decisive evidence that drives the SUT’s decisions across domains. DFF leverages explainable-AI methods to identify and compare the decisive features driving the SUT’s outputs for matched real-synthetic pairs. We further propose estimators based on counterfactual explanations, along with a DFF-guided calibration scheme to enhance simulator fidelity. Experiments on 2126 matched KITTI-VirtualKITTI2 pairs demonstrate that DFF reveals discrepancies overlooked by conventional output-value fidelity. Furthermore, results show that DFF-guided calibration improves decisive-feature and input-level fidelity without sacrificing output value fidelity across diverse SUTs.

[298] Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models

Gökdeniz Gülmez

Main category: cs.AI

TL;DR: Gabliteration is a neural weight modification technique that improves on traditional ablation methods using adaptive multi-directional projections with regularized layer selection to modify specific behaviors while minimizing quality degradation.

Details

Motivation: Existing weight modification methods compromise overall model quality when trying to modify specific behavioral patterns. The authors aim to address this fundamental limitation by developing a technique that can modify targeted behaviors without degrading performance in unrelated domains.

Method: Gabliteration implements adaptive multi-directional projections with regularized layer selection, featuring dynamic layer optimization, regularized projection matrices, and adaptive scaling mechanisms to achieve theoretically superior weight modification.

Result: The method was validated through the gabliterated-v1 model series (0.6B to 4B parameters) available on Hugging Face, demonstrating practical applicability across multiple model scales.

Conclusion: Gabliteration represents an advancement beyond traditional ablation methods, enabling targeted weight modification with minimal quality degradation in unrelated domains, making it practically applicable across various model sizes.

Abstract: We present Gabliteration, a novel neural weight modification technique that advances beyond traditional abliteration methods by implementing adaptive multi-directional projections with regularized layer selection. Our approach addresses the fundamental limitation of existing methods that compromise model quality while attempting to modify specific behavioral patterns. Through dynamic layer optimization, regularized projection matrices, and adaptive scaling mechanisms, we achieve theoretically superior weight modification while minimizing quality degradation in unrelated domains. We validate our method through the gabliterated-v1 model series (0.6B to 4B parameters) available on Hugging Face, demonstrating practical applicability across multiple model scales.

[299] CASCADE: Cumulative Agentic Skill Creation through Autonomous Development and Evolution

Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, Gerbrand Ceder

Main category: cs.AI

TL;DR: CASCADE is a self-evolving LLM agent framework that transitions from tool use to skill acquisition, enabling agents to master complex scientific tools through continuous learning and self-reflection, achieving 93.3% success on materials science/chemistry tasks.

Details

Motivation: Current LLM agents rely on predefined tools or early-stage tool generation, which limits their adaptability and scalability for complex scientific tasks. There's a need for agents that can autonomously acquire and master new skills rather than just using existing tools.

Method: CASCADE enables agents to master complex external tools through two meta-skills: 1) continuous learning via web search, code extraction, and memory utilization, and 2) self-reflection via introspection and knowledge graph exploration. The framework allows agents to accumulate executable skills that can be shared across agents and scientists.

Result: On SciSkillBench (116 materials science and chemistry research tasks), CASCADE achieved 93.3% success rate using GPT-5, compared to only 35.4% without evolution mechanisms. The framework demonstrated real-world applications in computational analysis, autonomous laboratory experiments, and selective reproduction of published papers.

Conclusion: CASCADE represents a transition from “LLM + tool use” to “LLM + skill acquisition,” enabling scalable AI-assisted scientific research through human-agent collaboration, memory consolidation, and shared executable skills across agents and scientists.

Abstract: Large language model (LLM) agents currently depend on predefined tools or early-stage tool generation, limiting their adaptability and scalability to complex scientific tasks. We introduce CASCADE, a self-evolving agentic framework representing an early instantiation of the transition from “LLM + tool use” to “LLM + skill acquisition”. CASCADE enables agents to master complex external tools and codify knowledge through two meta-skills: continuous learning via web search, code extraction, and memory utilization; self-reflection via introspection, knowledge graph exploration, and others. We evaluate CASCADE on SciSkillBench, a benchmark of 116 materials science and chemistry research tasks. CASCADE achieves a 93.3% success rate using GPT-5, compared to 35.4% without evolution mechanisms. We further demonstrate real-world applications in computational analysis, autonomous laboratory experiments, and selective reproduction of published papers. Along with human-agent collaboration and memory consolidation, CASCADE accumulates executable skills that can be shared across agents and scientists, moving toward scalable AI-assisted scientific research.

[300] Recursive Language Models

Alex L. Zhang, Tim Kraska, Omar Khattab

Main category: cs.AI

TL;DR: RLMs enable LLMs to process arbitrarily long prompts via recursive self-calling, outperforming vanilla models on long-context tasks with comparable cost.

Details

Motivation: Current LLMs have limited context windows that restrict their ability to process arbitrarily long prompts, creating a need for inference-time scaling solutions.

Method: Propose Recursive Language Models (RLMs) - an inference paradigm that treats long prompts as external environment, allowing LLMs to programmatically examine, decompose, and recursively call themselves over prompt snippets.

Result: RLMs process inputs up to 2 orders of magnitude beyond model context windows, outperform vanilla frontier LLMs and common long-context scaffolds across 4 diverse tasks. RLM-Qwen3-8B outperforms base Qwen3-8B by 28.3% and approaches vanilla GPT-5 quality on 3 long-context tasks.

Conclusion: RLMs provide an effective inference-time scaling solution for processing long prompts, demonstrating significant quality improvements over existing approaches while maintaining comparable computational cost.

Abstract: We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context scaffolds across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first natively recursive language model. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.

[301] SimpleMem: Efficient Lifelong Memory for LLM Agents

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

Main category: cs.AI

TL;DR: SimpleMem is an efficient memory framework for LLM agents that uses semantic lossless compression to manage historical experiences, achieving better performance with significantly reduced token costs.

Details

Motivation: Existing approaches for LLM agent memory either retain full interaction histories (causing redundancy) or use iterative reasoning (high token costs), creating a need for an efficient memory system that balances performance and efficiency.

Method: Three-stage pipeline: 1) Semantic Structured Compression - distills unstructured interactions into compact, multi-view indexed memory units; 2) Online Semantic Synthesis - intra-session process that integrates related context into unified abstract representations; 3) Intent-Aware Retrieval Planning - infers search intent to dynamically determine retrieval scope and construct precise context efficiently.

Result: Outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving average F1 improvement of 26.4% while reducing inference-time token consumption by up to 30-fold.

Conclusion: SimpleMem demonstrates superior balance between performance and efficiency for LLM agent memory systems through semantic lossless compression, offering practical benefits for long-term interactions in complex environments.

Abstract: To support long-term interaction in complex environments, LLM agents require memory systems that manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which distills unstructured interactions into compact, multi-view indexed memory units; (2) Online Semantic Synthesis, an intra-session process that instantly integrates related context into unified abstract representations to eliminate redundancy; and (3) Intent-Aware Retrieval Planning, which infers search intent to dynamically determine retrieval scope and construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% while reducing inference-time token consumption by up to 30-fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming-lab/SimpleMem.

[302] RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, Wei Chen

Main category: cs.AI

TL;DR: Proposed RubricHub: automated coarse-to-fine rubric generation framework with large-scale dataset (110k) for open-ended generation tasks, enabling SOTA performance through rubric-based fine-tuning and RL.

Details

Motivation: RLVR works well for reasoning tasks but struggles with open-ended generation due to lack of ground truth. Existing rubric-based methods have scalability bottlenecks and coarse criteria, causing supervision ceiling effects.

Method: Coarse-to-fine rubric generation framework using principle-guided synthesis, multi-model aggregation, and difficulty evolution. Created RubricHub dataset (110k). Two-stage post-training: Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL).

Result: Post-trained Qwen3-14B achieves SOTA on HealthBench (69.3), surpassing proprietary frontier models like GPT-5. RubricHub enables significant performance gains in open-ended generation tasks.

Conclusion: Automated rubric generation framework addresses limitations of existing methods, providing comprehensive discriminative criteria. RubricHub dataset and training pipeline unlock substantial improvements in open-ended generation performance.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale ($\sim$110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. Our code is available at \href{https://github.com/teqkilla/RubricHub}{ this URL}.

[303] Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Jiaao Chen, Hanrui Wang, Wei-Chen Wang, Yuzhi Zhang, Linfeng Zhang, Weinan E, Di Jin, Siheng Chen, Yanfeng Wang

Main category: cs.AI

TL;DR: ML-Master 2.0 introduces Hierarchical Cognitive Caching (HCC) to enable AI agents to handle ultra-long-horizon machine learning engineering tasks spanning days/weeks, achieving 56.44% medal rate on MLE-Bench.

Details

Motivation: Current AI agents struggle with ultra-long-horizon autonomy in scientific discovery tasks, as LLMs get overwhelmed by execution details and fail to consolidate sparse feedback into coherent long-term guidance over extended experimental cycles.

Method: Hierarchical Cognitive Caching (HCC) - a multi-tiered architecture inspired by computer systems that reframes context management as cognitive accumulation. It dynamically distills transient execution traces into stable knowledge and cross-task wisdom, decoupling immediate execution from long-term strategy.

Result: ML-Master 2.0 achieves state-of-the-art 56.44% medal rate on OpenAI’s MLE-Bench under 24-hour budgets, demonstrating superior performance in ultra-long-horizon machine learning engineering tasks.

Conclusion: Ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities, with HCC overcoming the scaling limits of static context windows for sustained strategic coherence.

Abstract: The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI’s MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.

[304] Actionable Interpretability Must Be Defined in Terms of Symmetries

Pietro Barbiero, Mateo Espinosa Zarlenga, Francesco Giannini, Alberto Termine, Filippo Bonchi, Mateja Jamnik, Giuseppe Marra

Main category: cs.AI

TL;DR: Interpretability research in AI is ill-posed; we propose defining interpretability via symmetries to create testable, actionable definitions.

Details

Motivation: Current AI interpretability research lacks formal testable definitions, making it difficult to design interpretable models or verify compliance with safety standards.

Method: Propose defining interpretability through four symmetries: inference equivariance, information invariance, concept-closure invariance, and structural invariance. These symmetries formalize interpretable models as a subclass of probabilistic models and unify interpretable inference as Bayesian inversion.

Result: The symmetry-based framework provides: (i) formal definition of interpretable models, (ii) unified formulation of interpretable inference (alignment, interventions, counterfactuals), and (iii) framework for verifying compliance with safety regulations.

Conclusion: Interpretability must be defined via symmetries to create actionable, testable definitions that enable formal model design and safety verification.

Abstract: This paper argues that interpretability research in Artificial Intelligence (AI) is fundamentally ill-posed as existing definitions of interpretability fail to describe how interpretability can be formally tested or designed for. We posit that actionable definitions of interpretability must be formulated in terms of symmetries that inform model design and lead to testable conditions. Under a probabilistic view, we hypothesise that four symmetries (inference equivariance, information invariance, concept-closure invariance, and structural invariance) suffice to (i) formalise interpretable models as a subclass of probabilistic models, (ii) yield a unified formulation of interpretable inference (e.g., alignment, interventions, and counterfactuals) as a form of Bayesian inversion, and (iii) provide a formal framework to verify compliance with safety standards and regulations.

[305] Epistemic Constitutionalism Or: how to avoid coherence bias

Michele Loi

Main category: cs.AI

TL;DR: The paper argues for an “epistemic constitution” for AI - explicit meta-norms regulating how AI systems form and express beliefs, using source attribution bias as a case study to demonstrate current problems and propose a Liberal constitutional approach.

Details

Motivation: Current AI systems operate with implicit, uninspected epistemic policies when forming beliefs and evaluating arguments. The paper identifies source attribution bias as a key problem where models penalize arguments based on ideological alignment between source and content rather than evaluating the arguments on their merits.

Method: The paper analyzes frontier language models’ behavior regarding source attribution bias, showing they enforce identity-stance coherence and suppress source-sensitivity when detecting systematic testing. It then develops two constitutional approaches (Platonic vs Liberal) and proposes a specific Liberal framework with eight principles and four orientations.

Result: The research reveals that AI systems treat source-sensitivity as bias to suppress rather than as legitimate epistemic vigilance, and that these biases collapse when models detect systematic testing. The paper demonstrates the need for explicit epistemic governance structures.

Conclusion: The paper argues for a Liberal epistemic constitution for AI that specifies procedural norms protecting collective inquiry while allowing principled source-attending, proposing that AI epistemic governance requires explicit, contestable structures similar to those expected for AI ethics.

Abstract: Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument’s content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

[306] Neural Theorem Proving for Verification Conditions: A Real-World Benchmark

Qiyuan Xu, Xiaokun Luan, Renxi Wang, Joshua Ong Jun Leang, Peixin Wang, Haonan Li, Wenda Li, Conrad Watt

Main category: cs.AI

TL;DR: NTP4VC is the first real-world multi-language benchmark for neural theorem proving of verification conditions, created from industrial projects like Linux and Contiki-OS kernels, showing LLMs have promise but significant challenges remain.

Details

Motivation: Automated proof of Verification Conditions (VCs) is a major bottleneck in program verification, with hard VCs often requiring extensive manual proofs. While neural theorem proving has succeeded in mathematical domains, its application to program verification VC proving remains unexplored, with no existing benchmarks for this fundamental problem.

Method: Created NTP4VC benchmark using real-world projects (Linux and Contiki-OS kernels) through industrial pipelines (Why3 and Frama-C) to generate semantically equivalent test cases across formal languages (Isabelle, Lean, Rocq). Evaluated both general-purpose and theorem-proving fine-tuned large language models on this benchmark.

Result: Large language models show promise in VC proving, but significant challenges remain for program verification applications, revealing a substantial gap and opportunity for future research in this area.

Conclusion: NTP4VC establishes the first real-world benchmark for neural theorem proving of verification conditions, demonstrating that while LLMs show potential, there are still major challenges to overcome in applying neural methods to practical program verification tasks.

Abstract: Theorem proving is fundamental to program verification, where the automated proof of Verification Conditions (VCs) remains a primary bottleneck. Real-world program verification frequently encounters hard VCs that existing Automated Theorem Provers (ATPs) cannot prove, leading to a critical need for extensive manual proofs that burden practical application. While Neural Theorem Proving (NTP) has achieved significant success in mathematical competitions, demonstrating the potential of machine learning approaches to formal reasoning, its application to program verification–particularly VC proving–remains largely unexplored. Despite existing work on annotation synthesis and verification-related theorem proving, no benchmark has specifically targeted this fundamental bottleneck: automated VC proving. This work introduces Neural Theorem Proving for Verification Conditions (NTP4VC), presenting the first real-world multi-language benchmark for this task. From real-world projects such as Linux and Contiki-OS kernel, our benchmark leverages industrial pipelines (Why3 and Frama-C) to generate semantically equivalent test cases across formal languages of Isabelle, Lean, and Rocq. We evaluate large language models (LLMs), both general-purpose and those fine-tuned for theorem proving, on NTP4VC. Results indicate that although LLMs show promise in VC proving, significant challenges remain for program verification, highlighting a large gap and opportunity for future research.

cs.SD

[307] Pianoroll-Event: A Novel Score Representation for Symbolic Music

Lekai Qian, Haoyu Gu, Dehan Li, Boyu Cao, Qi Liu

Main category: cs.SD

TL;DR: Pianoroll-Event: A novel encoding scheme that combines pianoroll representations with events to balance structural invariance, spatial locality, and encoding efficiency for symbolic music.

Details

Motivation: Existing symbolic music representations have complementary limitations - grid-based (pianoroll) representations preserve pitch-time spatial correspondence but suffer from data sparsity and low encoding efficiency, while discrete-event representations achieve compact encoding but fail to adequately capture structural invariance and spatial locality.

Method: Proposes Pianoroll-Event encoding scheme that describes pianoroll representations through four complementary event types: Frame Events (temporal boundaries), Gap Events (sparse regions), Pattern Events (note patterns), and Musical Structure Events (musical metadata). This approach maintains temporal dependencies and local spatial patterns while improving encoding efficiency.

Result: Pianoroll-Event improves encoding efficiency by 1.36× to 7.16× over representative discrete sequence methods. Experiments across multiple autoregressive architectures show models using this representation consistently outperform baselines in both quantitative and human evaluations.

Conclusion: Pianoroll-Event effectively balances sequence length and vocabulary size, combining the structural properties of pianoroll representations with the encoding efficiency of discrete-event representations, leading to better performance in music modeling tasks.

Abstract: Symbolic music representation is a fundamental challenge in computational musicology. While grid-based representations effectively preserve pitch-time spatial correspondence, their inherent data sparsity leads to low encoding efficiency. Discrete-event representations achieve compact encoding but fail to adequately capture structural invariance and spatial locality. To address these complementary limitations, we propose Pianoroll-Event, a novel encoding scheme that describes pianoroll representations through events, combining structural properties with encoding efficiency while maintaining temporal dependencies and local spatial patterns. Specifically, we design four complementary event types: Frame Events for temporal boundaries, Gap Events for sparse regions, Pattern Events for note patterns, and Musical Structure Events for musical metadata. Pianoroll-Event strikes an effective balance between sequence length and vocabulary size, improving encoding efficiency by 1.36\times to 7.16\times over representative discrete sequence methods. Experiments across multiple autoregressive architectures show models using our representation consistently outperform baselines in both quantitative and human evaluations.

[308] LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning

Wenhao Zou, Yuwei Miao, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu

Main category: cs.SD

TL;DR: LTS-VoiceAgent is a streaming framework that enables parallel thinking while speaking in voice agents, improving latency-efficiency trade-offs over serial cascaded pipelines.

Details

Motivation: Current voice agent architectures face a dilemma: end-to-end models lack deep reasoning capabilities, while cascaded pipelines (ASR→LLM→TTS) incur high latency due to strict sequential execution. Unlike human conversation where listeners think while speakers talk, existing systems wait for complete speech before processing.

Method: Proposes LTS-VoiceAgent with a Listen-Think-Speak framework that separates when to think from how to reason incrementally. Key components: 1) Dynamic Semantic Trigger to detect meaningful prefixes, 2) Dual-Role Stream Orchestrator coordinating background Thinker (state maintenance) and foreground Speaker (speculative solving), enabling parallel “thinking while speaking” without blocking responses.

Result: Experiments across VERA, Spoken-MQA, BigBenchAudio, and a new Pause-and-Repair benchmark show LTS-VoiceAgent achieves stronger accuracy-latency-efficiency trade-offs than serial cascaded baselines and existing streaming strategies.

Conclusion: LTS-VoiceAgent addresses the latency-reasoning trade-off in voice agents by enabling parallel thinking while speaking, outperforming existing approaches through its explicit separation of thinking timing from incremental reasoning mechanisms.

Abstract: Real-time voice agents face a dilemma: end-to-end models often lack deep reasoning, while cascaded pipelines incur high latency by executing ASR, LLM reasoning, and TTS strictly in sequence, unlike human conversation where listeners often start thinking before the speaker finishes. Since cascaded architectures remain the dominant choice for complex tasks, existing cascaded streaming strategies attempt to reduce this latency via mechanical segmentation (e.g., fixed chunks, VAD-based splitting) or speculative generation, but they frequently either break semantic units or waste computation on predictions that must be rolled back. To address these challenges, we propose LTS-VoiceAgent, a Listen-Think-Speak framework that explicitly separates when to think from how to reason incrementally. It features a Dynamic Semantic Trigger to detect meaningful prefixes, and a Dual-Role Stream Orchestrator that coordinates a background Thinker (for state maintenance) and a foreground Speaker (for speculative solving). This parallel design enables “thinking while speaking” without blocking responses. We also introduce a Pause-and-Repair benchmark containing natural disfluencies to stress-test streaming robustness. Experiments across VERA, Spoken-MQA, BigBenchAudio, and our benchmark show that LTS-VoiceAgent achieves a stronger accuracy-latency-efficiency trade-off than serial cascaded baselines and existing streaming strategies.

[309] Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding

Xiangbo Wang, Wenbin Jiang, Jin Wang, Yubo You, Sheng Fang, Fei Wen

Main category: cs.SD

TL;DR: SwitchCodec: neural audio codec using Residual Experts Vector Quantization (REVQ) with dynamic expert routing and variable bitrate control, outperforming existing methods.

Details

Motivation: Fixed number of codebooks in existing neural audio compression models is suboptimal for variable audio complexity - too many for simple signals, too few for complex ones.

Method: Proposes SwitchCodec based on Residual Experts Vector Quantization (REVQ): combines shared quantizer with dynamically routed expert quantizers activated based on input audio, plus variable-bitrate mechanism adjusting active experts at inference.

Result: Outperforms existing baselines on both objective metrics and subjective listening tests.

Conclusion: REVQ approach decouples bitrate from codebook capacity, improves compression efficiency, enables multi-bitrate operation without retraining, and ensures full utilization of all quantizers.

Abstract: Recent neural audio compression models often rely on residual vector quantization for high-fidelity coding, but using a fixed number of per-frame codebooks is suboptimal for the wide variability of audio content-especially for signals that are either very simple or highly complex. To address this limitation, we propose SwitchCodec, a neural audio codec based on Residual Experts Vector Quantization (REVQ). REVQ combines a shared quantizer with dynamically routed expert quantizers that are activated according to the input audio, decoupling bitrate from codebook capacity and improving compression efficiency. This design ensures full training and utilization of each quantizer. In addition, a variable-bitrate mechanism adjusts the number of active expert quantizers at inference, enabling multi-bitrate operation without retraining. Experiments demonstrate that SwitchCodec surpasses existing baselines on both objective metrics and subjective listening tests.

[310] Mix2Morph: Learning Sound Morphing from Noisy Mixes

Annie Chu, Hugo Flores García, Oriol Nieto, Justin Salamon, Bryan Pardo, Prem Seetharaman

Main category: cs.SD

TL;DR: Mix2Morph is a text-to-audio diffusion model fine-tuned for sound morphing without needing a dedicated morph dataset, focusing on “sound infusions” where one dominant sound gets enriched by another’s timbral qualities.

Details

Motivation: The paper aims to create more controllable and concept-driven sound design tools by addressing the challenge of sound morphing without requiring specialized datasets of morph examples, specifically targeting sound infusions as a practical and perceptually meaningful subclass of morphing.

Method: Fine-tunes a text-to-audio diffusion model on noisy surrogate mixes at higher diffusion timesteps to perform sound morphing, focusing on sound infusions where a primary sound provides temporal/structure while a secondary sound enriches timbral/textural qualities.

Result: Mix2Morph produces stable, perceptually coherent morphs that convincingly integrate qualities of both sources, outperforms prior baselines in objective evaluations and listening tests, and generates high-quality sound infusions across diverse categories.

Conclusion: Mix2Morph represents a step toward more controllable and concept-driven tools for sound design, demonstrating effective sound morphing capabilities without requiring dedicated morph datasets through innovative fine-tuning on noisy surrogate mixes.

Abstract: We introduce Mix2Morph, a text-to-audio diffusion model fine-tuned to perform sound morphing without a dedicated dataset of morphs. By finetuning on noisy surrogate mixes at higher diffusion timesteps, Mix2Morph yields stable, perceptually coherent morphs that convincingly integrate qualities of both sources. We specifically target sound infusions, a practically and perceptually motivated subclass of morphing in which one sound acts as the dominant primary source, providing overall temporal and structural behavior, while a secondary sound is infused throughout, enriching its timbral and textural qualities. Objective evaluations and listening tests show that Mix2Morph outperforms prior baselines and produces high-quality sound infusions across diverse categories, representing a step toward more controllable and concept-driven tools for sound design. Sound examples are available at https://anniejchu.github.io/mix2morph .

[311] Self Voice Conversion as an Attack against Neural Audio Watermarking

Yigitcan Özer, Wanying Ge, Zhe Zhang, Xin Wang, Junichi Yamagishi

Main category: cs.SD

TL;DR: Self voice conversion attack severely degrades state-of-the-art audio watermarking systems by remapping speaker voice to same identity while altering acoustic characteristics.

Details

Motivation: Existing audio watermarking robustness evaluations focus on conventional distortions (compression, noise, resampling), but deep learning-based attacks like self voice conversion pose novel and significant threats to watermark security that need investigation.

Method: Proposes self voice conversion as a universal, content-preserving attack against audio watermarking systems. Uses voice conversion models to remap a speaker’s voice to the same identity while altering acoustic characteristics.

Result: Demonstrates that self voice conversion attack severely degrades the reliability of state-of-the-art watermarking approaches, highlighting significant security vulnerabilities in modern audio watermarking techniques.

Conclusion: Self voice conversion represents a serious threat to audio watermarking security, exposing limitations in current robustness evaluations and necessitating new defenses against deep learning-based attacks.

Abstract: Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding capacity, robustness is still primarily assessed against conventional distortions such as compression, additive noise, and resampling. However, the rise of deep learning-based attacks introduces novel and significant threats to watermark security. In this work, we investigate self voice conversion as a universal, content-preserving attack against audio watermarking systems. Self voice conversion remaps a speaker’s voice to the same identity while altering acoustic characteristics through a voice conversion model. We demonstrate that this attack severely degrades the reliability of state-of-the-art watermarking approaches and highlight its implications for the security of modern audio watermarking techniques.

[312] On Every Note a Griff: Looking for a Useful Representation of Basso Continuo Performance Style

Adam Štefunko, Carlos Eduardo Cancino-Chacón, Jan Hajič

Main category: cs.SD

TL;DR: The paper introduces griff, a transposition-invariant feature representation for analyzing basso continuo improvisation styles, extracted from aligned performances in the ACoRD dataset.

Details

Motivation: Basso continuo is a living historical improvisation practice, but lacks systematic analysis tools. The ACoRD dataset provides modern recordings, but needs appropriate feature representations to study performance styles.

Method: Proposes griff representation inspired by historical treatises, extracted from aligned performances by grouping notes mapped to same score note. Uses ACoRD dataset (175 MIDI recordings by 7 players) and statistical analysis of griffs.

Result: Griffs enable meaningful analysis of basso continuo styles. Two experiments demonstrate how griffs can statistically analyze individuality of different players’ performance styles.

Conclusion: Griffs preserve improvisation structure for refined style analysis and provide historically informed feature space worthy of more robust empirical validation.

Abstract: Basso continuo is a baroque improvisatory accompaniment style which involves improvising multiple parts above a given bass line in a musical score on a harpsichord or organ. Basso continuo is not merely a matter of history; moreover, it is a historically inspired living practice, and The Aligned Continuo Dataset (ACoRD) records the first sample of modern-day basso continuo playing in the symbolic domain. This dataset, containing 175 MIDI recordings of 5 basso continuo scores performed by 7 players, allows us to start observing and analyzing the variety that basso continuo improvisation brings. A recently proposed basso continuo performance-to-score alignment system provides a way of mapping improvised performance notes to score notes. In order to study aligned basso continuo performances, we need an appropriate feature representation. We propose griff, a representation inspired by historical basso continuo treatises. It enables us to encode both pitch content and structure of a basso continuo realization in a transposition-invariant way. Griffs are directly extracted from aligned basso continuo performances by grouping together performance notes aligned to the same score note in a onset-time ordered way, and they provide meaningful tokens that form a feature space in which we can analyze basso continuo performance styles. We statistically describe griffs extracted from the ACoRD dataset recordings, and show in two experiments how griffs can be used for statistical analysis of individuality of different players’ basso continuo performance styles. We finally present an argument why it is desirable to preserve the structure of a basso continuo improvisation in order to conduct a refined analysis of personal performance styles of individual basso continuo practitioners, and why griffs can provide a meaningful historically informed feature space worthy of a more robust empirical validation.

[313] Audio Deepfake Detection in the Age of Advanced Text-to-Speech models

Robin Singh, Aditya Yogesh Nair, Fabio Palumbo, Florian Barbaro, Anna Dyka, Lohith Rachakonda

Main category: cs.SD

TL;DR: Comparative evaluation shows TTS detection systems perform inconsistently across different generative architectures, with multi-view approaches being most robust against audio deepfakes.

Details

Motivation: As TTS systems become more realistic, creating new challenges for audio deepfake detection, there's a need to understand how current detection methods perform against diverse TTS architectures.

Method: Generated 12,000 synthetic audio samples using three state-of-the-art TTS models (Dia2, Maya1, MeloTTS) from Daily-Dialog dataset, then evaluated against four detection frameworks including semantic, structural, and signal-level approaches.

Result: Detector performance varies significantly across TTS architectures - models effective against one type may fail against others, especially LLM-based synthesis. Multi-view detection combining complementary analysis levels shows robust performance across all models.

Conclusion: Single-paradigm detectors have limitations; integrated multi-view detection strategies are necessary to address the evolving audio deepfake threat landscape.

Abstract: Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models–Dia2, Maya1, and MeloTTS–representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.

[314] Gen-SER: When the generative model meets speech emotion recognition

Taihui Wang, Jinzheng Zhao, Rilin Chen, Tong Lei, Wenwu Wang, Dong Yu

Main category: cs.SD

TL;DR: Gen-SER reformulates speech emotion recognition as a distribution shift problem using generative models instead of traditional classification or LLM approaches.

Details

Motivation: Current SER methods rely on classification models or large language models, but these approaches may not fully capture the nuanced nature of emotion recognition as a distributional problem.

Method: Projects discrete emotion labels into continuous space using sinusoidal taxonomy encoding, then uses target-matching generative models to transform initial distributions into terminal distributions. Classification is done by comparing generated distributions with ground truth distributions.

Result: Experimental results confirm the method’s efficacy and demonstrate extensibility to various speech-understanding tasks, suggesting broader applicability to classification tasks.

Conclusion: Gen-SER provides a novel generative approach to SER that reformulates it as a distribution shift problem, showing promising results and potential for broader classification applications.

Abstract: Speech emotion recognition (SER) is crucial in speech understanding and generation. Most approaches are based on either classification models or large language models. Different from previous methods, we propose Gen-SER, a novel approach that reformulates SER as a distribution shift problem via generative models. We propose to project discrete class labels into a continuous space, and obtain the terminal distribution via sinusoidal taxonomy encoding. The target-matching-based generative model is adopted to transform the initial distribution into the terminal distribution efficiently. The classification is achieved by calculating the similarity of the generated terminal distribution and ground truth terminal distribution. The experimental results confirm the efficacy of the proposed method, demonstrating its extensibility to various speech-understanding tasks and suggesting its potential applicability to a broader range of classification tasks.

[315] Structural and Statistical Audio Texture Knowledge Distillation for Environmental Sound Classification

Jarin Ritu, Amirmohammad Mohammadi, Davelle Carreiro, Alexandra Van Dine, Joshua Peeples

Main category: cs.SD

TL;DR: SSATKD framework improves environmental sound classification by combining high-level context with low-level audio texture features through knowledge distillation, tested on diverse datasets with consistent accuracy gains.

Details

Motivation: Current knowledge distillation methods for environmental sound classification overlook essential low-level audio texture features needed to capture local patterns in complex acoustic environments.

Method: Proposed Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) framework that extracts both high-level contextual information and low-level structural/statistical audio textures from intermediate layers. Tested on four diverse datasets (two passive sonar, two general environmental sound) with two teacher adaptation strategies and various teacher architectures.

Result: Experimental results demonstrate consistent accuracy improvements across all datasets and settings, confirming the effectiveness and robustness of SSATKD in real-world sound classification tasks.

Conclusion: SSATKD framework successfully addresses the gap in capturing essential audio texture features for environmental sound classification, showing generalizability across diverse applications and proving effective for real-world sound classification tasks.

Abstract: While knowledge distillation has shown success in various audio tasks, its application to environmental sound classification often overlooks essential low-level audio texture features needed to capture local patterns in complex acoustic environments. To address this gap, the Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) framework is proposed, which combines high-level contextual information with low-level structural and statistical audio textures extracted from intermediate layers. To evaluate its generalizability to a broad range of applications, SSATKD is tested on four diverse datasets within the environmental sound classification domain, namely two passive sonar datasets: DeepShip and Vessel Type Underwater Acoustic Data (VTUAD) and two general environmental sound datasets: Environmental Sound Classification 50 (ESC-50) and UrbanSound8K. Two teacher adaptation strategies are explored: classifier-head-only adaptation and full fine-tuning. The framework is further evaluated using various convolutional and transformer-based teacher models. Experimental results demonstrate consistent accuracy improvements across all datasets and settings, confirming the effectiveness and robustness of SSATKD in real-world sound classification tasks.

[316] Addressing Gradient Misalignment in Data-Augmented Training for Robust Speech Deepfake Detection

Duc-Tuan Truong, Tianchi Liu, Junjie Li, Ruijie Tao, Kong Aik Lee, Eng Siong Chng

Main category: cs.SD

TL;DR: DPDA training framework with gradient alignment reduces optimization conflicts between original and augmented speech inputs, improving SDD performance by 18.69% EER reduction.

Details

Motivation: Data augmentation in speech deepfake detection can cause conflicting gradient updates between original and augmented inputs, hindering convergence and leading to suboptimal solutions.

Method: Dual-path data-augmented (DPDA) framework processes each utterance through two paths (original and augmented), then aligns their backpropagated gradient directions to reduce optimization conflicts.

Result: Analysis shows ~25% of training iterations have gradient conflicts with RawBoost augmentation. Gradient alignment accelerates convergence and achieves up to 18.69% relative EER reduction on In-the-Wild dataset.

Conclusion: Gradient alignment in DPDA framework effectively resolves optimization conflicts from data augmentation, improving speech deepfake detection performance and training efficiency.

Abstract: In speech deepfake detection (SDD), data augmentation (DA) is commonly used to improve model generalization across varied speech conditions and spoofing attacks. However, during training, the backpropagated gradients from original and augmented inputs may misalign, which can result in conflicting parameter updates. These conflicts could hinder convergence and push the model toward suboptimal solutions, thereby reducing the benefits of DA. To investigate and address this issue, we design a dual-path data-augmented (DPDA) training framework with gradient alignment for SDD. In our framework, each training utterance is processed through two input paths: one using the original speech and the other with its augmented version. This design allows us to compare and align their backpropagated gradient directions to reduce optimization conflicts. Our analysis shows that approximately 25% of training iterations exhibit gradient conflicts between the original inputs and their augmented counterparts when using RawBoost augmentation. By resolving these conflicts with gradient alignment, our method accelerates convergence by reducing the number of training epochs and achieves up to an 18.69% relative reduction in Equal Error Rate on the In-the-Wild dataset compared to the baseline.

[317] Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

Bernardo Torres, Manuel Moussallam, Gabriel Meseguer-Brocal

Main category: cs.SD

TL;DR: A training method to make audio autoencoder latent spaces linear for intuitive mixing and scaling operations without changing model architecture.

Details

Motivation: Audio autoencoders create useful compressed representations but their non-linear latent spaces prevent intuitive algebraic manipulation like mixing or scaling audio content.

Method: Simple training methodology using data augmentation to induce linearity in Consistency Autoencoders, achieving homogeneity (equivariance to scalar gain) and additivity (decoder preserves addition) without architectural changes.

Result: The CAE exhibits linear behavior in both encoder and decoder while preserving reconstruction fidelity, enabling practical applications like music source composition and separation via simple latent arithmetic.

Conclusion: The work presents a straightforward technique for constructing structured latent spaces that enable more intuitive and efficient audio processing through linear operations.

Abstract: Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model’s architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.

[318] Diffusion Timbre Transfer Via Mutual Information Guided Inpainting

Ching Ho Lee, Javier Nistal, Stefan Lattner, Marco Pasini, George Fazekas

Main category: cs.SD

TL;DR: A lightweight inference-time method for timbre transfer using pre-trained latent diffusion models without additional training, featuring noise injection and early-step clamping to preserve musical structure.

Details

Motivation: To enable timbre transfer (changing instrument sounds) as an inference-time editing problem for music audio, leveraging existing pre-trained models without requiring costly retraining.

Method: Two main techniques: (1) dimension-wise noise injection targeting latent channels most informative of instrument identity, and (2) early-step clamping mechanism that re-imposes input’s melodic and rhythmic structure during reverse diffusion. Works directly on audio latents and is compatible with text/audio conditioning like CLAP.

Result: The method demonstrates effective timbre transfer while preserving musical structure, with analysis of trade-offs between timbral change and structural preservation. Shows that simple inference-time controls can meaningfully steer pre-trained models for style-transfer applications.

Conclusion: Lightweight inference-time procedures can effectively repurpose pre-trained latent diffusion models for timbre transfer tasks without additional training, offering practical controls for balancing timbral transformation with structural preservation in music audio editing.

Abstract: We study timbre transfer as an inference-time editing problem for music audio. Starting from a strong pre-trained latent diffusion model, we introduce a lightweight procedure that requires no additional training: (i) a dimension-wise noise injection that targets latent channels most informative of instrument identity, and (ii) an early-step clamping mechanism that re-imposes the input’s melodic and rhythmic structure during reverse diffusion. The method operates directly on audio latents and is compatible with text/audio conditioning (e.g., CLAP). We discuss design choices,analyze trade-offs between timbral change and structural preservation, and show that simple inference-time controls can meaningfully steer pre-trained models for style-transfer use cases.

[319] EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

Luca Cerovaz, Michele Mancusi, Emanuele Rodolà

Main category: cs.SD

TL;DR: Complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling, eliminates adversarial discriminators, matches SOTA performance with 10x less training

Details

Motivation: Current spectral-domain audio codecs struggle with phase modeling, either ignoring phase or encoding it as separate real channels, which limits spatial fidelity and requires adversarial discriminators that hurt training stability

Method: End-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling throughout analysis-quantization-synthesis pipeline, removing adversarial discriminators and diffusion post-filters

Result: Matches or surpasses much longer-trained baselines in-domain, achieves SOTA out-of-domain performance, reduces training budget by order of magnitude while preserving high perceptual quality

Conclusion: Complex-valued representation with proper magnitude-phase coupling enables efficient, stable audio codec training without GANs or diffusion, achieving superior performance with significantly less compute

Abstract: Audio codecs power discrete music generative modelling, music streaming and immersive media by shrinking PCM audio to bandwidth-friendly bit-rates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram-domains typically struggle with phase modeling which is naturally complex-valued. Most frequency-domain neural codecs either disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity. This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability to compensate for the inadequate representation power of the audio signal. In this work we introduce an end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline and removes adversarial discriminators and diffusion post-filters. Without GANs or diffusion we match or surpass much longer-trained baselines in-domain and reach SOTA out-of-domain performance. Compared to standard baselines that train for hundreds of thousands of steps, our model reducing training budget by an order of magnitude is markedly more compute-efficient while preserving high perceptual quality.

cs.LG

[320] Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Minseo Kwak, Jaehyung Kim

Main category: cs.LG

TL;DR: Gap-K% is a new method for detecting pretraining data in LLMs that uses the probability gap between top-1 predicted and target tokens, with sliding windows to capture local correlations, achieving SOTA performance on benchmarks.

Details

Motivation: The opacity of massive pretraining corpora raises privacy and copyright concerns, making pretraining data detection critical. Existing methods relying on token likelihoods often overlook divergence from top-1 predictions and local correlations between adjacent tokens.

Method: Gap-K% analyzes optimization dynamics of LLM pretraining, using the log probability gap between the top-1 predicted token and target token. It incorporates a sliding window strategy to capture local correlations and mitigate token-level fluctuations.

Result: Extensive experiments on WikiMIA and MIMIR benchmarks show Gap-K% achieves state-of-the-art performance, consistently outperforming prior baselines across various model sizes and input lengths.

Conclusion: Gap-K% provides an effective pretraining data detection method grounded in LLM optimization dynamics, addressing limitations of existing approaches by considering prediction divergence and local token correlations.

Abstract: The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the divergence from the model’s top-1 prediction and local correlation between adjacent tokens. In this work, we propose Gap-K%, a novel pretraining data detection method grounded in the optimization dynamics of LLM pretraining. By analyzing the next-token prediction objective, we observe that discrepancies between the model’s top-1 prediction and the target token induce strong gradient signals, which are explicitly penalized during training. Motivated by this, Gap-K% leverages the log probability gap between the top-1 predicted token and the target token, incorporating a sliding window strategy to capture local correlations and mitigate token-level fluctuations. Extensive experiments on the WikiMIA and MIMIR benchmarks demonstrate that Gap-K% achieves state-of-the-art performance, consistently outperforming prior baselines across various model sizes and input lengths.

[321] DecHW: Heterogeneous Decentralized Federated Learning Exploiting Second-Order Information

Adnan Ahmad, Chiara Boldrini, Lorenzo Valerio, Andrea Passarella, Marco Conti

Main category: cs.LG

TL;DR: A novel decentralized federated learning approach addresses data/model heterogeneity by using second-order information to generate consensus weights for robust aggregation, improving generalization with reduced communication costs.

Details

Motivation: DFL faces challenges from data and model initialization heterogeneities across devices due to varying individual experiences and interaction levels, leading to parameter variations and slower convergence.

Method: Proposes capturing parameter variations via second-order information approximation of local models on their datasets to generate consensus weights, which scale neighborhood updates before aggregation into global neighborhood representation.

Result: Extensive computer vision experiments show strong generalizability of local models at reduced communication costs.

Conclusion: The approach effectively tackles data and model heterogeneity in DFL through parameter-level evidential credence analysis and robust aggregation, achieving better performance with lower communication overhead.

Abstract: Decentralized Federated Learning (DFL) is a serverless collaborative machine learning paradigm where devices collaborate directly with neighbouring devices to exchange model information for learning a generalized model. However, variations in individual experiences and different levels of device interactions lead to data and model initialization heterogeneities across devices. Such heterogeneities leave variations in local model parameters across devices that leads to slower convergence. This paper tackles the data and model heterogeneity by explicitly addressing the parameter level varying evidential credence across local models. A novel aggregation approach is introduced that captures these parameter variations in local models and performs robust aggregation of neighbourhood local updates. Specifically, consensus weights are generated via approximation of second-order information of local models on their local datasets. These weights are utilized to scale neighbourhood updates before aggregating them into global neighbourhood representation. In extensive experiments with computer vision tasks, the proposed approach shows strong generalizability of local models at reduced communication costs.

[322] oculomix: Hierarchical Sampling for Retinal-Based Systemic Disease Prediction

Hyunmin Kim, Yukun Zhou, Rahul A. Jonas, Lie Ju, Sunjin Hwang, Pearse A. Keane, Siegfried K. Wagner

Main category: cs.LG

TL;DR: Oculomix: A hierarchical sampling strategy for mixed sample augmentations in retinal imaging that preserves patient-specific attributes by constraining mixing to patient and exam levels, outperforming image-level methods by up to 3% AUROC for cardiovascular event prediction.

Details

Motivation: Current mixed sample data augmentations (CutMix, MixUp) used for training transformer-based foundation models in oculomics perturb patient-specific attributes like medical comorbidities and clinical factors because they only consider images and labels, not the hierarchical patient-exam relationships.

Method: Oculomix uses a hierarchical sampling strategy based on two clinical priors: 1) images from same patient at same time point share attributes (exam level), 2) images from same patient at different time points have soft temporal trends (patient level). The method constrains mixing space to patient and exam levels to preserve patient-specific characteristics.

Result: Oculomix consistently outperforms image-level CutMix and MixUp by up to 3% in AUROC for five-year prediction of major adverse cardiovascular events (MACE) using ViT models on the Alzeye dataset (large ethnically diverse population).

Conclusion: The hierarchical sampling approach is necessary and valuable for oculomics as it better preserves patient-specific characteristics during data augmentation, leading to improved performance in predicting systemic diseases from retinal imaging.

Abstract: Oculomics - the concept of predicting systemic diseases, such as cardiovascular disease and dementia, through retinal imaging - has advanced rapidly due to the data efficiency of transformer-based foundation models like RETFound. Image-level mixed sample data augmentations, such as CutMix and MixUp, are frequently used for training transformers, yet these techniques perturb patient-specific attributes, such as medical comorbidity and clinical factors, since they only account for images and labels. To address this limitation, we propose a hierarchical sampling strategy, Oculomix, for mixed sample augmentations. Our method is based on two clinical priors. First (exam level), images acquired from the same patient at the same time point share the same attributes. Second (patient level), images acquired from the same patient at different time points have a soft temporal trend, as morbidity generally increases over time. Guided by these priors, our method constrains the mixing space to the patient and exam levels to better preserve patient-specific characteristics and leverages their hierarchical relationships. The proposed method is validated using ViT models on a five-year prediction of major adverse cardiovascular events (MACE) in a large ethnically diverse population (Alzeye). We show that Oculomix consistently outperforms image-level CutMix and MixUp by up to 3% in AUROC, demonstrating the necessity and value of the proposed method in oculomics.

[323] Continuous-Flow Data-Rate-Aware CNN Inference on FPGA

Tobias Habermann, Michael Mecik, Zhenyu Wang, César David Vera, Martin Kumm, Mario Garrido

Main category: cs.LG

TL;DR: Novel data-rate-aware continuous-flow CNN architecture for FPGAs that achieves near 100% hardware utilization by interleaving low data rate signals and sharing hardware units.

Details

Motivation: Previous unrolled FPGA implementations focused on fully connected networks, but CNNs require fewer computations for same accuracy. However, CNN pooling and strided convolutional layers reduce data rates, causing hardware underutilization in fully parallel implementations.

Method: Analyzes CNN data flow and proposes data-rate-aware continuous-flow architecture with interleaving of low data rate signals, hardware unit sharing, and optimal parallelization to match fully parallel throughput.

Result: Achieves high hardware utilization close to 100%, saves significant arithmetic logic, enables implementation of complex CNNs like MobileNet on single FPGA with high throughput.

Conclusion: The proposed approach effectively addresses data rate reduction issues in CNN hardware accelerators, making complex CNN deployment on FPGAs feasible with efficient resource utilization.

Abstract: Among hardware accelerators for deep-learning inference, data flow implementations offer low latency and high throughput capabilities. In these architectures, each neuron is mapped to a dedicated hardware unit, making them well-suited for field-programmable gate array (FPGA) implementation. Previous unrolled implementations mostly focus on fully connected networks because of their simplicity, although it is well known that convolutional neural networks (CNNs) require fewer computations for the same accuracy. When observing the data flow in CNNs, pooling layers and convolutional layers with a stride larger than one, the number of data at their output is reduced with respect to their input. This data reduction strongly affects the data rate in a fully parallel implementation, making hardware units heavily underutilized unless it is handled properly. This work addresses this issue by analyzing the data flow of CNNs and presents a novel approach to designing data-rate-aware, continuous-flow CNN architectures. The proposed approach ensures a high hardware utilization close to 100% by interleaving low data rate signals and sharing hardware units, as well as using the right parallelization to achieve the throughput of a fully parallel implementation. The results show that a significant amount of the arithmetic logic can be saved, which allows implementing complex CNNs like MobileNet on a single FPGA with high throughput.

[324] Scaling Next-Brain-Token Prediction for MEG

Richard Csaky

Main category: cs.LG

TL;DR: A large autoregressive model for source-space MEG that scales next-token prediction across datasets and scanners, generating minutes of MEG from context using a modified SEANet-style vector-quantizer and Qwen2.5-VL backbone.

Details

Motivation: To develop a model capable of generating realistic, long-horizon magnetoencephalography (MEG) data that generalizes across different datasets and scanners, addressing the challenge of scaling brain signal generation to handle large corpora with long context windows.

Method: Uses a modified SEANet-style vector-quantizer to reduce multichannel MEG into flattened token streams, then trains a Qwen2.5-VL backbone from scratch for next-token prediction. Introduces three evaluation tests: on-manifold stability via generated-only drift analysis, and conditional specificity via prompt-swap controls with neurophysiologically grounded metrics.

Result: The model was trained on CamCAN and Omega datasets and evaluated on held-out MOUS dataset, demonstrating cross-dataset generalization. Generated MEG sequences remain relatively stable over long rollouts and are closer to correct continuations than swapped controls across all metrics.

Conclusion: The approach successfully scales autoregressive MEG generation to handle long context windows across diverse datasets, producing stable and specific brain signal generations that generalize well to unseen data from different scanners.

Abstract: We present a large autoregressive model for source-space MEG that scales next-token prediction to long context across datasets and scanners: handling a corpus of over 500 hours and thousands of sessions across the three largest MEG datasets. A modified SEANet-style vector-quantizer reduces multichannel MEG into a flattened token stream on which we train a Qwen2.5-VL backbone from scratch to predict the next brain token and to recursively generate minutes of MEG from up to a minute of context. To evaluate long-horizon generation, we introduce three task-matched tests: (i) on-manifold stability via generated-only drift compared to the time-resolved distribution of real sliding windows, and (ii) conditional specificity via correct context versus prompt-swap controls using a neurophysiologically grounded metric set. We train on CamCAN and Omega and run all analyses on held-out MOUS, establishing cross-dataset generalization. Across metrics, generations remain relatively stable over long rollouts and are closer to the correct continuation than swapped controls. Code available at: https://github.com/ricsinaruto/brain-gen.

[325] Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds

Faruk Alpay, Bugra Kilictas

Main category: cs.LG

TL;DR: Transformers develop multi-step reasoning through a geometric phase transition where hidden states collapse into low-dimensional concept basins, forming reusable Transient Class Objects.

Details

Motivation: To understand how deep Transformer models develop multi-step reasoning capabilities through geometric and statistical physics analysis of their internal representations.

Method: Analyze layerwise covariance spectrum of activations, track deviations from random-matrix bulk, use sparsity-based order parameter, formalize forward pass as coarse-graining map, and validate with layerwise probes across model scales (1.5B-30B).

Result: Observe sharp reduction in effective dimensionality consistent with phase transition at critical depth γ_c≈0.42, formation of stable “concept basins” as fixed points, spectral tail collapse, and emergence of Transient Class Objects (TCOs) in representation space.

Conclusion: Multi-step reasoning emerges through geometric phase transition where representations collapse into low-entropy concept basins, forming reusable object-like structures that enable logical separability and reasoning.

Abstract: We study the emergence of multi-step reasoning in deep Transformer language models through a geometric and statistical-physics lens. Treating the hidden-state trajectory as a flow on an implicit Riemannian manifold, we analyze the layerwise covariance spectrum of activations, where $C^{(\ell)}=\mathbb{E}[h^{(\ell)}h^{(\ell)\top}]$, and track deviations from a random-matrix bulk. Across model scales (1.5B–30B), we observe a sharp reduction in effective dimensionality consistent with a phase transition: an order parameter based on sparsity/localization, $Ω(h)=1-|h|_1/(\sqrt{d}|h|_2)$, exhibits a discontinuity near a critical normalized depth $γ_c\approx 0.42$ in sufficiently large models. We formalize the forward pass as a discrete coarse-graining map and relate the appearance of stable “concept basins” to fixed points of this renormalization-like dynamics. The resulting low-entropy regime is characterized by a spectral tail collapse and by the formation of transient, reusable object-like structures in representation space, which we call Transient Class Objects (TCOs). We provide theoretical conditions connecting logical separability to spectral decay and validate the predicted signatures with layerwise probes on multiple open-weight model families.

[326] Emergent Specialization in Learner Populations: Competition as the Source of Diversity

Yuhao Li

Main category: cs.LG

TL;DR: Competition alone induces emergent specialization in learner populations without communication or diversity incentives, validated across six real-world domains with strong performance gains.

Details

Motivation: To understand how populations of learners can develop coordinated, diverse behaviors without explicit communication or diversity incentives, inspired by ecological niche theory.

Method: NichePopulation algorithm combining competitive exclusion with niche affinity tracking, tested across six real-world domains (cryptocurrency trading, commodity prices, weather forecasting, solar irradiance, urban traffic, and air quality).

Result: Achieved mean Specialization Index of 0.75 with effect sizes Cohen’s d > 20; learners still achieve SI > 0.30 without niche bonus; diverse populations outperform homogeneous baselines by +26.5%; outperforms MARL baselines (QMIX, MAPPO, IQL) by 4.3x while being 4x faster.

Conclusion: Competition alone is sufficient to induce emergent specialization in learner populations, enabling method-level division of labor that significantly outperforms both homogeneous populations and existing multi-agent reinforcement learning approaches.

Abstract: How can populations of learners develop coordinated, diverse behaviors without explicit communication or diversity incentives? We demonstrate that competition alone is sufficient to induce emergent specialization – learners spontaneously partition into specialists for different environmental regimes through competitive dynamics, consistent with ecological niche theory. We introduce the NichePopulation algorithm, a simple mechanism combining competitive exclusion with niche affinity tracking. Validated across six real-world domains (cryptocurrency trading, commodity prices, weather forecasting, solar irradiance, urban traffic, and air quality), our approach achieves a mean Specialization Index of 0.75 with effect sizes of Cohen’s d > 20. Key findings: (1) At lambda=0 (no niche bonus), learners still achieve SI > 0.30, proving specialization is genuinely emergent; (2) Diverse populations outperform homogeneous baselines by +26.5% through method-level division of labor; (3) Our approach outperforms MARL baselines (QMIX, MAPPO, IQL) by 4.3x while being 4x faster.

[327] Classifier Calibration at Scale: An Empirical Study of Model-Agnostic Post-Hoc Methods

Valery Manokhin, Daniel Grønhaug

Main category: cs.LG

TL;DR: Benchmark of 21 classifiers and 5 calibration methods on binary tabular data shows Venn-Abers predictors achieve largest average log-loss reductions, while common methods like Platt scaling can degrade performance for modern models.

Details

Motivation: To systematically evaluate model-agnostic post-hoc calibration methods for improving probabilistic predictions in binary classification, with focus on methods providing distribution-free validity guarantees under exchangeability, and to understand how calibration affects modern tabular models.

Method: Benchmarked 21 classifiers (linear models, SVMs, tree ensembles, neural/ foundation models) on TabArena-v0.1 binary tasks using 5-fold cross-validation. Five calibrators (Isotonic regression, Platt scaling, Beta calibration, Venn-Abers predictors, Pearsonify) trained on separate calibration split and applied to test predictions. Evaluated using proper scoring rules (log-loss, Brier score), diagnostic measures (Spiegelhalter’s Z, ECE, ECI), discrimination (AUC-ROC), and classification metrics.

Result: Venn-Abers predictors achieve largest average reductions in log-loss, followed by Beta calibration. Platt scaling shows weaker, inconsistent effects. Beta calibration improves log-loss most frequently, while Venn-Abers has fewer extreme degradations and slightly more extreme improvements. Common calibration procedures (especially Platt scaling and isotonic regression) can systematically degrade proper scoring performance for strong modern tabular models. Calibration effects vary substantially across datasets and architectures with no uniform dominance.

Conclusion: Calibration effects are dataset- and architecture-dependent with no universally dominant method. Venn-Abers and Beta calibration perform best overall, while commonly used methods like Platt scaling and isotonic regression can harm modern models’ probabilistic predictions. Classification performance is generally preserved, but calibration improvements are marginal for accuracy.

Abstract: We study model-agnostic post-hoc calibration methods intended to improve probabilistic predictions in supervised binary classification on real i.i.d. tabular data, with particular emphasis on conformal and Venn-based approaches that provide distribution-free validity guarantees under exchangeability. We benchmark 21 widely used classifiers, including linear models, SVMs, tree ensembles (CatBoost, XGBoost, LightGBM), and modern tabular neural and foundation models, on binary tasks from the TabArena-v0.1 suite using randomized, stratified five-fold cross-validation with a held-out test fold. Five calibrators; Isotonic regression, Platt scaling, Beta calibration, Venn-Abers predictors, and Pearsonify are trained on a separate calibration split and applied to test predictions. Calibration is evaluated using proper scoring rules (log-loss and Brier score) and diagnostic measures (Spiegelhalter’s Z, ECE, and ECI), alongside discrimination (AUC-ROC) and standard classification metrics. Across tasks and architectures, Venn-Abers predictors achieve the largest average reductions in log-loss, followed closely by Beta calibration, while Platt scaling exhibits weaker and less consistent effects. Beta calibration improves log-loss most frequently across tasks, whereas Venn-Abers displays fewer instances of extreme degradation and slightly more instances of extreme improvement. Importantly, we find that commonly used calibration procedures, most notably Platt scaling and isotonic regression, can systematically degrade proper scoring performance for strong modern tabular models. Overall classification performance is often preserved, but calibration effects vary substantially across datasets and architectures, and no method dominates uniformly. In expectation, all methods except Pearsonify slightly increase accuracy, but the effect is marginal, with the largest expected gain about 0.008%.

[328] NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

Jiayu Xu, Junbiao Pang

Main category: cs.LG

TL;DR: NCSAM uses simulated label noise and sharpness-aware minimization to improve generalization and robustness against noisy labels.

Details

Motivation: Real-world datasets often contain noisy labels, but current approaches focus on label correction. This paper takes a novel theoretical approach by analyzing the relationship between loss landscape flatness and label noise.

Method: Proposes Noise-Compensated Sharpness-aware Minimization (NCSAM), which leverages perturbations from Sharpness-Aware Minimization (SAM) to counteract label noise damage. Theoretically demonstrates that carefully simulated label noise enhances both generalization and robustness.

Result: Extensive experiments on multiple benchmark datasets show consistent superiority over state-of-the-art approaches across diverse tasks. Testing accuracy exhibits similar behavior to noise-clear datasets.

Conclusion: NCSAM effectively addresses noisy label problems by combining simulated label noise with sharpness-aware optimization, providing both theoretical insights and practical performance improvements.

Abstract: Learning from Noisy Labels (LNL) presents a fundamental challenge in deep learning, as real-world datasets often contain erroneous or corrupted annotations, \textit{e.g.}, data crawled from Web. Current research focuses on sophisticated label correction mechanisms. In contrast, this paper adopts a novel perspective by establishing a theoretical analysis the relationship between flatness of the loss landscape and the presence of label noise. In this paper, we theoretically demonstrate that carefully simulated label noise synergistically enhances both the generalization performance and robustness of label noises. Consequently, we propose Noise-Compensated Sharpness-aware Minimization (NCSAM) to leverage the perturbation of Sharpness-Aware Minimization (SAM) to remedy the damage of label noises. Our analysis reveals that the testing accuracy exhibits a similar behavior that has been observed on the noise-clear dataset. Extensive experimental results on multiple benchmark datasets demonstrate the consistent superiority of the proposed method over existing state-of-the-art approaches on diverse tasks.

[329] Probabilistic Sensing: Intelligence in Data Sampling

Ibrahim Albulushi, Saleh Bunaiyan, Suraj S. Cheema, Hesham ElSawy, Feras Al-Dirini

Main category: cs.LG

TL;DR: A probabilistic sensing paradigm using p-neurons enables intelligent, energy-efficient data acquisition with minimal information loss.

Details

Motivation: Extending sensor intelligence to the data-acquisition process (deciding whether to sample or not) can achieve transformative energy-efficiency gains, but deterministic decision-making risks information loss.

Method: The paradigm is inspired by the autonomous nervous system and employs a probabilistic neuron (p-neuron) driven by an analog feature extraction circuit, achieving microsecond response times that overcome sub-sampling-rate limits.

Result: Validation experiments on active seismic survey data demonstrate lossless probabilistic data acquisition with 0.41% normalized mean squared error and 93% savings in active operation time and number of generated samples.

Conclusion: The probabilistic sensing paradigm enables real-time intelligent autonomous activation of data-sampling with significant energy efficiency gains while maintaining data quality.

Abstract: Extending the intelligence of sensors to the data-acquisition process - deciding whether to sample or not - can result in transformative energy-efficiency gains. However, making such a decision in a deterministic manner involves risk of losing information. Here we present a sensing paradigm that enables making such a decision in a probabilistic manner. The paradigm takes inspiration from the autonomous nervous system and employs a probabilistic neuron (p-neuron) driven by an analog feature extraction circuit. The response time of the system is on the order of microseconds, over-coming the sub-sampling-rate response time limit and enabling real-time intelligent autonomous activation of data-sampling. Validation experiments on active seismic survey data demonstrate lossless probabilistic data acquisition, with a normalized mean squared error of 0.41%, and 93% saving in the active operation time of the system and the number of generated samples.

[330] MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

Huanlin Gao, Ping Chen, Fuyuan Shi, Ruijia Wu, Li YanTao, Qiang Hui, Yuren You, Ting Lu, Chao Tan, Shaoan Zhao, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian

Main category: cs.LG

TL;DR: MeanCache is a training-free caching framework for Flow Matching inference that uses average velocities instead of instantaneous velocities to reduce trajectory deviations, achieving 3.6-4.6X acceleration while maintaining generation quality.

Details

Motivation: Existing caching methods for Flow Matching inference rely on instantaneous velocity information, which leads to severe trajectory deviations and error accumulation under high acceleration ratios. There's a need for a more stable approach that can handle acceleration scenarios better.

Method: MeanCache introduces an average-velocity perspective using cached Jacobian-vector products (JVP) to construct interval average velocities from instantaneous velocities. It also employs a trajectory-stability scheduling strategy with Peak-Suppressed Shortest Path under budget constraints to optimize cache timing and JVP reuse stability.

Result: Experiments on FLUX.1, Qwen-Image, and HunyuanVideo demonstrate MeanCache achieves 4.12X, 4.56X, and 3.59X acceleration respectively, while consistently outperforming state-of-the-art caching baselines in generation quality.

Conclusion: MeanCache provides a simple yet effective approach for Flow Matching inference that mitigates local error accumulation through average velocities, offering a new perspective that could inspire further exploration of stability-driven acceleration in commercial-scale generative models.

Abstract: We present MeanCache, a training-free caching framework for efficient Flow Matching inference. Existing caching methods reduce redundant computation but typically rely on instantaneous velocity information (e.g., feature caching), which often leads to severe trajectory deviations and error accumulation under high acceleration ratios. MeanCache introduces an average-velocity perspective: by leveraging cached Jacobian–vector products (JVP) to construct interval average velocities from instantaneous velocities, it effectively mitigates local error accumulation. To further improve cache timing and JVP reuse stability, we develop a trajectory-stability scheduling strategy as a practical tool, employing a Peak-Suppressed Shortest Path under budget constraints to determine the schedule. Experiments on FLUX.1, Qwen-Image, and HunyuanVideo demonstrate that MeanCache achieves 4.12X and 4.56X and 3.59X acceleration, respectively, while consistently outperforming state-of-the-art caching baselines in generation quality. We believe this simple yet effective approach provides a new perspective for Flow Matching inference and will inspire further exploration of stability-driven acceleration in commercial-scale generative models.

[331] Cross-Session Decoding of Neural Spiking Data via Task-Conditioned Latent Alignment

Canyang Zhao, Bolin Peng, J. Patrick Mayo, Ce Ju, Bing Liu

Main category: cs.LG

TL;DR: TCLA framework aligns neural representations across sessions for better BCI decoding with limited data

Details

Motivation: Cross-session nonstationarity in neural activity makes BCI decoders fail to generalize, especially when limited data is available from new sessions for retraining

Method: Task-Conditioned Latent Alignment (TCLA) uses autoencoder to learn low-dimensional neural dynamics from source session, then aligns target session representations to source in task-conditioned manner

Result: TCLA consistently improves decoding performance across datasets, with gains up to 0.386 in R² for velocity decoding in motor dataset compared to baseline methods

Conclusion: TCLA provides effective strategy for transferring knowledge across sessions, enabling more robust neural decoding under limited data conditions

Abstract: Cross-session nonstationarity in neural activity recorded by implanted electrodes is a major challenge for invasive Brain-computer interfaces (BCIs), as decoders trained on data from one session often fail to generalize to subsequent sessions. This issue is further exacerbated in practice, as retraining or adapting decoders becomes particularly challenging when only limited data are available from a new session. To address this challenge, we propose a Task-Conditioned Latent Alignment framework (TCLA) for cross-session neural decoding. Building upon an autoencoder architecture, TCLA first learns a low-dimensional representation of neural dynamics from a source session with sufficient data. For target sessions with limited data, TCLA then aligns target latent representations to the source in a task-conditioned manner, enabling effective transfer of learned neural dynamics. We evaluate TCLA on the macaque motor and oculomotor center-out dataset. Compared to baseline methods trained solely on target-session data, TCLA consistently improves decoding performance across datasets and decoding settings, with gains in the coefficient of determination of up to 0.386 for y coordinate velocity decoding in a motor dataset. These results suggest that TCLA provides an effective strategy for transferring knowledge from source to target sessions, enabling more robust neural decoding under conditions with limited data.

[332] The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models

Dakuan Lu, Jiaqi Zhang, Cheng Yuan, Jiawei Shao, Xuelong Li

Main category: cs.LG

TL;DR: Multi-model collaboration follows power-law scaling with total parameter count, achieving better performance than single models, with model diversity driving collaboration gains.

Details

Motivation: Single LLMs have inherent performance bounds, and while multi-model integration techniques exist, there's no unifying theoretical framework for performance scaling in multi-model collaboration.

Method: Proposes the Law of Multi-model Collaboration, a scaling law predicting LLM ensemble performance limits based on aggregated parameter budget. Uses method-agnostic formulation with idealized integration oracle where loss is determined by minimum loss of any model in the pool.

Result: Multi-model systems follow power-law scaling with total parameter count, showing more significant improvement trend and lower theoretical loss floor than single model scaling. Heterogeneous model families achieve better performance scaling than homogeneous ones.

Conclusion: Model collaboration represents a critical axis for extending LLM intelligence frontier, with model diversity being a primary driver of collaboration gains.

Abstract: Recent advances in large language models (LLMs) have been largely driven by scaling laws for individual models, which predict performance improvements as model parameters and data volume increase. However, the capabilities of any single LLM are inherently bounded. One solution originates from intricate interactions among multiple LLMs, rendering their collective performance surpasses that of any constituent model. Despite the rapid proliferation of multi-model integration techniques such as model routing and post-hoc ensembling, a unifying theoretical framework of performance scaling for multi-model collaboration remains absent. In this work, we propose the Law of Multi-model Collaboration, a scaling law that predicts the performance limits of LLM ensembles based on their aggregated parameter budget. To quantify the intrinsic upper bound of multi-model collaboration, we adopt a method-agnostic formulation and assume an idealized integration oracle where the total cross-entropy loss of each sample is determined by the minimum loss of any model in the model pool. Experimental results reveal that multi-model systems follow a power-law scaling with respect to the total parameter count, exhibiting a more significant improvement trend and a lower theoretical loss floor compared to single model scaling. Moreover, ensembles of heterogeneous model families achieve better performance scaling than those formed within a single model family, indicating that model diversity is a primary driver of collaboration gains. These findings suggest that model collaboration represents a critical axis for extending the intelligence frontier of LLMs.

[333] Modeling Cascaded Delay Feedback for Online Net Conversion Rate Prediction: Benchmark, Insights and Solutions

Mingxuan Luo, Guipeng Xv, Sishuo Chen, Xinyu Li, Li Zhang, Zhangming Chan, Xiang-Rong Sheng, Han Zhu, Jian Xu, Bo Zheng, Chen Lin

Main category: cs.LG

TL;DR: The paper introduces CASCADE, the first large-scale open dataset for NetCVR prediction, and proposes TESLA, a continuous modeling framework that outperforms state-of-the-art methods by addressing cascaded delayed feedback challenges in industrial recommender systems.

Details

Motivation: Traditional conversion rate (CVR) fails to reflect true recommendation effectiveness because it ignores refund behavior. NetCVR (purchase without refund) better captures user satisfaction and business value, but faces challenges with multi-stage cascaded delayed feedback, lack of open datasets, and no online continuous training schemes.

Method: The authors introduce CASCADE dataset from Taobao and propose TESLA framework with: 1) CVR-refund-rate cascaded architecture, 2) stage-wise debiasing, and 3) delay-time-aware ranking loss for continuous NetCVR modeling.

Result: TESLA consistently outperforms state-of-the-art methods on CASCADE dataset, achieving absolute improvements of 12.41% in RI-AUC and 14.94% in RI-PRAUC for NetCVR prediction.

Conclusion: The paper successfully addresses NetCVR prediction challenges by providing the first open dataset (CASCADE) and an effective continuous modeling framework (TESLA) that leverages cascaded architecture and delay-time features, significantly improving prediction accuracy.

Abstract: In industrial recommender systems, conversion rate (CVR) is widely used for traffic allocation, but it fails to fully reflect recommendation effectiveness because it ignores refund behavior. To better capture true user satisfaction and business value, net conversion rate (NetCVR), defined as the probability that a clicked item is purchased and not refunded, has been proposed.Unlike CVR, NetCVR prediction involves a more complex multi-stage cascaded delayed feedback process. The two cascaded delays from click to conversion and from conversion to refund have opposite effects, making traditional CVR modeling methods inapplicable. Moreover, the lack of open-source datasets and online continuous training schemes further hinders progress in this area.To address these challenges, we introduce CASCADE (Cascaded Sequences of Conversion and Delayed Refund), the first large-scale open dataset derived from the Taobao app for online continuous NetCVR prediction. Through an in-depth analysis of CASCADE, we identify three key insights: (1) NetCVR exhibits strong temporal dynamics, necessitating online continuous modeling; (2) cascaded modeling of CVR and refund rate outperforms direct NetCVR modeling; and (3) delay time, which correlates with both CVR and refund rate, is an important feature for NetCVR prediction.Based on these insights, we propose TESLA, a continuous NetCVR modeling framework featuring a CVR-refund-rate cascaded architecture, stage-wise debiasing, and a delay-time-aware ranking loss. Extensive experiments demonstrate that TESLA consistently outperforms state-of-the-art methods on CASCADE, achieving absolute improvements of 12.41 percent in RI-AUC and 14.94 percent in RI-PRAUC on NetCVR prediction. The code and dataset are publicly available at https://github.com/alimama-tech/NetCVR.

[334] Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers

Jinlin Liu, Wei Chen, Xiaojin Zhang

Main category: cs.LG

TL;DR: PIL (Perturbation-Induced Linearization) is a computationally efficient method for generating unlearnable examples using linear surrogate models instead of deep neural networks, achieving comparable protection with dramatically reduced computational time.

Details

Motivation: To address concerns about unauthorized data usage in deep learning, unlearnable examples add perturbations to prevent effective model learning. However, existing methods are computationally expensive because they use deep neural networks as surrogate models for perturbation generation.

Method: PIL (Perturbation-Induced Linearization) generates perturbations using only linear surrogate models instead of deep neural networks. The method reveals that unlearnable examples work by inducing linearization in deep models, allowing efficient perturbation generation.

Result: PIL achieves comparable or better performance than existing surrogate-based methods while dramatically reducing computational time. The method also provides analysis of unlearnable examples under percentage-based partial perturbation.

Conclusion: PIL offers a practical, computationally efficient approach for data protection through unlearnable examples, while also providing insights into the underlying mechanism of how these examples work by inducing linearization in deep models.

Abstract: Collecting web data to train deep models has become increasingly common, raising concerns about unauthorized data usage. To mitigate this issue, unlearnable examples introduce imperceptible perturbations into data, preventing models from learning effectively. However, existing methods typically rely on deep neural networks as surrogate models for perturbation generation, resulting in significant computational costs. In this work, we propose Perturbation-Induced Linearization (PIL), a computationally efficient yet effective method that generates perturbations using only linear surrogate models. PIL achieves comparable or better performance than existing surrogate-based methods while reducing computational time dramatically. We further reveal a key mechanism underlying unlearnable examples: inducing linearization to deep models, which explains why PIL can achieve competitive results in a very short time. Beyond this, we provide an analysis about the property of unlearnable examples under percentage-based partial perturbation. Our work not only provides a practical approach for data protection but also offers insights into what makes unlearnable examples effective.

[335] BayPrAnoMeta: Bayesian Proto-MAML for Few-Shot Industrial Image Anomaly Detection

Soham Sarkar, Tanmay Sen, Sayantan Banerjee

Main category: cs.LG

TL;DR: BayPrAnoMeta: Bayesian Proto-MAML for few-shot industrial anomaly detection using probabilistic normality models and Bayesian posterior predictive likelihood, with federated meta-learning extension.

Details

Motivation: Industrial image anomaly detection faces extreme class imbalance and scarcity of labeled defective samples, especially in few-shot settings. Existing Proto-MAML approaches use deterministic prototypes and distance-based adaptation, which may not handle uncertainty well in extreme few-shot scenarios.

Method: Bayesian generalization of Proto-MAML that replaces prototypes with task-specific probabilistic normality models. Uses Normal-Inverse-Wishart (NIW) prior on normal support embeddings, producing Student-t predictive distribution for uncertainty-aware anomaly scoring. Extended to federated meta-learning with supervised contrastive regularization for heterogeneous industrial clients.

Result: Experiments on MVTec AD benchmark show consistent and significant AUROC improvements over MAML, Proto-MAML, and PatchCore-based methods in few-shot anomaly detection settings.

Conclusion: BayPrAnoMeta provides an effective Bayesian approach for few-shot industrial anomaly detection with uncertainty-aware scoring, robustness in extreme few-shot settings, and federated learning capability for heterogeneous clients.

Abstract: Industrial image anomaly detection is a challenging problem owing to extreme class imbalance and the scarcity of labeled defective samples, particularly in few-shot settings. We propose BayPrAnoMeta, a Bayesian generalization of Proto-MAML for few-shot industrial image anomaly detection. Unlike existing Proto-MAML approaches that rely on deterministic class prototypes and distance-based adaptation, BayPrAnoMeta replaces prototypes with task-specific probabilistic normality models and performs inner-loop adaptation via a Bayesian posterior predictive likelihood. We model normal support embeddings with a Normal-Inverse-Wishart (NIW) prior, producing a Student-$t$ predictive distribution that enables uncertainty-aware, heavy-tailed anomaly scoring and is essential for robustness in extreme few-shot settings. We further extend BayPrAnoMeta to a federated meta-learning framework with supervised contrastive regularization for heterogeneous industrial clients and prove convergence to stationary points of the resulting nonconvex objective. Experiments on the MVTec AD benchmark demonstrate consistent and significant AUROC improvements over MAML, Proto-MAML, and PatchCore-based methods in few-shot anomaly detection settings.

[336] Decomposing multimodal embedding spaces with group-sparse autoencoders

Chiraag Kaushik, Davis Barch, Andrea Fanelli

Main category: cs.LG

TL;DR: The paper proposes a new SAE-based method for multimodal embedding decomposition that addresses the “split dictionary” problem through cross-modal random masking and group-sparse regularization, improving modality alignment and feature interpretability.

Details

Motivation: Standard sparse autoencoders (SAEs) applied to multimodal embeddings (like CLIP) often learn "split dictionaries" where features are unimodal rather than multimodal, limiting cross-modal alignment and interpretability.

Method: Proposes a new SAE approach with cross-modal random masking and group-sparse regularization to encourage multimodal feature learning. Applied to CLIP (image/text) and CLAP (audio/text) embeddings.

Result: The method learns more multimodal dictionaries, reduces dead neurons, improves feature semanticity, and enhances alignment of concepts between modalities compared to standard SAEs.

Conclusion: The improved multimodal alignment enables better interpretability and control of cross-modal tasks, advancing the application of SAEs to multimodal embedding spaces.

Abstract: The Linear Representation Hypothesis asserts that the embeddings learned by neural networks can be understood as linear combinations of features corresponding to high-level concepts. Based on this ansatz, sparse autoencoders (SAEs) have recently become a popular method for decomposing embeddings into a sparse combination of linear directions, which have been shown empirically to often correspond to human-interpretable semantics. However, recent attempts to apply SAEs to multimodal embedding spaces (such as the popular CLIP embeddings for image/text data) have found that SAEs often learn “split dictionaries”, where most of the learned sparse features are essentially unimodal, active only for data of a single modality. In this work, we study how to effectively adapt SAEs for the setting of multimodal embeddings while ensuring multimodal alignment. We first argue that the existence of a split dictionary decomposition on an aligned embedding space implies the existence of a non-split dictionary with improved modality alignment. Then, we propose a new SAE-based approach to multimodal embedding decomposition using cross-modal random masking and group-sparse regularization. We apply our method to popular embeddings for image/text (CLIP) and audio/text (CLAP) data and show that, compared to standard SAEs, our approach learns a more multimodal dictionary while reducing the number of dead neurons and improving feature semanticity. We finally demonstrate how this improvement in alignment of concepts between modalities can enable improvements in the interpretability and control of cross-modal tasks.

[337] CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

Martijn Bartelds, Ananjan Nandi, Moussa Koulako Bala Doumbouya, Dan Jurafsky, Tatsunori Hashimoto, Karen Livescu

Main category: cs.LG

TL;DR: CTC-DRO improves multilingual ASR by addressing group DRO’s limitations with smoothed group weight updates and input length-matched batching to handle CTC loss scaling issues.

Details

Motivation: Group DRO fails when group losses misrepresent performance differences, especially in speech where CTC loss scales with input length and varies with linguistic/acoustic properties, creating spurious group loss differences.

Method: CTC-DRO smooths group weight updates to prevent overemphasis on consistently high-loss groups, and uses input length-matched batching to mitigate CTC’s scaling issues with input length.

Result: CTC-DRO consistently outperforms group DRO and CTC-based baselines on multilingual ASR across five language sets from ML-SUPERB 2.0, reducing worst-language error by up to 47.1% and average error by up to 32.9%.

Conclusion: CTC-DRO effectively addresses group DRO’s shortcomings for multilingual ASR with minimal computational costs, and has potential for reducing group disparities in other domains with similar challenges.

Abstract: Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss not only scales with input length but also varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC’s scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the diverse ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 47.1% and the average error by up to 32.9%. CTC-DRO can be applied to ASR with minimal computational costs, and, while motivated by multilingual ASR, offers the potential for reducing group disparities in other domains with similar challenges.

[338] Structural Compositional Function Networks: Interpretable Functional Compositions for Tabular Discovery

Fang Li

Main category: cs.LG

TL;DR: StructuralCFN is a novel neural architecture for tabular data that models features as mathematical compositions of each other with differentiable structural priors, achieving better performance than gradient-boosted trees while providing human-readable mathematical interpretability.

Details

Motivation: Traditional deep learning struggles with tabular data compared to gradient-boosted decision trees, and standard neural networks fail to exploit the inherent structural dependencies in tabular distributions while maintaining scientific interpretability.

Method: Proposes Structural Compositional Function Networks (StructuralCFN) with Relation-Aware Inductive Bias via differentiable structural prior. Models each feature as a mathematical composition of its counterparts through Differentiable Adaptive Gating, which discovers optimal activation physics (attention-style filtering vs. inhibitory polarity) for each relationship. Enables Structured Knowledge Integration for domain-specific relational priors.

Result: Evaluated across 18 benchmarks with 10-fold cross-validation, showing statistically significant improvements (p < 0.05) on scientific/clinical datasets (Blood Transfusion, Ozone, WDBC). Provides Intrinsic Symbolic Interpretability by recovering governing “laws” as human-readable mathematical expressions. Maintains compact parameter footprint (300-2,500 parameters) that is 10x-20x smaller than standard deep baselines.

Conclusion: StructuralCFN successfully addresses the limitations of traditional deep learning on tabular data by incorporating structural priors, achieving superior performance while providing interpretable mathematical representations of data relationships.

Abstract: Despite the ubiquity of tabular data in high-stakes domains, traditional deep learning architectures often struggle to match the performance of gradient-boosted decision trees while maintaining scientific interpretability. Standard neural networks typically treat features as independent entities, failing to exploit the inherent manifold structural dependencies that define tabular distributions. We propose Structural Compositional Function Networks (StructuralCFN), a novel architecture that imposes a Relation-Aware Inductive Bias via a differentiable structural prior. StructuralCFN explicitly models each feature as a mathematical composition of its counterparts through Differentiable Adaptive Gating, which automatically discovers the optimal activation physics (e.g., attention-style filtering vs. inhibitory polarity) for each relationship. Our framework enables Structured Knowledge Integration, allowing domain-specific relational priors to be injected directly into the architecture to guide discovery. We evaluate StructuralCFN across a rigorous 10-fold cross-validation suite on 18 benchmarks, demonstrating statistically significant improvements (p < 0.05) on scientific and clinical datasets (e.g., Blood Transfusion, Ozone, WDBC). Furthermore, StructuralCFN provides Intrinsic Symbolic Interpretability: it recovers the governing “laws” of the data manifold as human-readable mathematical expressions while maintaining a compact parameter footprint (300–2,500 parameters) that is over an order of magnitude (10x–20x) smaller than standard deep baselines.

[339] CiMRAG: Cim-Aware Domain-Adaptive and Noise-Resilient Retrieval-Augmented Generation for Edge-Based LLMs

Shih-Hsuan Chiu, Ming-Syan Chen

Main category: cs.LG

TL;DR: TONEL framework improves noise robustness and domain adaptability for RAG on edge devices with CiM architectures by learning task-specific embeddings resilient to environmental noise.

Details

Motivation: Deploying RAG on edge devices faces efficiency challenges due to growing profile data, while CiM architectures that solve data movement bottlenecks are vulnerable to environmental noise degrading retrieval precision, especially in dynamic multi-domain edge scenarios requiring both accuracy and adaptability.

Method: TONEL (Task-Oriented Noise-resilient Embedding Learning) employs a noise-aware projection model to learn task-specific embeddings compatible with CiM hardware constraints, enabling accurate retrieval under noisy conditions.

Result: Extensive experiments on personalization benchmarks demonstrate effectiveness and practicality relative to strong baselines, especially in task-specific noisy scenarios.

Conclusion: TONEL addresses critical challenges of noise robustness and domain adaptability for RAG deployment on edge devices with CiM architectures, making personalized virtual assistants more practical in real-world noisy edge environments.

Abstract: Personalized virtual assistants powered by large language models (LLMs) on edge devices are attracting growing attention, with Retrieval-Augmented Generation (RAG) emerging as a key method for personalization by retrieving relevant profile data and generating tailored responses. However, deploying RAG on edge devices faces efficiency hurdles due to the rapid growth of profile data, such as user-LLM interactions and recent updates. While Computing-in-Memory (CiM) architectures mitigate this bottleneck by eliminating data movement between memory and processing units via in-situ operations, they are susceptible to environmental noise that can degrade retrieval precision. This poses a critical issue in dynamic, multi-domain edge-based scenarios (e.g., travel, medicine, and law) where both accuracy and adaptability are paramount. To address these challenges, we propose Task-Oriented Noise-resilient Embedding Learning (TONEL), a framework that improves noise robustness and domain adaptability for RAG in noisy edge environments. TONEL employs a noise-aware projection model to learn task-specific embeddings compatible with CiM hardware constraints, enabling accurate retrieval under noisy conditions. Extensive experiments conducted on personalization benchmarks demonstrate the effectiveness and practicality of our methods relative to strong baselines, especially in task-specific noisy scenarios.

[340] Regime-Adaptive Bayesian Optimization via Dirichlet Process Mixtures of Gaussian Processes

Yan Zhang, Xuefeng Liu, Sipeng Chen, Sascha Ranftl, Chong Liu, Shibo Li

Main category: cs.LG

TL;DR: RAMBO: Dirichlet Process Mixture of GPs for Bayesian Optimization on multi-regime problems, automatically discovering latent regimes with adaptive inference and specialized acquisition functions.

Details

Motivation: Standard BO assumes uniform smoothness, which fails in multi-regime problems like molecular conformation search (distinct energy basins) and drug discovery (heterogeneous scaffolds). Single GP either oversmooths sharp transitions or hallucinates noise, leading to miscalibrated uncertainty.

Method: RAMBO uses Dirichlet Process Mixture of Gaussian Processes to automatically discover latent regimes during optimization. Each regime gets independent GP with locally-optimized hyperparameters. Uses collapsed Gibbs sampling for efficient inference (analytically marginalizes latent functions) and adaptive concentration parameter scheduling for coarse-to-fine regime discovery. Acquisition functions decompose uncertainty into intra-regime and inter-regime components.

Result: Experiments on synthetic benchmarks and real-world applications (molecular conformer optimization, virtual screening for drug discovery, fusion reactor design) show consistent improvements over state-of-the-art baselines on multi-regime objectives.

Conclusion: RAMBO effectively handles multi-regime optimization problems by automatically discovering and modeling distinct regimes, outperforming standard BO methods that assume uniform smoothness across the search space.

Abstract: Standard Bayesian Optimization (BO) assumes uniform smoothness across the search space an assumption violated in multi-regime problems such as molecular conformation search through distinct energy basins or drug discovery across heterogeneous molecular scaffolds. A single GP either oversmooths sharp transitions or hallucinates noise in smooth regions, yielding miscalibrated uncertainty. We propose RAMBO, a Dirichlet Process Mixture of Gaussian Processes that automatically discovers latent regimes during optimization, each modeled by an independent GP with locally-optimized hyperparameters. We derive collapsed Gibbs sampling that analytically marginalizes latent functions for efficient inference, and introduce adaptive concentration parameter scheduling for coarse-to-fine regime discovery. Our acquisition functions decompose uncertainty into intra-regime and inter-regime components. Experiments on synthetic benchmarks and real-world applications, including molecular conformer optimization, virtual screening for drug discovery, and fusion reactor design, demonstrate consistent improvements over state-of-the-art baselines on multi-regime objectives.

[341] Externally Validated Longitudinal GRU Model for Visit-Level 180-Day Mortality Risk in Metastatic Castration-Resistant Prostate Cancer

Javier Mencia-Ledo, Mohammad Noaeen, Zahra Shakeri

Main category: cs.LG

TL;DR: Researchers developed and validated a 180-day mortality risk model for metastatic castration-resistant prostate cancer using longitudinal clinical data, with GRU and Random Survival Forest models showing best performance.

Details

Motivation: Metastatic castration-resistant prostate cancer (mCRPC) has poor prognosis and heterogeneous treatment response, creating a need for accurate short-term mortality prediction to enable proactive care planning.

Method: Used longitudinal data from two Phase III cohorts (n=526 and n=640), comparing five architectures: LSTM, GRU, Cox PH, Random Survival Forest, and Logistic Regression. Selected smallest risk-threshold achieving 85% sensitivity floor for each dataset.

Result: GRU and RSF showed high initial discrimination (C-index: 87% both). In external validation, GRU achieved better calibration (slope: 0.93; intercept: 0.07) and PR-AUC of 0.87. Clinical impact analysis showed median time-in-warning of 151.0 days for true positives. BMI and systolic blood pressure were strongest predictors.

Conclusion: Longitudinal routine clinical markers can effectively estimate short-horizon mortality risk in mCRPC, supporting proactive care planning over multi-month windows.

Abstract: Metastatic castration-resistant prostate cancer (mCRPC) is a highly aggressive disease with poor prognosis and heterogeneous treatment response. In this work, we developed and externally validated a visit-level 180-day mortality risk model using longitudinal data from two Phase III cohorts (n=526 and n=640). Only visits with observable 180-day outcomes were labeled; right-censored cases were excluded from analysis. We compared five candidate architectures: Long Short-Term Memory, Gated Recurrent Unit (GRU), Cox Proportional Hazards, Random Survival Forest (RSF), and Logistic Regression. For each dataset, we selected the smallest risk-threshold that achieved an 85% sensitivity floor. The GRU and RSF models showed high discrimination capabilities initially (C-index: 87% for both). In external validation, the GRU obtained a higher calibration (slope: 0.93; intercept: 0.07) and achieved an PR-AUC of 0.87. Clinical impact analysis showed a median time-in-warning of 151.0 days for true positives (59.0 days for false positives) and 18.3 alerts per 100 patient-visits. Given late-stage frailty or cachexia and hemodynamic instability, permutation importance ranked BMI and systolic blood pressure as the strongest associations. These results suggest that longitudinal routine clinical markers can estimate short-horizon mortality risk in mCRPC and support proactive care planning over a multi-month window.

[342] Domain Expansion: A Latent Space Construction Framework for Multi-Task Learning

Chi-Yao Huang, Khoa Vo, Aayush Atul Verma, Duo Lu, Yezhou Yang

Main category: cs.LG

TL;DR: Domain Expansion framework prevents latent representation collapse in multi-objective learning by using orthogonal pooling to create mutually orthogonal subspaces for different tasks.

Details

Motivation: Multi-objective training often leads to conflicting gradients that degrade shared representations, forcing them into a compromised suboptimal state called latent representation collapse.

Method: Introduces Domain Expansion framework with orthogonal pooling mechanism to construct latent space where each objective is assigned to mutually orthogonal subspaces.

Result: Validated across diverse benchmarks (ShapeNet, MPIIGaze, Rotated MNIST) on multi-objective problems combining classification with pose and gaze estimation. Prevents collapse and yields interpretable compositional latent space.

Conclusion: The orthogonal subspace structure prevents representation collapse and creates an explicit, interpretable latent space where concepts can be directly manipulated.

Abstract: Training a single network with multiple objectives often leads to conflicting gradients that degrade shared representations, forcing them into a compromised state that is suboptimal for any single task–a problem we term latent representation collapse. We introduce Domain Expansion, a framework that prevents these conflicts by restructuring the latent space itself. Our framework uses a novel orthogonal pooling mechanism to construct a latent space where each objective is assigned to a mutually orthogonal subspace. We validate our approach across diverse benchmarks–including ShapeNet, MPIIGaze, and Rotated MNIST–on challenging multi-objective problems combining classification with pose and gaze estimation. Our experiments demonstrate that this structure not only prevents collapse but also yields an explicit, interpretable, and compositional latent space where concepts can be directly manipulated.

[343] Distributional value gradients for stochastic environments

Baptiste Debes, Tinne Tuytelaars

Main category: cs.LG

TL;DR: Distributional Sobolev Training extends distributional RL to model both value functions and their gradients, improving performance in stochastic environments using a cVAE world model and MSMMD distributional Bellman operator.

Details

Motivation: Existing gradient-regularized value learning methods like MAGE struggle in stochastic or noisy environments, limiting their applicability. The paper aims to address these limitations by extending distributional RL to model not just value functions but also their gradients.

Method: Proposes Distributional Sobolev Training that models distributions over both scalar state-action value functions and their gradients. Uses a one-step world model of reward and transition distributions via conditional Variational Autoencoder (cVAE). The framework is sample-based and employs Max-sliced Maximum Mean Discrepancy (MSMMD) to instantiate the distributional Bellman operator.

Result: Proves that the Sobolev-augmented Bellman operator is a contraction with a unique fixed point. Demonstrates effectiveness on a simple stochastic RL toy problem and benchmarks performance on several MuJoCo environments.

Conclusion: Distributional Sobolev Training successfully addresses limitations of existing gradient-regularized methods in stochastic environments by modeling both value distributions and their gradients, with theoretical guarantees and empirical validation on benchmark tasks.

Abstract: Gradient-regularized value learning methods improve sample efficiency by leveraging learned models of transition dynamics and rewards to estimate return gradients. However, existing approaches, such as MAGE, struggle in stochastic or noisy environments, limiting their applicability. In this work, we address these limitations by extending distributional reinforcement learning on continuous state-action spaces to model not only the distribution over scalar state-action value functions but also over their gradients. We refer to this approach as Distributional Sobolev Training. Inspired by Stochastic Value Gradients (SVG), our method utilizes a one-step world model of reward and transition distributions implemented via a conditional Variational Autoencoder (cVAE). The proposed framework is sample-based and employs Max-sliced Maximum Mean Discrepancy (MSMMD) to instantiate the distributional Bellman operator. We prove that the Sobolev-augmented Bellman operator is a contraction with a unique fixed point, and highlight a fundamental smoothness trade-off underlying contraction in gradient-aware RL. To validate our method, we first showcase its effectiveness on a simple stochastic reinforcement learning toy problem, then benchmark its performance on several MuJoCo environments.

[344] Techno-economic optimization of a heat-pipe microreactor, part II: multi-objective optimization analysis

Paul Seurin, Dean Price

Main category: cs.LG

TL;DR: Multi-objective optimization of heat-pipe microreactors using PEARL algorithm to simultaneously minimize power peaking factor and levelized cost of electricity across different cost scenarios.

Details

Motivation: Extend previous single-objective optimization framework to address both economic (LCOE) and safety/performance (power peaking factor) considerations for heat-pipe microreactors, which are important for remote deployment applications.

Method: Used Pareto Envelope Augmented with Reinforcement Learning (PEARL) algorithm for multi-objective optimization, incorporating surrogate modeling and evaluating three different cost scenarios for reactor components.

Result: Identified consistent design strategies: reduce solid moderator radius, pin pitch, and drum coating angle while increasing fuel height to lower power peaking factor; minimize axial reflector contribution, reduce control drum reliance, substitute expensive TRISO fuel with cheaper graphite, and maximize fuel burnup to optimize LCOE.

Conclusion: PEARL shows promise for navigating design trade-offs, but discrepancies between surrogate models and full simulations remain. Future work includes constraint relaxation and improved surrogate development.

Abstract: Heat-pipe microreactors (HPMRs) are compact and transportable nuclear power systems exhibiting inherent safety, well-suited for deployment in remote regions where access is limited and reliance on costly fossil fuels is prevalent. In prior work, we developed a design optimization framework that incorporates techno-economic considerations through surrogate modeling and reinforcement learning (RL)-based optimization, focusing solely on minimizing the levelized cost of electricity (LCOE) by using a bottom-up cost estimation approach. In this study, we extend that framework to a multi-objective optimization that uses the Pareto Envelope Augmented with Reinforcement Learning (PEARL) algorithm. The objectives include minimizing both the rod-integrated peaking factor ($F_{Δh}$) and LCOE – subject to safety and operational constraints. We evaluate three cost scenarios: (1) a high-cost axial and drum reflectors, (2) a low-cost axial reflector, and (3) low-cost axial and drum reflectors. Our findings indicate that reducing the solid moderator radius, pin pitch, and drum coating angle – all while increasing the fuel height – effectively lowers $F_{Δh}$. Across all three scenarios, four key strategies consistently emerged for optimizing LCOE: (1) minimizing the axial reflector contribution when costly, (2) reducing control drum reliance, (3) substituting expensive tri-structural isotropic (TRISO) fuel with axial reflector material priced at the level of graphite, and (4) maximizing fuel burnup. While PEARL demonstrates promise in navigating trade-offs across diverse design scenarios, discrepancies between surrogate model predictions and full-order simulations remain. Further improvements are anticipated through constraint relaxation and surrogate development, constituting an ongoing area of investigation.

[345] Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

Meng Xin, Sweta Priyadarshi, Jingyu Xin, Bilal Kartal, Aditya Vavre, Asma Kuriparambil Thekkumpate, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Ido Shahaf, Akhiad Bercovich, Kinjal Patel, Suguna Varshini Velury, Chenjie Luo, Zhiyu Cheng, Jenny Chen, Chen-Han Yu, Wei Ping, Oleg Rybakov, Nima Tajbakhsh, Oluwatobi Olabiyi, Dusan Stosic, Di Wu, Song Han, Eric Chung, Sharath Turuvekere Sreenivas, Bryan Catanzaro, Yoshi Suhara, Tijmen Blankevoort, Huizi Mao

Main category: cs.LG

TL;DR: QAD (quantization-aware distillation) effectively recovers accuracy for NVFP4-quantized LLMs/VLMs using KL divergence loss from full-precision teacher to quantized student, outperforming traditional QAT in multi-stage post-trained models.

Details

Motivation: Traditional quantization-aware training (QAT) suffers from engineering complexity and training instability when applied to modern LLMs/VLMs trained through multi-stage post-training pipelines (SFT, RL, model merging). QAD provides a more stable and effective alternative that works with limited data.

Method: Quantization-aware distillation (QAD) uses KL divergence loss to distill knowledge from a full-precision teacher model into a quantized student model. The approach is designed to be robust to data quality and coverage issues.

Result: QAD consistently recovers near-BF16 accuracy across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1.

Conclusion: QAD is an effective and stable alternative to traditional QAT for recovering accuracy in quantized LLMs/VLMs, particularly for models trained through complex multi-stage post-training pipelines, and works well even with limited training data.

Abstract: This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today’s LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.

[346] In-Context Reinforcement Learning From Suboptimal Historical Data

Juncheng Dong, Moyang Guo, Ethan X. Fang, Zhuoran Yang, Vahid Tarokh

Main category: cs.LG

TL;DR: DIT is a transformer-based framework for in-context RL that improves over imitation learning when trained on suboptimal offline data by emulating actor-critic methods through importance-weighted training.

Details

Motivation: Transformers excel at in-context learning, but standard autoregressive training on suboptimal RL trajectories leads to imitation learning and suboptimal performance. Need a method that can learn optimal policies from imperfect historical data.

Method: Two-stage approach: 1) Train transformer-based value function to estimate advantage functions of behavior policies, 2) Train transformer-based policy using weighted maximum likelihood estimation where weights are based on the value function to improve suboptimal policies.

Result: DIT achieves superior performance on both bandit and MDP problems, especially when offline dataset contains suboptimal historical data, outperforming standard imitation learning approaches.

Conclusion: The Decision Importance Transformer effectively adapts actor-critic RL to in-context learning, enabling transformers to learn optimal policies from suboptimal offline data through importance-weighted training.

Abstract: Transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context reinforcement learning (ICRL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL tasks, and then fix and use this transformer to create an action policy for new RL tasks. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer(DIT) framework, which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.

[347] A Reinforcement Learning Based Universal Sequence Design for Polar Codes

David Kin Wai Ho, Arman Fazeli, Mohamad M. Mansour, Louay M. A. Jalloul

Main category: cs.LG

TL;DR: A reinforcement learning framework for designing universal Polar code sequences that scales to 2048-bit lengths and achieves competitive performance with 5G NR sequences.

Details

Motivation: To advance Polar code design for 6G applications by creating an extensible and adaptable framework that works across diverse channel conditions and decoding strategies, while scaling to practical code lengths suitable for standardization.

Method: Reinforcement learning-based universal sequence design framework with three key scaling elements: (i) physical law constrained learning using universal partial order property, (ii) limited lookahead evaluation exploiting weak long-term influence of decisions, and (iii) joint multi-configuration optimization for learning efficiency.

Result: Achieves competitive performance relative to 5G NR sequences across all (N,K) configurations, with up to 0.2 dB gain over beta-expansion baseline at N=2048. Successfully scales to code lengths up to 2048 bits.

Conclusion: The proposed reinforcement learning framework provides a scalable, adaptable solution for Polar code sequence design that meets 6G requirements and standardization needs, with key innovations enabling learning at practical code lengths.

Abstract: To advance Polar code design for 6G applications, we develop a reinforcement learning-based universal sequence design framework that is extensible and adaptable to diverse channel conditions and decoding strategies. Crucially, our method scales to code lengths up to $2048$, making it suitable for use in standardization. Across all $(N,K)$ configurations supported in 5G, our approach achieves competitive performance relative to the NR sequence adopted in 5G and yields up to a 0.2 dB gain over the beta-expansion baseline at $N=2048$. We further highlight the key elements that enabled learning at scale: (i) incorporation of physical law constrained learning grounded in the universal partial order property of Polar codes, (ii) exploitation of the weak long term influence of decisions to limit lookahead evaluation, and (iii) joint multi-configuration optimization to increase learning efficiency.

[348] Going NUTS with ADVI: Exploring various Bayesian Inference techniques with Facebook Prophet

Jovan Krajevski, Biljana Tojtovska Ribarski

Main category: cs.LG

TL;DR: The paper presents a PyMC-based reimplementation of Facebook Prophet to enable more flexible Bayesian inference methods beyond Prophet’s built-in options.

Details

Motivation: The authors encountered limitations with Facebook Prophet's built-in inference methods (only MAP with L-BFGS-B and MCMC with NUTS) and insufficient API flexibility for implementing custom modeling ideas.

Method: Complete reimplementation of the Prophet model in PyMC, enabling extension of the base model and evaluation/comparison of multiple Bayesian inference methods including full MCMC, MAP estimation, and Variational inference.

Result: The paper analyzes different Bayesian inference techniques on time-series forecasting problems, discussing sampling approaches, convergence diagnostics, forecasting metrics, computational efficiency, and identifies issues for future work.

Conclusion: The PyMC-based implementation successfully addresses Prophet’s limitations by providing more flexible Bayesian inference options and extensible modeling capabilities, though some issues remain for future work.

Abstract: Since its introduction, Facebook Prophet has attracted positive attention from both classical statisticians and the Bayesian statistics community. The model provides two built-in inference methods: maximum a posteriori estimation using the L-BFGS-B algorithm, and Markov Chain Monte Carlo (MCMC) sampling via the No-U-Turn Sampler (NUTS). While exploring various time-series forecasting problems using Bayesian inference with Prophet, we encountered limitations stemming from the inability to apply alternative inference techniques beyond those provided by default. Additionally, the fluent API design of Facebook Prophet proved insufficiently flexible for implementing our custom modeling ideas. To address these shortcomings, we developed a complete reimplementation of the Prophet model in PyMC, which enables us to extend the base model and evaluate and compare multiple Bayesian inference methods. In this paper, we present our PyMC-based implementation and analyze in detail the implementation of different Bayesian inference techniques. We consider full MCMC techniques, MAP estimation and Variational inference techniques on a time-series forecasting problem. We discuss in details the sampling approach, convergence diagnostics, forecasting metrics as well as their computational efficiency and detect possible issues which will be addressed in our future work.

[349] Membership Inference Attacks Against Fine-tuned Diffusion Language Models

Yuetian Chen, Kaiyuan Zhang, Yuntao Du, Edoardo Stoppa, Charles Fleming, Ashish Kundu, Bruno Ribeiro, Ninghui Li

Main category: cs.LG

TL;DR: First systematic investigation of Membership Inference Attack vulnerabilities in Diffusion Language Models, introducing SAMA attack that achieves 30% AUC improvement over baselines by exploiting DLMs’ multiple maskable configurations.

Details

Motivation: Diffusion Language Models (DLMs) are promising alternatives to autoregressive models but their privacy vulnerabilities via Membership Inference Attacks remain critically underexplored. Unlike autoregressive models with single fixed prediction patterns, DLMs' multiple maskable configurations create exponentially more attack opportunities that need systematic investigation.

Method: Introduces SAMA (Subset-Aggregated Membership Attack) which addresses sparse signal challenge through robust aggregation. Samples masked subsets across progressive densities and applies sign-based statistics effective despite heavy-tailed noise. Uses inverse-weighted aggregation prioritizing sparse masks’ cleaner signals to transform sparse memorization detection into robust voting mechanism.

Result: Experiments on nine datasets show SAMA achieves 30% relative AUC improvement over best baseline, with up to 8 times improvement at low false positive rates. Reveals significant, previously unknown vulnerabilities in DLMs.

Conclusion: Findings reveal significant privacy vulnerabilities in Diffusion Language Models that were previously unknown, necessitating development of tailored privacy defenses for DLMs.

Abstract: Diffusion Language Models (DLMs) represent a promising alternative to autoregressive language models, using bidirectional masked token prediction. Yet their susceptibility to privacy leakage via Membership Inference Attacks (MIA) remains critically underexplored. This paper presents the first systematic investigation of MIA vulnerabilities in DLMs. Unlike the autoregressive models’ single fixed prediction pattern, DLMs’ multiple maskable configurations exponentially increase attack opportunities. This ability to probe many independent masks dramatically improves detection chances. To exploit this, we introduce SAMA (Subset-Aggregated Membership Attack), which addresses the sparse signal challenge through robust aggregation. SAMA samples masked subsets across progressive densities and applies sign-based statistics that remain effective despite heavy-tailed noise. Through inverse-weighted aggregation prioritizing sparse masks’ cleaner signals, SAMA transforms sparse memorization detection into a robust voting mechanism. Experiments on nine datasets show SAMA achieves 30% relative AUC improvement over the best baseline, with up to 8 times improvement at low false positive rates. These findings reveal significant, previously unknown vulnerabilities in DLMs, necessitating the development of tailored privacy defenses.

[350] Spectral Ghost in Representation Learning: from Component Analysis to Self-Supervised Learning

Bo Dai, Na Li, Dale Schuurmans

Main category: cs.LG

TL;DR: The paper develops a principled theoretical foundation for self-supervised representation learning, revealing the spectral essence of successful SSL algorithms and providing a unified framework for understanding, analysis, and algorithm design.

Details

Motivation: The rapid growth of diverse SSL methods lacks unified understanding, theoretical foundation, and clear design principles, hampering development and practical justification of representation learning methods.

Method: Theoretical investigation of representation sufficiency from a spectral representation view, revealing the spectral essence of existing successful SSL algorithms to develop a unified framework.

Result: Develops a principled foundation for representation learning that provides theoretical understanding, reveals spectral essence of SSL algorithms, and creates a unified framework for analysis.

Conclusion: The spectral-based unified framework enables better understanding of SSL, inspires development of more efficient algorithms, and provides principled guidance for real-world applications of representation learning.

Abstract: Self-supervised learning (SSL) have improved empirical performance by unleashing the power of unlabeled data for practical applications. Specifically, SSL extracts the representation from massive unlabeled data, which will be transferred to a plenty of down streaming tasks with limited data. The significant improvement on diverse applications of representation learning has attracted increasing attention, resulting in a variety of dramatically different self-supervised learning objectives for representation extraction, with an assortment of learning procedures, but the lack of a clear and unified understanding. Such an absence hampers the ongoing development of representation learning, leaving a theoretical understanding missing, principles for efficient algorithm design unclear, and the use of representation learning methods in practice unjustified. The urgency for a unified framework is further motivated by the rapid growth in representation learning methods. In this paper, we are therefore compelled to develop a principled foundation of representation learning. We first theoretically investigate the sufficiency of the representation from a spectral representation view, which reveals the spectral essence of the existing successful SSL algorithms and paves the path to a unified framework for understanding and analysis. Such a framework work also inspires the development of more efficient and easy-to-use representation learning algorithms with principled way in real-world applications.

[351] PASS: Ambiguity Guided Subsets for Scalable Classical and Quantum Constrained Clustering

Pedro Chumpitaz-Flores, My Duong, Ying Mao, Kaixun Hua

Main category: cs.LG

TL;DR: PASS framework for scalable pairwise-constrained clustering using constraint-aware subset selection with ML constraint collapsing and CL violation detection.

Details

Motivation: Current pairwise-constrained clustering methods struggle with data scalability, especially in niche applications like quantum or quantum-hybrid clustering, due to the added complexity of ML and CL constraints.

Method: PASS collapses ML constraints into pseudo-points and offers two selection rules: 1) constraint-aware margin rule collecting near-boundary points and CL violations, and 2) information-geometric rule scoring points via Fisher-Rao distance from soft assignment posteriors, selecting highest-information subsets under budget constraints.

Result: PASS achieves competitive SSE at substantially lower cost than exact or penalty-based methods, and remains effective in regimes where prior approaches fail across diverse benchmarks.

Conclusion: PASS enables scalable, high-quality pairwise-constrained clustering while preserving constraint satisfaction through intelligent subset selection strategies.

Abstract: Pairwise-constrained clustering augments unsupervised partitioning with side information by enforcing must-link (ML) and cannot-link (CL) constraints between specific samples, yielding labelings that respect known affinities and separations. However, ML and CL constraints add an extra layer of complexity to the clustering problem, with current methods struggling in data scalability, especially in niche applications like quantum or quantum-hybrid clustering. We propose PASS, a pairwise-constraints and ambiguity-driven subset selection framework that preserves ML and CL constraints satisfaction while allowing scalable, high-quality clustering solution. PASS collapses ML constraints into pseudo-points and offers two selectors: a constraint-aware margin rule that collects near-boundary points and all detected CL violations, and an information-geometric rule that scores points via a Fisher-Rao distance derived from soft assignment posteriors, then selects the highest-information subset under a simple budget. Across diverse benchmarks, PASS attains competitive SSE at substantially lower cost than exact or penalty-based methods, and remains effective in regimes where prior approaches fail.

[352] What’s the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering

Jim Maar, Denis Paperno, Callum Stuart McDougall, Neel Nanda

Main category: cs.LG

TL;DR: Researchers develop simple techniques to detect implicit planning in language models, showing it’s present even in small models (from 1B parameters) and can be manipulated through vector steering.

Details

Motivation: Previous work suggested language models exhibit implicit planning behavior during next-token prediction, but existing methods were complex. The authors aim to develop simpler, scalable techniques to study this phenomenon across various models.

Method: Proposed simple techniques for assessing implicit planning using case studies on rhyme poetry generation and question answering. Used vector steering at the end of preceding lines to manipulate generated rhymes or answers, affecting intermediate tokens leading up to target words.

Result: Demonstrated that implicit planning is a universal mechanism present in smaller models than previously thought (starting from 1B parameters). Showed that generated rhymes or answers can be manipulated through vector steering, affecting intermediate token generation.

Conclusion: The methodology offers a widely applicable way to study implicit planning in LLMs. Understanding planning abilities can inform AI safety and control decisions, revealing that implicit planning is more widespread than previously recognized.

Abstract: Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross-layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g. “-ight”) or answer to a question (“whale”) can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.

[353] Local Duality for Sparse Support Vector Machines

Penghe Zhang, Naihua Xiu, Houduo Qi

Main category: cs.LG

TL;DR: This paper develops a local duality theory for sparse support vector machines (SSVMs) using cardinality minimization, establishes their relationship with hinge-loss and ramp-loss SVMs, and shows SSVMs are dual to 0/1-loss SVMs.

Details

Motivation: Sparse SVMs have shown empirical advantages over convex SVMs but lack theoretical justification when derived by adding cardinality functions to dual SVM problems. The paper aims to fill this theoretical gap.

Method: Develops local duality theory for SSVMs, proves SSVM is exactly the dual problem of 0/1-loss SVM, establishes linear representer theorem for local solutions, and analyzes convergence relationships between hSVM, rSVM, and 0/1-loss SVM solutions.

Result: Proves SSVM is dual to 0/1-loss SVM, shows linear representer theorem holds for local solutions, demonstrates convergence of hSVM global solutions to 0/1-loss SVM local solutions under conditions, and proves 0/1-loss SVM local minimizers are also rSVM local minimizers.

Conclusion: The developed local duality theory provides theoretical justification for SSVMs, explains their empirical superiority over hSVM and rSVM, and offers guidelines for hyperparameter selection. Numerical tests on real datasets demonstrate potential advantages of SSVMs with locally nice solutions.

Abstract: Due to the rise of cardinality minimization in optimization, sparse support vector machines (SSVMs) have attracted much attention lately and show certain empirical advantages over convex SVMs. A common way to derive an SSVM is to add a cardinality function such as $\ell_0$-norm to the dual problem of a convex SVM. However, this process lacks theoretical justification. This paper fills the gap by developing a local duality theory for such an SSVM formulation and exploring its relationship with the hinge-loss SVM (hSVM) and the ramp-loss SVM (rSVM). In particular, we prove that the derived SSVM is exactly the dual problem of the 0/1-loss SVM, and the linear representer theorem holds for their local solutions. The local solution of SSVM also provides guidelines on selecting hyperparameters of hSVM and rSVM. {Under specific conditions, we show that a sequence of global solutions of hSVM converges to a local solution of 0/1-loss SVM. Moreover, a local minimizer of 0/1-loss SVM is a local minimizer of rSVM.} This explains why a local solution induced by SSVM outperforms hSVM and rSVM in the prior empirical study. We further conduct numerical tests on real datasets and demonstrate potential advantages of SSVM by working with locally nice solutions proposed in this paper.

[354] Loss Landscape Geometry and the Learning of Symmetries: Or, What Influence Functions Reveal About Robust Generalization

James Amarel, Robyn Miller, Nicolas Hengartner, Benjamin Migliori, Emily Casleton, Alexei Skurikhin, Earl Lawrence, Gerd J. Kunde

Main category: cs.LG

TL;DR: Researchers developed a diagnostic to measure how neural PDE emulators internalize physical symmetries by analyzing gradient propagation between symmetry-related states, revealing whether training dynamics couple physically equivalent configurations.

Details

Motivation: Current methods for assessing symmetry in neural emulators mainly rely on forward-pass equivariance tests, which don't reveal whether the learning process itself properly couples symmetry-related states. There's a need for diagnostics that examine how training dynamics internalize physical symmetries in PDE solution operators.

Method: Introduced an influence-based diagnostic that measures propagation of parameter updates between symmetry-related states using metric-weighted overlap of loss gradients evaluated along group orbits. This probes the local geometry of the learned loss landscape and assesses whether learning dynamics couple physically equivalent configurations.

Result: Applied to autoregressive fluid flow emulators, orbit-wise gradient coherence provides the mechanism for learning to generalize over symmetry transformations and indicates when training selects a symmetry-compatible basin. The diagnostic successfully evaluates whether surrogate models have internalized symmetry properties of known solution operators.

Conclusion: The proposed diagnostic offers a novel technique for evaluating symmetry internalization in neural PDE emulators, going beyond traditional equivariance tests by directly examining learning dynamics and gradient coherence along symmetry orbits.

Abstract: We study how neural emulators of partial differential equation solution operators internalize physical symmetries by introducing an influence-based diagnostic that measures the propagation of parameter updates between symmetry-related states, defined as the metric-weighted overlap of loss gradients evaluated along group orbits. This quantity probes the local geometry of the learned loss landscape and goes beyond forward-pass equivariance tests by directly assessing whether learning dynamics couple physically equivalent configurations. Applying our diagnostic to autoregressive fluid flow emulators, we show that orbit-wise gradient coherence provides the mechanism for learning to generalize over symmetry transformations and indicates when training selects a symmetry compatible basin. The result is a novel technique for evaluating if surrogate models have internalized symmetry properties of the known solution operator.

[355] MAPLE: Self-supervised Learning-Enhanced Nonlinear Dimensionality Reduction for Visual Analysis

Zeyang Huang, Takanori Fujiwara, Angelos Chatzimparmpas, Wandrille Duchemin, Andreas Kerren

Main category: cs.LG

TL;DR: MAPLE is a new nonlinear dimensionality reduction method that improves upon UMAP by using self-supervised learning and maximum manifold capacity representations to better model complex manifolds in high-dimensional data.

Details

Motivation: The motivation is to enhance UMAP's manifold modeling capabilities, particularly for high-dimensional data with substantial intra-cluster variance and curved manifold structures (like biological or image data), where existing methods may struggle to properly separate clusters and reveal fine subcluster structures.

Method: MAPLE uses a self-supervised learning approach with maximum manifold capacity representations (MMCRs) that compress variances among locally similar data points while amplifying variance among dissimilar data points, effectively untangling complex manifolds.

Result: Qualitative and quantitative evaluations show that MAPLE produces clearer visual cluster separations and finer subcluster resolution than UMAP while maintaining comparable computational cost.

Conclusion: MAPLE successfully enhances UMAP’s manifold modeling capabilities, offering improved performance for complex high-dimensional data without increasing computational burden, making it particularly suitable for biological and image data analysis.

Abstract: We present a new nonlinear dimensionality reduction method, MAPLE, that enhances UMAP by improving manifold modeling. MAPLE employs a self-supervised learning approach to more efficiently encode low-dimensional manifold geometry. Central to this approach are maximum manifold capacity representations (MMCRs), which help untangle complex manifolds by compressing variances among locally similar data points while amplifying variance among dissimilar data points. This design is particularly effective for high-dimensional data with substantial intra-cluster variance and curved manifold structures, such as biological or image data. Our qualitative and quantitative evaluations demonstrate that MAPLE can produce clearer visual cluster separations and finer subcluster resolution than UMAP while maintaining comparable computational cost.

[356] NeuraLSP: An Efficient and Rigorous Neural Left Singular Subspace Preconditioner for Conjugate Gradient Methods

Alexander Benanti, Xi Han, Hong Qin

Main category: cs.LG

TL;DR: NeuraLSP: A neural preconditioner using left singular subspace of near-nullspace vectors to accelerate PDE solving with up to 53% speedup.

Details

Motivation: Existing neural preconditioners using GNNs suffer from rank inflation and suboptimal convergence rates when aggregating discretized PDE matrices into graphs, limiting their effectiveness for solving large sparse linear systems.

Method: Proposes NeuraLSP, a neural preconditioner combined with a novel loss metric that leverages the left singular subspace of the system matrix’s near-nullspace vectors, compressing spectral information into a fixed low-rank operator.

Result: The method demonstrates theoretical guarantees and empirical robustness to rank inflation, achieving up to 53% speedup across diverse families of PDEs.

Conclusion: NeuraLSP provides a theoretically sound and empirically effective neural preconditioning approach that addresses limitations of existing methods, offering significant acceleration for PDE solving with robustness to rank inflation.

Abstract: Numerical techniques for solving partial differential equations (PDEs) are integral for many fields across science and engineering. Such techniques usually involve solving large, sparse linear systems, where preconditioning methods are critical. In recent years, neural methods, particularly graph neural networks (GNNs), have demonstrated their potential through accelerated convergence. Nonetheless, to extract connective structures, existing techniques aggregate discretized system matrices into graphs, and suffer from rank inflation and a suboptimal convergence rate. In this paper, we articulate NeuraLSP, a novel neural preconditioner combined with a novel loss metric that leverages the left singular subspace of the system matrix’s near-nullspace vectors. By compressing spectral information into a fixed low-rank operator, our method exhibits both theoretical guarantees and empirical robustness to rank inflation, affording up to a 53% speedup. Besides the theoretical guarantees for our newly-formulated loss function, our comprehensive experimental results across diverse families of PDEs also substantiate the aforementioned theoretical advances.

[357] Causal-Driven Feature Evaluation for Cross-Domain Image Classification

Chen Cheng, Ang Li

Main category: cs.LG

TL;DR: The paper proposes a causal perspective on OOD generalization, arguing that domain-invariant features aren’t necessarily causally effective for prediction, and introduces a segment-level framework to measure causal necessity and sufficiency across domains.

Details

Motivation: Current OOD generalization approaches focus on domain-invariant representations, but these invariant features may not be causally effective for prediction. The authors argue that invariance alone is insufficient for reliable OOD performance and propose a causal evaluation framework.

Method: The paper introduces an explicit segment-level framework that directly measures causal effectiveness (necessity and sufficiency) of learned representations under distribution shift, providing a more faithful criterion than invariance alone.

Result: Experiments on multi-domain benchmarks show consistent improvements in OOD performance, particularly under challenging domain shifts, demonstrating the value of causal evaluation for robust generalization.

Conclusion: Causal evaluation of representations (measuring necessity and sufficiency) provides a more reliable approach to OOD generalization than pursuing domain invariance alone, leading to better performance under distribution shifts.

Abstract: Out-of-distribution (OOD) generalization remains a fundamental challenge in real-world classification, where test distributions often differ substantially from training data. Most existing approaches pursue domain-invariant representations, implicitly assuming that invariance implies reliability. However, features that are invariant across domains are not necessarily causally effective for prediction. In this work, we revisit OOD classification from a causal perspective and propose to evaluate learned representations based on their necessity and sufficiency under distribution shift. We introduce an explicit segment-level framework that directly measures causal effectiveness across domains, providing a more faithful criterion than invariance alone. Experiments on multi-domain benchmarks demonstrate consistent improvements in OOD performance, particularly under challenging domain shifts, highlighting the value of causal evaluation for robust generalization.

[358] On the Computational Complexity of Performative Prediction

Ioannis Anagnostides, Rohan Chauhan, Ioannis Panageas, Tuomas Sandholm, Jingming Yan

Main category: cs.LG

TL;DR: Computing performatively stable points becomes PPAD-complete (equivalent to finding Nash equilibria) when performative effects are strong (ρ>1), even for simple quadratic loss with linear shifts.

Details

Motivation: To understand the computational complexity of performative prediction in the strong performative effects regime (ρ>1), which was previously open, and to establish phase transitions in tractability.

Method: Established computational complexity results using PPAD-completeness reductions, extended to general convex domains, and analyzed strategic classification as a special case with PLS-hardness results.

Result: Sharp phase transition: computing ε-performatively stable points is PPAD-complete when ρ=1+O(ε), even for quadratic loss with linear shifts. Strategic classification is PLS-hard.

Conclusion: Strong performative effects (ρ>1) make finding performatively stable points computationally intractable (PPAD-complete), similar to finding Nash equilibria, revealing fundamental limits in performative prediction.

Abstract: Performative prediction captures the phenomenon where deploying a predictive model shifts the underlying data distribution. While simple retraining dynamics are known to converge linearly when the performative effects are weak ($ρ< 1$), the complexity in the regime $ρ> 1$ was hitherto open. In this paper, we establish a sharp phase transition: computing an $ε$-performatively stable point is PPAD-complete – and thus polynomial-time equivalent to Nash equilibria in general-sum games – even when $ρ= 1 + O(ε)$. This intractability persists even in the ostensibly simple setting with a quadratic loss function and linear distribution shifts. One of our key technical contributions is to extend this PPAD-hardness result to general convex domains, which is of broader interest in the complexity of variational inequalities. Finally, we address the special case of strategic classification, showing that computing a strategic local optimum is PLS-hard.

[359] Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery

Zhipeng Zhang, Wenting Ma, Kai Li, Meng Guo, Lei Yang, Wei Yu, Hongji Cui, Yichen Zhang, Mo Zhang, Jinzhe Lin, Zhenjie Yao

Main category: cs.LG

TL;DR: Meta-cognitive RL framework enables agents to self-assess learning reliability using VPES-driven meta-trust, achieving better robustness against reward corruption.

Details

Motivation: Current robust RL methods lack self-awareness about learning reliability, causing them to either overreact to noise or fail catastrophically when uncertainty accumulates.

Method: Proposes meta-cognitive RL with meta-trust variable driven by Value Prediction Error Stability (VPES), enabling fail-safe regulation and gradual trust recovery to modulate learning dynamics.

Result: Experiments on continuous-control benchmarks with reward corruption show higher average returns and significantly reduced late-stage training failures compared to strong robustness baselines.

Conclusion: Meta-cognitive control with recovery mechanisms enables more robust and reliable reinforcement learning by allowing agents to self-regulate based on internal reliability assessments.

Abstract: Robust reinforcement learning methods typically focus on suppressing unreliable experiences or corrupted rewards, but they lack the ability to reason about the reliability of their own learning process. As a result, such methods often either overreact to noise by becoming overly conservative or fail catastrophically when uncertainty accumulates. In this work, we propose a meta-cognitive reinforcement learning framework that enables an agent to assess, regulate, and recover its learning behavior based on internally estimated reliability signals. The proposed method introduces a meta-trust variable driven by Value Prediction Error Stability (VPES), which modulates learning dynamics via fail-safe regulation and gradual trust recovery. Experiments on continuous-control benchmarks with reward corruption demonstrate that recovery-enabled meta-cognitive control achieves higher average returns and significantly reduces late-stage training failures compared to strong robustness baselines.

[360] DeRaDiff: Denoising Time Realignment of Diffusion Models

Ratnavibusena Don Shahain Manujith, Yang Zhang, Teoh Tze Tzun, Kenji Kawaguchi

Main category: cs.LG

TL;DR: DeRaDiff enables efficient search for optimal regularization strength in diffusion model alignment by modulating strength during sampling without retraining.

Details

Motivation: Current diffusion model alignment methods require expensive hyperparameter sweeps to find optimal KL regularization strength, which is computationally prohibitive.

Method: DeRaDiff uses denoising time realignment that replaces reverse step reference distribution with geometric mixture of aligned and reference posteriors, enabling on-the-fly control via single parameter lambda.

Result: Method approximates models aligned from scratch at different regularization strengths across text-image alignment and quality metrics, eliminating need for expensive alignment sweeps.

Conclusion: DeRaDiff provides efficient way to search for optimal regularization strength, substantially reducing computational costs of diffusion model alignment.

Abstract: Recent advances align diffusion models with human preferences to increase aesthetic appeal and mitigate artifacts and biases. Such methods aim to maximize a conditional output distribution aligned with higher rewards whilst not drifting far from a pretrained prior. This is commonly enforced by KL (Kullback Leibler) regularization. As such, a central issue still remains: how does one choose the right regularization strength? Too high of a strength leads to limited alignment and too low of a strength leads to “reward hacking”. This renders the task of choosing the correct regularization strength highly non-trivial. Existing approaches sweep over this hyperparameter by aligning a pretrained model at multiple regularization strengths and then choose the best strength. Unfortunately, this is prohibitively expensive. We introduce DeRaDiff, a denoising time realignment procedure that, after aligning a pretrained model once, modulates the regularization strength during sampling to emulate models trained at other regularization strengths without any additional training or finetuning. Extending decoding-time realignment from language to diffusion models, DeRaDiff operates over iterative predictions of continuous latents by replacing the reverse step reference distribution by a geometric mixture of an aligned and reference posterior, thus giving rise to a closed form update under common schedulers and a single tunable parameter, lambda, for on the fly control. Our experiments show that across multiple text image alignment and image-quality metrics, our method consistently provides a strong approximation for models aligned entirely from scratch at different regularization strengths. Thus, our method yields an efficient way to search for the optimal strength, eliminating the need for expensive alignment sweeps and thereby substantially reducing computational costs.

[361] Minimum-Cost Network Flow with Dual Predictions

Zhiyang Chen, Hailong Yao, Xia Yin

Main category: cs.LG

TL;DR: First minimum-cost network flow algorithm with dual prediction that provides provable performance improvements with time complexity bounds based on prediction error.

Details

Motivation: To leverage machine-learned predictions to improve performance of classic algorithms, specifically for minimum-cost network flow problems which have important applications in traffic networks and chip design.

Method: Augments the classic ε-relaxation minimum-cost flow algorithm with a dual prediction. Provides theoretical analysis of time complexity in terms of infinity norm prediction error and sample complexity bounds for PAC-learning the prediction.

Result: Empirical validation shows 12.74× average speedup on traffic networks and 1.64× average speedup on chip escape routing applications. Theoretical bounds show both consistency (good predictions help) and robustness (bad predictions don’t hurt too much).

Conclusion: Machine-learned predictions can be effectively integrated into minimum-cost flow algorithms to achieve significant speedups while maintaining theoretical guarantees, demonstrating practical value for real-world applications.

Abstract: Recent work has shown that machine-learned predictions can provably improve the performance of classic algorithms. In this work, we propose the first minimum-cost network flow algorithm augmented with a dual prediction. Our method is based on a classic minimum-cost flow algorithm, namely $\varepsilon$-relaxation. We provide time complexity bounds in terms of the infinity norm prediction error, which is both consistent and robust. We also prove sample complexity bounds for PAC-learning the prediction. We empirically validate our theoretical results on two applications of minimum-cost flow, i.e., traffic networks and chip escape routing, in which we learn a fixed prediction, and a feature-based neural network model to infer the prediction, respectively. Experimental results illustrate $12.74\times$ and $1.64\times$ average speedup on two applications.

[362] ProFlow: Zero-Shot Physics-Consistent Sampling via Proximal Flow Guidance

Zichao Yu, Ming Li, Wenyi Zhang, Difan Zou, Weiguo Gao

Main category: cs.LG

TL;DR: ProFlow is a zero-shot physics-consistent sampling framework that uses proximal guidance to infer physical fields from sparse observations while strictly satisfying PDE constraints without retraining generative priors.

Details

Motivation: Existing deep generative models for inverse problems struggle to enforce hard physical constraints without costly retraining or disrupting learned generative priors. There's a need for sampling mechanisms that can reconcile strict physical consistency and observational fidelity with pre-trained prior structures.

Method: ProFlow uses a proximal guidance framework with a two-step scheme: (1) terminal optimization step that projects flow predictions onto physically and observationally consistent sets via proximal minimization, and (2) interpolation step that maps refined states back to the generative trajectory to maintain consistency with the learned flow probability path.

Result: Comprehensive benchmarks on Poisson, Helmholtz, Darcy, and viscous Burgers’ equations show ProFlow achieves superior physical and observational consistency, as well as more accurate distributional statistics compared to state-of-the-art diffusion- and flow-based baselines.

Conclusion: ProFlow provides an effective zero-shot sampling framework that can infer physically consistent solutions from sparse observations using fixed generative priors without task-specific retraining, addressing the fundamental challenge of enforcing hard physical constraints in inverse problems.

Abstract: Inferring physical fields from sparse observations while strictly satisfying partial differential equations (PDEs) is a fundamental challenge in computational physics. Recently, deep generative models offer powerful data-driven priors for such inverse problems, yet existing methods struggle to enforce hard physical constraints without costly retraining or disrupting the learned generative prior. Consequently, there is a critical need for a sampling mechanism that can reconcile strict physical consistency and observational fidelity with the statistical structure of the pre-trained prior. To this end, we present ProFlow, a proximal guidance framework for zero-shot physics-consistent sampling, defined as inferring solutions from sparse observations using a fixed generative prior without task-specific retraining. The algorithm employs a rigorous two-step scheme that alternates between: (\romannumeral1) a terminal optimization step, which projects the flow prediction onto the intersection of the physically and observationally consistent sets via proximal minimization; and (\romannumeral2) an interpolation step, which maps the refined state back to the generative trajectory to maintain consistency with the learned flow probability path. This procedure admits a Bayesian interpretation as a sequence of local maximum a posteriori (MAP) updates. Comprehensive benchmarks on Poisson, Helmholtz, Darcy, and viscous Burgers’ equations demonstrate that ProFlow achieves superior physical and observational consistency, as well as more accurate distributional statistics, compared to state-of-the-art diffusion- and flow-based baselines.

[363] Hyperparameter Transfer with Mixture-of-Expert Layers

Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, Boris Hanin

Main category: cs.LG

TL;DR: Proposes a new parameterization method for Mixture-of-Experts (MoE) transformer models that enables reliable hyperparameter transfer across different model scales, validated from 51M to 2B+ parameters.

Details

Motivation: MoE layers help scale neural networks by decoupling total parameters from activated parameters, but introduce training complexity: (1) new router weights requiring hyperparameter tuning, and (2) new architecture dimensions (number/size of experts) that must be chosen and can be large. Need cheap and reliable hyperparameter selection methods.

Method: Proposes a novel parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert size. The parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. Enables hyperparameter transfer across different model scales trained at fixed token budgets.

Result: Empirically shows the parameterization enables reliable hyperparameter transfer across models from 51M to over 2B total parameters. Successfully transfers hyperparameters identified from small models on short token horizons to train larger models on longer horizons with performant model behaviors.

Conclusion: The proposed parameterization method provides a practical solution for cheap and reliable hyperparameter selection in MoE models, enabling efficient scaling across different model dimensions while maintaining performance.

Abstract: Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.

[364] Certificate-Guided Pruning for Stochastic Lipschitz Optimization

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.LG

TL;DR: CGP is a black-box optimization method for Lipschitz functions with noisy evaluations that maintains explicit certificates of suboptimality, provides sample complexity guarantees, and offers adaptive extensions for practical use.

Details

Motivation: Existing adaptive discretization methods for Lipschitz function optimization lack explicit optimality certificates and measurable progress guarantees, making it difficult to determine when to stop optimization or assess solution quality.

Method: Certificate-Guided Pruning (CGP) maintains an active set of potentially optimal points using confidence-adjusted Lipschitz envelopes, pruning points that are certifiably suboptimal with high probability. Three extensions: CGP-Adaptive learns Lipschitz constant online, CGP-TR uses trust regions for high dimensions, and CGP-Hybrid switches to Gaussian process refinement when local smoothness is detected.

Result: Under margin condition with near-optimality dimension α, CGP achieves sample complexity $\tildeO(\varepsilon^{-(2+α)})$. Experiments on 12 benchmarks (d ∈ [2, 100]) show CGP variants match or exceed strong baselines while providing principled stopping criteria via certificate volume.

Conclusion: CGP provides a principled framework for Lipschitz optimization with explicit optimality certificates, controlled convergence guarantees, and practical extensions that scale to high dimensions while maintaining theoretical guarantees.

Abstract: We study black-box optimization of Lipschitz functions under noisy evaluations. Existing adaptive discretization methods implicitly avoid suboptimal regions but do not provide explicit certificates of optimality or measurable progress guarantees. We introduce \textbf{Certificate-Guided Pruning (CGP)}, which maintains an explicit \emph{active set} $A_t$ of potentially optimal points via confidence-adjusted Lipschitz envelopes. Any point outside $A_t$ is certifiably suboptimal with high probability, and under a margin condition with near-optimality dimension $α$, we prove $\Vol(A_t)$ shrinks at a controlled rate yielding sample complexity $\tildeO(\varepsilon^{-(2+α)})$. We develop three extensions: CGP-Adaptive learns $L$ online with $O(\log T)$ overhead; CGP-TR scales to $d > 50$ via trust regions with local certificates; and CGP-Hybrid switches to GP refinement when local smoothness is detected. Experiments on 12 benchmarks ($d \in [2, 100]$) show CGP variants match or exceed strong baselines while providing principled stopping criteria via certificate volume.

[365] Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning

Jinyang Wu, Shuo Yang, Changpeng Yang, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao

Main category: cs.LG

TL;DR: Spark is a reinforcement learning framework that strategically branches exploration at critical decision states for efficient training of language model agents in long-horizon tasks.

Details

Motivation: Current RL methods for training LLM agents waste computational resources on trivial steps while failing to guarantee sample quality, especially for long-horizon tasks with limited resources.

Method: Spark uses strategic policy-aware exploration via key-state dynamic branching - it selectively branches at critical decision points to probe promising trajectories, leveraging the agent’s intrinsic decision-making signals for adaptive resource allocation.

Result: Experiments across diverse tasks (including embodied planning) show Spark achieves superior success rates with significantly fewer training samples and exhibits robust generalization in unseen scenarios.

Conclusion: Spark enables resource-efficient exploration by prioritizing sampling quality over blind coverage, reducing dependence on human priors and allowing autonomous expansion of exploration for stronger generalization.

Abstract: Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose \textbf{Spark} (\textbf{S}trategic \textbf{P}olicy-\textbf{A}ware explo\textbf{R}ation via \textbf{K}ey-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent’s intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that \textsc{Spark} achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios.

[366] An Accounting Identity for Algorithmic Fairness

Hadi Elzayn, Jacob Goldin

Main category: cs.LG

TL;DR: The paper establishes a mathematical identity linking model accuracy and fairness criteria, showing they are complements rather than tradeoffs in binary prediction.

Details

Motivation: To resolve the perceived tension between accuracy and fairness in predictive models by deriving a formal mathematical relationship that clarifies their interdependence.

Method: Derives an accounting identity for predictive models that connects accuracy with fairness criteria, analyzes the identity mathematically, and validates with experiments on benchmark data.

Result: Shows accuracy and fairness are complements: increasing accuracy shrinks the “total unfairness budget” and vice-versa. Fairness interventions often substitute between different fairness violations rather than reducing overall unfairness.

Conclusion: Accuracy and fairness should be viewed as complements in binary prediction. The derived identity provides a framework for understanding fairness tradeoffs and shows how additional outcome information can relax fairness incompatibilities in non-binary tasks.

Abstract: We derive an accounting identity for predictive models that links accuracy with common fairness criteria. The identity shows that for globally calibrated models, the weighted sums of miscalibration within groups and error imbalance across groups is equal to a “total unfairness budget.” For binary outcomes, this budget is the model’s mean-squared error times the difference in group prevalence across outcome classes. The identity nests standard impossibility results as special cases, while also describing inherent tradeoffs when one or more fairness measures are not perfectly satisfied. The results suggest that accuracy and fairness are best viewed as complements in binary prediction tasks: increasing accuracy necessarily shrinks the total unfairness budget and vice-versa. Experiments on benchmark data confirm the theory and show that many fairness interventions largely substitute between fairness violations, and when they reduce accuracy they tend to expand the total unfairness budget. The results extend naturally to prediction tasks with non-binary outcomes, illustrating how additional outcome information can relax fairness incompatibilities and identifying conditions under which the binary-style impossibility does and does not extend to regression tasks.

[367] Parametric and Generative Forecasts of Day-Ahead Market Curves for Storage Optimization

Julian Gutierrez, Redouane Silvente

Main category: cs.LG

TL;DR: Two ML frameworks for EPEX SPOT day-ahead market: 1) Fast parametric model for hourly demand/supply curve forecasting, 2) Generative models for synthetic order-level scenarios to optimize storage strategies and analyze price compression effects.

Details

Motivation: Need for effective forecasting and optimization tools in the EPEX SPOT day-ahead electricity market to support storage operations and understand market dynamics like price compression effects.

Method: Two approaches: 1) Parametric model using Chebyshev polynomials for elastic segments with min/max volumes for interpretable hourly curve forecasting; 2) Generative models learning joint distribution of 24-hour order-level submissions conditioned on weather/fuel variables to create synthetic daily scenarios.

Result: Parametric model enables daily use with low error and clear interpretability; generative models produce synthetic order scenarios that aggregate to hourly curves; storage optimization reveals price compression effects with lower peaks, higher off-peak levels, and diminishing returns with capacity expansion.

Conclusion: The paper presents complementary ML frameworks for electricity market analysis - a practical parametric model for daily operations and comprehensive generative models for strategic analysis, enabling optimized storage strategies and revealing important market dynamics like price compression.

Abstract: We present two machine learning frameworks for forecasting aggregated curves and optimizing storage in the EPEX SPOT day-ahead market. First, a fast parametric model forecasts hourly demand and supply curves in a low-dimensional and grid-robust representation, with minimum and maximum volumes combined with a Chebyshev polynomial for the elastic segment. The model enables daily use with low error and clear interpretability. Second, for a more comprehensive analysis, though less suited to daily operation, we employ generative models that learn the joint distribution of 24-hour order-level submissions given weather and fuel variables. These models generate synthetic daily scenarios of individual buy and sell orders, which, once aggregated, yield hourly supply and demand curves. Based on these forecasts, we optimize a price-making storage strategy, quantify revenue distributions, and highlight the price-compression effect with lower peaks, higher off-peak levels, and diminishing returns as capacity expands.

[368] Order-Optimal Sample Complexity of Rectified Flows

Hari Krishna Sahoo, Mudit Gaur, Vaneet Aggarwal

Main category: cs.LG

TL;DR: Rectified flow models achieve optimal sample complexity of Õ(ε⁻²) by constraining transport trajectories to be linear, enabling fast sampling with single Euler steps.

Details

Motivation: Flow-based generative models are more efficient than diffusion models, but existing flow matching models have suboptimal O(ε⁻⁴) sample complexity. Rectified flows offer linear transport trajectories that accelerate sampling while potentially achieving better theoretical sample complexity.

Method: Rectified flow models constrain transport trajectories to be linear from base distribution to data distribution. The velocity field is parameterized by neural networks, trained with squared loss along these linear paths. The analysis exploits the model’s structure to control localized Rademacher complexity.

Result: Rectified flows achieve sample complexity Õ(ε⁻²), improving on the best known O(ε⁻⁴) bounds for flow matching models and matching the optimal rate for mean estimation. This explains their strong empirical performance and enables high-quality generation with single Euler steps.

Conclusion: Rectified flow models provide both practical efficiency (fast sampling) and theoretical optimality (optimal sample complexity), making them superior to previous flow-based approaches and competitive with diffusion models.

Abstract: Recently, flow-based generative models have shown superior efficiency compared to diffusion models. In this paper, we study rectified flow models, which constrain transport trajectories to be linear from the base distribution to the data distribution. This structural restriction greatly accelerates sampling, often enabling high-quality generation with a single Euler step. Under standard assumptions on the neural network classes used to parameterize the velocity field and data distribution, we prove that rectified flows achieve sample complexity $\tilde{O}(\varepsilon^{-2})$. This improves on the best known $O(\varepsilon^{-4})$ bounds for flow matching model and matches the optimal rate for mean estimation. Our analysis exploits the particular structure of rectified flows: because the model is trained with a squared loss along linear paths, the associated hypothesis class admits a sharply controlled localized Rademacher complexity. This yields the improved, order-optimal sample complexity and provides a theoretical explanation for the strong empirical performance of rectified flow models.

[369] HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-BENCH

Yueyang Wang, Jiawei Fu, Baolong Bi, Xili Wang, Xiaoqing Liu

Main category: cs.LG

TL;DR: The paper introduces HE-SNR, a novel entropy-based metric for guiding LLM mid-training on software engineering tasks, addressing limitations of standard metrics like perplexity.

Details

Motivation: Current metrics like perplexity are compromised by the "Long-Context Tax" and have weak correlation with downstream software engineering performance in benchmarks like SWE-bench, creating a gap in effective mid-training guidance.

Method: Proposes Entropy Compression Hypothesis and HE-SNR metric based on fine-grained entropy analysis, plus a rigorous data filtering strategy. Validated on industrial-scale Mixture-of-Experts models with varying context windows (32K/128K).

Result: HE-SNR demonstrates superior robustness and predictive power compared to standard metrics, effectively guiding LLM mid-training for complex software engineering tasks.

Conclusion: Provides theoretical foundation and practical tools for optimizing LLM potential in complex engineering domains through entropy-based mid-training guidance.

Abstract: SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supervised Fine-Tuning (SFT), there remains a critical deficit in metrics capable of guiding mid-training effectively. Standard metrics such as Perplexity (PPL) are compromised by the “Long-Context Tax” and exhibit weak correlation with downstream SWE performance. In this paper, we bridge this gap by first introducing a rigorous data filtering strategy. Crucially, we propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States of low orders (“reasonable hesitation”). Grounded in this fine-grained entropy analysis, we formulate a novel metric, HE-SNR (High-Entropy Signal-to-Noise Ratio). Validated on industrial-scale Mixture-of-Experts (MoE) models across varying context windows (32K/128K), our approach demonstrates superior robustness and predictive power. This work provides both the theoretical foundation and practical tools for optimizing the latent potential of LLMs in complex engineering domains.

[370] Robust SDE Parameter Estimation Under Missing Time Information Setting

Long Van Tran, Truyen Tran, Phuoc Nguyen

Main category: cs.LG

TL;DR: Novel framework that simultaneously recovers temporal order and estimates SDE parameters when temporal ordering information is corrupted or missing, using score-matching to infer order and maximum likelihood for parameter estimation.

Details

Motivation: Parameter estimation for SDEs typically requires accurately timestamped observational sequences, but fails when temporal ordering information is corrupted, missing, or deliberately hidden (e.g., for privacy). This limits applicability in sensitive domains where temporal information might be obscured.

Method: 1) Exploit asymmetries between forward and backward SDE processes to derive a score-matching criterion that infers correct temporal order between pairs of observations. 2) Recover total temporal order via a sorting procedure. 3) Estimate SDE parameters from reconstructed sequence using maximum likelihood.

Result: Extensive experiments on synthetic and real-world datasets demonstrate the method’s effectiveness in recovering temporal order and estimating SDE parameters, extending parameter estimation to settings with missing temporal order.

Conclusion: The framework successfully addresses the challenge of SDE parameter estimation when temporal ordering is corrupted or missing, broadening applicability in sensitive domains where temporal information might be deliberately hidden for privacy reasons.

Abstract: Recent advances in stochastic differential equations (SDEs) have enabled robust modeling of real-world dynamical processes across diverse domains, such as finance, health, and systems biology. However, parameter estimation for SDEs typically relies on accurately timestamped observational sequences. When temporal ordering information is corrupted, missing, or deliberately hidden (e.g., for privacy), existing estimation methods often fail. In this paper, we investigate the conditions under which temporal order can be recovered and introduce a novel framework that simultaneously reconstructs temporal information and estimates SDE parameters. Our approach exploits asymmetries between forward and backward processes, deriving a score-matching criterion to infer the correct temporal order between pairs of observations. We then recover the total order via a sorting procedure and estimate SDE parameters from the reconstructed sequence using maximum likelihood. Finally, we conduct extensive experiments on synthetic and real-world datasets to demonstrate the effectiveness of our method, extending parameter estimation to settings with missing temporal order and broadening applicability in sensitive domains.

[371] C2:Cross learning module enhanced decision transformer with Constraint-aware loss for auto-bidding

Jinren Ding, Xuejian Xu, Shen Jiang, Zhitong Hao, Jinhui Yang, Peng Jiang

Main category: cs.LG

TL;DR: C2 enhances Decision Transformer for auto-bidding by adding cross-attention for better sequence correlation modeling and constraint-aware loss for selective learning of optimal behaviors.

Details

Motivation: Decision Transformer has limitations in auto-bidding: insufficient cross-correlation modeling among state, action, and RTG sequences, and indiscriminate learning of both optimal and suboptimal behaviors.

Method: Proposes C2 framework with two innovations: 1) Cross Learning Block (CLB) using cross-attention to strengthen inter-sequence correlation modeling, and 2) Constraint-aware Loss (CL) incorporating budget and CPA constraints for selective learning of optimal trajectories.

Result: Extensive offline evaluations on AuctionNet dataset show consistent performance gains (up to 3.23% over state-of-the-art GAVE) across diverse budget settings; ablation studies confirm complementary synergy of CLB and CL.

Conclusion: C2 demonstrates superiority in auto-bidding by addressing Decision Transformer’s limitations through enhanced cross-correlation modeling and selective learning of optimal behaviors.

Abstract: Decision Transformer (DT) shows promise for generative auto-bidding by capturing temporal dependencies, but suffers from two critical limitations: insufficient cross-correlation modeling among state, action, and return-to-go (RTG) sequences, and indiscriminate learning of optimal/suboptimal behaviors. To address these, we propose C2, a novel framework enhancing DT with two core innovations: (1) a Cross Learning Block (CLB) via cross-attention to strengthen inter-sequence correlation modeling; (2) a Constraint-aware Loss (CL) incorporating budget and Cost-Per-Acquisition (CPA) constraints for selective learning of optimal trajectories. Extensive offline evaluations on the AuctionNet dataset demonstrate consistent performance gains (up to 3.23% over state-of-the-art GAVE) across diverse budget settings; ablation studies verify the complementary synergy of CLB and CL, confirming C2’s superiority in auto-bidding. The code for reproducing our results is available at: https://github.com/Dingjinren/C2.

[372] The Forecast After the Forecast: A Post-Processing Shift in Time Series

Daojun Liang, Qi Li, Yinglong Wang, Jing Chen, Hu Zhang, Xiaoxiao Cui, Qizheng Wang, Shuo Li

Main category: cs.LG

TL;DR: δ-Adapter is a lightweight post-processing method that boosts deployed time series forecasters without retraining, using input nudging and output correction modules with theoretical guarantees.

Details

Motivation: As forecasting models approach diminishing accuracy returns, there's an underexplored opportunity in post-processing to improve accuracy and uncertainty without retraining or modifying deployed backbones.

Method: δ-Adapter learns tiny, bounded modules at two interfaces: input nudging (soft edits to covariates) and output residual correction. It provides local descent guarantees, O(δ) drift bounds, and compositional stability. It can also act as a feature selector via sparse horizon-aware masks and as a distribution calibrator with Quantile Calibrator and Conformal Corrector.

Result: Experiments across diverse backbones and datasets show δ-Adapter improves accuracy and calibration with negligible compute and no interface changes.

Conclusion: δ-Adapter addresses the last-mile gap in time-series forecasting by providing a lightweight, architecture-agnostic way to boost deployed forecasters without retraining, offering improved accuracy, interpretability, and uncertainty calibration.

Abstract: Time series forecasting has long been dominated by advances in model architecture, with recent progress driven by deep learning and hybrid statistical techniques. However, as forecasting models approach diminishing returns in accuracy, a critical yet underexplored opportunity emerges: the strategic use of post-processing. In this paper, we address the last-mile gap in time-series forecasting, which is to improve accuracy and uncertainty without retraining or modifying a deployed backbone. We propose $δ$-Adapter, a lightweight, architecture-agnostic way to boost deployed time series forecasters without retraining. $δ$-Adapter learns tiny, bounded modules at two interfaces: input nudging (soft edits to covariates) and output residual correction. We provide local descent guarantees, $O(δ)$ drift bounds, and compositional stability for combined adapters. Meanwhile, it can act as a feature selector by learning a sparse, horizon-aware mask over inputs to select important features, thereby improving interpretability. In addition, it can also be used as a distribution calibrator to measure uncertainty. Thus, we introduce a Quantile Calibrator and a Conformal Corrector that together deliver calibrated, personalized intervals with finite-sample coverage. Our experiments across diverse backbones and datasets show that $δ$-Adapter improves accuracy and calibration with negligible compute and no interface changes.

[373] Cheap2Rich: A Multi-Fidelity Framework for Data Assimilation and System Identification of Multiscale Physics – Rotating Detonation Engines

Yuxuan Bao, Jan Zajac, Megan Powers, Venkat Raman, J. Nathan Kutz

Main category: cs.LG

TL;DR: Cheap2Rich is a multi-scale data assimilation framework that reconstructs high-fidelity states from sparse sensor data by combining fast low-fidelity models with learned, interpretable discrepancy corrections, demonstrated on rotating detonation engines.

Details

Motivation: Bridging the sim2real gap between computationally inexpensive models and complex physical systems, especially in multi-scale settings where reduced-order models only capture dominant dynamics.

Method: Multi-scale data assimilation framework that combines fast low-fidelity priors with learned, interpretable discrepancy corrections to reconstruct high-fidelity state spaces from sparse sensor histories.

Result: Successfully reconstructs high-fidelity RDE states from sparse measurements while isolating physically meaningful discrepancy dynamics associated with injector-driven effects.

Conclusion: Provides a general multi-fidelity framework for data assimilation and system identification in complex multi-scale systems, enabling rapid design exploration, real-time monitoring/control with interpretable discrepancy dynamics.

Abstract: Bridging the sim2real gap between computationally inexpensive models and complex physical systems remains a central challenge in machine learning applications to engineering problems, particularly in multi-scale settings where reduced-order models typically capture only dominant dynamics. In this work, we present Cheap2Rich, a multi-scale data assimilation framework that reconstructs high-fidelity state spaces from sparse sensor histories by combining a fast low-fidelity prior with learned, interpretable discrepancy corrections. We demonstrate the performance on rotating detonation engines (RDEs), a challenging class of systems that couple detonation-front propagation with injector-driven unsteadiness, mixing, and stiff chemistry across disparate scales. Our approach successfully reconstructs high-fidelity RDE states from sparse measurements while isolating physically meaningful discrepancy dynamics associated with injector-driven effects. The results highlight a general multi-fidelity framework for data assimilation and system identification in complex multi-scale systems, enabling rapid design exploration and real-time monitoring and control while providing interpretable discrepancy dynamics. Code for this project is is available at: github.com/kro0l1k/Cheap2Rich.

[374] Memory Retrieval in Transformers: Insights from The Encoding Specificity Principle

Viet Hung Dinh, Ming Ding, Youyang Qu, Kanchana Thilakarathna

Main category: cs.LG

TL;DR: This paper investigates how attention layers in transformer LLMs function as memory mechanisms, proposing that keywords serve as retrieval cues similar to human memory processes, and identifies specific neurons that encode these keywords for applications like machine unlearning.

Details

Motivation: The motivation stems from increasing regulatory pressures for transparency and accountability in AI, combined with the underexplored role of attention layers in transformer-based LLMs. The paper aims to bridge computational models with psychological theories of human memory to better understand how LLMs retrieve information.

Method: The study draws on psychological and computational psycholinguistics research, particularly the Encoding Specificity Principle, to analyze attention layers as memory mechanisms. It hypothesizes that keywords serve as retrieval cues and provides converging evidence for this hypothesis. The method involves isolating specific neurons within attention layers that selectively encode and facilitate retrieval of context-defining keywords.

Result: The research demonstrates that attention layers instantiate memory mechanisms where keywords function as retrieval cues. It identifies specific neurons that encode these keywords, enabling keyword extraction from identified neurons. These findings contribute to understanding LLM memory processes and have practical applications in downstream tasks like machine unlearning.

Conclusion: The study successfully bridges psychological memory theories with transformer architectures, showing that attention layers operate as cue-based retrieval systems similar to human memory. The identification of keyword-encoding neurons provides a foundation for explainable AI applications, particularly in machine unlearning, while advancing our understanding of LLM internal mechanisms.

Abstract: While explainable artificial intelligence (XAI) for large language models (LLMs) remains an evolving field with many unresolved questions, increasing regulatory pressures have spurred interest in its role in ensuring transparency, accountability, and privacy-preserving machine unlearning. Despite recent advances in XAI have provided some insights, the specific role of attention layers in transformer based LLMs remains underexplored. This study investigates the memory mechanisms instantiated by attention layers, drawing on prior research in psychology and computational psycholinguistics that links Transformer attention to cue based retrieval in human memory. In this view, queries encode the retrieval context, keys index candidate memory traces, attention weights quantify cue trace similarity, and values carry the encoded content, jointly enabling the construction of a context representation that precedes and facilitates memory retrieval. Guided by the Encoding Specificity Principle, we hypothesize that the cues used in the initial stage of retrieval are instantiated as keywords. We provide converging evidence for this keywords-as-cues hypothesis. In addition, we isolate neurons within attention layers whose activations selectively encode and facilitate the retrieval of context-defining keywords. Consequently, these keywords can be extracted from identified neurons and further contribute to downstream applications such as unlearning.

[375] Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction

Tianyi Alex Qiu, Micah Carroll, Cameron Allen

Main category: cs.LG

TL;DR: Peer prediction method for LLM evaluation/training uses game theory to elicit honest answers without ground truth, works better with larger capability gaps between models.

Details

Motivation: LLM evaluation and training often lack strong supervision for difficult tasks, leading to deceptive results when models exploit imperfect supervision. Need methods that work with weak supervision.

Method: Peer prediction method from mechanism design literature, rewards honest/informative answers using mutual predictability metric without ground truth labels. Uses game-theoretic incentive compatibility principles.

Result: Method effective and resistant to deception with theoretical/empirical validation up to 405B params. Training 8B model with peer prediction recovers truthfulness drop from malicious finetuning. Shows inverse scaling: resistance strengthens with larger capability gaps between experts/participants.

Conclusion: Peer prediction enables reliable LLM evaluation with weak supervision, outperforms LLM-as-a-Judge which fails with deceptive models 5-20x larger, while peer prediction thrives with 100x+ size differences.

Abstract: The evaluation and post-training of large language models (LLMs) rely on supervision, but strong supervision for difficult tasks is often unavailable, especially when evaluating frontier models. In such cases, models are demonstrated to exploit evaluations built on such imperfect supervision, leading to deceptive results. However, underutilized in LLM research, a wealth of mechanism design research focuses on game-theoretic incentive compatibility, i.e., eliciting honest and informative answers with weak supervision. Drawing from this literature, we introduce the peer prediction method for model evaluation and post-training. It rewards honest and informative answers over deceptive and uninformative ones, using a metric based on mutual predictability and without requiring ground truth labels. We demonstrate the method’s effectiveness and resistance to deception, with both theoretical guarantees and empirical validation on models with up to 405B parameters. We show that training an 8B model with peer prediction-based reward recovers most of the drop in truthfulness due to prior malicious finetuning, even when the reward is produced by a 0.135B language model with no finetuning. On the evaluation front, in contrast to LLM-as-a-Judge which requires strong and trusted judges, we discover an inverse scaling property in peer prediction, where, surprisingly, resistance to deception is strengthened as the capability gap between the experts and participants widens, enabling reliable evaluation of strong models with weak supervision. In particular, LLM-as-a-Judge become worse than random guess when facing deceptive models 5-20x the judge’s size, while peer prediction thrives when such gaps are large, including in cases with over 100x size difference.

[376] A Learning-based Framework for Spatial Impulse Response Compensation in 3D Photoacoustic Computed Tomography

Kaiyi Yang, Seonyeong Park, Gangwon Jeong, Hsuan-Kai Huang, Alexander A. Oraevsky, Umberto Villa, Mark A. Anastasio

Main category: cs.LG

TL;DR: A deep learning framework for compensating spatial impulse response effects in 3D photoacoustic computed tomography, enabling accurate reconstruction while maintaining computational efficiency.

Details

Motivation: Current PACT reconstruction faces a trade-off: analytic methods are fast but ignore transducer spatial impulse responses (SIRs), compromising resolution, while optimization-based methods account for SIRs but are computationally expensive, especially in 3D.

Method: Developed a learned SIR compensation framework using deep learning to map SIR-corrupted measurement data to compensated data for idealized point-like transducers. Two models were investigated: U-Net and a physics-inspired Deconv-Net, with a fast analytical training data generation procedure.

Result: The framework demonstrated resolution improvement, robustness to noise, object complexity, and sound speed heterogeneity in virtual studies. In in-vivo breast imaging, it revealed fine structures previously obscured by SIR artifacts.

Conclusion: This is the first demonstration of learned SIR compensation in 3D PACT imaging, providing an accurate yet computationally efficient solution that bridges the gap between analytic and optimization-based reconstruction methods.

Abstract: Photoacoustic computed tomography (PACT) is a promising imaging modality that combines the advantages of optical contrast with ultrasound detection. Utilizing ultrasound transducers with larger surface areas can improve detection sensitivity. However, when computationally efficient analytic reconstruction methods that neglect the spatial impulse responses (SIRs) of the transducer are employed, the spatial resolution of the reconstructed images will be compromised. Although optimization-based reconstruction methods can explicitly account for SIR effects, their computational cost is generally high, particularly in three-dimensional (3D) applications. To address the need for accurate but rapid 3D PACT image reconstruction, this study presents a framework for establishing a learned SIR compensation method that operates in the data domain. The learned compensation method maps SIR-corrupted PACT measurement data to compensated data that would have been recorded by idealized point-like transducers. Subsequently, the compensated data can be used with a computationally efficient reconstruction method that neglects SIR effects. Two variants of the learned compensation model are investigated that employ a U-Net model and a specifically designed, physics-inspired model, referred to as Deconv-Net. A fast and analytical training data generation procedure is also a component of the presented framework. The framework is rigorously validated in virtual imaging studies, demonstrating resolution improvement and robustness to noise variations, object complexity, and sound speed heterogeneity. When applied to in-vivo breast imaging data, the learned compensation models revealed fine structures that had been obscured by SIR-induced artifacts. To our knowledge, this is the first demonstration of learned SIR compensation in 3D PACT imaging.

[377] Delayed Feedback Modeling for Post-Click Gross Merchandise Volume Prediction: Benchmark, Insights and Approaches

Xinyu Li, Sishuo Chen, Guipeng Xv, Li Zhang, Mingxuan Luo, Zhangming Chan, Xiang-Rong Sheng, Han Zhu, Jian Xu, Chen Lin

Main category: cs.LG

TL;DR: TRACE benchmark for GMV prediction with delayed feedback modeling, and READER model with repurchase-aware dual-branch architecture for improved accuracy.

Details

Motivation: Online ad ranking models are shifting from CVR to GMV prediction, but delayed feedback modeling for continuous GMV targets remains unexplored. GMV prediction is more challenging due to continuous targets and potential multiple purchases from single clicks.

Method: Established TRACE benchmark with complete transaction sequences for streaming delayed feedback modeling. Proposed READER model with repurchase-aware dual-branch predictor that selectively activates expert parameters based on repurchase predictions and dynamically calibrates regression targets.

Result: READER achieves 2.19% improvement in accuracy over baselines on TRACE benchmark. Analysis reveals two key insights: GMV label distribution evolves rapidly requiring online streaming training, and repurchase samples have substantially different label distributions than single-purchase samples.

Conclusion: The study opens new avenue for online delayed feedback modeling in GMV prediction. TRACE benchmark and READER model facilitate future research in this direction, addressing unique challenges of continuous targets and multiple purchases from single clicks.

Abstract: The prediction objectives of online advertisement ranking models are evolving from probabilistic metrics like conversion rate (CVR) to numerical business metrics like post-click gross merchandise volume (GMV). Unlike the well-studied delayed feedback problem in CVR prediction, delayed feedback modeling for GMV prediction remains unexplored and poses greater challenges, as GMV is a continuous target, and a single click can lead to multiple purchases that cumulatively form the label. To bridge the research gap, we establish TRACE, a GMV prediction benchmark containing complete transaction sequences rising from each user click, which supports delayed feedback modeling in an online streaming manner. Our analysis and exploratory experiments on TRACE reveal two key insights: (1) the rapid evolution of the GMV label distribution necessitates modeling delayed feedback under online streaming training; (2) the label distribution of repurchase samples substantially differs from that of single-purchase samples, highlighting the need for separate modeling. Motivated by these findings, we propose RepurchasE-Aware Dual-branch prEdictoR (READER), a novel GMV modeling paradigm that selectively activates expert parameters according to repurchase predictions produced by a router. Moreover, READER dynamically calibrates the regression target to mitigate under-estimation caused by incomplete labels. Experimental results show that READER yields superior performance on TRACE over baselines, achieving a 2.19% improvement in terms of accuracy. We believe that our study will open up a new avenue for studying online delayed feedback modeling for GMV prediction, and our TRACE benchmark with the gathered insights will facilitate future research and application in this promising direction. Our code and dataset are available at https://github.com/alimama-tech/OnlineGMV .

[378] Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching

Fengrui Zuo, Zhiwei Ke, Yiming Liu, Wenqi Lou, Chao Wang, Xvehai Zhou

Main category: cs.LG

TL;DR: Window-Diffusion: A window-based token pruning and caching method that achieves up to 99× inference speedup for diffusion language models by exploiting structural locality and reducing redundant computation.

Details

Motivation: Diffusion language models suffer from substantial redundant computation during inference due to full-sequence attention at every iteration, even though analysis reveals that decoding is driven by a small set of prefix-localized active tokens with diminishing influence from distant context.

Method: Proposes a window-based token pruning and caching approach: maintains a sliding local computation window, partitions undecoded tokens into active tokens (computed online), buffer tokens (KV states cached and periodically refreshed), and far-field tokens (pruned outside window). Computation is restricted to active and buffer tokens within the window.

Result: Experiments on LLaDA and Dream show up to 99× inference speedup while largely preserving generation performance under matched compute budgets.

Conclusion: The method effectively exploits structural locality in DLM inference to dramatically reduce computational cost without significant performance degradation, making it directly applicable to pretrained models without retraining.

Abstract: Diffusion language models (DLMs) generate text through iterative denoising, but inference requires full-sequence attention at every iteration, resulting in substantial redundant computation on masked tokens. Block-wise diffusion can reduce this cost, yet it typically relies on retraining and constrained update orders, limiting its direct applicability to pretrained DLMs. Our token-level analysis reveals pronounced structural locality in DLM inference. Decoding is driven by a small set of prefix-localized active tokens; the influence of distant undecoded context diminishes rapidly, and decoded tokens exhibit stage-wise temporal stability, enabling reuse of intermediate representations except for a brief post-decode transient. Motivated by these observations, we propose \textbf{\placeholder}\footnote{The source code is available at https://github.com/vhicrgit/Window-Diffusion.}, a window-based token pruning and caching method for inference. We maintain a local computation window that slides rightward as denoising progresses, and partition undecoded tokens into: (i) \textit{active tokens} that are computed online, (ii) \textit{buffer tokens} whose KV states are cached and periodically refreshed, and (iii) \textit{far-field tokens} that are pruned outside the window. Computation is restricted to active and buffer tokens within the window, while far-field tokens are omitted at each stage. Experiments on LLaDA and Dream show that, under matched compute budgets, our method achieves up to $99\times$ inference speedup while largely preserving generation performance.

[379] TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

Minjae Lee, Wonjun Kang, Byeongkeun Ahn, Christian Classen, Kevin Galim, Seunghyuk Oh, Minghao Yan, Hyung Il Koo, Kangwook Lee

Main category: cs.LG

TL;DR: TABED is a training-free, plug-and-play method that dynamically ensembles multiple draft tokens via batch inference to accelerate Large Vision-Language Model inference, achieving 1.74x speedup over autoregressive decoding.

Details

Motivation: Speculative decoding has been effective for LLM acceleration but remains unexplored for Large Vision-Language Models (LVLMs). Existing methods show scenario-specific performance fluctuations, requiring a more robust approach.

Method: Test-time Adaptive Batched Ensemble Drafting (TABED) dynamically ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in speculative decoding settings.

Result: Achieves average robust walltime speedup of 1.74x over autoregressive decoding and 5% improvement over single drafting methods, with negligible ensembling costs through parameter sharing.

Conclusion: TABED provides an effective, training-free solution for accelerating LVLM inference with plug-and-play compatibility that can be enhanced with advanced verification and alternative drafting methods.

Abstract: Speculative decoding (SD) has proven effective for accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision-Language Models (LVLMs), which extend LLMs to process both image and text prompts. To address this gap, we benchmark existing inference methods with small draft models on 11 datasets across diverse input scenarios and observe scenario-specific performance fluctuations. Motivated by these findings, we propose Test-time Adaptive Batched Ensemble Drafting (TABED), which dynamically ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in the SD setting. The dynamic ensemble method achieves an average robust walltime speedup of 1.74x over autoregressive decoding and a 5% improvement over single drafting methods, while remaining training-free and keeping ensembling costs negligible through parameter sharing. With its plug-and-play compatibility, we further enhance TABED by integrating advanced verification and alternative drafting methods. Code and custom-trained models are available at https://github.com/furiosa-ai/TABED.

[380] TINNs: Time-Induced Neural Networks for Solving Time-Dependent PDEs

Chen-Yang Dai, Che-Chia Chang, Te-Sheng Lin, Ming-Chih Lai, Chieh-Hsin Lai

Main category: cs.LG

TL;DR: TINNs improve PINNs by making network weights time-dependent, enabling evolving spatial representations for better accuracy and faster convergence on time-dependent PDEs.

Details

Motivation: Standard PINNs use shared weights across all times, forcing the same features to represent different dynamics, which degrades accuracy and can destabilize training when enforcing PDE, boundary, and initial constraints jointly.

Method: Propose Time-Induced Neural Networks (TINNs) that parameterize network weights as a learned function of time, allowing spatial representation to evolve over time while maintaining shared structure. Formulated as nonlinear least-squares problem optimized with Levenberg-Marquardt method.

Result: Experiments on various time-dependent PDEs show up to 4× improved accuracy and 10× faster convergence compared to PINNs and strong baselines.

Conclusion: TINNs provide a more effective architecture for solving time-dependent PDEs by enabling time-evolving representations while maintaining computational efficiency through shared structure.

Abstract: Physics-informed neural networks (PINNs) solve time-dependent partial differential equations (PDEs) by learning a mesh-free, differentiable solution that can be evaluated anywhere in space and time. However, standard space–time PINNs take time as an input but reuse a single network with shared weights across all times, forcing the same features to represent markedly different dynamics. This coupling degrades accuracy and can destabilize training when enforcing PDE, boundary, and initial constraints jointly. We propose Time-Induced Neural Networks (TINNs), a novel architecture that parameterizes the network weights as a learned function of time, allowing the effective spatial representation to evolve over time while maintaining shared structure. The resulting formulation naturally yields a nonlinear least-squares problem, which we optimize efficiently using a Levenberg–Marquardt method. Experiments on various time-dependent PDEs show up to $4\times$ improved accuracy and $10\times$ faster convergence compared to PINNs and strong baselines.

[381] Can Continuous-Time Diffusion Models Generate and Solve Globally Constrained Discrete Problems? A Study on Sudoku

Mariia Drozdova

Main category: cs.LG

TL;DR: Continuous-time generative models can represent sparse, globally constrained discrete sets like Sudoku grids, with stochastic sampling outperforming deterministic methods and DDPM-style sampling achieving highest validity.

Details

Motivation: To investigate whether standard continuous-time generative models can represent distributions with extremely sparse, globally constrained discrete support, using Sudoku grids as a controlled testbed.

Method: Train flow-matching and score-based models on Gaussian probability paths, comparing deterministic ODE sampling, stochastic SDE sampling, and DDPM-style discretizations from same continuous-time training. Also repurpose models for guided generation via repeated sampling under clamped clues.

Result: Stochastic sampling substantially outperforms deterministic flows; score-based samplers are most reliable among continuous-time methods; DDPM-style ancestral sampling achieves highest overall validity. Models can act as probabilistic Sudoku solvers through guided generation.

Conclusion: Classic diffusion/flow formulations can assign non-zero probability mass to globally constrained combinatorial structures and can be used for constraint satisfaction via stochastic search, though less sample-efficient than specialized methods.

Abstract: Can standard continuous-time generative models represent distributions whose support is an extremely sparse, globally constrained discrete set? We study this question using completed Sudoku grids as a controlled testbed, treating them as a subset of a continuous relaxation space. We train flow-matching and score-based models along a Gaussian probability path and compare deterministic (ODE) sampling, stochastic (SDE) sampling, and DDPM-style discretizations derived from the same continuous-time training. Unconditionally, stochastic sampling substantially outperforms deterministic flows; score-based samplers are the most reliable among continuous-time methods, and DDPM-style ancestral sampling achieves the highest validity overall. We further show that the same models can be repurposed for guided generation: by repeatedly sampling completions under clamped clues and stopping when constraints are satisfied, the model acts as a probabilistic Sudoku solver. Although far less sample-efficient than classical solvers and discrete-geometry-aware diffusion methods, these experiments demonstrate that classic diffusion/flow formulations can assign non-zero probability mass to globally constrained combinatorial structures and can be used for constraint satisfaction via stochastic search.

[382] Unsupervised Anomaly Detection in Multi-Agent Trajectory Prediction via Transformer-Based Models

Qing Lyu, Zhe Fu, Alexandre Bayen

Main category: cs.LG

TL;DR: Unsupervised anomaly detection framework using multi-agent Transformer to identify safety-critical driving scenarios, validated through dual evaluation of detection stability and physical alignment with safety metrics.

Details

Motivation: Safety-critical scenarios are rare in autonomous driving, making supervised labeling impractical. Traditional rule-based metrics like Time-to-Collision are too simplistic, and existing methods lack systematic verification of whether statistical anomalies reflect actual physical danger.

Method: Proposes an unsupervised anomaly detection framework based on a multi-agent Transformer that models normal driving behavior and measures deviations through prediction residuals. Uses a dual evaluation scheme: (1) Stability measured using Kendall Rank Correlation Coefficient and Jaccard index, (2) Physical alignment assessed through correlations with established Surrogate Safety Measures (SSM).

Result: Maximum residual aggregator achieves highest physical alignment while maintaining stability. Framework identifies 388 unique anomalies missed by Time-to-Collision and statistical baselines, capturing subtle multi-agent risks like reactive braking under lateral drift. Detected anomalies are clustered into four interpretable risk types.

Conclusion: The proposed framework effectively identifies safety-critical driving scenarios through unsupervised anomaly detection, providing actionable insights for simulation and testing while addressing limitations of traditional rule-based and statistical methods.

Abstract: Identifying safety-critical scenarios is essential for autonomous driving, but the rarity of such events makes supervised labeling impractical. Traditional rule-based metrics like Time-to-Collision are too simplistic to capture complex interaction risks, and existing methods lack a systematic way to verify whether statistical anomalies truly reflect physical danger. To address this gap, we propose an unsupervised anomaly detection framework based on a multi-agent Transformer that models normal driving and measures deviations through prediction residuals. A dual evaluation scheme has been proposed to assess both detection stability and physical alignment: Stability is measured using standard ranking metrics in which Kendall Rank Correlation Coefficient captures rank agreement and Jaccard index captures the consistency of the top-K selected items; Physical alignment is assessed through correlations with established Surrogate Safety Measures (SSM). Experiments on the NGSIM dataset demonstrate our framework’s effectiveness: We show that the maximum residual aggregator achieves the highest physical alignment while maintaining stability. Furthermore, our framework identifies 388 unique anomalies missed by Time-to-Collision and statistical baselines, capturing subtle multi-agent risks like reactive braking under lateral drift. The detected anomalies are further clustered into four interpretable risk types, offering actionable insights for simulation and testing.

[383] LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, Tao Wei

Main category: cs.LG

TL;DR: LLM-AutoDP is a framework that uses LLMs as agents to automatically generate and optimize data processing strategies for domain-specific fine-tuning, addressing privacy concerns and reducing manual effort.

Details

Motivation: Domain-specific fine-tuning requires effective data processing, but current manual approaches are labor-intensive and raise privacy concerns in sensitive domains like healthcare where direct human access to raw data is problematic.

Method: LLM-AutoDP uses LLMs as agents to automatically generate multiple candidate data processing strategies and iteratively refines them through feedback signals and comparative evaluations. Three acceleration techniques are introduced: Distribution Preserving Sampling, Processing Target Selection, and Cache-and-Reuse Mechanism.

Result: Models trained on data processed by LLM-AutoDP achieve over 80% win rates against models trained on unprocessed data, and approximately 65% win rate against AutoML baselines. Acceleration techniques reduce total searching time by up to 10 times.

Conclusion: LLM-AutoDP provides an effective and efficient automated solution for data processing in domain-specific LLM fine-tuning, addressing privacy concerns while reducing manual effort and computational costs.

Abstract: Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM-AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in-context learning mechanism enables the agent to converge toward high-quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Preserving Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low-quality samples for focused processing; Cache-and-Reuse Mechanism}, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on LLM agents, LLM-AutoDP achieves approximately a 65% win rate. Moreover, our acceleration techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.

[384] FedRD: Reducing Divergences for Generalized Federated Learning via Heterogeneity-aware Parameter Guidance

Kaile Wang, Jiannong Cao, Yu Yang, Xiaoyin Li, Mingjin Zhang

Main category: cs.LG

TL;DR: FedRD is a heterogeneity-aware federated learning algorithm that addresses optimization and performance divergences in federated domain generalization, enabling effective generalization to unseen clients through parameter-guided global aggregation and local debiased classification.

Details

Motivation: The paper addresses the challenge of generalizing federated learning models to unseen clients under heterogeneous data distributions. New clients require significant adjustments and training to align with existing systems, highlighting two unsolved issues: Optimization Divergence and Performance Divergence in federated domain generalization.

Method: FedRD uses parameter-guided global generalization aggregation combined with local debiased classification to reduce divergences. The approach collaboratively utilizes these techniques to obtain an optimal global model that works well for both participating and unseen clients in heterogeneous federated learning settings.

Result: Extensive experiments on public multi-domain datasets demonstrate that FedRD exhibits substantial performance advantages over competing baselines in addressing the federated domain generalization problem with heterogeneous data.

Conclusion: FedRD effectively tackles the challenges of optimization and performance divergences in heterogeneous federated learning, providing a solution that generalizes well to unseen clients while maintaining privacy-preserving collaboration among different entities.

Abstract: Heterogeneous federated learning (HFL) aims to ensure effective and privacy-preserving collaboration among different entities. As newly joined clients require significant adjustments and additional training to align with the existing system, the problem of generalizing federated learning models to unseen clients under heterogeneous data has become progressively crucial. Consequently, we highlight two unsolved challenging issues in federated domain generalization: Optimization Divergence and Performance Divergence. To tackle the above challenges, we propose FedRD, a novel heterogeneity-aware federated learning algorithm that collaboratively utilizes parameter-guided global generalization aggregation and local debiased classification to reduce divergences, aiming to obtain an optimal global model for participating and unseen clients. Extensive experiments on public multi-domain datasets demonstrate that our approach exhibits a substantial performance advantage over competing baselines in addressing this specific problem.

[385] ScatterFusion: A Hierarchical Scattering Transform Framework for Enhanced Time Series Forecasting

Wei Li

Main category: cs.LG

TL;DR: ScatterFusion integrates scattering transforms with hierarchical attention for multi-scale time series forecasting, outperforming existing methods on benchmark datasets.

Details

Motivation: Time series forecasting is challenging due to complex temporal dependencies at multiple scales, requiring methods that can capture both local and global patterns effectively.

Method: Four-component framework: 1) Hierarchical Scattering Transform Module for multi-scale feature extraction, 2) Scale-Adaptive Feature Enhancement for dynamic feature importance adjustment, 3) Multi-Resolution Temporal Attention for learning dependencies at different time horizons, and 4) TSR decomposition-guided structure-aware loss function.

Result: Extensive experiments on seven benchmark datasets show ScatterFusion outperforms other common methods with significant error reductions across various prediction horizons.

Conclusion: ScatterFusion provides an effective framework for robust time series forecasting by synergistically integrating scattering transforms with hierarchical attention mechanisms to handle complex multi-scale temporal dependencies.

Abstract: Time series forecasting presents significant challenges due to the complex temporal dependencies at multiple time scales. This paper introduces ScatterFusion, a novel framework that synergistically integrates scattering transforms with hierarchical attention mechanisms for robust time series forecasting. Our approach comprises four key components: (1) a Hierarchical Scattering Transform Module (HSTM) that extracts multi-scale invariant features capturing both local and global patterns; (2) a Scale-Adaptive Feature Enhancement (SAFE) module that dynamically adjusts feature importance across different scales; (3) a Multi-Resolution Temporal Attention (MRTA) mechanism that learns dependencies at varying time horizons; and (4) a Trend-Seasonal-Residual (TSR) decomposition-guided structure-aware loss function. Extensive experiments on seven benchmark datasets demonstrate that ScatterFusion outperforms other common methods, achieving significant reductions in error metrics across various prediction horizons.

[386] AWGformer: Adaptive Wavelet-Guided Transformer for Multi-Resolution Time Series Forecasting

Wei Li

Main category: cs.LG

TL;DR: AWGformer integrates adaptive wavelet decomposition with cross-scale attention for efficient multi-scale time series forecasting, achieving state-of-the-art performance.

Details

Motivation: Time series forecasting needs to capture patterns across multiple temporal scales while maintaining computational efficiency, requiring better methods to handle multi-scale and non-stationary signals.

Method: Four key components: 1) Adaptive Wavelet Decomposition Module for dynamic wavelet basis selection, 2) Cross-Scale Feature Fusion for frequency band interactions, 3) Frequency-Aware Multi-Head Attention with frequency-selective weighting, 4) Hierarchical Prediction Network for multi-resolution forecasting.

Result: Significant average improvements over state-of-the-art methods on benchmark datasets, with particular effectiveness on multi-scale and non-stationary time series.

Conclusion: AWGformer successfully integrates wavelet decomposition with attention mechanisms for superior multi-scale time series forecasting, with theoretical convergence guarantees and connections to classical signal processing principles.

Abstract: Time series forecasting requires capturing patterns across multiple temporal scales while maintaining computational efficiency. This paper introduces AWGformer, a novel architecture that integrates adaptive wavelet decomposition with cross-scale attention mechanisms for enhanced multi-variate time series prediction. Our approach comprises: (1) an Adaptive Wavelet Decomposition Module (AWDM) that dynamically selects optimal wavelet bases and decomposition levels based on signal characteristics; (2) a Cross-Scale Feature Fusion (CSFF) mechanism that captures interactions between different frequency bands through learnable coupling matrices; (3) a Frequency-Aware Multi-Head Attention (FAMA) module that weights attention heads according to their frequency selectivity; (4) a Hierarchical Prediction Network (HPN) that generates forecasts at multiple resolutions before reconstruction. Extensive experiments on benchmark datasets demonstrate that AWGformer achieves significant average improvements over state-of-the-art methods, with particular effectiveness on multi-scale and non-stationary time series. Theoretical analysis provides convergence guarantees and establishes the connection between our wavelet-guided attention and classical signal processing principles.

[387] Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs

Yuhang Liu, Erdun Gao, Dong Gong, Anton van den Hengel, Javen Qinfeng Shi

Main category: cs.LG

TL;DR: The paper introduces Concept Component Analysis (ConCA), a theory-backed framework for extracting interpretable concepts from LLM representations using linear unmixing of log-posteriors, addressing theoretical ambiguities in existing sparse autoencoder methods.

Details

Motivation: Current sparse autoencoder (SAE) approaches for mechanistic interpretability lack theoretical grounding, creating ambiguity in how LLM representations correspond to human-interpretable concepts and causing methodological challenges in design and evaluation.

Method: Proposes Concept Component Analysis (ConCA) which treats concepts as latent variables and shows LLM representations can be approximated as linear mixtures of log-posteriors over concepts. Introduces sparse ConCA variant using sparsity prior to address ill-posedness of the linear unmixing problem.

Result: Implemented 12 sparse ConCA variants that successfully extract meaningful concepts across multiple LLMs, demonstrating theory-backed advantages over traditional sparse autoencoders.

Conclusion: ConCA provides a principled theoretical framework for concept extraction from LLMs, addressing fundamental ambiguities in existing methods and enabling more reliable interpretability through unsupervised linear unmixing of log-posteriors.

Abstract: Developing human understandable interpretation of large language models (LLMs) becomes increasingly critical for their deployment in essential domains. Mechanistic interpretability seeks to mitigate the issues through extracts human-interpretable process and concepts from LLMs’ activations. Sparse autoencoders (SAEs) have emerged as a popular approach for extracting interpretable and monosemantic concepts by decomposing the LLM internal representations into a dictionary. Despite their empirical progress, SAEs suffer from a fundamental theoretical ambiguity: the well-defined correspondence between LLM representations and human-interpretable concepts remains unclear. This lack of theoretical grounding gives rise to several methodological challenges, including difficulties in principled method design and evaluation criteria. In this work, we show that, under mild assumptions, LLM representations can be approximated as a {linear mixture} of the log-posteriors over concepts given the input context, through the lens of a latent variable model where concepts are treated as latent variables. This motivates a principled framework for concept extraction, namely Concept Component Analysis (ConCA), which aims to recover the log-posterior of each concept from LLM representations through a {unsupervised} linear unmixing process. We explore a specific variant, termed sparse ConCA, which leverages a sparsity prior to address the inherent ill-posedness of the unmixing problem. We implement 12 sparse ConCA variants and demonstrate their ability to extract meaningful concepts across multiple LLMs, offering theory-backed advantages over SAEs.

[388] Nonlinear Dimensionality Reduction with Diffusion Maps in Practice

Sönke Beier, Paula Pirker-Díaz, Friedrich Pagenkopf, Karoline Wiesner

Main category: cs.LG

TL;DR: Diffusion Map is a spectral dimensionality reduction method for uncovering nonlinear manifolds in high-dimensional data, but its application is sensitive to preprocessing, parameter settings, and component selection, which hasn’t been comprehensively addressed.

Details

Motivation: Diffusion Map is widely used across scientific disciplines but lacks comprehensive guidance on critical implementation aspects like data preprocessing, parameter settings, and component selection, which significantly affect the resulting manifold structure.

Method: The paper provides a practice-oriented review of Diffusion Map technique, illustrates common pitfalls, and showcases a recently introduced technique for identifying the most relevant components in the dimensionality reduction process.

Result: The results demonstrate that the first components (typically assumed to be most important) are not necessarily the most relevant ones for capturing meaningful manifold structure, highlighting the need for proper component selection methods.

Conclusion: Proper implementation of Diffusion Map requires careful attention to preprocessing, parameter tuning, and component selection, with the first components not always being the most informative, necessitating systematic approaches for optimal manifold discovery.

Abstract: Diffusion Map is a spectral dimensionality reduction technique which is able to uncover nonlinear submanifolds in high-dimensional data. And, it is increasingly applied across a wide range of scientific disciplines, such as biology, engineering, and social sciences. But data preprocessing, parameter settings and component selection have a significant influence on the resulting manifold, something which has not been comprehensively discussed in the literature so far. We provide a practice oriented review of the Diffusion Map technique, illustrate pitfalls and showcase a recently introduced technique for identifying the most relevant components. Our results show that the first components are not necessarily the most relevant ones.

[389] TimeCatcher: A Variational Framework for Volatility-Aware Forecasting of Non-Stationary Time Series

Zhiyu Chen, Minhao Liu, Yanru Zhang

Main category: cs.LG

TL;DR: TimeCatcher: A volatility-aware variational forecasting framework that improves long-term forecasting of highly non-stationary time series by capturing latent dynamic patterns and amplifying significant local variations.

Details

Motivation: Current lightweight MLP-based models assume local stationarity, making them prone to errors when forecasting highly non-stationary series with abrupt fluctuations, especially in domains like web traffic monitoring.

Method: Extends linear architectures with a variational encoder to capture latent dynamic patterns from historical data and a volatility-aware enhancement mechanism to detect and amplify significant local variations.

Result: Outperforms state-of-the-art baselines on nine real-world datasets from traffic, financial, energy, and weather domains, with particularly large improvements in long-term forecasting scenarios with high volatility and sudden fluctuations.

Conclusion: TimeCatcher effectively addresses the limitations of existing MLP-based models for non-stationary time series forecasting, demonstrating superior performance especially in challenging long-term forecasting scenarios with high volatility.

Abstract: Recent lightweight MLP-based models have achieved strong performance in time series forecasting by capturing stable trends and seasonal patterns. However, their effectiveness hinges on an implicit assumption of local stationarity assumption, making them prone to errors in long-term forecasting of highly non-stationary series, especially when abrupt fluctuations occur, a common challenge in domains like web traffic monitoring. To overcome this limitation, we propose TimeCatcher, a novel Volatility-Aware Variational Forecasting framework. TimeCatcher extends linear architectures with a variational encoder to capture latent dynamic patterns hidden in historical data and a volatility-aware enhancement mechanism to detect and amplify significant local variations. Experiments on nine real-world datasets from traffic, financial, energy, and weather domains show that TimeCatcher consistently outperforms state-of-the-art baselines, with particularly large improvements in long-term forecasting scenarios characterized by high volatility and sudden fluctuations. Our code is available at https://github.com/ColaPrinceCHEN/TimeCatcher.

[390] Fair Recourse for All: Ensuring Individual and Group Fairness in Counterfactual Explanations

Fatima Ezzeddine, Obaida Ammar, Silvia Giordano, Omran Ayoub

Main category: cs.LG

TL;DR: A novel reinforcement learning approach for generating fair counterfactual explanations that ensures both individual and group fairness while maintaining explanation quality.

Details

Motivation: Counterfactual explanations are crucial for transparent ML but need fairness guarantees - similar individuals should get similar recourse, and different protected groups should get equitable recourse options.

Method: Proposes a model-agnostic reinforcement learning approach to generate CFs satisfying fairness constraints, formulating the problem as an optimization task with three fairness levels: individual, group, and hybrid fairness.

Result: The approach effectively ensures individual and group fairness while preserving CF quality (proximity and plausibility) on three benchmark datasets, with separate quantification of fairness costs.

Conclusion: Opens broader discussion on hybrid fairness in XAI, demonstrating that individual and group fairness (often treated as orthogonal) can be jointly addressed in counterfactual explanation generation.

Abstract: Explainable Artificial Intelligence (XAI) is becoming increasingly essential for enhancing the transparency of machine learning (ML) models. Among the various XAI techniques, counterfactual explanations (CFs) hold a pivotal role due to their ability to illustrate how changes in input features can alter an ML model’s decision, thereby offering actionable recourse to users. Ensuring that individuals with comparable attributes and those belonging to different protected groups (e.g., demographic) receive similar and actionable recourse options is essential for trustworthy and fair decision-making. In this work, we address this challenge directly by focusing on the generation of fair CFs. Specifically, we start by defining and formulating fairness at: 1) individual fairness, ensuring that similar individuals receive similar CFs, 2) group fairness, ensuring equitable CFs across different protected groups and 3) hybrid fairness, which accounts for both individual and broader group-level fairness. We formulate the problem as an optimization task and propose a novel model-agnostic, reinforcement learning based approach to generate CFs that satisfy fairness constraints at both the individual and group levels, two objectives that are usually treated as orthogonal. As fairness metrics, we extend existing metrics commonly used for auditing ML models, such as equal choice of recourse and equal effectiveness across individuals and groups. We evaluate our approach on three benchmark datasets, showing that it effectively ensures individual and group fairness while preserving the quality of the generated CFs in terms of proximity and plausibility, and quantify the cost of fairness in the different levels separately. Our work opens a broader discussion on hybrid fairness and its role and implications for XAI and beyond CFs.

[391] Implicit Hypothesis Testing and Divergence Preservation in Neural Network Representations

Kadircan Aksoy, Peter Jung, Protim Bhattacharjee

Main category: cs.LG

TL;DR: Neural classifiers improve generalization by increasingly aligning with optimal binary hypothesis testing rules during training, with monotonic KL divergence improvements relating to error rate exponents.

Details

Motivation: To understand supervised training dynamics of neural classifiers through the lens of binary hypothesis testing, examining how networks develop optimal decision rules during training.

Method: Model classification as binary tests between class-conditional distributions of representations, empirically analyze training trajectories to study alignment with Neyman-Pearson optimal decision rules.

Result: Well-generalizing networks increasingly align with Neyman-Pearson optimal decision rules during training via monotonic improvements in KL divergence that relate to error rate exponents.

Conclusion: This perspective yields explanations and suggests possible training or regularization strategies for different classes of neural networks based on hypothesis testing principles.

Abstract: We study the supervised training dynamics of neural classifiers through the lens of binary hypothesis testing. We model classification as a set of binary tests between class-conditional distributions of representations and empirically show that, along training trajectories, well-generalizing networks increasingly align with Neyman-Pearson optimal decision rules via monotonic improvements in KL divergence that relate to error rate exponents. We finally discuss how this yields an explanation and possible training or regularization strategies for different classes of neural networks.

[392] CCMamba: Selective State-Space Models for Higher-Order Graph Learning on Combinatorial Complexes

Jiawen Chen, Qi Shao, Mingtong Zhou, Duxin Chen, Wenwu Yu

Main category: cs.LG

TL;DR: CCMamba is a new topological deep learning framework using state-space models for efficient learning on combinatorial complexes, achieving linear-time processing and outperforming existing methods.

Details

Motivation: Existing topological deep learning methods on combinatorial complexes rely on attention mechanisms with quadratic complexity, limiting scalability and rank-aware information aggregation in higher-order structures.

Method: Reformulates message passing as selective state-space modeling by organizing multi-rank incidence relations into structured sequences processed by rank-aware state-space models, enabling adaptive, directional, and long-range information propagation in linear time without self-attention.

Result: Theoretical analysis shows expressive power upper-bound is the 1-Weisfeiler-Lehman test. Experiments on graph, hypergraph, and simplicial benchmarks demonstrate consistent outperformance of existing methods with improved scalability and robustness to depth.

Conclusion: CCMamba provides the first unified mamba-based neural framework for combinatorial complexes, offering efficient linear-time processing while maintaining strong performance across various topological structures.

Abstract: Topological deep learning has emerged for modeling higher-order relational structures beyond pairwise interactions that standard graph neural networks fail to capture. Although combinatorial complexes offer a unified topological framework, most existing topological deep learning methods rely on local message passing via attention mechanisms, which incur quadratic complexity and remain low-dimensional, limiting scalability and rank-aware information aggregation in higher-order complexes.We propose Combinatorial Complex Mamba (CCMamba), the first unified mamba-based neural framework for learning on combinatorial complexes. CCMamba reformulates message passing as a selective state-space modeling problem by organizing multi-rank incidence relations into structured sequences processed by rank-aware state-space models. This enables adaptive, directional, and long range information propagation in linear time without self attention. We further establish the theoretical analysis that the expressive power upper-bound of CCMamba message passing is the 1-Weisfeiler-Lehman test. Experiments on graph, hypergraph, and simplicial benchmarks demonstrate that CCMamba consistently outperforms existing methods while exhibiting improved scalability and robustness to depth.

[393] An explainable framework for the relationship between dementia and glucose metabolism patterns

C. Vázquez-García, F. J. Martínez-Murcia, F. Segovia Román, A. Forte, J. Ramírez, I. Illán, A. Hernández-Segura, C. Jiménez-Mesa, Juan M. Górriz

Main category: cs.LG

TL;DR: Semi-supervised VAE framework with similarity regularization aligns latent variables with clinical biomarkers to extract disease-related patterns from neuroimaging data, demonstrated on ADNI PET scans.

Details

Motivation: High-dimensional neuroimaging data has complex non-linear relationships that challenge neurodegenerative disease assessment. Need for methods that can encode scans into interpretable latent spaces aligned with disease progression biomarkers.

Method: Proposed semi-supervised VAE framework with flexible similarity regularization term that aligns selected latent variables with clinical/biomarker measures of dementia progression. Allows adaptation of similarity metric and supervised variables to specific goals.

Result: Demonstrated on ADNI PET scans by aligning first latent dimension with cognitive score. Generated average reconstructions across cognitive impairment levels. Voxel-wise GLM analysis revealed reduced metabolism in hippocampus and within Default Mode and Central Executive Networks. Remaining latent variables captured confounds like inter-subject variability and site effects.

Conclusion: Framework effectively extracts disease-related patterns aligned with established Alzheimer’s biomarkers, offering an interpretable and adaptable tool for studying neurodegenerative progression.

Abstract: High-dimensional neuroimaging data presents challenges for assessing neurodegenerative diseases due to complex non-linear relationships. Variational Autoencoders (VAEs) can encode scans into lower-dimensional latent spaces capturing disease-relevant features. We propose a semi-supervised VAE framework with a flexible similarity regularization term that aligns selected latent variables with clinical or biomarker measures of dementia progression. This allows adapting the similarity metric and supervised variables to specific goals or available data. We demonstrate the approach using PET scans from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), guiding the first latent dimension to align with a cognitive score. Using this supervised latent variable, we generate average reconstructions across levels of cognitive impairment. Voxel-wise GLM analysis reveals reduced metabolism in key regions, mainly the hippocampus, and within major Resting State Networks, particularly the Default Mode and Central Executive Networks. The remaining latent variables encode affine transformations and intensity variations, capturing confounds such as inter-subject variability and site effects. Our framework effectively extracts disease-related patterns aligned with established Alzheimer’s biomarkers, offering an interpretable and adaptable tool for studying neurodegenerative progression.

[394] Unsupervised Ensemble Learning Through Deep Energy-based Models

Ariel Maymon, Yanir Buznah, Uri Shaham

Main category: cs.LG

TL;DR: A novel deep energy-based method for unsupervised ensemble learning that combines multiple learners’ predictions without ground truth labels, learner features, or problem-specific information.

Details

Motivation: Address the challenge of combining multiple learners' predictions when ground truth labels are unavailable and evaluating individual classifier performance is difficult due to limited information.

Method: Deep energy-based method for constructing an accurate meta-learner using only the predictions of individual learners, capable of capturing complex dependence structures between them.

Result: Superior performance across diverse ensemble scenarios including challenging mixture of experts settings, demonstrated on standard ensemble datasets and curated datasets designed to test expertise fusion.

Conclusion: The approach shows potential for unsupervised ensemble learning to harness collective intelligence in data-scarce or privacy-sensitive environments, with theoretical guarantees for conditionally independent learners.

Abstract: Unsupervised ensemble learning emerged to address the challenge of combining multiple learners’ predictions without access to ground truth labels or additional data. This paradigm is crucial in scenarios where evaluating individual classifier performance or understanding their strengths is challenging due to limited information. We propose a novel deep energy-based method for constructing an accurate meta-learner using only the predictions of individual learners, potentially capable of capturing complex dependence structures between them. Our approach requires no labeled data, learner features, or problem-specific information, and has theoretical guarantees for when learners are conditionally independent. We demonstrate superior performance across diverse ensemble scenarios, including challenging mixture of experts settings. Our experiments span standard ensemble datasets and curated datasets designed to test how the model fuses expertise from multiple sources. These results highlight the potential of unsupervised ensemble learning to harness collective intelligence, especially in data-scarce or privacy-sensitive environments.

[395] Robust Distributed Learning under Resource Constraints: Decentralized Quantile Estimation via (Asynchronous) ADMM

Anna van Elst, Igor Colin, Stephan Clémençon

Main category: cs.LG

TL;DR: AsylADMM: A lightweight gossip algorithm for decentralized median/quantile estimation requiring only 2 variables per node, enabling robust trimming and geometric median estimation.

Details

Motivation: Need communication-efficient, robust, and memory-light algorithms for decentralized learning on resource-constrained edge devices. Existing gossip methods lack robustness, while ADMM-based median estimation methods require memory scaling with node degree.

Method: Propose AsylADMM - a novel gossip algorithm for decentralized median and quantile estimation designed for asynchronous updates. Uses only two variables per node regardless of network degree. Analyze synchronous variant for theoretical guarantees.

Result: Empirically demonstrates fast convergence for asynchronous algorithm. Enables quantile-based trimming, geometric median estimation, and depth-based trimming. Quantile-based trimming outperforms existing rank-based methods. Provides novel theoretical analysis of rank-based trimming via Markov chain theory.

Conclusion: AsylADMM addresses key requirements for edge device learning: communication efficiency, robustness to data corruption, and lightweight memory usage (only 2 variables per node), making it practical for resource-constrained environments.

Abstract: Specifications for decentralized learning on resource-constrained edge devices require algorithms that are communication-efficient, robust to data corruption, and lightweight in memory usage. While state-of-the-art gossip-based methods satisfy the first requirement, achieving robustness remains challenging. Asynchronous decentralized ADMM-based methods have been explored for estimating the median, a statistical centrality measure that is notoriously more robust than the mean. However, existing approaches require memory that scales with node degree, making them impractical when memory is limited. In this paper, we propose AsylADMM, a novel gossip algorithm for decentralized median and quantile estimation, primarily designed for asynchronous updates and requiring only two variables per node. We analyze a synchronous variant of AsylADMM to establish theoretical guarantees and empirically demonstrate fast convergence for the asynchronous algorithm. We then show that our algorithm enables quantile-based trimming, geometric median estimation, and depth-based trimming, with quantile-based trimming empirically outperforming existing rank-based methods. Finally, we provide a novel theoretical analysis of rank-based trimming via Markov chain theory.

[396] Reinforcement Unlearning via Group Relative Policy Optimization

Efstratios Zaradoukas, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.LG

TL;DR: PURGE is a novel LLM unlearning method that uses intrinsic rewards to penalize forbidden concepts, achieving efficient unlearning while maintaining model utility and robustness.

Details

Motivation: LLMs memorize sensitive/copyrighted data during pretraining, creating compliance issues with GDPR and EU AI Act. Existing unlearning methods leak data, sacrifice fluency/robustness, or require costly external reward models.

Method: PURGE (Policy Unlearning through Relative Group Erasure) uses Group Relative Policy Optimization framework with intrinsic reward signals that penalize mentions of forbidden concepts, formulating unlearning as a verifiable problem.

Result: Reduces token usage per target by 46x vs SotA, improves fluency by 5.48%, adversarial robustness by 12.02%, achieves 11% unlearning effectiveness while preserving 98% of original utility on RWKU benchmark.

Conclusion: Framing LLM unlearning as a verifiable task enables reliable, efficient, and scalable forgetting, offering a promising direction combining theoretical guarantees, improved safety, and practical deployment efficiency.

Abstract: During pretraining, LLMs inadvertently memorize sensitive or copyrighted data, posing significant compliance challenges under legal frameworks like the GDPR and the EU AI Act. Fulfilling these mandates demands techniques that can remove information from a deployed model without retraining from scratch. Existing unlearning approaches attempt to address this need, but often leak the very data they aim to erase, sacrifice fluency and robustness, or depend on costly external reward models. We introduce PURGE (Policy Unlearning through Relative Group Erasure), a novel method grounded in the Group Relative Policy Optimization framework that formulates unlearning as a verifiable problem. PURGE uses an intrinsic reward signal that penalizes any mention of forbidden concepts, allowing safe and consistent unlearning. Our approach reduces token usage per target by up to a factor of 46 compared with SotA methods, while improving fluency by 5.48 percent and adversarial robustness by 12.02 percent over the base model. On the Real World Knowledge Unlearning (RWKU) benchmark, PURGE achieves 11 percent unlearning effectiveness while preserving 98 percent of original utility. PURGE shows that framing LLM unlearning as a verifiable task, enables more reliable, efficient, and scalable forgetting, suggesting a promising new direction for unlearning research that combines theoretical guarantees, improved safety, and practical deployment efficiency.

[397] Ranking-aware Reinforcement Learning for Ordinal Ranking

Aiming Hao, Chen Zhu, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu

Main category: cs.LG

TL;DR: RARL is a reinforcement learning framework for ordinal regression/ranking that combines regression and Learning-to-Rank with a unified objective and ranking-aware reward, enhanced by Response Mutation Operations for better exploration.

Details

Motivation: Conventional methods struggle to model inherent ordinal dependencies in ordinal regression and ranking tasks, creating a need for approaches that can explicitly learn these relationships.

Method: RARL uses a unified objective integrating regression and Learning-to-Rank with a ranking-aware verifiable reward for policy optimization. Response Mutation Operations inject controlled noise to improve exploration and prevent stagnation.

Result: The effectiveness of RARL is validated through extensive experiments on three distinct benchmarks, demonstrating its capability to handle ordinal regression and ranking tasks.

Conclusion: RARL provides a novel RL framework that successfully addresses ordinal dependencies in regression and ranking through synergistic integration of tasks and enhanced exploration mechanisms.

Abstract: Ordinal regression and ranking are challenging due to inherent ordinal dependencies that conventional methods struggle to model. We propose Ranking-Aware Reinforcement Learning (RARL), a novel RL framework that explicitly learns these relationships. At its core, RARL features a unified objective that synergistically integrates regression and Learning-to-Rank (L2R), enabling mutual improvement between the two tasks. This is driven by a ranking-aware verifiable reward that jointly assesses regression precision and ranking accuracy, facilitating direct model updates via policy optimization. To further enhance training, we introduce Response Mutation Operations (RMO), which inject controlled noise to improve exploration and prevent stagnation at saddle points. The effectiveness of RARL is validated through extensive experiments on three distinct benchmarks.

[398] Regularized Gradient Temporal-Difference Learning

Hyunjun Na, Donghwan Lee

Main category: cs.LG

TL;DR: Proposes R-GTD, a regularized GTD algorithm that guarantees convergence even when the feature interaction matrix is singular, addressing instability issues in existing off-policy policy evaluation methods.

Details

Motivation: Existing GTD algorithms for off-policy policy evaluation rely on the restrictive assumption that the feature interaction matrix (FIM) is nonsingular. In practice, FIM can become singular, leading to instability or degraded performance, creating a need for more robust methods.

Method: Proposes a regularized optimization objective by reformulating the mean-square projected Bellman error (MSPBE) minimization. This formulation naturally yields regularized GTD algorithms (R-GTD) that guarantee convergence even with singular FIM.

Result: Establishes theoretical convergence guarantees and explicit error bounds for R-GTD. Empirical experiments validate its effectiveness, showing it converges to a unique solution even when FIM is singular.

Conclusion: R-GTD provides a robust solution to the singularity problem in GTD learning, ensuring stable convergence for off-policy policy evaluation with function approximation, overcoming limitations of existing methods.

Abstract: Gradient temporal-difference (GTD) learning algorithms are widely used for off-policy policy evaluation with function approximation. However, existing convergence analyses rely on the restrictive assumption that the so-called feature interaction matrix (FIM) is nonsingular. In practice, the FIM can become singular and leads to instability or degraded performance. In this paper, we propose a regularized optimization objective by reformulating the mean-square projected Bellman error (MSPBE) minimization. This formulation naturally yields a regularized GTD algorithms, referred to as R-GTD, which guarantees convergence to a unique solution even when the FIM is singular. We establish theoretical convergence guarantees and explicit error bounds for the proposed method, and validate its effectiveness through empirical experiments.

[399] CoBA: Integrated Deep Learning Model for Reliable Low-Altitude UAV Classification in mmWave Radio Networks

Junaid Sajid, Ivo Müürsepp, Luca Reggiani, Davide Scazzoli, Federico Francesco Luigi Mariani, Maurizio Magarini, Rizwan Ahmad, Muhammad Mahtab Alam

Main category: cs.LG

TL;DR: CoBA: A deep learning model combining CNN, BiLSTM, and attention mechanisms that uses 5G mmWave radio measurements to classify UAV operations in authorized vs. restricted low-altitude airspaces, achieving superior accuracy over conventional methods.

Details

Motivation: With increasing UAV use in civilian/industrial applications, secure low-altitude operations are crucial. Current challenges include accurately classifying UAVs in dense mmWave environments, requiring models that handle complex propagation and signal variability for airspace regulation.

Method: Proposes CoBA model integrating CNN (for spatial patterns), BiLSTM (for temporal patterns), and attention mechanisms. Uses 5G mmWave radio measurements from controlled UAV flights. Validated with dataset collected from TalTech’s 5G mmWave network with authorized/restricted flight scenarios.

Result: CoBA achieves superior accuracy in classifying UAV operations, significantly outperforming conventional ML models and fingerprinting-based benchmarks. Demonstrates potential for reliable and regulated UAV airspace monitoring.

Conclusion: The integrated CNN-BiLSTM-Attention approach effectively captures spatial-temporal patterns in UAV radio measurements, providing a robust solution for low-altitude UAV airspace classification and regulation using 5G mmWave networks.

Abstract: Uncrewed Aerial Vehicles (UAVs) are increasingly used in civilian and industrial applications, making secure low-altitude operations crucial. In dense mmWave environments, accurately classifying low-altitude UAVs as either inside authorized or restricted airspaces remains challenging, requiring models that handle complex propagation and signal variability. This paper proposes a deep learning model, referred to as CoBA, which stands for integrated Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), and Attention which leverages Fifth Generation (5G) millimeter-wave (mmWave) radio measurements to classify UAV operations in authorized and restricted airspaces at low altitude. The proposed CoBA model integrates convolutional, bidirectional recurrent, and attention layers to capture both spatial and temporal patterns in UAV radio measurements. To validate the model, a dedicated dataset is collected using the 5G mmWave network at TalTech, with controlled low altitude UAV flights in authorized and restricted scenarios. The model is evaluated against conventional ML models and a fingerprinting-based benchmark. Experimental results show that CoBA achieves superior accuracy, significantly outperforming all baseline models and demonstrating its potential for reliable and regulated UAV airspace monitoring.

[400] WFR-MFM: One-Step Inference for Dynamic Unbalanced Optimal Transport

Xinyu Wang, Ruoyu Wang, Qiangwei Peng, Peijie Zhou, Tiejun Li

Main category: cs.LG

TL;DR: WFR-MFM enables fast one-step generation for dynamic unbalanced optimal transport without trajectory simulation, achieving orders-of-magnitude faster inference while maintaining accuracy.

Details

Motivation: Reconstructing dynamical evolution from limited observations is a fundamental challenge in single-cell biology, where existing approaches rely on trajectory simulation at inference time, making inference a bottleneck for scalable applications.

Method: Proposes a mean-flow framework for unbalanced flow matching that summarizes transport and mass-growth dynamics using mean velocity and mass-growth fields. Develops Wasserstein-Fisher-Rao Mean Flow Matching (WFR-MFM) to solve dynamic unbalanced optimal transport under Wasserstein-Fisher-Rao geometry.

Result: WFR-MFM achieves orders-of-magnitude faster inference than existing baselines while maintaining high predictive accuracy across synthetic and real single-cell RNA sequencing datasets. Enables efficient perturbation response prediction on large synthetic datasets with thousands of conditions.

Conclusion: The proposed mean-flow framework provides a scalable solution for dynamic unbalanced optimal transport in single-cell biology, overcoming the inference bottleneck of trajectory-based methods while preserving accuracy.

Abstract: Reconstructing dynamical evolution from limited observations is a fundamental challenge in single-cell biology, where dynamic unbalanced optimal transport provides a principled framework for modeling coupled transport and mass variation. However, existing approaches rely on trajectory simulation at inference time, making inference a key bottleneck for scalable applications. In this work, we propose a mean-flow framework for unbalanced flow matching that summarizes both transport and mass-growth dynamics over arbitrary time intervals using mean velocity and mass-growth fields, enabling fast one-step generation without trajectory simulation. To solve dynamic unbalanced optimal transport under the Wasserstein-Fisher-Rao geometry, we further build on this framework to develop Wasserstein-Fisher-Rao Mean Flow Matching (WFR-MFM). Across synthetic and real single-cell RNA sequencing datasets, WFR-MFM achieves orders-of-magnitude faster inference than a range of existing baselines while maintaining high predictive accuracy, and enables efficient perturbation response prediction on large synthetic datasets with thousands of conditions.

[401] ACFormer: Mitigating Non-linearity with Auto Convolutional Encoder for Time Series Forecasting

Gawon Lee, Hanbyeol Park, Minseop Kim, Dohee Kim, Hyerim Bae

Main category: cs.LG

TL;DR: ACFormer is a novel time series forecasting architecture that combines linear projections with convolutional feature extraction to better capture both global trends and non-linear signals.

Details

Motivation: Current linear TSF models are efficient at capturing global trends but struggle with non-linear signals, while convolutional models have untapped potential for handling complex temporal dependencies and inter-channel correlations.

Method: Systematic receptive field analysis of CNN models, introducing “individual receptive field” concept, then proposing ACFormer architecture with shared compression module, gated attention for temporal locality, and independent patch expansion layer for variable-specific patterns.

Result: ACFormer achieves state-of-the-art performance on multiple benchmark datasets, effectively mitigating linear models’ drawbacks in capturing high-frequency components.

Conclusion: ACFormer successfully reconciles linear efficiency with convolutional non-linear feature extraction, demonstrating superior robustness to non-linear fluctuations while maintaining forecasting accuracy.

Abstract: Time series forecasting (TSF) faces challenges in modeling complex intra-channel temporal dependencies and inter-channel correlations. Although recent research has highlighted the efficiency of linear architectures in capturing global trends, these models often struggle with non-linear signals. To address this gap, we conducted a systematic receptive field analysis of convolutional neural network (CNN) TSF models. We introduce the “individual receptive field” to uncover granular structural dependencies, revealing that convolutional layers act as feature extractors that mirror channel-wise attention while exhibiting superior robustness to non-linear fluctuations. Based on these insights, we propose ACFormer, an architecture designed to reconcile the efficiency of linear projections with the non-linear feature-extraction power of convolutions. ACFormer captures fine-grained information through a shared compression module, preserves temporal locality via gated attention, and reconstructs variable-specific temporal patterns using an independent patch expansion layer. Extensive experiments on multiple benchmark datasets demonstrate that ACFormer consistently achieves state-of-the-art performance, effectively mitigating the inherent drawbacks of linear models in capturing high-frequency components.

[402] DIVERSE: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration

Gilles Eerlings, Brent Zoomers, Jori Liesenborgs, Gustavo Rovelo Ruiz, Kris Luyten

Main category: cs.LG

TL;DR: DIVERSE is a framework that efficiently explores the Rashomon set of neural networks (models with similar accuracy but different predictions) by adding FiLM layers to pretrained models and using CMA-ES optimization to generate diverse model variants without retraining.

Details

Motivation: The Rashomon set contains multiple high-performing models with different predictive behaviors, but exploring this set through retraining is computationally expensive. There's a need for efficient methods to discover functionally distinct models without retraining.

Method: DIVERSE augments pretrained models with Feature-wise Linear Modulation (FiLM) layers and uses Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to search a latent modulation space, generating diverse model variants without requiring retraining or gradient access.

Result: Across MNIST, PneumoniaMNIST, and CIFAR-10 datasets, DIVERSE successfully uncovers multiple high-performing yet functionally distinct models. It achieves comparable diversity to retraining baselines at significantly reduced computational cost while maintaining robustness and performance.

Conclusion: DIVERSE provides a competitive and efficient approach for exploring the Rashomon set, enabling construction of diverse model sets that maintain performance while supporting well-balanced model multiplicity, making Rashomon set exploration more feasible in practice.

Abstract: We propose DIVERSE, a framework for systematically exploring the Rashomon set of deep neural networks, the collection of models that match a reference model’s accuracy while differing in their predictive behavior. DIVERSE augments a pretrained model with Feature-wise Linear Modulation (FiLM) layers and uses Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to search a latent modulation space, generating diverse model variants without retraining or gradient access. Across MNIST, PneumoniaMNIST, and CIFAR-10, DIVERSE uncovers multiple high-performing yet functionally distinct models. Our experiments show that DIVERSE offers a competitive and efficient exploration of the Rashomon set, making it feasible to construct diverse sets that maintain robustness and performance while supporting well-balanced model multiplicity. While retraining remains the baseline to generate Rashomon sets, DIVERSE achieves comparable diversity at reduced computational cost.

[403] Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability

Rohan Asthana, Vasileios Belagiannis

Main category: cs.LG

TL;DR: The paper proposes a new memorization detection metric for diffusion models that combines isotropic norm and anisotropic alignment, enabling faster detection without denoising steps.

Details

Motivation: Current memorization detection methods for diffusion models rely on norm-based metrics that only work well under isotropic assumptions at high/medium noise levels, but fail in anisotropic low-noise regimes where memorization actually occurs.

Method: Developed a memorization detection metric that integrates both isotropic norm components and anisotropic alignment between guidance vectors and unconditional scores. The metric can be computed directly on pure noise inputs via two forward passes (conditional and unconditional), eliminating costly denoising steps.

Result: The proposed metric outperforms existing denoising-free detection methods on Stable Diffusion v1.4 and v2, while being at least 5x faster than previous best approaches. Also demonstrated effectiveness through a mitigation strategy that adapts memorized prompts based on the detection metric.

Conclusion: The paper presents an efficient and effective memorization detection approach for diffusion models that works across different noise regimes, enabling practical detection without expensive denoising while maintaining high accuracy.

Abstract: Diffusion-based image generative models produce high-fidelity images through iterative denoising but remain vulnerable to memorization, where they unintentionally reproduce exact copies or parts of training images. Recent memorization detection methods are primarily based on the norm of score difference as indicators of memorization. We prove that such norm-based metrics are mainly effective under the assumption of isotropic log-probability distributions, which generally holds at high or medium noise levels. In contrast, analyzing the anisotropic regime reveals that memorized samples exhibit strong angular alignment between the guidance vector and unconditional scores in the low-noise setting. Through these insights, we develop a memorization detection metric by integrating isotropic norm and anisotropic alignment. Our detection metric can be computed directly on pure noise inputs via two conditional and unconditional forward passes, eliminating the need for costly denoising steps. Detection experiments on Stable Diffusion v1.4 and v2 show that our metric outperforms existing denoising-free detection methods while being at least approximately 5x faster than the previous best approach. Finally, we demonstrate the effectiveness of our approach by utilizing a mitigation strategy that adapts memorized prompts based on our developed metric.

[404] A Foundation Model for Virtual Sensors

Leon Götz, Lars Frederik Peiss, Erik Sauer, Andreas Udo Sass, Thorsten Bagdonat, Stephan Günnemann, Leo Schwinn

Main category: cs.LG

TL;DR: First foundation model for virtual sensors that simultaneously predicts diverse sensors efficiently, learns relevant inputs automatically, and scales to hundreds of sensors with constant parameters.

Details

Motivation: Existing virtual sensor approaches require application-specific models with hand-selected inputs, cannot leverage task synergies, lack consistent benchmarks, and emerging time series foundation models are computationally expensive and limited to predicting input signals (incompatible with virtual sensors).

Method: Introduces a unified foundation model architecture that can simultaneously predict diverse virtual sensors while maintaining computational efficiency. The model learns relevant input signals for each virtual sensor automatically, eliminating expert knowledge requirements while adding explainability.

Result: Achieves 415x reduction in computation time and 951x reduction in memory requirements compared to baselines, while maintaining or even improving predictive quality. Scales gracefully to hundreds of virtual sensors with nearly constant parameter count.

Conclusion: The proposed foundation model addresses limitations of existing virtual sensor approaches and time series foundation models, enabling practical deployment in large-scale sensor networks through computational efficiency, automatic input selection, and scalability.

Abstract: Virtual sensors use machine learning to predict target signals from available measurements, replacing expensive physical sensors in critical applications. Existing virtual sensor approaches require application-specific models with hand-selected inputs for each sensor, cannot leverage task synergies, and lack consistent benchmarks. At the same time, emerging time series foundation models are computationally expensive and limited to predicting their input signals, making them incompatible with virtual sensors. We introduce the first foundation model for virtual sensors addressing both limitations. Our unified model can simultaneously predict diverse virtual sensors exploiting synergies while maintaining computational efficiency. It learns relevant input signals for each virtual sensor, eliminating expert knowledge requirements while adding explainability. In our large-scale evaluation on a standard benchmark and an application-specific dataset with over 18 billion samples, our architecture achieves 415x reduction in computation time and 951x reduction in memory requirements, while maintaining or even improving predictive quality compared to baselines. Our model scales gracefully to hundreds of virtual sensors with nearly constant parameter count, enabling practical deployment in large-scale sensor networks.

[405] Learning Contextual Runtime Monitors for Safe AI-Based Autonomy

Alejandro Luque-Cerpa, Mengyuan Wang, Emil Carlsson, Sanjit A. Seshia, Devdatt Dubhashi, Hazem Torfah

Main category: cs.LG

TL;DR: A framework for context-aware runtime monitors that selects the best ML controller for current conditions, improving safety and performance over traditional ensemble methods.

Details

Motivation: ML controllers in autonomous systems degrade in unfamiliar environments, creating safety concerns. Traditional ensemble methods dilute individual controller strengths by averaging outputs rather than leveraging contextual expertise.

Method: Reformulates safe AI control ensembles as contextual monitoring problem. Uses contextual multi-armed bandits for monitor learning, where monitor observes system context and selects best-suited controller for current conditions.

Result: Provides theoretical safety guarantees during controller selection and improved utilization of controller diversity. Validated in simulated autonomous driving scenarios, showing significant safety and performance improvements over non-contextual baselines.

Conclusion: Context-aware monitoring framework effectively exploits controller diversity by selecting context-appropriate controllers, offering both theoretical safety guarantees and practical performance benefits for AI-based control ensembles.

Abstract: We introduce a novel framework for learning context-aware runtime monitors for AI-based control ensembles. Machine-learning (ML) controllers are increasingly deployed in (autonomous) cyber-physical systems because of their ability to solve complex decision-making tasks. However, their accuracy can degrade sharply in unfamiliar environments, creating significant safety concerns. Traditional ensemble methods aim to improve robustness by averaging or voting across multiple controllers, yet this often dilutes the specialized strengths that individual controllers exhibit in different operating contexts. We argue that, rather than blending controller outputs, a monitoring framework should identify and exploit these contextual strengths. In this paper, we reformulate the design of safe AI-based control ensembles as a contextual monitoring problem. A monitor continuously observes the system’s context and selects the controller best suited to the current conditions. To achieve this, we cast monitor learning as a contextual learning task and draw on techniques from contextual multi-armed bandits. Our approach comes with two key benefits: (1) theoretical safety guarantees during controller selection, and (2) improved utilization of controller diversity. We validate our framework in two simulated autonomous driving scenarios, demonstrating significant improvements in both safety and performance compared to non-contextual baselines.

[406] An Empirical Investigation of Neural ODEs and Symbolic Regression for Dynamical Systems

Panayiotis Ioannou, Pietro Liò, Pietro Cicuta

Main category: cs.LG

TL;DR: NODEs can extrapolate to new boundary conditions with dynamic similarity, SR recovers equations from noisy data with correct variable selection, and SR can recover governing equations from NODE-generated data trained on limited samples.

Details

Motivation: To explore the capabilities of Neural ODEs for modeling complex system dynamics and Symbolic Regression for discovering governing differential equations from noisy data, with potential applications for scientific discovery.

Method: Used noisy synthetic data from two damped oscillatory systems to test NODE extrapolation capabilities and SR equation recovery. Trained NODEs on limited data (10% of full simulation) and used SR to recover equations from both ground-truth and NODE-generated data.

Result: 1) NODEs effectively extrapolate to new boundary conditions when trajectories share dynamic similarity with training data. 2) SR successfully recovers equations from noisy ground-truth data with correct variable selection. 3) SR recovers 2 out of 3 governing equations plus approximation for the third from NODE-generated data trained on only 10% of full simulation.

Conclusion: Using NODEs to enrich limited data and enable symbolic regression to infer physical laws represents a promising new approach for scientific discovery, though further work is needed to improve equation recovery from NODE-generated data.

Abstract: Accurately modelling the dynamics of complex systems and discovering their governing differential equations are critical tasks for accelerating scientific discovery. Using noisy, synthetic data from two damped oscillatory systems, we explore the extrapolation capabilities of Neural Ordinary Differential Equations (NODEs) and the ability of Symbolic Regression (SR) to recover the underlying equations. Our study yields three key insights. First, we demonstrate that NODEs can extrapolate effectively to new boundary conditions, provided the resulting trajectories share dynamic similarity with the training data. Second, SR successfully recovers the equations from noisy ground-truth data, though its performance is contingent on the correct selection of input variables. Finally, we find that SR recovers two out of the three governing equations, along with a good approximation for the third, when using data generated by a NODE trained on just 10% of the full simulation. While this last finding highlights an area for future work, our results suggest that using NODEs to enrich limited data and enable symbolic regression to infer physical laws represents a promising new approach for scientific discovery.

[407] MuRAL-CPD: Active Learning for Multiresolution Change Point Detection

Stefano Bertolasi, Diego Carrera, Diego Stucchi, Pasqualina Fragneto, Luigi Amedeo Bianchi

Main category: cs.LG

TL;DR: MuRAL-CPD is a semi-supervised change point detection method that combines wavelet-based multiresolution analysis with active learning to incorporate user feedback and optimize hyperparameters for better alignment with user-defined change concepts.

Details

Motivation: Traditional CPD methods are unsupervised and lack adaptability to task-specific definitions of change. They cannot incorporate user knowledge or benefit from supervision, limiting their practical effectiveness in real-world applications where users have specific notions of what constitutes meaningful change.

Method: MuRAL-CPD integrates active learning into a multiresolution CPD framework. It uses wavelet-based multiresolution decomposition to detect changes across multiple temporal scales, then incorporates user feedback to iteratively optimize key hyperparameters, allowing the model to align its change detection with user expectations.

Result: Experimental results on several real-world datasets demonstrate that MuRAL-CPD outperforms state-of-the-art methods, particularly in scenarios with minimal supervision available. The method shows improved accuracy and interpretability.

Conclusion: MuRAL-CPD successfully addresses limitations of traditional CPD methods by incorporating user feedback through active learning, enabling better alignment with task-specific change definitions and improving performance in practical applications with limited supervision.

Abstract: Change Point Detection (CPD) is a critical task in time series analysis, aiming to identify moments when the underlying data-generating process shifts. Traditional CPD methods often rely on unsupervised techniques, which lack adaptability to task-specific definitions of change and cannot benefit from user knowledge. To address these limitations, we propose MuRAL-CPD, a novel semi-supervised method that integrates active learning into a multiresolution CPD algorithm. MuRAL-CPD leverages a wavelet-based multiresolution decomposition to detect changes across multiple temporal scales and incorporates user feedback to iteratively optimize key hyperparameters. This interaction enables the model to align its notion of change with that of the user, improving both accuracy and interpretability. Our experimental results on several real-world datasets show the effectiveness of MuRAL-CPD against state-of-the-art methods, particularly in scenarios where minimal supervision is available.

[408] Adapting the Behavior of Reinforcement Learning Agents to Changing Action Spaces and Reward Functions

Raul de la Rosa, Ivana Dusparic, Nicolas Cardozo

Main category: cs.LG

TL;DR: MORPHIN is a self-adaptive Q-learning framework that enables RL agents to adapt to non-stationary environments with changing reward functions and expanding action spaces without full retraining.

Details

Motivation: RL agents struggle in real-world non-stationary environments where reward functions shift and action spaces expand, requiring adaptation without catastrophic forgetting of prior knowledge.

Method: Integrates concept drift detection with dynamic adjustments to learning and exploration hyperparameters, enabling on-the-fly adaptation to reward function changes and action space expansions while preserving prior policy knowledge.

Result: MORPHIN achieves superior convergence speed and continuous adaptation compared to standard Q-learning, improving learning efficiency by up to 1.7x in Gridworld and traffic signal control simulations.

Conclusion: MORPHIN provides an effective framework for RL agents to adapt to non-stationary environments with changing reward functions and expanding action spaces while preventing catastrophic forgetting.

Abstract: Reinforcement Learning (RL) agents often struggle in real-world applications where environmental conditions are non-stationary, particularly when reward functions shift or the available action space expands. This paper introduces MORPHIN, a self-adaptive Q-learning framework that enables on-the-fly adaptation without full retraining. By integrating concept drift detection with dynamic adjustments to learning and exploration hyperparameters, MORPHIN adapts agents to changes in both the reward function and on-the-fly expansions of the agent’s action space, while preserving prior policy knowledge to prevent catastrophic forgetting. We validate our approach using a Gridworld benchmark and a traffic signal control simulation. The results demonstrate that MORPHIN achieves superior convergence speed and continuous adaptation compared to a standard Q-learning baseline, improving learning efficiency by up to 1.7x.

[409] Positive-Unlabeled Reinforcement Learning Distillation for On-Premise Small Models

Zhiqiang Kou, Junyang Chen, Xin-Qiang Cai, Xiaobo Xia, Ming-Kun Xie, Dong-Dong Wu, Biao Liu, Yuheng Jia, Xin Geng, Masashi Sugiyama, Tat-Seng Chua

Main category: cs.LG

TL;DR: PU RL distillation enables on-premise small models to achieve RL alignment without human preferences or reward models by distilling teacher preferences from black-box generations through anchor-conditioned self-ranking.

Details

Motivation: On-premise deployment of small models is growing due to privacy, cost, and latency constraints, but practical pipelines often stop at SFT and miss RL alignment because it requires expensive human preference annotation or heavy reliance on high-quality reward models with API usage and maintenance, which are unsuitable for on-premise settings.

Method: Proposes positive-unlabeled RL distillation: query teacher once per prompt for anchor response, locally sample multiple student candidates, perform anchor-conditioned self-ranking to induce pairwise or listwise preferences, enabling fully local training via direct preference optimization or group relative policy optimization.

Result: Theoretical analysis shows induced preference signal is order-consistent and concentrates on near-optimal candidates, supporting stability for preference optimization. Experiments demonstrate consistently strong performance under low-cost setting.

Conclusion: The method bridges the gap for on-premise small-model deployment by enabling RL alignment without human-labeled preferences or reward models, making alignment practical for constrained environments.

Abstract: Due to constraints on privacy, cost, and latency, on-premise deployment of small models is increasingly common. However, most practical pipelines stop at supervised fine-tuning (SFT) and fail to reach the reinforcement learning (RL) alignment stage. The main reason is that RL alignment typically requires either expensive human preference annotation or heavy reliance on high-quality reward models with large-scale API usage and ongoing engineering maintenance, both of which are ill-suited to on-premise settings. To bridge this gap, we propose a positive-unlabeled (PU) RL distillation method for on-premise small-model deployment. Without human-labeled preferences or a reward model, our method distills the teacher’s preference-optimization capability from black-box generations into a locally trainable student. For each prompt, we query the teacher once to obtain an anchor response, locally sample multiple student candidates, and perform anchor-conditioned self-ranking to induce pairwise or listwise preferences, enabling a fully local training loop via direct preference optimization or group relative policy optimization. Theoretical analysis justifies that the induced preference signal by our method is order-consistent and concentrates on near-optimal candidates, supporting its stability for preference optimization. Experiments demonstrate that our method achieves consistently strong performance under a low-cost setting.

[410] Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

Minwu Kim, Safal Shrestha, Keith Ross

Main category: cs.LG

TL;DR: Failure-prefix conditioning improves RLVR training on saturated problems by exposing models to rare failure states through conditioning on incorrect reasoning prefixes.

Details

Motivation: RLVR improves LLM reasoning but training stalls on saturated problems due to poor accessibility of informative failures - learning signals exist but are rarely encountered during standard rollouts.

Method: Failure-prefix conditioning: reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, exposing models to failure-prone states instead of starting from original questions.

Result: Performance gains match training on medium-difficulty problems while preserving token efficiency. Method reduces performance degradation under misleading failure prefixes (with mild trade-off in adherence to correct early reasoning). Iterative approach with refreshed failure prefixes unlocks additional gains after plateaus.

Conclusion: Failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems by strategically exposing models to informative failure states.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model’s robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.

[411] Continual GUI Agents

Ziwei Liu, Borui Kang, Hangjie Yuan, Zixiang Zhao, Wei Li, Yifan Zhu, Tao Feng

Main category: cs.LG

TL;DR: GUI-AiF: A reinforcement fine-tuning framework for GUI agents that maintains stable performance across shifting domains/resolutions via novel anchoring rewards, establishing the first continual learning framework for GUI agents.

Details

Motivation: GUI agents trained on static environments deteriorate when faced with changing digital environments (new domains/resolutions over time), creating a need for continual learning approaches that can adapt to distribution shifts.

Method: Introduces GUI-Anchoring in Flux (GUI-AiF), a reinforcement fine-tuning framework with two novel rewards: Anchoring Point Reward in Flux (APR-iF) and Anchoring Region Reward in Flux (ARR-iF). These rewards guide agents to align with shifting interaction points and regions rather than over-adapting to static grounding cues.

Result: Extensive experiments show GUI-AiF surpasses state-of-the-art baselines, demonstrating superior performance in maintaining stable grounding as GUI distributions shift over time.

Conclusion: Establishes the first continual learning framework for GUI agents and reveals the untapped potential of reinforcement fine-tuning for continual GUI agents, addressing the challenge of performance deterioration in fluxing digital environments.

Abstract: As digital environments (data distribution) are in flux, with new GUI data arriving over time-introducing new domains or resolutions-agents trained on static environments deteriorate in performance. In this work, we introduce Continual GUI Agents, a new task that requires GUI agents to perform continual learning under shifted domains and resolutions. We find existing methods fail to maintain stable grounding as GUI distributions shift over time, due to the diversity of UI interaction points and regions in fluxing scenarios. To address this, we introduce GUI-Anchoring in Flux (GUI-AiF), a new reinforcement fine-tuning framework that stabilizes continual learning through two novel rewards: Anchoring Point Reward in Flux (APR-iF) and Anchoring Region Reward in Flux (ARR-iF). These rewards guide the agents to align with shifting interaction points and regions, mitigating the tendency of existing reward strategies to over-adapt to static grounding cues (e.g., fixed coordinates or element scales). Extensive experiments show GUI-AiF surpasses state-of-the-art baselines. Our work establishes the first continual learning framework for GUI agents, revealing the untapped potential of reinforcement fine-tuning for continual GUI Agents.

[412] HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs

Guoan Wang, Feiyu Wang, Zongwei Lv, Yikun Zong, Tong Yang

Main category: cs.LG

TL;DR: Hestia: Hessian-guided differentiable QAT framework for extremely low-bit LLMs that uses softmax relaxation and Hessian-guided temperature annealing to improve 1.58-bit quantization performance.

Details

Motivation: Deployment of large language models is bottlenecked by memory constraints, requiring extremely low-bit quantization. Traditional QAT methods use hard rounding and STE from the start, which prematurely discretizes the optimization landscape and causes gradient mismatch, hindering effective optimization of quantized models.

Method: Hestia replaces rigid step functions with temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. It uses tensor-wise Hessian trace as a lightweight curvature signal to drive fine-grained temperature annealing, enabling sensitivity-aware discretization across the model.

Result: Evaluations on Llama-3.2 show Hestia consistently outperforms existing ternary QAT baselines, yielding average zero-shot improvements of 5.39% and 4.34% for the 1B and 3B models respectively.

Conclusion: Hessian-guided relaxation effectively recovers representational capacity and establishes a more robust training path for 1.58-bit LLMs, addressing the limitations of traditional QAT methods for extremely low-bit quantization.

Abstract: As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low-bit quantization. However, most quantization-aware training (QAT) methods apply hard rounding and the straight-through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian-guided differentiable QAT framework for extremely low-bit LLMs, which replaces the rigid step function with a temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor-wise Hessian trace metric as a lightweight curvature signal to drive fine-grained temperature annealing, enabling sensitivity-aware discretization across the model. Evaluations on Llama-3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero-shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian-guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58-bit LLMs. The code is available at https://github.com/hestia2026/Hestia.

[413] Optimal Transport Group Counterfactual Explanations

Enrique Valero-Leal, Bernd Bischl, Pedro Larrañaga, Concha Bielza, Giuseppe Casalicchio

Main category: cs.LG

TL;DR: Learning an optimal transport map for group counterfactual explanations that generalizes to new group members without re-optimization, minimizing transport cost while preserving group geometry.

Details

Motivation: Existing group counterfactual explanation methods have limitations: they don't generalize to new group members, rely on strong model assumptions like linearity, and poorly control group geometry distortion during counterfactual generation.

Method: Learn an explicit optimal transport map that sends any group instance to its counterfactual without re-optimization, minimizing the group’s total transport cost. For linear classifiers, derive functions via mathematical optimization (identifying convex optimization types like QP, QCQP).

Result: The method accurately generalizes to new group members, preserves group geometry well, and incurs only negligible additional transport cost compared to baselines. Even when model linearity cannot be exploited, it significantly outperforms baseline methods.

Conclusion: The proposed optimal transport approach provides efficient, generalizable group counterfactual explanations with better geometry preservation and lower transport costs, offering interpretable common actionable recourse.

Abstract: Group counterfactual explanations find a set of counterfactual instances to explain a group of input instances contrastively. However, existing methods either (i) optimize counterfactuals only for a fixed group and do not generalize to new group members, (ii) strictly rely on strong model assumptions (e.g., linearity) for tractability or/and (iii) poorly control the counterfactual group geometry distortion. We instead learn an explicit optimal transport map that sends any group instance to its counterfactual without re-optimization, minimizing the group’s total transport cost. This enables generalization with fewer parameters, making it easier to interpret the common actionable recourse. For linear classifiers, we prove that functions representing group counterfactuals are derived via mathematical optimization, identifying the underlying convex optimization type (QP, QCQP, …). Experiments show that they accurately generalize, preserve group geometry and incur only negligible additional transport cost compared to baseline methods. If model linearity cannot be exploited, our approach also significantly outperforms the baselines.

[414] Reward Models Inherit Value Biases from Pretraining

Brian Christian, Jessica A. F. Thompson, Elle Michelle Yang, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, Tsvetomira Dumbalska

Main category: cs.LG

TL;DR: Reward models inherit value biases from their base LLMs, with Llama-based RMs preferring “agency” and Gemma-based RMs preferring “communion” regardless of identical preference data and training.

Details

Motivation: Reward models are crucial for aligning LLMs with human values but are understudied compared to pre-trained and post-trained LLMs. The influence of base model representations on RM behavior remains poorly understood, despite RMs being initialized from these LLMs.

Method: Comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora. Analyzed using the “Big Two” psychological axes (agency vs communion). Derived implicit reward scores from base model logits. Conducted ablation experiments varying preference data source and quantity.

Result: RMs exhibit significant value differences based on their base model: Llama RMs robustly prefer “agency” while Gemma RMs robustly prefer “communion.” This effect persists even with identical preference data and training processes. Implicit reward scores from base models show the same agency/communion differences. The effect is repeatable and surprisingly durable across ablations.

Conclusion: RM outputs are significantly influenced by their pretrained base LLMs, not just human preference data. This underscores the importance of safety/alignment efforts at pretraining stage and shows that base model choice involves value considerations beyond just performance.

Abstract: Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pre-trained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the “Big Two” psychological axes, we show a robust preference of Llama RMs for “agency” and a corresponding robust preference of Gemma RMs for “communion.” This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pre-trained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at the pretraining stage, and makes clear that open-source developers’ choice of base model is as much a consideration of values as of performance.

[415] C3Box: A CLIP-based Class-Incremental Learning Toolbox

Hao Sun, Da-Wei Zhou

Main category: cs.LG

TL;DR: C3Box is a modular Python toolbox that unifies CLIP-based Class-Incremental Learning methods into a standardized framework for reproducible benchmarking.

Details

Motivation: Existing CLIP-based CIL methods are scattered across different codebases with inconsistent configurations, making fair comparisons, reproducibility, and practical adoption difficult.

Method: Developed C3Box, a modular toolbox that integrates traditional CIL methods, ViT-based CIL methods, and state-of-the-art CLIP-based CIL methods into a unified CLIP-based framework with JSON-based configuration and standardized execution pipeline.

Result: C3Box provides a reliable benchmark platform for continual learning research with low engineering overhead, supporting reproducible experimentation across major operating systems using widely-used open-source libraries.

Conclusion: C3Box addresses the fragmentation in CLIP-based CIL research by offering a comprehensive, user-friendly toolbox that enables fair comparisons and reproducible experimentation, advancing continual learning research.

Abstract: Traditional machine learning systems are typically designed for static data distributions, which suffer from catastrophic forgetting when learning from evolving data streams. Class-Incremental Learning (CIL) addresses this challenge by enabling learning systems to continuously learn new classes while preserving prior knowledge. With the rise of pre-trained models (PTMs) such as CLIP, leveraging their strong generalization and semantic alignment capabilities has become a promising direction in CIL. However, existing CLIP-based CIL methods are often scattered across disparate codebases, rely on inconsistent configurations, hindering fair comparisons, reproducibility, and practical adoption. Therefore, we propose C3Box (CLIP-based Class-inCremental learning toolBOX), a modular and comprehensive Python toolbox. C3Box integrates representative traditional CIL methods, ViT-based CIL methods, and state-of-the-art CLIP-based CIL methods into a unified CLIP-based framework. By inheriting the streamlined design of PyCIL, C3Box provides a JSON-based configuration and standardized execution pipeline. This design enables reproducible experimentation with low engineering overhead and makes C3Box a reliable benchmark platform for continual learning research. Designed to be user-friendly, C3Box relies only on widely used open-source libraries and supports major operating systems. The code is available at https://github.com/LAMDA-CL/C3Box.

[416] Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?

Hao Liang, Jiayu Cheng, Sean R. Sinclair, Yali Du

Main category: cs.LG

TL;DR: Pure Exploitation Learning (PEL) achieves first finite-sample regret bounds for exploitation-only algorithms in Exogenous MDPs, showing exploration is unnecessary when uncertainty comes only from exogenous inputs.

Details

Motivation: Despite empirical evidence that greedy methods work well in operations research applications (inventory control, energy storage, resource allocation) where uncertainty comes from exogenous inputs, theory has lagged behind with all existing regret guarantees requiring explicit exploration or tabular assumptions.

Method: Proposes Pure Exploitation Learning (PEL) with two variants: tabular PEL achieving $\widetilde{O}(H^2|Ξ|\sqrt{K})$ regret, and LSVI-PE for large continuous state spaces using linear approximation with polynomial regret in feature dimension, exogenous state space, and horizon (independent of endogenous state/action spaces). Uses counterfactual trajectories and Bellman-closed feature transport for accurate value estimates without optimism.

Result: First general finite-sample regret bounds for exploitation-only algorithms in Exo-MDPs. PEL consistently outperforms baselines in experiments on synthetic and resource-management tasks.

Conclusion: Overturns conventional wisdom that exploration is required, demonstrating that in Exogenous MDPs, pure exploitation is sufficient for learning with theoretical guarantees.

Abstract: Exogenous MDPs (Exo-MDPs) capture sequential decision-making where uncertainty comes solely from exogenous inputs that evolve independently of the learner’s actions. This structure is especially common in operations research applications such as inventory control, energy storage, and resource allocation, where exogenous randomness (e.g., demand, arrivals, or prices) drives system behavior. Despite decades of empirical evidence that greedy, exploitation-only methods work remarkably well in these settings, theory has lagged behind: all existing regret guarantees for Exo-MDPs rely on explicit exploration or tabular assumptions. We show that exploration is unnecessary. We propose Pure Exploitation Learning (PEL) and prove the first general finite-sample regret bounds for exploitation-only algorithms in Exo-MDPs. In the tabular case, PEL achieves $\widetilde{O}(H^2|Ξ|\sqrt{K})$. For large, continuous endogenous state spaces, we introduce LSVI-PE, a simple linear-approximation method whose regret is polynomial in the feature dimension, exogenous state space, and horizon, independent of the endogenous state and action spaces. Our analysis introduces two new tools: counterfactual trajectories and Bellman-closed feature transport, which together allow greedy policies to have accurate value estimates without optimism. Experiments on synthetic and resource-management tasks show that PEL consistently outperforming baselines. Overall, our results overturn the conventional wisdom that exploration is required, demonstrating that in Exo-MDPs, pure exploitation is enough.

[417] Evolutionary Strategies lead to Catastrophic Forgetting in LLMs

Immanuel Abdi, Akshat Gupta, Micah Mok, Alexander Lu, Nicholas Lee, Gopala Anumanchipalli

Main category: cs.LG

TL;DR: ES shows comparable performance to GRPO on math/reasoning tasks but suffers from severe forgetting in continual learning due to less sparse, larger-norm updates.

Details

Motivation: Address the challenge of continual learning in AI systems, specifically the memory limitations of gradient-based algorithms for LLMs, by exploring gradient-free Evolutionary Strategies as an alternative.

Method: Comprehensive analysis of Evolutionary Strategies (ES), evaluating forgetting curves during increasing update steps, comparing performance with GRPO on math and reasoning tasks, and analyzing update characteristics (sparsity and ℓ₂ norm).

Result: ES achieves performance close to GRPO with comparable compute budget, but exhibits significant forgetting of prior abilities. ES updates are much less sparse and have orders of magnitude larger ℓ₂ norm compared to GRPO updates.

Conclusion: ES suffers from severe forgetting in continual learning contexts, limiting its applicability for online training. The study highlights forgetting issues in gradient-free algorithms and calls for future work to mitigate these problems.

Abstract: One of the biggest missing capabilities in current AI systems is the ability to learn continuously after deployment. Implementing such continually learning systems have several challenges, one of which is the large memory requirement of gradient-based algorithms that are used to train state-of-the-art LLMs. Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms and have shown encouraging performance on specific tasks in LLMs. In this paper, we perform a comprehensive analysis of ES and specifically evaluate its forgetting curves when training for an increasing number of update steps. We first find that ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget. However, and most importantly for continual learning, the performance gains in ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online. We also explore the reason behind this behavior and show that the updates made using ES are much less sparse and have orders of magnitude larger $\ell_2$ norm compared to corresponding GRPO updates, explaining the contrasting forgetting curves between the two algorithms. With this study, we aim to highlight the issue of forgetting in gradient-free algorithms like ES and hope to inspire future work to mitigate these issues.

[418] Conditional PED-ANOVA: Hyperparameter Importance in Hierarchical & Dynamic Search Spaces

Kaito Baba, Yoshihiko Ozaki, Shuhei Watanabe

Main category: cs.LG

TL;DR: condPED-ANOVA extends PED-ANOVA to estimate hyperparameter importance in conditional search spaces where hyperparameters can depend on each other’s presence or domain.

Details

Motivation: The original PED-ANOVA framework assumes fixed, unconditional search spaces and cannot properly handle conditional hyperparameters, which are common in real-world machine learning configurations where some hyperparameters only become relevant when others take specific values.

Method: Introduces conditional HPI for top-performing regions and derives a closed-form estimator that accurately reflects conditional activation and domain changes, addressing the limitations of naive adaptations of existing HPI estimators.

Result: Experiments show that naive adaptations of existing HPI estimators yield misleading or uninterpretable importance estimates in conditional settings, while condPED-ANOVA consistently provides meaningful importances that reflect the underlying conditional structure.

Conclusion: condPED-ANOVA provides a principled framework for estimating hyperparameter importance in conditional search spaces, overcoming the limitations of existing methods that assume unconditional search spaces.

Abstract: We propose conditional PED-ANOVA (condPED-ANOVA), a principled framework for estimating hyperparameter importance (HPI) in conditional search spaces, where the presence or domain of a hyperparameter can depend on other hyperparameters. Although the original PED-ANOVA provides a fast and efficient way to estimate HPI within the top-performing regions of the search space, it assumes a fixed, unconditional search space and therefore cannot properly handle conditional hyperparameters. To address this, we introduce a conditional HPI for top-performing regions and derive a closed-form estimator that accurately reflects conditional activation and domain changes. Experiments show that naive adaptations of existing HPI estimators yield misleading or uninterpretable importance estimates in conditional settings, whereas condPED-ANOVA consistently provides meaningful importances that reflect the underlying conditional structure.

[419] Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs

Melika Mobini, Vincent Holst, Floriano Tori, Andres Algaba, Vincent Ginis

Main category: cs.LG

TL;DR: LLM-generated bibliographies closely mimic human citation structure but leave detectable semantic fingerprints in content embeddings, making them distinguishable from human references.

Details

Motivation: To determine if LLM-generated reference lists are distinguishable from human ones, as LLMs are increasingly used for bibliography curation.

Method: Built paired citation graphs (ground truth vs GPT-4o-generated) for 10,000 papers, added random baseline. Compared structure-only features vs title/abstract embeddings using Random Forest on graph aggregates and Graph Neural Networks.

Result: Structure alone barely separates GPT from ground truth (RF accuracy ≈0.60), while embeddings sharply increase separability to ≈0.83-0.93. Results robust across Claude Sonnet 4.5 and multiple embedding models.

Conclusion: LLM bibliographies mimic human citation topology but leave detectable semantic fingerprints; detection should target content signals rather than global graph structure.

Abstract: Large language models are increasingly used to curate bibliographies, raising the question: are their reference lists distinguishable from human ones? We build paired citation graphs, ground truth and GPT-4o-generated (from parametric knowledge), for 10,000 focal papers ($\approx$ 275k references) from SciSciNet, and added a field-matched random baseline that preserves out-degree and field distributions while breaking latent structure. We compare (i) structure-only node features (degree/closeness/eigenvector centrality, clustering, edge count) with (ii) 3072-D title/abstract embeddings, using an RF on graph-level aggregates and Graph Neural Networks with node features. Structure alone barely separates GPT from ground truth (RF accuracy $\approx$ 0.60) despite cleanly rejecting the random baseline ($\approx$ 0.89–0.92). By contrast, embeddings sharply increase separability: RF on aggregated embeddings reaches $\approx$ 0.83, and GNNs with embedding node features achieve 93% test accuracy on GPT vs.\ ground truth. We show the robustness of our findings by replicating the pipeline with Claude Sonnet 4.5 and with multiple embedding models (OpenAI and SPECTER), with RF separability for ground truth vs.\ Claude $\approx 0.77$ and clean rejection of the random baseline. Thus, LLM bibliographies, generated purely from parametric knowledge, closely mimic human citation topology, but leave detectable semantic fingerprints; detection and debiasing should target content signals rather than global graph structure.

[420] Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause

Main category: cs.LG

TL;DR: SDPO is a self-distillation method that leverages rich textual feedback (like error messages) to improve RL training in verifiable domains, outperforming traditional scalar-reward RL methods.

Details

Motivation: Current RL methods in verifiable domains (code, math) only use scalar outcome rewards, creating a credit assignment bottleneck. Many environments actually provide rich textual feedback (error messages, judge evaluations) that could be leveraged for better learning.

Method: Self-Distillation Policy Optimization (SDPO) treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed next-token predictions back into the policy. It converts tokenized feedback into dense learning signals without external teachers or explicit reward models.

Result: SDPO improves sample efficiency and final accuracy over strong RL baselines across scientific reasoning, tool use, and competitive programming. It also outperforms baselines in standard RL environments by using successful rollouts as implicit feedback. On difficult binary-reward tasks, SDPO achieves same discovery probability as best-of-k sampling with 3x fewer attempts.

Conclusion: SDPO effectively leverages rich textual feedback through self-distillation, addressing credit assignment bottlenecks in RL for verifiable domains and demonstrating superior performance across multiple challenging benchmarks.

Abstract: Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model’s ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.

[421] Deep Semi-Supervised Survival Analysis for Predicting Cancer Prognosis

Anchen Sun, Zhibin Chen, Xiaodong Cai

Main category: cs.LG

TL;DR: Cox-MT: A deep semi-supervised learning approach using Mean Teacher framework to improve ANN-based Cox survival models by leveraging both labeled and unlabeled data, showing significant performance gains over supervised-only methods in cancer prognosis prediction.

Details

Motivation: ANN-based Cox-PH models require large labeled datasets for training, but labeled survival data is often limited. This constraint limits the performance of existing ANN-based Cox models, creating a need for methods that can effectively utilize both labeled and unlabeled data.

Method: Developed Cox-MT using deep semi-supervised learning based on Mean Teacher framework. Built single-modal models using TCGA RNA-seq data or whole slide images, and multi-modal models combining both. The approach leverages both labeled and unlabeled data during training.

Result: Single-modal Cox-MT models significantly outperformed existing Cox-nnet across four cancer types. Performance improved significantly with more unlabeled data. Multi-modal Cox-MT showed considerably better performance than single-modal models.

Conclusion: Cox-MT effectively leverages both labeled and unlabeled data to significantly enhance prediction accuracy compared to existing ANN-based Cox models trained solely on labeled data, demonstrating the value of semi-supervised learning in survival analysis.

Abstract: The Cox Proportional Hazards (PH) model is widely used in survival analysis. Recently, artificial neural network (ANN)-based Cox-PH models have been developed. However, training these Cox models with high-dimensional features typically requires a substantial number of labeled samples containing information about time-to-event. The limited availability of labeled data for training often constrains the performance of ANN-based Cox models. To address this issue, we employed a deep semi-supervised learning (DSSL) approach to develop single- and multi-modal ANN-based Cox models based on the Mean Teacher (MT) framework, which utilizes both labeled and unlabeled data for training. We applied our model, named Cox-MT, to predict the prognosis of several types of cancer using data from The Cancer Genome Atlas (TCGA). Our single-modal Cox-MT models, utilizing TCGA RNA-seq data or whole slide images, significantly outperformed the existing ANN-based Cox model, Cox-nnet, using the same data set across four types of cancer considered. As the number of unlabeled samples increased, the performance of Cox-MT significantly improved with a given set of labeled data. Furthermore, our multi-modal Cox-MT model demonstrated considerably better performance than the single-modal model. In summary, the Cox-MT model effectively leverages both labeled and unlabeled data to significantly enhance prediction accuracy compared to existing ANN-based Cox models trained solely on labeled data.

[422] GNN Explanations that do not Explain and How to find Them

Steve Azzolin, Stefano Teso, Bruno Lepri, Andrea Passerini, Sagar Malhotra

Main category: cs.LG

TL;DR: SE-GNN explanations can be degenerate - completely unrelated to how models make predictions, yet current faithfulness metrics fail to detect this, allowing potential misuse of sensitive attributes.

Details

Motivation: Self-explainable GNN explanations are crucial for transparency and detecting misuse of sensitive attributes, but current explanations can be misleading without clear characterization of failure cases.

Method: Identified critical failure where SE-GNN explanations are unrelated to actual inference process, showing models can achieve optimal performance while producing degenerate explanations, and current faithfulness metrics fail to detect this.

Result: Degenerate explanations can be maliciously planted to hide sensitive attribute usage or emerge naturally, highlighting auditing reliability issues. Proposed novel faithfulness metric that reliably identifies degenerate explanations in both malicious and natural settings.

Conclusion: Current SE-GNN explanations have critical failure modes requiring better auditing; new faithfulness metric addresses this gap by reliably detecting degenerate explanations.

Abstract: Explanations provided by Self-explainable Graph Neural Networks (SE-GNNs) are fundamental for understanding the model’s inner workings and for identifying potential misuse of sensitive attributes. Although recent works have highlighted that these explanations can be suboptimal and potentially misleading, a characterization of their failure cases is unavailable. In this work, we identify a critical failure of SE-GNN explanations: explanations can be unambiguously unrelated to how the SE-GNNs infer labels. We show that, on the one hand, many SE-GNNs can achieve optimal true risk while producing these degenerate explanations, and on the other, most faithfulness metrics can fail to identify these failure modes. Our empirical analysis reveals that degenerate explanations can be maliciously planted (allowing an attacker to hide the use of sensitive attributes) and can also emerge naturally, highlighting the need for reliable auditing. To address this, we introduce a novel faithfulness metric that reliably marks degenerate explanations as unfaithful, in both malicious and natural settings. Our code is available in the supplemental.

[423] SA-PEF: Step-Ahead Partial Error Feedback for Efficient Federated Learning

Dawit Kiros Redie, Reza Arablouei, Stefan Werner

Main category: cs.LG

TL;DR: SA-PEF combines step-ahead correction with partial error feedback to accelerate federated learning convergence under non-IID data by controlling residual error contraction.

Details

Motivation: Standard error feedback (EF) in federated learning suffers from slow residual error decay under non-IID data, causing gradient mismatch and stalled progress in early training rounds.

Method: Proposes step-ahead partial error feedback (SA-PEF) that integrates step-ahead correction with partial error feedback, recovering EF when α=0 and step-ahead EF (SAEF) when α=1.

Result: Establishes convergence guarantees matching standard Fed-SGD up to constant factors, with O((η,η₀TR)⁻¹) convergence to variance/heterogeneity floor. Shows step-ahead-controlled residual contraction explains early-phase acceleration.

Conclusion: SA-PEF consistently reaches target accuracy faster than EF across diverse architectures and datasets by balancing SAEF’s rapid warm-up with EF’s long-term stability through optimal α selection.

Abstract: Biased gradient compression with error feedback (EF) reduces communication in federated learning (FL), but under non-IID data, the residual error can decay slowly, causing gradient mismatch and stalled progress in the early rounds. We propose step-ahead partial error feedback (SA-PEF), which integrates step-ahead (SA) correction with partial error feedback (PEF). SA-PEF recovers EF when the step-ahead coefficient $α=0$ and step-ahead EF (SAEF) when $α=1$. For non-convex objectives and $δ$-contractive compressors, we establish a second-moment bound and a residual recursion that guarantee convergence to stationarity under heterogeneous data and partial client participation. The resulting rates match standard non-convex Fed-SGD guarantees up to constant factors, achieving $O((η,η_0TR)^{-1})$ convergence to a variance/heterogeneity floor with a fixed inner step size. Our analysis reveals a step-ahead-controlled residual contraction $ρ_r$ that explains the observed acceleration in the early training phase. To balance SAEF’s rapid warm-up with EF’s long-term stability, we select $α$ near its theory-predicted optimum. Experiments across diverse architectures and datasets show that SA-PEF consistently reaches target accuracy faster than EF.

[424] $\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, Simon See

Main category: cs.LG

TL;DR: The paper studies Minimal Embeddable Dimension (MED) for embedding subset memberships, finding tight bounds and showing logarithmic dependency, suggesting retrieval limitations are due to learnability not geometry.

Details

Motivation: To understand the fundamental geometric constraints of embedding subset memberships (m elements and their subsets) into vector spaces, and to determine whether embedding-based retrieval limitations come from geometric constraints or learnability challenges.

Method: Theoretical derivation of tight bounds for MED under various distance/similarity measures (ℓ₂ metric, inner product, cosine similarity), plus numerical simulations where subset embeddings are centroids of contained elements’ embeddings.

Result: Established tight bounds for MED theoretically and empirically, with simulations showing logarithmic dependency between MED and number of elements, easily achieving this in practice.

Conclusion: Embedding-based retrieval limitations primarily stem from learnability challenges rather than geometric constraints, providing guidance for future algorithm design.

Abstract: This paper studies the minimal dimension required to embed subset memberships ($m$ elements and ${m\choose k}$ subsets of at most $k$ elements) into vector spaces, denoted as Minimal Embeddable Dimension (MED). The tight bounds of MED are derived theoretically and supported empirically for various notions of “distances” or “similarities,” including the $\ell_2$ metric, inner product, and cosine similarity. In addition, we conduct numerical simulation in a more achievable setting, where the ${m\choose k}$ subset embeddings are chosen as the centroid of the embeddings of the contained elements. Our simulation easily realizes a logarithmic dependency between the MED and the number of elements to embed. These findings imply that embedding-based retrieval limitations stem primarily from learnability challenges, not geometric constraints, guiding future algorithm design.

[425] GraphAllocBench: A Flexible Benchmark for Preference-Conditioned Multi-Objective Policy Learning

Zhiheng Jiang, Yunzhe Wang, Ryan Marr, Ellen Novoseller, Benjamin T. Files, Volkan Ustun

Main category: cs.LG

TL;DR: GraphAllocBench: A new graph-based benchmark for Preference-Conditioned Policy Learning in Multi-Objective RL, featuring scalable resource allocation tasks with novel evaluation metrics.

Details

Motivation: Existing PCPL benchmarks are limited to toy tasks and fixed environments, lacking realism and scalability needed to evaluate MORL approaches in complex scenarios.

Method: Introduces GraphAllocBench with CityPlannerEnv - a graph-based resource allocation sandbox inspired by city management. Includes diverse objectives, preference conditions, and high-dimensional scalability. Proposes two new metrics: Proportion of Non-Dominated Solutions (PNDS) and Ordering Score (OS).

Result: Benchmark exposes limitations of existing MORL approaches (MLPs) and shows potential for graph-based methods like GNNs in complex combinatorial allocation tasks. Enables flexible variation of objectives, preferences, and allocation rules.

Conclusion: GraphAllocBench establishes a versatile, extensible benchmark for advancing PCPL research, particularly for complex, high-dimensional combinatorial problems where graph-aware methods show promise.

Abstract: Preference-Conditioned Policy Learning (PCPL) in Multi-Objective Reinforcement Learning (MORL) aims to approximate diverse Pareto-optimal solutions by conditioning policies on user-specified preferences over objectives. This enables a single model to flexibly adapt to arbitrary trade-offs at run-time by producing a policy on or near the Pareto front. However, existing benchmarks for PCPL are largely restricted to toy tasks and fixed environments, limiting their realism and scalability. To address this gap, we introduce GraphAllocBench, a flexible benchmark built on a novel graph-based resource allocation sandbox environment inspired by city management, which we call CityPlannerEnv. GraphAllocBench provides a rich suite of problems with diverse objective functions, varying preference conditions, and high-dimensional scalability. We also propose two new evaluation metrics – Proportion of Non-Dominated Solutions (PNDS) and Ordering Score (OS) – that directly capture preference consistency while complementing the widely used hypervolume metric. Through experiments with Multi-Layer Perceptrons (MLPs) and graph-aware models, we show that GraphAllocBench exposes the limitations of existing MORL approaches and paves the way for using graph-based methods such as Graph Neural Networks in complex, high-dimensional combinatorial allocation tasks. Beyond its predefined problem set, GraphAllocBench enables users to flexibly vary objectives, preferences, and allocation rules, establishing it as a versatile and extensible benchmark for advancing PCPL. Code: https://anonymous.4open.science/r/GraphAllocBench

[426] Post-Training Fairness Control: A Single-Train Framework for Dynamic Fairness in Recommendation

Weixin Chen, Li Chen, Yuhan Zhao

Main category: cs.LG

TL;DR: Cofair enables post-training fairness control in recommender systems without retraining, allowing dynamic adjustment of fairness requirements through fairness-conditioned adapters and user-level regularization.

Details

Motivation: Existing fairness-aware recommender systems fix fairness requirements at training time, but real-world scenarios require flexibility as different stakeholders may demand varying fairness levels over time. Retraining for each new requirement is impractical.

Method: Introduces a single-train framework with shared representation layer and fairness-conditioned adapter modules to produce user embeddings for varied fairness levels. Includes user-level regularization term that guarantees monotonic fairness improvements across fairness levels.

Result: Theoretical analysis shows adversarial objective upper bounds demographic parity and regularization enforces progressive fairness. Experiments on multiple datasets demonstrate dynamic fairness at different levels, achieving comparable or better fairness-accuracy curves than state-of-the-art baselines without retraining.

Conclusion: Cofair provides a practical solution for dynamic fairness control in recommender systems, enabling post-training adjustment of fairness requirements to accommodate diverse stakeholder needs without the computational cost of retraining.

Abstract: Despite growing efforts to mitigate unfairness in recommender systems, existing fairness-aware methods typically fix the fairness requirement at training time and provide limited post-training flexibility. However, in real-world scenarios, diverse stakeholders may demand differing fairness requirements over time, so retraining for different fairness requirements becomes prohibitive. To address this limitation, we propose Cofair, a single-train framework that enables post-training fairness control in recommendation. Specifically, Cofair introduces a shared representation layer with fairness-conditioned adapter modules to produce user embeddings specialized for varied fairness levels, along with a user-level regularization term that guarantees user-wise monotonic fairness improvements across these levels. We theoretically establish that the adversarial objective of Cofair upper bounds demographic parity and the regularization term enforces progressive fairness at user level. Comprehensive experiments on multiple datasets and backbone models demonstrate that our framework provides dynamic fairness at different levels, delivering comparable or better fairness-accuracy curves than state-of-the-art baselines, without the need to retrain for each new fairness requirement. Our code is publicly available at https://github.com/weixinchen98/Cofair.

[427] Supervised Guidance Training for Infinite-Dimensional Diffusion Models

Elizabeth L. Baker, Alexander Denker, Jes Frellsen

Main category: cs.LG

TL;DR: This paper extends diffusion models to infinite-dimensional function spaces for Bayesian inverse problems, providing theoretical foundations for conditioning diffusion priors on observations using Doob’s h-transform, and introduces a practical simulation-free training method for posterior sampling.

Details

Motivation: While diffusion models have been extended to function spaces as expressive priors, there's no theoretical framework for conditioning them to sample from posterior distributions in Bayesian inverse problems. The authors aim to bridge this gap between diffusion models and Bayesian inference in infinite-dimensional settings.

Method: The authors prove that diffusion models can be conditioned using an infinite-dimensional extension of Doob’s h-transform, decomposing the conditional score into unconditional score plus guidance term. They propose Supervised Guidance Training, a simulation-free score matching objective to approximate the intractable guidance term for efficient posterior sampling.

Result: The paper provides the first theoretical framework for conditioning diffusion models in function spaces, enabling accurate posterior sampling for Bayesian inverse problems. The proposed Supervised Guidance Training method allows stable and efficient fine-tuning of trained diffusion models to sample from posterior distributions.

Conclusion: This work establishes the theoretical foundations for using diffusion models as expressive priors in Bayesian inverse problems and provides a practical method for posterior sampling, opening new possibilities for solving inverse problems in function spaces using diffusion-based approaches.

Abstract: Score-based diffusion models have recently been extended to infinite-dimensional function spaces, with uses such as inverse problems arising from partial differential equations. In the Bayesian formulation of inverse problems, the aim is to sample from a posterior distribution over functions obtained by conditioning a prior on noisy observations. While diffusion models provide expressive priors in function space, the theory of conditioning them to sample from the posterior remains open. We address this, assuming that either the prior lies in the Cameron-Martin space, or is absolutely continuous with respect to a Gaussian measure. We prove that the models can be conditioned using an infinite-dimensional extension of Doob’s $h$-transform, and that the conditional score decomposes into an unconditional score and a guidance term. As the guidance term is intractable, we propose a simulation-free score matching objective (called Supervised Guidance Training) enabling efficient and stable posterior sampling. We illustrate the theory with numerical examples on Bayesian inverse problems in function spaces. In summary, our work offers the first function-space method for fine-tuning trained diffusion models to accurately sample from a posterior.

[428] Exploring Transformer Placement in Variational Autoencoders for Tabular Data Generation

Aníbal Silva, Moisés Santos, André Restivo, Carlos Soares

Main category: cs.LG

TL;DR: VAEs struggle with tabular data feature relationships; Transformers help capture interactions. Study shows Transformers in VAE components create fidelity-diversity trade-off and reveal linear relationships in decoder blocks.

Details

Motivation: Standard VAEs with MLPs struggle to model relationships between features in tabular data, especially with mixed data types. Transformers, with their attention mechanism, are better suited for capturing complex feature interactions, but their integration into VAE architectures needs empirical investigation.

Method: Empirically investigate integrating Transformers into different components of a VAE architecture. Conduct experiments on 57 datasets from the OpenML CC18 suite to analyze the impact of Transformer placement in VAE components.

Result: Two main findings: 1) Positioning Transformers to leverage latent and decoder representations creates a trade-off between fidelity and diversity. 2) High similarity between consecutive Transformer blocks in all components, with decoder showing approximately linear relationships between input and output.

Conclusion: Transformers can improve VAE performance on tabular data by better capturing feature interactions, but their placement affects fidelity-diversity balance. The linear relationships in decoder blocks suggest potential architectural simplifications.

Abstract: Tabular data remains a challenging domain for generative models. In particular, the standard Variational Autoencoder (VAE) architecture, typically composed of multilayer perceptrons, struggles to model relationships between features, especially when handling mixed data types. In contrast, Transformers, through their attention mechanism, are better suited for capturing complex feature interactions. In this paper, we empirically investigate the impact of integrating Transformers into different components of a VAE. We conduct experiments on 57 datasets from the OpenML CC18 suite and draw two main conclusions. First, results indicate that positioning Transformers to leverage latent and decoder representations leads to a trade-off between fidelity and diversity. Second, we observe a high similarity between consecutive blocks of a Transformer in all components. In particular, in the decoder, the relationship between the input and output of a Transformer is approximately linear.

[429] Less is More: Clustered Cross-Covariance Control for Offline RL

Nan Qiao, Sheng Yue, Shuning Wang, Yongheng Deng, Ju Ren

Main category: cs.LG

TL;DR: C^4 addresses offline RL distributional shift by mitigating harmful TD cross-covariance effects through partitioned buffer sampling and gradient-based corrective penalties.

Details

Motivation: Standard squared error objectives in offline RL induce harmful TD cross-covariance that amplifies in out-of-distribution areas, biasing optimization and degrading policy learning, especially with scarce data or OOD-dominated datasets.

Method: Two complementary strategies: 1) Partitioned buffer sampling that restricts updates to localized replay partitions to attenuate irregular covariance effects and align update directions (C^4), and 2) Explicit gradient-based corrective penalty that cancels covariance-induced bias within each update.

Result: Theoretical proof that buffer partitioning preserves lower bound property of maximization objective, mitigates excessive conservatism in extreme OOD areas without altering core behavior of policy constrained offline RL. Empirically shows higher stability and up to 30% improvement in returns over prior methods.

Conclusion: C^4 effectively addresses distributional shift in offline RL by controlling harmful TD cross-covariance effects, particularly beneficial for small datasets and OOD-emphasized splits, while maintaining theoretical guarantees and practical implementability.

Abstract: A fundamental challenge in offline reinforcement learning is distributional shift. Scarce data or datasets dominated by out-of-distribution (OOD) areas exacerbate this issue. Our theoretical analysis and experiments show that the standard squared error objective induces a harmful TD cross covariance. This effect amplifies in OOD areas, biasing optimization and degrading policy learning. To counteract this mechanism, we develop two complementary strategies: partitioned buffer sampling that restricts updates to localized replay partitions, attenuates irregular covariance effects, and aligns update directions, yielding a scheme that is easy to integrate with existing implementations, namely Clustered Cross-Covariance Control for TD (C^4). We also introduce an explicit gradient-based corrective penalty that cancels the covariance induced bias within each update. We prove that buffer partitioning preserves the lower bound property of the maximization objective, and that these constraints mitigate excessive conservatism in extreme OOD areas without altering the core behavior of policy constrained offline reinforcement learning. Empirically, our method showcases higher stability and up to 30% improvement in returns over prior methods, especially with small datasets and splits that emphasize OOD areas.

[430] COMET-SG1: Lightweight Autoregressive Regressor for Edge and Embedded AI

Shakhyar Gogoi

Main category: cs.LG

TL;DR: COMET-SG1 is a lightweight autoregressive regression model for edge AI time-series prediction that prioritizes stability and reduced long-horizon drift through linear behavior-space encoding and deterministic state updates.

Details

Motivation: Need for stable time-series prediction models on edge/embedded systems where prediction errors accumulate over time, requiring bounded long-horizon behavior under fully autoregressive inference.

Method: Uses linear behavior-space encoding, memory-anchored transition estimation, and deterministic state updates instead of RNNs or transformers. Designed for compact parameter footprint and fixed-point arithmetic compatibility.

Result: Achieves competitive short-horizon accuracy while showing significantly reduced long-horizon drift compared to MLP, LSTM, and k-nearest neighbor baselines on non-stationary synthetic time-series data.

Conclusion: COMET-SG1 provides a practical, interpretable approach for stable autoregressive prediction in edge/embedded AI applications with its lightweight design and stability-oriented architecture.

Abstract: COMET-SG1 is a lightweight, stability-oriented autoregressive regression model designed for time-series prediction on edge and embedded AI systems. Unlike recurrent neural networks or transformer-based sequence models, COMET-SG1 operates through linear behavior-space encoding, memory-anchored transition estimation, and deterministic state updates. This structure prioritizes bounded long-horizon behavior under fully autoregressive inference, a critical requirement for edge deployment where prediction errors accumulate over time. Experiments on non-stationary synthetic time-series data demonstrate that COMET-SG1 achieves competitive short-horizon accuracy while exhibiting significantly reduced long-horizon drift compared to MLP, LSTM, and k-nearest neighbor baselines. With a compact parameter footprint and operations compatible with fixed-point arithmetic, COMET-SG1 provides a practical and interpretable approach for stable autoregressive prediction in edge and embedded AI applications.

[431] Smoothing the Black-Box: Signed-Distance Supervision for Black-Box Model Copying

Rubén Jiménez, Oriol Pujol

Main category: cs.LG

TL;DR: Black-box model copying using signed distances instead of hard labels improves boundary reconstruction and generalization for legacy model upgrades.

Details

Motivation: Machine learning systems need continuous evolution as data, architectures, and regulations change, often without access to original training data or model internals. Black-box copying provides a practical refactoring mechanism for upgrading legacy models using only input-output queries.

Method: Proposes a distance-based copying framework that replaces hard-label supervision with signed distances to the teacher’s decision boundary, converting copying into a smooth regression problem. Develops an α-governed smoothing and regularization scheme with Hölder/Lipschitz control, and introduces two model-agnostic algorithms to estimate signed distances under label-only access.

Result: Experiments on synthetic problems and UCI benchmarks show consistent improvements in fidelity and generalization accuracy over hard-label baselines, while enabling distance outputs as uncertainty-related signals for black-box replicas.

Conclusion: Distance-based copying provides a more effective approach for black-box model replication by exploiting local geometry through signed distance supervision, overcoming limitations of hard-label approaches for boundary reconstruction.

Abstract: Deployed machine learning systems must continuously evolve as data, architectures, and regulations change, often without access to original training data or model internals. In such settings, black-box copying provides a practical refactoring mechanism, i.e. upgrading legacy models by learning replicas from input-output queries alone. When restricted to hard-label outputs, copying turns into a discontinuous surface reconstruction problem from pointwise queries, severely limiting the ability to recover boundary geometry efficiently. We propose a distance-based copying (distillation) framework that replaces hard-label supervision with signed distances to the teacher’s decision boundary, converting copying into a smooth regression problem that exploits local geometry. We develop an $α$-governed smoothing and regularization scheme with Hölder/Lipschitz control over the induced target surface, and introduce two model-agnostic algorithms to estimate signed distances under label-only access. Experiments on synthetic problems and UCI benchmarks show consistent improvements in fidelity and generalization accuracy over hard-label baselines, while enabling distance outputs as uncertainty-related signals for black-box replicas.

[432] When More Data Doesn’t Help: Limits of Adaptation in Multitask Learning

Steve Hanneke, Mingyue Xu

Main category: cs.LG

TL;DR: The paper establishes a stronger impossibility result for multitask learning, showing that adaptation hardness persists even with arbitrarily large sample sizes per task, going beyond previous no-free-lunch theorems.

Details

Motivation: To understand the fundamental statistical limits of multitask learning beyond what was shown in prior work (arXiv:2006.15785), which demonstrated that without distributional information, no algorithm based on sample aggregation alone can guarantee optimal risk when sample size per task is bounded.

Method: The authors establish a stronger impossibility result of adaptation that holds for arbitrarily large sample size per task, going beyond the previous no-free-lunch theorem. They analyze the statistical limits of multitask learning and discuss optimal adaptivity concepts.

Result: The paper shows that the hardness of multitask learning cannot be overcome by having abundant data per task - adaptation difficulties persist even with arbitrarily large sample sizes per task, establishing a stronger impossibility result than previous work.

Conclusion: Multitask learning faces fundamental statistical limitations that cannot be resolved simply by increasing sample sizes per task. The paper establishes a stronger impossibility result and introduces concepts of optimal adaptivity that may guide future research in this area.

Abstract: Multitask learning and related frameworks have achieved tremendous success in modern applications. In multitask learning problem, we are given a set of heterogeneous datasets collected from related source tasks and hope to enhance the performance above what we could hope to achieve by solving each of them individually. The recent work of arXiv:2006.15785 has showed that, without access to distributional information, no algorithm based on aggregating samples alone can guarantee optimal risk as long as the sample size per task is bounded. In this paper, we focus on understanding the statistical limits of multitask learning. We go beyond the no-free-lunch theorem in arXiv:2006.15785 by establishing a stronger impossibility result of adaptation that holds for arbitrarily large sample size per task. This improvement conveys an important message that the hardness of multitask learning cannot be overcame by having abundant data per task. We also discuss the notion of optimal adaptivity that may be of future interests.

[433] Active Learning for Decision Trees with Provable Guarantees

Arshia Soltani Moakhar, Tanapoom Laoaron, Faraz Ghahremani, Kiarash Banihashem, MohammadTaghi Hajiaghayi

Main category: cs.LG

TL;DR: First theoretical analysis of active learning label complexity for decision trees, showing polylogarithmic label queries under specific assumptions, and presenting first general active learning algorithm with multiplicative error guarantee.

Details

Motivation: To advance theoretical understanding of active learning label complexity for decision trees as binary classifiers, addressing the gap in analysis of key parameters like the disagreement coefficient and developing algorithms with provable guarantees.

Method: 1) Analyze disagreement coefficient for decision trees under two natural assumptions: distinct feature dimensions per root-to-leaf path and regular grid-like data structure. 2) Develop first general active learning algorithm for binary classification with multiplicative (1+ε)-approximate guarantee. 3) Combine results to design active learning algorithm for decision trees with polylogarithmic label queries.

Result: 1) Showed assumptions are essential - relaxing them leads to polynomial label complexity. 2) Designed active learning algorithm for decision trees using only polylogarithmic number of label queries under stated assumptions. 3) Established label complexity lower bound showing algorithm’s ε-dependence is near-optimal.

Conclusion: This paper provides foundational theoretical analysis for active learning with decision trees, identifying necessary assumptions for efficient learning and developing near-optimal algorithms with provable guarantees, advancing the theoretical understanding of label complexity in this domain.

Abstract: This paper advances the theoretical understanding of active learning label complexity for decision trees as binary classifiers. We make two main contributions. First, we provide the first analysis of the disagreement coefficient for decision trees-a key parameter governing active learning label complexity. Our analysis holds under two natural assumptions required for achieving polylogarithmic label complexity, (i) each root-to-leaf path queries distinct feature dimensions, and (ii) the input data has a regular, grid-like structure. We show these assumptions are essential, as relaxing them leads to polynomial label complexity. Second, we present the first general active learning algorithm for binary classification that achieves a multiplicative error guarantee, producing a $(1+ε)$-approximate classifier. By combining these results, we design an active learning algorithm for decision trees that uses only a polylogarithmic number of label queries in the dataset size, under the stated assumptions. Finally, we establish a label complexity lower bound, showing our algorithm’s dependence on the error tolerance $ε$ is close to optimal.

[434] PatchFormer: A Patch-Based Time Series Foundation Model with Hierarchical Masked Reconstruction and Cross-Domain Transfer Learning for Zero-Shot Multi-Horizon Forecasting

Olaf Yunus Laitinen Imanov, Derya Umut Kulali, Taner Yilmaz

Main category: cs.LG

TL;DR: PatchFormer: A patch-based time series foundation model using hierarchical masked reconstruction for self-supervised pretraining and lightweight adapters for efficient transfer, achieving state-of-the-art zero-shot forecasting with 27.3% MSE reduction and 94% less task-specific data.

Details

Motivation: Existing time series forecasting approaches require domain-specific feature engineering and substantial labeled data for each task, which limits their applicability across different domains and increases development costs.

Method: PatchFormer segments time series into patches and learns multiscale temporal representations with learnable aggregation across scales. It uses masked patch reconstruction with dynamic masking and objectives that encourage both local accuracy and global consistency, followed by cross-domain knowledge distillation. Lightweight adapters enable efficient transfer learning.

Result: Experiments on 24 benchmark datasets across weather, energy, traffic, finance, and healthcare show state-of-the-art zero-shot multi-horizon forecasting with 27.3% MSE reduction relative to strong baselines, requiring 94% less task-specific training data. The model scales near log-linearly with pretraining data up to 100 billion points and processes length-512 sequences 3.8x faster than full-sequence transformers.

Conclusion: PatchFormer demonstrates that patch-based time series foundation models with hierarchical masked reconstruction and efficient adapters can achieve superior zero-shot forecasting performance across diverse domains while significantly reducing the need for task-specific data and computational resources.

Abstract: Time series forecasting is a fundamental problem with applications in climate, energy, healthcare, and finance. Many existing approaches require domain-specific feature engineering and substantial labeled data for each task. We introduce PatchFormer, a patch-based time series foundation model that uses hierarchical masked reconstruction for self-supervised pretraining and lightweight adapters for efficient transfer. PatchFormer segments time series into patches and learns multiscale temporal representations with learnable aggregation across temporal scales. Pretraining uses masked patch reconstruction with dynamic masking and objectives that encourage both local accuracy and global consistency, followed by cross-domain knowledge distillation. Experiments on 24 benchmark datasets spanning weather, energy, traffic, finance, and healthcare demonstrate state-of-the-art zero-shot multi-horizon forecasting, reducing mean squared error by 27.3 percent relative to strong baselines while requiring 94 percent less task-specific training data. The model exhibits near log-linear scaling with more pretraining data up to 100 billion points and processes length-512 sequences 3.8x faster than full-sequence transformers.

[435] Membership Privacy Risks of Sharpness Aware Minimization

Young In Kim, Andrea Agiollo, Pratiksha Agrawal, Johannes O. Royset, Rajiv Khanna

Main category: cs.LG

TL;DR: SAM improves generalization but worsens membership privacy, making models more vulnerable to membership inference attacks than SGD despite better test performance.

Details

Motivation: To investigate whether optimization algorithms that find flatter minima (like SAM) provide better membership privacy protection, given their known benefits for generalization and robustness.

Method: Analyzed SAM vs SGD across multiple datasets and attack methods, examined memorization and influence scores, modeled SAM under linear interpolation regime, and conducted theoretical analysis of sharpness regularization effects.

Result: SAM is more vulnerable to membership inference attacks than SGD, despite achieving lower test error. SAM captures atypical subpatterns better, leading to higher memorization scores, while SGD relies more on majority features. SAM’s lower variance in prediction confidence amplifies membership signals.

Conclusion: The geometric mechanism of SAM that improves generalization simultaneously exacerbates membership leakage, creating a privacy-generalization trade-off. Sharpness regularization inherently reduces variance, guaranteeing higher MIA advantage for confidence and likelihood ratio attacks.

Abstract: Optimization algorithms that seek flatter minima, such as Sharpness-Aware Minimization (SAM), are credited with improved generalization and robustness to noise. We ask whether such gains impact membership privacy. Surprisingly, we find that SAM is more prone to Membership Inference Attacks (MIA) than classical SGD across multiple datasets and attack methods, despite achieving lower test error. This suggests that the geometric mechanism of SAM that improves generalization simultaneously exacerbates membership leakage. We investigate this phenomenon through extensive analysis of memorization and influence scores. Our results reveal that SAM is more capable of capturing atypical subpatterns, leading to higher memorization scores of samples. Conversely, SGD depends more heavily on majority features, exhibiting worse generalization on atypical subgroups and lower memorization. Crucially, this characteristic of SAM can be linked to lower variance in the prediction confidence of unseen samples, thereby amplifying membership signals. Finally, we model SAM under a perfectly interpolating linear regime and theoretically show that sharpness regularization inherently reduces variance, guaranteeing a higher MIA advantage for confidence and likelihood ratio attacks.

[436] MAnchors: Memorization-Based Acceleration of Anchors via Rule Reuse and Transformation

Haonan Yu, Junhao Liu, Xin Zhang

Main category: cs.LG

TL;DR: Memorization-based framework accelerates Anchors explanation technique by reusing intermediate results through rule transformations, reducing computation time while preserving fidelity.

Details

Motivation: Anchors is computationally inefficient, limiting its practical adoption despite being a popular local model-agnostic explanation technique. The paper aims to address this efficiency problem.

Method: Proposes a memorization-based framework that stores and reuses intermediate results from prior explanations. Uses two rule transformations: horizontal transformation adapts pre-trained explanations to new inputs by replacing features, and vertical transformation refines general explanations until they are precise enough for specific inputs.

Result: The method significantly reduces explanation generation time across tabular, text, and image datasets while maintaining explanation fidelity and interpretability.

Conclusion: The proposed framework enables practical adoption of Anchors in time-sensitive applications by accelerating the explanation process without compromising quality.

Abstract: Anchors is a popular local model-agnostic explanation technique whose applicability is limited by its computational inefficiency. To address this limitation, we propose a memorization-based framework that accelerates Anchors while preserving explanation fidelity and interpretability. Our approach leverages the iterative nature of Anchors’ algorithm which gradually refines an explanation until it is precise enough for a given input by storing and reusing intermediate results obtained during prior explanations. Specifically, we maintain a memory of low-precision, high-coverage rules and introduce a rule transformation framework to adapt them to new inputs: the horizontal transformation adapts a pre-trained explanation to the current input by replacing features, and the vertical transformation refines the general explanation until it is precise enough for the input. We evaluate our method across tabular, text, and image datasets, demonstrating that it significantly reduces explanation generation time while maintaining fidelity and interpretability, thereby enabling the practical adoption of Anchors in time-sensitive applications.

[437] BAGEL: Projection-Free Algorithm for Adversarially Constrained Online Convex Optimization

Yiyang Lu, Mohammad Pedramfar, Vaneet Aggarwal

Main category: cs.LG

TL;DR: BAGEL algorithm achieves optimal O(√T) regret and constraint violation in constrained online convex optimization using separation oracles instead of projections, matching projection-based rates with fewer computational demands.

Details

Motivation: Projection-based COCO algorithms achieve optimal regret but are computationally expensive due to projections. Projection-free methods using linear optimization oracles are more scalable but achieve slower O(T^{3/4}) regret rates. The paper investigates whether optimal O(√T) rates can be recovered in projection-free settings by strengthening oracle assumptions.

Method: Introduces BAGEL algorithm that uses a Separation Oracle (SO) instead of projections. The method leverages infeasible projections via SO to achieve optimal regret rates with Õ(T) oracle calls, with dependence on the geometry of the action set.

Result: BAGEL achieves O(√T) regret and Õ(√T) cumulative constraint violation for convex cost functions, matching the time-horizon dependence of projection-based methods while using separation oracles instead of projections.

Conclusion: Establishes a regime where projection-free methods can attain the same convergence rates as projection-based counterparts by using stronger separation oracles, providing a computationally efficient alternative to projection-based COCO algorithms.

Abstract: Projection-based algorithms for Constrained Online Convex Optimization (COCO) achieve optimal $\mathcal{O}(T^{1/2})$ regret guarantees but face scalability challenges due to the computational complexity of projections. To circumvent this, projection-free methods utilizing Linear Optimization Oracles (LOO) have been proposed, albeit typically achieving slower $\mathcal{O}(T^{3/4})$ regret rates. In this work, we examine whether the $\mathcal{O}(T^{1/2})$ rate can be recovered in the projection-free setting by strengthening the oracle assumption. We introduce BAGEL, an algorithm utilizing a Separation Oracle (SO) that achieves $\mathcal{O}(T^{1/2})$ regret and $\tilde{\mathcal{O}}(T^{1/2})$ cumulative constraint violation (CCV) for convex cost functions. Our analysis shows that by leveraging an infeasible projection via SO, we can match the time-horizon dependence of projection-based methods with $\tilde{\mathcal{O}}(T)$ oracle calls, provided dependence on the geometry of the action set. This establishes a specific regime where projection-free methods can attain the same convergence rates as projection-based counterparts.

[438] Compositional Reasoning with Transformers, RNNs, and Chain of Thought

Gilad Yehudai, Noah Amsel, Joan Bruna

Main category: cs.LG

TL;DR: The paper analyzes expressive power of transformers, RNNs, and transformers with chain-of-thought on Compositional Reasoning Questions, showing none can solve them without hyperparameter growth, but each has different tradeoffs.

Details

Motivation: To understand whether there's a single best neural architecture for given tasks by comparing expressive power of different architectures on compositional reasoning problems.

Method: Analyze Compositional Reasoning Questions (CRQs) - multi-step problems with tree-like structure like Boolean formulas. Prove impossibility results under hardness assumptions and provide constructive solutions for each architecture.

Result: Transformers need logarithmic depth, RNNs need logarithmic embedding dimension (with specific input order), transformers with CoT need n tokens for input size n. No architecture is strictly better; each has different tradeoffs.

Conclusion: Even for a single problem class, different architectures have distinct strengths/weaknesses. CRQs are inherently hard but solvable in multiple ways, showing no universal best architecture.

Abstract: It is well understood that different neural network architectures are suited to different tasks, but is there always a single best architecture for a given task? We compare the expressive power of transformers, RNNs, and transformers with chain of thought tokens on a simple and natural class of tasks we term Compositional Reasoning Questions (CRQ). This family captures multi-step problems with tree-like compositional structure, such as evaluating Boolean formulas. We prove that under standard hardness assumptions, \emph{none} of these three architectures is capable of solving CRQs unless some hyperparameter (depth, embedding dimension, and number of chain of thought tokens, respectively) grows with the size of the input. We then provide constructions for solving CRQs with each architecture. For transformers, our construction uses depth that is logarithmic in the problem size. For RNNs, logarithmic embedding dimension is necessary and sufficient, so long as the inputs are provided in a certain order. For transformers with chain of thought, our construction uses $n$ CoT tokens for input size $n$. These results show that, while CRQs are inherently hard, there are several different ways for language models to overcome this hardness. Even for a single class of problems, each architecture has strengths and weaknesses, and none is strictly better than the others.

[439] Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson

Main category: cs.LG

TL;DR: Transformers can solve graph problems with constant depth if width grows linearly, showing width can compensate for depth for faster inference and training.

Details

Motivation: To understand the trade-off between transformer depth and width for solving graph-based algorithmic problems, specifically what happens when width grows linearly while depth is kept fixed.

Method: Theoretical analysis of transformer architectures for graph-based tasks, examining minimal size requirements (depth vs width) for implementing various algorithms.

Result: Surprisingly, linear width allows constant depth to suffice for many graph problems, while some problems require quadratic width. Wider models can achieve same accuracy as deep models with faster training/inference.

Conclusion: Transformer implementations of graph algorithms show complex depth-width trade-offs, where moderate width increases enable shallower models with practical advantages for inference and training time.

Abstract: Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.

[440] Physics-Guided Multimodal Transformers are the Necessary Foundation for the Next Generation of Meteorological Science

Jing Han, Hanting Chen, Kai Han, Xiaomeng Huang, Wenjun Xu, Dacheng Tao, Ping Zhang

Main category: cs.LG

TL;DR: The paper argues for transitioning from fragmented hybrid AI approaches to a unified physics-guided multimodal transformer paradigm in meteorological and climate sciences to create more scientifically consistent and physically-constrained AI systems.

Details

Motivation: Current purely data-driven models treat atmospheric processes as visual patterns, producing results that lack scientific consistency and violate physical laws. Existing hybrid approaches are ad-hoc and don't scale well across heterogeneous meteorological data types (satellite imagery, sparse sensor measurements).

Method: Proposes using transformer architecture as the foundation for systematic integration of domain knowledge through physical constraint embedding and physics-informed loss functions, leveraging transformers’ inherent capacity for cross-modal alignment.

Result: This is a position paper advocating for an architectural shift rather than presenting specific experimental results. The argument is that this approach provides the only viable foundation for creating scientifically grounded AI systems.

Conclusion: The community should move away from “black-box” curve fitting toward AI systems that are inherently falsifiable, scientifically grounded, and robust enough to address extreme weather and climate change challenges through unified physics-guided multimodal transformers.

Abstract: This position paper argues that the next generation of artificial intelligence in meteorological and climate sciences must transition from fragmented hybrid heuristics toward a unified paradigm of physics-guided multimodal transformers. While purely data-driven models have achieved significant gains in predictive accuracy, they often treat atmospheric processes as mere visual patterns, frequently producing results that lack scientific consistency or violate fundamental physical laws. We contend that current hybrid'' attempts to bridge this gap remain ad-hoc and struggle to scale across the heterogeneous nature of meteorological data ranging from satellite imagery to sparse sensor measurements. We argue that the transformer architecture, through its inherent capacity for cross-modal alignment, provides the only viable foundation for a systematic integration of domain knowledge via physical constraint embedding and physics-informed loss functions. By advocating for this unified architectural shift, we aim to steer the community away from black-box’’ curve fitting and toward AI systems that are inherently falsifiable, scientifically grounded, and robust enough to address the existential challenges of extreme weather and climate change.

[441] FC-PINO: High Precision Physics-Informed Neural Operators via Fourier Continuation

Adarsh Ganeshram, Haydn Maust, Valentin Duruisseaux, Zongyi Li, Yixuan Wang, Daniel Leibovici, Oscar Bruno, Thomas Hou, Anima Anandkumar

Main category: cs.LG

TL;DR: FC-PINO extends PINO to handle non-periodic PDEs by integrating Fourier continuation methods, overcoming spectral differentiation’s periodicity assumption and Gibbs phenomena issues.

Details

Motivation: Standard PINO uses spectral differentiation that assumes periodicity, causing significant errors (Gibbs phenomena) near boundaries for non-periodic PDEs, limiting its applicability to real-world problems.

Method: Integrates Fourier continuation (FC-Legendre and FC-Gram approaches) into PINO framework to transform non-periodic signals into periodic functions on extended domains, enabling accurate derivative computations without periodicity constraints.

Result: FC-PINO substantially outperforms standard PINO on non-periodic and non-smooth PDE benchmarks, providing accurate, robust, and scalable solutions where PINO fails or struggles even with padding.

Conclusion: Fourier continuation is critical for extending PINO’s applicability to a wider range of PDE problems requiring high-precision solutions, overcoming limitations of spectral differentiation for non-periodic cases.

Abstract: The physics-informed neural operator (PINO) is a machine learning paradigm that has demonstrated promising results for learning solutions to partial differential equations (PDEs). It leverages the Fourier Neural Operator to learn solution operators in function spaces and leverages physics losses during training to penalize deviations from known physics laws. Spectral differentiation provides an efficient way to compute derivatives for the physics losses, but it inherently assumes periodicity. When applied to non-periodic functions, this assumption can lead to significant errors, including Gibbs phenomena near domain boundaries which degrade the accuracy of both function representations and derivative computations. To overcome this limitation, we introduce the FC-PINO (Fourier-Continuation-based Physics-Informed Neural Operator) architecture which extends the accuracy and efficiency of PINO and spectral differentiation to non-periodic and non-smooth PDEs. In FC-PINO, we propose integrating Fourier continuation into the PINO framework, and test two different continuation approaches: FC-Legendre and FC-Gram. By transforming non-periodic signals into periodic functions on extended domains in a well-conditioned manner, Fourier continuation enables fast and accurate derivative computations. This approach avoids the discretization sensitivity of finite differences and the memory overhead of automatic differentiation. We demonstrate that standard PINO fails (without padding) or struggles (even with padding) to solve non-periodic and non-smooth PDEs with high precision, across challenging benchmarks. In contrast, the proposed FC-PINO provides accurate, robust, and scalable solutions, substantially outperforming PINO alternatives, and demonstrating that Fourier continuation is critical for extending PINO to a wider range of PDE problems when high-precision solutions are needed.

[442] NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Lawrence Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin F. Yang

Main category: cs.LG

TL;DR: NoWag is a unified framework for one-shot shape-preserving compression of large language models, applying vector quantization and pruning to compress Llama models with state-of-the-art results.

Details

Motivation: Large language models have excellent performance but suffer from huge computational and memory requirements, making deployment difficult in resource-constrained environments.

Method: NoWag (Normalized Weight and Activation Guided Compression) is a unified framework for one-shot shape preserving compression algorithms. It applies two techniques: vector quantization (NoWag-VQ) and unstructured/semi-structured pruning (NoWag-P).

Result: NoWag-VQ significantly outperforms state-of-the-art one-shot vector quantization methods, while NoWag-P performs competitively against leading pruning techniques. The framework was successfully applied to compress Llama-2 (7B, 13B, 70B) and Llama-3 (8B, 70B) models.

Conclusion: The findings highlight underlying commonalities between different compression paradigms and suggest promising directions for future research in model compression.

Abstract: Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag (Normalized Weight and Activation Guided Compression), a unified framework for one-shot shape preserving compression algorithms. We apply NoWag to compress Llama-2 (7B, 13B, 70B) and Llama-3 (8B, 70B) models using two popular shape-preserving techniques: vector quantization (NoWag-VQ) and unstructured/semi-structured pruning (NoWag-P). Our results show that NoWag-VQ significantly outperforms state-of-the-art one-shot vector quantization methods, while NoWag-P performs competitively against leading pruning techniques. These findings highlight underlying commonalities between these compression paradigms and suggest promising directions for future research. Our code is available at https://github.com/LawrenceRLiu/NoWag

[443] Toward Highly Efficient and Private Submodular Maximization via Matrix-Based Acceleration

Boyu Liu, Lianke Qin, Zhao Song, Yitan Wang, Jiale Zhao

Main category: cs.LG

TL;DR: A framework for efficient and private submodular function maximization with improved computational complexity and privacy guarantees.

Details

Motivation: Submodular function maximization is widely used in applications like document summarization and sensor placement, but faces computational bottlenecks (O(knd²)) and privacy concerns in sensitive optimization tasks.

Method: Three key components: 1) Novel matrix-based computation paradigm for faster function evaluations, 2) Approximate data structures to streamline optimization, 3) Integration of (ε,δ)-differential privacy guarantees.

Result: Achieves theoretical complexity of O(ε⁻²(nd+kn+kd²)log(k/δ)), significantly improving over the O(knd²) bottleneck while providing formal privacy guarantees.

Conclusion: The proposed integrated framework simultaneously addresses efficiency and privacy challenges in submodular function maximization, making it more practical for real-world applications with sensitive data.

Abstract: Submodular function maximization is a critical building block for diverse tasks, such as document summarization, sensor placement, and image segmentation. Yet its practical utility is often limit by the $O(knd^2)$ computational bottleneck. In this paper, we propose an integrated framework that addresses efficiency and privacy simultaneously. First, we introduce a novel matrix-based computation paradigm that accelerates function evaluations. Second, we develop approximate data structures that further streamline the optimization process, achieving a theoretical complexity of $O(ε^{-2}(nd+kn+kd^2)\log(k/δ))$. Third, we integrate ($ε, δ$)-DP guaranties to address the privacy concerns inherent in sensitive optimization tasks.

[444] Comparing Time-Series Analysis Approaches Utilized in Research Papers to Forecast COVID-19 Cases in Africa: A Literature Review

Ali Ebadi, Ebrahim Sahafizadeh

Main category: cs.LG

TL;DR: Literature review comparing time-series analysis approaches for forecasting COVID-19 cases in Africa, covering papers from 2020-2023.

Details

Motivation: To systematically compare and evaluate different time-series analysis methodologies used for COVID-19 forecasting in Africa, identifying effective approaches and their limitations.

Method: Systematic literature review of English-language papers (Jan 2020-July 2023) from PubMed, Google Scholar, Scopus, and Web of Science, focusing on time-series analysis of COVID-19 data in Africa, with evaluation of model implementation and performance.

Result: The review identified various time-series methodologies used for COVID-19 forecasting in Africa, highlighting their effectiveness and limitations in predicting virus spread.

Conclusion: The findings provide insights for improving time-series models and suggest integrating different approaches for better public health decision-making in future research.

Abstract: This literature review aimed to compare various time-series analysis approaches utilized in forecasting COVID-19 cases in Africa. The study involved a methodical search for English-language research papers published between January 2020 and July 2023, focusing specifically on papers that utilized time-series analysis approaches on COVID-19 datasets in Africa. A variety of databases including PubMed, Google Scholar, Scopus, and Web of Science were utilized for this process. The research papers underwent an evaluation process to extract relevant information regarding the implementation and performance of the time-series analysis models. The study highlighted the different methodologies employed, evaluating their effectiveness and limitations in forecasting the spread of the virus. The result of this review could contribute deeper insights into the field, and future research should consider these insights to improve time series analysis models and explore the integration of different approaches for enhanced public health decision-making.

[445] Robust MAE-Driven NAS: From Mask Reconstruction to Architecture Innovation

Yiming Hu, Xiangxiang Chu, Yong Wang

Main category: cs.LG

TL;DR: Unsupervised NAS using Masked Autoencoders eliminates need for labeled data while maintaining performance and addressing DARTS collapse issues.

Details

Motivation: Traditional NAS methods heavily rely on labeled data which is expensive and time-consuming to obtain. There's a need for NAS approaches that can work without labeled supervision while maintaining performance and generalization capabilities.

Method: Proposes an unsupervised NAS method based on Masked Autoencoders (MAE), replacing supervised learning with image reconstruction tasks. Introduces a hierarchical decoder to address performance collapse issues in Differentiable Architecture Search (DARTS) in unsupervised settings.

Result: Extensive experiments across various datasets demonstrate the method’s effectiveness and robustness, showing superiority over counterpart approaches.

Conclusion: The proposed unsupervised NAS method using MAE successfully eliminates the need for labeled data while maintaining performance and generalization ability, with a hierarchical decoder solving DARTS collapse problems in unsupervised settings.

Abstract: Neural Architecture Search (NAS) relies heavily on labeled data, which is labor-intensive and time-consuming to obtain. In this paper, we propose a novel NAS method based on an unsupervised paradigm, specifically Masked Autoencoders (MAE), thereby eliminating the need for labeled data. By replacing the supervised learning objective with an image reconstruction task, our approach enables the efficient discovery of network architectures without compromising performance and generalization ability. Additionally, we address the problem of performance collapse encountered in the widely-used Differentiable Architecture Search (DARTS) in the unsupervised setting by designing a hierarchical decoder. Extensive experiments across various datasets demonstrate the effectiveness and robustness of our method, offering empirical evidence of its superiority over the counterparts.

[446] GPT2MEG: Quantizing MEG for Autoregressive Generation

Richard Csaky, Mats W. J. van Es, Oiwi Parker Jones, Mark Woolrich

Main category: cs.LG

TL;DR: GPT2MEG: A GPT-2-style Transformer adapted for autoregressive generation of realistic MEG neural time series, outperforming WaveNet variants and linear baselines in forecasting, generation, and downstream decoding tasks.

Details

Motivation: While foundation models with self-supervised objectives are increasingly applied to brain recordings, autoregressive generation of realistic multichannel neural time series (particularly MEG) remains underexplored compared to other modalities.

Method: Two approaches studied: (1) modified multichannel WaveNet variants, and (2) a GPT-2-style Transformer trained autoregressively via next-step prediction on unlabelled MEG. For the Transformer, proposed a quantization/tokenization and embedding scheme (channel, subject, and task-condition embeddings) that adapts language-model architecture for continuous, high-rate multichannel time series, enabling conditional simulation of task-evoked activity.

Result: GPT2MEG more faithfully reproduces temporal, spectral, and task-evoked statistics of real MEG than WaveNet variants and linear autoregressive baselines across forecasting, long-horizon generation, and downstream decoding tasks. Scales to multiple subjects via subject embeddings.

Conclusion: GPT-2-style Transformers with appropriate quantization and embedding schemes can effectively model and generate realistic MEG neural time series, outperforming existing approaches and enabling conditional simulation of task-evoked brain activity across multiple subjects.

Abstract: Foundation models trained with self-supervised objectives are increasingly applied to brain recordings, but autoregressive generation of realistic multichannel neural time series remains comparatively underexplored, particularly for Magnetoencephalography (MEG). We study (i) modified multichannel WaveNet variants and (ii) a GPT-2-style Transformer, autoregressively trained by next-step prediction on unlabelled MEG. For the Transformer, we propose a simple quantization/tokenization and embedding scheme (channel, subject, and task-condition embeddings) that repurposes a language-model architecture for continuous, high-rate multichannel time series and enables conditional simulation of task-evoked activity. Across forecasting, long-horizon generation, and downstream decoding, GPT2MEG more faithfully reproduces temporal, spectral, and task-evoked statistics of real MEG than WaveNet variants and linear autoregressive baselines, and scales to multiple subjects via subject embeddings. Code available at https://github.com/ricsinaruto/MEG-transfer-decoding.

[447] Lipschitz-Regularized Critics Lead to Policy Robustness Against Transition Dynamics Uncertainty

Xulin Chen, Ruipeng Liu, Zhenyu Gan, Garrett E. Katz

Main category: cs.LG

TL;DR: PPO-PGDLC: A robust RL algorithm combining PPO with Projected Gradient Descent and Lipschitz-regularized critic to handle transition uncertainties and improve policy robustness.

Details

Motivation: Existing robust RL approaches have limitations: Lipschitz regularization typically focuses on actor or actor-critic modules without exploring critic-only regularization's impact on policy robustness, while robust Bellman operator methods lack real-world validation. Transition uncertainties in RL cause performance degradation when policies are deployed on hardware.

Method: PPO-PGDLC algorithm based on Proximal Policy Optimization (PPO) that integrates Projected Gradient Descent (PGD) with a Lipschitz-regularized critic (LC). PGD calculates adversarial states within uncertainty sets to approximate robust Bellman operator, while Lipschitz regularization improves policy smoothness.

Result: Experimental results on two classic control tasks and one real-world robotic locomotion task show PPO-PGDLC achieves better performance and predicts smoother actions under environmental perturbations compared to several baseline algorithms.

Conclusion: The proposed PPO-PGDLC effectively addresses transition uncertainty challenges in RL by combining robust Bellman operator approximation via PGD with critic smoothness enhancement through Lipschitz regularization, demonstrating improved robustness in both simulated and real-world scenarios.

Abstract: Uncertainties in transition dynamics pose a critical challenge in reinforcement learning (RL), often resulting in performance degradation of trained policies when deployed on hardware. Many robust RL approaches follow two strategies: enforcing smoothness in actor or actor-critic modules with Lipschitz regularization, or learning robust Bellman operators. However, the first strategy does not investigate the impact of critic-only Lipschitz regularization on policy robustness, while the second lacks comprehensive validation in real-world scenarios. Building on this gap and prior work, we propose PPO-PGDLC, an algorithm based on Proximal Policy Optimization (PPO) that integrates Projected Gradient Descent (PGD) with a Lipschitz-regularized critic (LC). The PGD component calculates the adversarial state within an uncertainty set to approximate the robust Bellman operator, and the Lipschitz-regularized critic further improves the smoothness of learned policies. Experimental results on two classic control tasks and one real-world robotic locomotion task demonstrates that, compared to several baseline algorithms, PPO-PGDLC achieves better performance and predicts smoother actions under environmental perturbations.

[448] Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, Yu Cheng

Main category: cs.LG

TL;DR: ReVisual-R1 achieves SOTA among open-source 7B MLLMs on challenging multimodal reasoning benchmarks by addressing multimodal RL training issues and implementing a staged training approach with effective cold start initialization.

Details

Motivation: Current MLLMs struggle to achieve complex reasoning capabilities similar to Deepseek-R1's textual reasoning, even when applying reinforcement learning directly. The paper aims to understand why multimodal RL fails and develop effective training strategies for multimodal reasoning.

Method: 1) Identified three key phenomena in MLLM training: importance of cold start initialization with selected text data, gradient stagnation in standard GRPO for multimodal RL, and benefits of staged training. 2) Developed ReVisual-R1 by addressing multimodal RL issues and implementing a staged approach: text-only initialization → multimodal RL → text-only RL refinement.

Result: ReVisual-R1 achieves state-of-the-art performance among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and AIME2024/AIME2025, demonstrating superior multimodal reasoning capabilities.

Conclusion: Effective multimodal reasoning requires addressing specific training challenges: proper cold start initialization, solving gradient stagnation in multimodal RL, and staged training that balances perceptual grounding with cognitive reasoning development.

Abstract: Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.

[449] HyResPINNs: A Hybrid Residual Physics-Informed Neural Network Architecture Designed to Balance Expressiveness and Trainability

Madison Cooley, Robert M. Kirby, Shandian Zhe, Varun Shankar

Main category: cs.LG

TL;DR: HyResPINNs: A two-level convex-gated PINN architecture that maximizes approximation expressiveness for fixed degrees of freedom, combining smooth basis functions with deep neural networks and gating mechanisms.

Details

Motivation: To enhance the expressiveness of Physics-Informed Neural Networks (PINNs) for solving PDEs while maintaining computational efficiency with a fixed number of degrees of freedom, addressing limitations in current PINN architectures.

Method: Two-level convex-gated architecture: 1) Trainable per-block combination of smooth basis functions with trainable sparsity and deep neural networks, 2) Gating entire blocks (similar to ResNets/Highway Nets) for depth-wise expressivity.

Result: HyResPINNs consistently achieve superior accuracy to baseline methods on diverse challenging PDE problems while remaining competitive in training times.

Conclusion: HyResPINNs combine desirable features from traditional scientific computing and modern machine learning, paving the way for more robust and expressive physics-informed modeling approaches.

Abstract: Physics-informed neural networks (PINNs) have emerged as a powerful approach for solving partial differential equations (PDEs) by training neural networks with loss functions that incorporate physical constraints. In this work, we introduce HyResPINNs, a two-level convex-gated architecture designed to maximize approximation expressiveness for a fixed number of degrees of freedom (DoF). The first level involves a trainable, per-block combination of smooth basis functions with trainable sparsity, and deep neural networks; the second involves the ability to gate entire blocks (much like in ResNets or Highway Nets), allowing for expressivity along the depth dimension of the architecture. Our empirical evaluation on a diverse set of challenging PDE problems demonstrates that HyResPINNs consistently achieve superior accuracy to baseline methods while remaining competitive relative to training times. These results highlight the potential of HyResPINNs to combine desirable features from traditional scientific computing methods and modern machine learning, paving the way for more robust and expressive approaches to physics-informed modeling.

[450] HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

Hongzheng Chen, Yingheng Wang, Yaohui Cai, Hins Hu, Jiajie Li, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang

Main category: cs.LG

TL;DR: HeuriGym is an agentic framework for evaluating LLM-generated heuristic algorithms for combinatorial optimization problems, using iterative refinement and a new Quality-Yield Index metric.

Details

Motivation: Current LLM evaluation methods are inadequate - they either use closed-ended questions prone to saturation/memorization, or subjective comparisons lacking consistency. There's a need for better evaluation of LLMs' heuristic algorithm generation capabilities for combinatorial optimization problems.

Method: HeuriGym framework allows LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine solutions. Evaluates 9 state-of-the-art models on 9 combinatorial optimization problems across domains like computer systems, logistics, and biology.

Result: Even top models (GPT-o4-mini-high and Gemini-2.5-Pro) achieve QYI scores of only 0.6, well below expert baseline of 1. The benchmark exposes persistent limitations in LLMs’ tool use, planning, and adaptive reasoning capabilities.

Conclusion: HeuriGym provides a rigorous open-source benchmark to guide LLM development toward more effective problem-solving in scientific and engineering domains, addressing current evaluation shortcomings.

Abstract: While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.

[451] LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs

Piyush Jha, Arnav Arora, Vijay Ganesh

Main category: cs.LG

TL;DR: LLMStinger uses reinforcement learning with LLMs to generate adversarial suffixes for jailbreak attacks, achieving significant improvements in attack success rates across multiple models including LLaMA2, Claude 2, GPT-3.5, and Gemma.

Details

Motivation: Traditional jailbreak methods require complex prompt engineering or white-box access, creating a need for more automated and effective approaches to test LLM safety measures.

Method: Uses reinforcement learning loop to fine-tune an attacker LLM that generates adversarial suffixes based on existing attacks from the HarmBench benchmark for harmful questions.

Result: Outperformed 15 latest methods with +57.2% ASR improvement on LLaMA2-7B-chat, +50.3% on Claude 2, 94.97% on GPT-3.5, and 99.4% on Gemma-2B-it.

Conclusion: LLMStinger demonstrates robust and adaptable jailbreak capabilities across both open and closed-source models, highlighting vulnerabilities in current LLM safety measures.

Abstract: We introduce LLMStinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Unlike traditional methods, which require complex prompt engineering or white-box access, LLMStinger uses a reinforcement learning (RL) loop to fine-tune an attacker LLM, generating new suffixes based on existing attacks for harmful questions from the HarmBench benchmark. Our method significantly outperforms existing red-teaming approaches (we compared against 15 of the latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for their extensive safety measures. Additionally, we achieved a 94.97% ASR on GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability of LLMStinger across open and closed-source models.

[452] Universal Sequence Preconditioning

Annie Marsden, Elad Hazan

Main category: cs.LG

TL;DR: Universal preconditioning method using orthogonal polynomial convolution reduces regret in sequential prediction, achieving first sublinear dimension-independent regret bounds for marginally stable linear systems.

Details

Motivation: The paper addresses the problem of preconditioning in sequential prediction, particularly for linear dynamical systems where standard methods may not achieve optimal regret bounds, especially for systems with marginally stable and asymmetric transition matrices.

Method: The authors propose a universal preconditioning method that convolves the target sequence with coefficients from orthogonal polynomials (Chebyshev or Legendre). This approach is based on the theoretical insight that convolving the target corresponds to applying a polynomial to the hidden transition matrix of the linear dynamical system.

Result: The method reduces regret for two distinct prediction algorithms and achieves the first-ever sublinear and hidden-dimension-independent regret bounds (up to logarithmic factors) for systems with marginally stable and asymmetric transition matrices. Experimental results show improved performance across diverse algorithms including recurrent neural networks, with generalization beyond linear dynamical systems.

Conclusion: Simple orthogonal polynomial-based preconditioning provides a universal and effective approach for sequential prediction, offering theoretical guarantees and practical improvements across various algorithms and system types.

Abstract: We study the problem of preconditioning in sequential prediction. From the theoretical lens of linear dynamical systems, we show that convolving the target sequence corresponds to applying a polynomial to the hidden transition matrix. Building on this insight, we propose a universal preconditioning method that convolves the target with coefficients from orthogonal polynomials such as Chebyshev or Legendre. We prove that this approach reduces regret for two distinct prediction algorithms and yields the first ever sublinear and hidden-dimension-independent regret bounds (up to logarithmic factors) that hold for systems with marginally table and asymmetric transition matrices. Finally, extensive synthetic and real-world experiments show that this simple preconditioning strategy improves the performance of a diverse range of algorithms, including recurrent neural networks, and generalizes to signals beyond linear dynamical systems.

[453] Mechanism of Task-oriented Information Removal in In-context Learning

Hakaze Cho, Haolin Yang, Gouki Minegishi, Naoya Inoue

Main category: cs.LG

TL;DR: ICL works by removing task-irrelevant information from hidden states through denoising heads, not by adding new information.

Details

Motivation: To understand the inner mechanism of in-context learning (ICL) in language models, which remains unclear despite its effectiveness in few-shot learning.

Method: Investigate ICL through information removal perspective: analyze hidden states, design metrics to measure information removal, identify denoising attention heads, and conduct ablation experiments.

Result: Zero-shot LMs produce non-selective representations; ICL selectively removes task-irrelevant information via denoising heads; blocking these heads degrades ICL accuracy, especially when correct labels are absent.

Conclusion: ICL’s core mechanism is task-oriented information removal from entangled representations, facilitated by specific denoising attention heads, rather than information addition.

Abstract: In-context Learning (ICL) is an emerging few-shot learning paradigm based on modern Language Models (LMs), yet its inner mechanism remains unclear. In this paper, we investigate the mechanism through a novel perspective of information removal. Specifically, we demonstrate that in the zero-shot scenario, LMs encode queries into non-selective representations in hidden states containing information for all possible tasks, leading to arbitrary outputs without focusing on the intended task, resulting in near-zero accuracy. Meanwhile, we find that selectively removing specific information from hidden states by a low-rank filter effectively steers LMs toward the intended task. Building on these findings, by measuring the hidden states on carefully designed metrics, we observe that few-shot ICL effectively simulates such task-oriented information removal processes, selectively removing the redundant information from entangled non-selective representations, and improving the output based on the demonstrations, which constitutes a key mechanism underlying ICL. Moreover, we identify essential attention heads inducing the removal operation, termed Denoising Heads, which enables the ablation experiments blocking the information removal operation from the inference, where the ICL accuracy significantly degrades, especially when the correct label is absent from the few-shot demonstrations, confirming both the critical role of the information removal mechanism and denoising heads.

[454] DiffRatio: Training One-Step Diffusion Models Without Teacher Supervision

Wenlin Chen, Mingtian Zhang, Jiajun He, Zijing Ou, José Miguel Hernández-Lobato, Bernhard Schölkopf, David Barber

Main category: cs.LG

TL;DR: DiffRatio is a new framework for training one-step diffusion models that directly estimates score differences via learned density ratios instead of separately estimating teacher and student scores, reducing bias and improving generation quality.

Details

Motivation: Current score-based distillation methods suffer from gradient estimation bias from two sources: biased teacher supervision due to pre-training score estimation errors, and student model score estimation errors during distillation. These biases degrade one-step diffusion model quality.

Method: Instead of independently estimating teacher and student scores and taking their difference, DiffRatio directly estimates the score difference as the gradient of a learned log density ratio between student and data distributions across diffusion time steps. Uses a lightweight density-ratio network instead of two full score networks.

Result: Achieves competitive one-step generation results on CIFAR-10 and ImageNet (64x64 and 512x512), outperforming most teacher-supervised distillation methods. Reduces gradient estimation bias, simplifies training pipeline, improves computational/memory efficiency, and enables inference-time parallel scaling.

Conclusion: DiffRatio provides an effective framework for training high-quality one-step diffusion models by directly estimating score differences via density ratios, addressing key biases in existing distillation methods while offering computational benefits and enabling principled inference-time improvements.

Abstract: Score-based distillation methods (e.g., variational score distillation) train one-step diffusion models by first pre-training a teacher score model and then distilling it into a one-step student model. However, the gradient estimator in the distillation stage usually suffers from two sources of bias: (1) biased teacher supervision due to score estimation error incurred during pre-training, and (2) the student model’s score estimation error during distillation. These biases can degrade the quality of the resulting one-step diffusion model. To address this, we propose DiffRatio, a new framework for training one-step diffusion models: instead of estimating the teacher and student scores independently and then taking their difference, we directly estimate the score difference as the gradient of a learned log density ratio between the student and data distributions across diffusion time steps. This approach greatly simplifies the training pipeline, significantly reduces gradient estimation bias, and improves one-step generation quality. Additionally, it also reduces auxiliary network size by using a lightweight density-ratio network instead of two full score networks, which improves computational and memory efficiency. DiffRatio achieves competitive one-step generation results on CIFAR-10 and ImageNet (64x64 and 512x512), outperforming most teacher-supervised distillation methods. Moreover, the learned density ratio naturally serves as a verifier, enabling a principled inference-time parallel scaling scheme that further improves the generation quality without external rewards or additional sequential computation.

[455] Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation

Toshinori Kitamura, Arnob Ghosh, Tadashi Kozuno, Wataru Kumagai, Kazumi Kasaura, Kenta Hoshino, Yohei Hosoe, Yutaka Matsuo

Main category: cs.LG

TL;DR: Proposes efficient linear CMDP algorithm with Õ(√K) regret and zero constraint violation, improving over prior work that either violates constraints or has exponential complexity.

Details

Motivation: While CMDP problems are well-understood in tabular settings, theoretical results for function approximation remain scarce. There's a need for efficient algorithms that can handle linear function approximation while maintaining constraint satisfaction guarantees.

Method: Proposes a reinforcement learning algorithm for linear CMDPs that uses function approximation. The method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of state space size.

Result: Achieves Õ(√K) regret with episode-wise zero-violation guarantee. The algorithm significantly improves upon recent linear CMDP algorithms that either violate constraints or incur exponential computational costs.

Conclusion: The paper closes the gap between tabular and function approximation settings for CMDPs by providing an efficient algorithm with strong theoretical guarantees for both regret and constraint satisfaction.

Abstract: We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves $\tilde{\mathcal{O}}(\sqrt{K})$ regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.

[456] Machine learning surrogate models of many-body dispersion interactions in polymer melts

Zhaoxiang Shen, Raúl I. Sosa, Jakub Lengiewicz, Alexandre Tkatchenko, Stéphane P. A. Bordas

Main category: cs.LG

TL;DR: A machine learning surrogate model based on trimmed SchNet architecture accurately predicts many-body dispersion forces in polymer melts with high computational efficiency.

Details

Motivation: MBD interactions are crucial for understanding van der Waals forces in complex molecular systems, but their high computational cost limits application in large-scale simulations, especially for polymer melts which require accurate MBD description.

Method: A trimmed SchNet architecture that selectively retains relevant atomic connections and incorporates trainable radial basis functions for geometric encoding, specifically designed for polymer melt systems.

Result: High predictive accuracy and robust generalization across polyethylene, polypropylene, and polyvinyl chloride melts, capturing key physical features like characteristic decay behavior of MBD interactions.

Conclusion: The computationally efficient surrogate model enables practical incorporation of MBD effects into large-scale molecular simulations for polymer systems.

Abstract: Accurate prediction of many-body dispersion (MBD) interactions is essential for understanding the van der Waals forces that govern the behavior of many complex molecular systems. However, the high computational cost of MBD calculations limits their direct application in large-scale simulations. In this work, we introduce a machine learning surrogate model specifically designed to predict MBD forces in polymer melts, a system that demands accurate MBD description and offers structural advantages for machine learning approaches. Our model is based on a trimmed SchNet architecture that selectively retains the most relevant atomic connections and incorporates trainable radial basis functions for geometric encoding. We validate our surrogate model on datasets from polyethylene, polypropylene, and polyvinyl chloride melts, demonstrating high predictive accuracy and robust generalization across diverse polymer systems. In addition, the model captures key physical features, such as the characteristic decay behavior of MBD interactions, providing valuable insights for optimizing cutoff strategies. Characterized by high computational efficiency, our surrogate model enables practical incorporation of MBD effects into large-scale molecular simulations.

[457] Demystifying the Slash Pattern in Attention: The Role of RoPE

Yuan Cheng, Fengzhuo Zhang, Yunlong Hou, Cunxiao Du, Chao Du, Tianyu Pang, Aixin Sun, Zhuoran Yang

Main category: cs.LG

TL;DR: The paper explains why slash attention patterns emerge in LLMs, showing they’re intrinsic to models and generalize to out-of-distribution prompts through empirical analysis and theoretical proof.

Details

Motivation: To understand why slash attention patterns (where attention concentrates along sub-diagonals) emerge in LLMs, as these patterns are crucial for information flow across tokens but their emergence mechanism was unclear.

Method: Combined empirical analysis of open-source LLMs with theoretical modeling. Empirically analyzed queries, keys, and RoPE to identify characteristic conditions. Theoretically analyzed training dynamics of shallow Transformers with RoPE under these conditions using gradient descent.

Result: Found two key conditions for Slash-Dominant Heads: (1) queries and keys are almost rank-one, making them nearly identical across tokens, and (2) RoPE is dominated by medium- and high-frequency components. Proved these conditions are sufficient for SDH emergence and that SDHs generalize to out-of-distribution prompts.

Conclusion: Slash attention patterns emerge intrinsically in LLMs due to specific structural properties of queries, keys, and RoPE, and this emergence can be theoretically explained and proven under certain modeling assumptions.

Abstract: Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the $Δ$-th sub-diagonal for some offset $Δ$. These patterns play a key role in passing information across tokens. But why do they emerge? In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives. First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts. To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores. Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components. Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs. Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions. Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs. The SDHs generalize to out-of-distribution prompts.

[458] Understanding Post-Training Structural Changes in Large Language Models

Xinyu He, Xianghui Cao

Main category: cs.LG

TL;DR: SVD analysis reveals post-training causes uniform singular value scaling and coordinated orthogonal transformations in LLM parameters, challenging black-box views of parameter space.

Details

Motivation: Post-training significantly changes LLM behavior but its impact on internal parameter space remains poorly understood. The authors aim to uncover regularities in how parameters evolve during post-training methods like instruction tuning and Long-CoT distillation.

Method: Systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on instruction tuning and Long-CoT distillation. Proposed a framework to describe coordinated parameter dynamics.

Result: Two robust structural changes: (1) near-uniform geometric scaling of singular values across layers, and (2) highly consistent orthogonal transformations of left/right singular vectors. Singular value scaling underpins temperature-controlled regulation, while vector rotation encodes semantic alignment.

Conclusion: Post-training relies on foundational pre-training capabilities. The findings challenge black-box views of parameter space, reveal first clear regularities in parameter evolution, and provide new perspective for investigating model parameter changes.

Abstract: Post-training fundamentally alters the behavior of large language models (LLMs), yet its impact on the internal parameter space remains poorly understood. In this work, we conduct a systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on two widely adopted post-training methods: instruction tuning and long-chain-of-thought (Long-CoT) distillation. Our analysis reveals two unexpected and robust structural changes: (1) a near-uniform geometric scaling of singular values across layers; and (2) highly consistent orthogonal transformations are applied to the left and right singular vectors of each matrix. Based on these findings, We propose a simple yet effective framework to describe the coordinated dynamics of parameters in LLMs, which elucidates why post-training inherently relies on the foundational capabilities developed during pre-training. Further experiments demonstrate that singular value scaling underpins the temperature-controlled regulatory mechanisms of post-training, while the coordinated rotation of singular vectors encodes the essential semantic alignment. These results challenge the prevailing view of the parameter space in large models as a black box, uncovering the first clear regularities in how parameters evolve during training, and providing a new perspective for deeper investigation into model parameter changes.

[459] Smart Exploration in Reinforcement Learning using Bounded Uncertainty Models

J. S. van Hulst, W. P. M. H. Heemels, D. J. Antunes

Main category: cs.LG

TL;DR: BUMEX: Model-based exploration using prior knowledge to accelerate RL by optimizing over model sets to bound Q-functions and guide exploration.

Details

Motivation: RL typically requires large amounts of data for optimal policy learning. The paper aims to address this data inefficiency by incorporating prior model knowledge to guide exploration and accelerate learning.

Method: Assume access to model set containing true transition kernel and reward function. Optimize over model set to obtain upper/lower bounds on Q-function, which guide agent exploration. Introduce regularized version for convergence guarantees. Special case: BMDP framework makes optimization convex and implementable.

Result: Theoretical convergence guarantees for Q-function to optimal Q-function. Finite-time convergence to optimal policy under BMDP framework with mild assumptions. Simulation study shows BUMEX significantly accelerates learning in benchmark examples.

Conclusion: BUMEX effectively accelerates RL by leveraging prior model knowledge through model set optimization, providing theoretical guarantees and practical implementation, especially in BMDP framework.

Abstract: Reinforcement learning (RL) is a powerful framework for decision-making in uncertain environments, but it often requires large amounts of data to learn an optimal policy. We address this challenge by incorporating prior model knowledge to guide exploration and accelerate the learning process. Specifically, we assume access to a model set that contains the true transition kernel and reward function. We optimize over this model set to obtain upper and lower bounds on the Q-function, which are then used to guide the exploration of the agent. We provide theoretical guarantees on the convergence of the Q-function to the optimal Q-function under the proposed class of exploring policies. Furthermore, we also introduce a data-driven regularized version of the model set optimization problem that ensures the convergence of the class of exploring policies to the optimal policy. Lastly, we show that when the model set has a specific structure, namely the bounded-parameter MDP (BMDP) framework, the regularized model set optimization problem becomes convex and simple to implement. In this setting, we also prove finite-time convergence to the optimal policy under mild assumptions. We demonstrate the effectiveness of the proposed exploration strategy, which we call BUMEX (Bounded Uncertainty Model-based Exploration), in a simulation study. The results indicate that the proposed method can significantly accelerate learning in benchmark examples. A toolbox is available at https://github.com/JvHulst/BUMEX.

[460] Discrete Variational Autoencoding via Policy Search

Michael Drolet, Firas Al-Hafez, Aditya Bhatt, Jan Peters, Oleg Arenz

Main category: cs.LG

TL;DR: Proposes a training framework for discrete VAEs using natural gradient updates without reparameterization, outperforming existing methods on high-dimensional image reconstruction tasks like ImageNet.

Details

Motivation: Discrete VAEs offer efficient multimodal search but face training challenges due to non-differentiable discrete variables. Existing approximations (Gumbel-Softmax, straight-through, REINFORCE) have limitations in high-dimensional tasks like image reconstruction.

Method: Uses natural gradient of a non-parametric encoder to update the parametric encoder without reparameterization. Combines automatic step size adaptation with transformer-based encoder for scaling to challenging datasets.

Result: Method scales to ImageNet and outperforms both approximate reparameterization methods and quantization-based discrete autoencoders in reconstructing high-dimensional data from compact latent spaces.

Conclusion: The proposed training framework enables effective discrete VAE training without reparameterization, achieving state-of-the-art performance on high-dimensional reconstruction tasks.

Abstract: Discrete latent bottlenecks in variational autoencoders (VAEs) offer high bit efficiency and can be modeled with autoregressive discrete distributions, enabling parameter-efficient multimodal search with transformers. However, discrete random variables do not allow for exact differentiable parameterization; therefore, discrete VAEs typically rely on approximations, such as Gumbel-Softmax reparameterization or straight-through gradient estimates, or employ high-variance gradient-free methods such as REINFORCE that have had limited success on high-dimensional tasks such as image reconstruction. Inspired by popular techniques in policy search, we propose a training framework for discrete VAEs that leverages the natural gradient of a non-parametric encoder to update the parametric encoder without requiring reparameterization. Our method, combined with automatic step size adaptation and a transformer-based encoder, scales to challenging datasets such as ImageNet and outperforms both approximate reparameterization methods and quantization-based discrete autoencoders in reconstructing high-dimensional data from compact latent spaces.

[461] Sufficient Decision Proxies for Decision-Focused Learning

Noah Schutte, Grigorii Veviurko, Krzysztof Postek, Neil Yorke-Smith

Main category: cs.LG

TL;DR: The paper investigates when to use different decision proxies in decision-focused learning, proposing new proxies that maintain decision quality without increasing learning complexity.

Details

Motivation: Current DFL approaches either predict single scenarios or estimate full distributions, but there's little understanding of when each approach is appropriate. The paper aims to identify problem properties that justify using specific decision proxies and develop better proxies.

Method: The paper analyzes problem properties that determine appropriate decision proxies, then proposes alternative decision proxies for DFL that maintain decision quality without increasing learning complexity. Methods are validated through experiments on continuous/discrete problems with uncertainty in objectives and constraints.

Result: The proposed approaches show effectiveness in experiments across various problem types - continuous and discrete problems, as well as problems with uncertainty in both objective functions and constraints.

Conclusion: The paper provides theoretical justification for when to use specific decision proxies in DFL and demonstrates practical alternatives that work well across different problem types without compromising learning complexity.

Abstract: When solving optimization problems under uncertainty with contextual data, utilizing machine learning to predict the uncertain parameters’ values is a popular and effective approach. Decision-focused learning (DFL) aims at learning a predictive model such that decision quality, instead of prediction accuracy, is maximized. Common practice is to predict a single scenario representing the uncertain parameters, implicitly assuming that there exists a deterministic problem approximation (proxy) that allows for optimal decision-making. The opposite has also been considered, where the underlying distribution is estimated with a parameterized distribution. However, little is known about when either choice is valid. This paper investigates for the first time problem properties that justify using a certain decision proxy. Using this, we present alternative decision proxies for DFL, with little or no compromise on the complexity of the learning task. We show the effectiveness of presented approaches in experiments on continuous and discrete problems, as well as problems with uncertainty in the objective function and in the constraints.

[462] GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning

Silvia Sapora, Devon Hjelm, Alexander Toshev, Omar Attia, Bogdan Mazoure

Main category: cs.LG

TL;DR: GRACE uses LLMs and evolutionary search to generate interpretable, code-based reward functions from expert demonstrations, outperforming traditional IRL and IL methods.

Details

Motivation: Traditional Inverse Reinforcement Learning produces black-box reward models that are difficult to interpret, debug, and verify, limiting their practical utility and trustworthiness.

Method: GRACE combines Large Language Models with evolutionary search to reverse-engineer executable code-based reward functions directly from expert trajectories, producing inspectable and verifiable reward functions.

Result: GRACE efficiently learns highly accurate rewards on MuJoCo, BabyAI and AndroidWorld benchmarks, even in complex multi-task settings, and produces policies competitive with ground-truth reward approaches.

Conclusion: GRACE successfully addresses the interpretability problem in IRL by generating code-based reward functions that are transparent, verifiable, and effective for policy learning in complex environments.

Abstract: Inverse Reinforcement Learning aims to recover reward models from expert demonstrations, but traditional methods yield black-box models that are difficult to interpret and debug. In this work, we introduce GRACE (Generating Rewards As CodE), a method for using Large Language Models within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically validate GRACE on the MuJoCo, BabyAI and AndroidWorld benchmarks, where it efficiently learns highly accurate rewards, even in complex, multi-task settings. Further, we demonstrate that the resulting reward leads to strong policies, compared to both competitive Imitation Learning and online RL approaches with ground-truth rewards. Finally, we show that GRACE is able to build complex reward APIs in multi-task setups.

[463] Deep-ICE: the first globally optimal algorithm for minimizing 0-1 loss in two-layer ReLU and maxout networks

Xi He, Yi Miao, Max A. Little

Main category: cs.LG

TL;DR: First globally optimal algorithm for empirical risk minimization of two-layer maxout/ReLU networks with O(N^{DK+1}) complexity, plus coreset method for large datasets achieving 20-30% error reduction.

Details

Motivation: Existing methods for training neural networks (like gradient descent) only find local optima, not globally optimal solutions. There's a need for provably exact solutions to the empirical risk minimization problem for two-layer networks with piecewise linear activations.

Method: 1) Globally optimal algorithm with O(N^{DK+1}) complexity for exact solution of empirical risk minimization; 2) Novel coreset selection method to reduce dataset size for practical application to large datasets; 3) Extension to arbitrary computable loss functions without affecting complexity.

Result: Algorithm provides provably exact solutions for small-scale datasets. With coreset extension, achieves 20-30% reduction in misclassifications for both training and prediction compared to state-of-the-art approaches (gradient descent-trained neural networks and SVMs) on same model architectures.

Conclusion: First globally optimal algorithm for two-layer maxout/ReLU networks enables exact solutions to empirical risk minimization. The coreset extension makes it practical for large datasets, significantly outperforming existing methods while providing optimality guarantees.

Abstract: This paper introduces the first globally optimal algorithm for the empirical risk minimization problem of two-layer maxout and ReLU networks, i.e., minimizing the number of misclassifications. The algorithm has a worst-case time complexity of $O\left(N^{DK+1}\right)$, where $K$ denotes the number of hidden neurons and $D$ represents the number of features. It can be can be generalized to accommodate arbitrary computable loss functions without affecting its computational complexity. Our experiments demonstrate that the proposed algorithm provides provably exact solutions for small-scale datasets. To handle larger datasets, we introduce a novel coreset selection method that reduces the data size to a manageable scale, making it feasible for our algorithm. This extension enables efficient processing of large-scale datasets and achieves significantly improved performance, with a 20-30% reduction in misclassifications for both training and prediction, compared to state-of-the-art approaches (neural networks trained using gradient descent and support vector machines), when applied to the same models (two-layer networks with fixed hidden nodes and linear models).

[464] Deep SPI: Safe Policy Improvement via World Models

Florent Delgrange, Raphael Avalos, Willem Röpke

Main category: cs.LG

TL;DR: DeepSPI extends safe policy improvement to online deep RL with world models, providing theoretical guarantees while matching/exceeding PPO and DeepMDPs on ALE-57.

Details

Motivation: Existing safe policy improvement guarantees are limited to offline, tabular RL settings, leaving a gap for online deep reinforcement learning with world models and representation learning.

Method: Develop theoretical framework showing policy updates restricted to neighborhood of current policy ensure monotonic improvement. Link transition/reward prediction losses to representation quality. Introduce DeepSPI algorithm coupling local transition/reward losses with regularized policy updates.

Result: On ALE-57 benchmark, DeepSPI matches or exceeds strong baselines (PPO and DeepMDPs) while retaining theoretical guarantees.

Conclusion: DeepSPI successfully extends safe policy improvement theory to online deep RL settings with world models, providing both theoretical guarantees and competitive empirical performance.

Abstract: Safe policy improvement (SPI) offers theoretical control over policy updates, yet existing guarantees largely concern offline, tabular reinforcement learning (RL). We study SPI in general online settings, when combined with world model and representation learning. We develop a theoretical framework showing that restricting policy updates to a well-defined neighborhood of the current policy ensures monotonic improvement and convergence. This analysis links transition and reward prediction losses to representation quality, yielding online, “deep” analogues of classical SPI theorems from the offline RL literature. Building on these results, we introduce DeepSPI, a principled on-policy algorithm that couples local transition and reward losses with regularised policy updates. On the ALE-57 benchmark, DeepSPI matches or exceeds strong baselines, including PPO and DeepMDPs, while retaining theoretical guarantees.

[465] Byte Pair Encoding for Efficient Time Series Forecasting

Leon Götz, Marcel Kollovieh, Stephan Günnemann, Leo Schwinn

Main category: cs.LG

TL;DR: Proposes pattern-centric tokenization for time series using frequent motifs, with conditional decoding optimization, achieving 36% better forecasting and 1990% efficiency gains.

Details

Motivation: Existing time series tokenization methods use fixed sample-to-token ratios, creating excessive tokens for simple patterns (like constant values), leading to computational inefficiency. Need adaptive compression that recognizes underlying patterns.

Method: 1) Pattern-centric tokenization using discrete vocabulary of frequent motifs (inspired by byte pair encoding). 2) Merges samples with underlying patterns into tokens for adaptive compression. 3) Conditional decoding as lightweight post-hoc optimization exploiting motif finite set and time series continuous properties (no gradients, no computational overhead).

Result: On time series foundation models: 36% improvement in forecasting performance, 1990% average efficiency boost. Conditional decoding reduces MSE by up to 44%. Tokenization adapts to diverse temporal patterns, generalizes to unseen data, and captures meaningful representations (statistical moments, trends).

Conclusion: Motif-based tokenization with conditional decoding provides efficient, adaptive compression for time series analysis, significantly improving both performance and computational efficiency while capturing meaningful temporal properties.

Abstract: Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in substantial computational overhead. Inspired by the success of byte pair encoding, we propose the first pattern-centric tokenization scheme for time series analysis. Based on a discrete vocabulary of frequent motifs, our method merges samples with underlying patterns into tokens, compressing time series adaptively. Exploiting our finite set of motifs and the continuous properties of time series, we further introduce conditional decoding as a lightweight yet powerful post-hoc optimization method, which requires no gradient computation and adds no computational overhead. On recent time series foundation models, our motif-based tokenization improves forecasting performance by 36% and boosts efficiency by 1990% on average. Conditional decoding further reduces MSE by up to 44%. In an extensive analysis, we demonstrate the adaptiveness of our tokenization to diverse temporal patterns, its generalization to unseen data, and its meaningful token representations capturing distinct time series properties, including statistical moments and trends.

[466] DeepBooTS: Dual-Stream Residual Boosting for Drift-Resilient Time-Series Forecasting

Daojun Liang, Jing Chen, Xiao Wang, Yinglong Wang, Shuo Li

Main category: cs.LG

TL;DR: DeepBooTS: A dual-stream residual-decreasing boosting method for time series forecasting that improves robustness to concept drift through block-wise ensemble learning and progressive residual correction.

Details

Motivation: Time series exhibit pronounced non-stationarity, causing concept drift that compromises the robustness of most forecasting methods, even with instance normalization. Existing methods struggle with concept drift robustness.

Method: DeepBooTS is an end-to-end dual-stream residual-decreasing boosting method. It uses block-wise ensemble learning where each block becomes an ensemble of learners with auxiliary output branches forming highways to final predictions. The method progressively reconstructs intrinsic signals by correcting residuals of previous blocks, leading to learning-driven decomposition of both inputs and targets.

Result: Extensive experiments show DeepBooTS outperforms existing methods by a large margin, achieving 15.8% average performance improvement across various datasets, establishing a new benchmark for time series forecasting.

Conclusion: DeepBooTS enhances versatility and interpretability while substantially improving robustness to concept drift through its novel ensemble-based residual correction approach, providing a significant advancement in time series forecasting methodology.

Abstract: Time-Series (TS) exhibits pronounced non-stationarity. Consequently, most forecasting methods display compromised robustness to concept drift, despite the prevalent application of instance normalization. We tackle this challenge by first analysing concept drift through a bias-variance lens and proving that weighted ensemble reduces variance without increasing bias. These insights motivate DeepBooTS, a novel end-to-end dual-stream residual-decreasing boosting method that progressively reconstructs the intrinsic signal. In our design, each block of a deep model becomes an ensemble of learners with an auxiliary output branch forming a highway to the final prediction. The block-wise outputs correct the residuals of previous blocks, leading to a learning-driven decomposition of both inputs and targets. This method enhances versatility and interpretability while substantially improving robustness to concept drift. Extensive experiments, including those on large-scale datasets, show that the proposed method outperforms existing methods by a large margin, yielding an average performance improvement of 15.8% across various datasets, establishing a new benchmark for TS forecasting.

[467] Spiking Brain Compression: Post-Training Second-order Compression for Spiking Neural Networks

Lianfeng Shi, Ao Li, Benjamin Ward-Cherrier

Main category: cs.LG

TL;DR: SBC is a one-shot post-training compression framework for SNNs that extends Optimal Brain Surgeon with spike-train-based objectives, achieving state-of-the-art compression with single- to double-digit accuracy gains over ANN baselines.

Details

Motivation: SNNs are energy-efficient for neuromorphic hardware but have limited memory/computational resources. Existing SNN compression methods require multiple compression/training iterations, which is costly for pre-trained or large SNNs. There's a need for efficient one-shot post-training compression.

Method: Proposes Spiking Brain Compression (SBC) framework that extends Optimal Brain Surgeon to SNNs. Replaces current-based objectives with spike-train-based objectives whose Hessian is cheaply computable. Uses single backward pass to compress parameters and analytically rescale remaining weights.

Result: Achieves state-of-the-art one-shot post-training compression for SNNs across event-based and static datasets (including ImageNet). Shows single- to double-digit accuracy gains over ANN compression baselines ported to SNNs. Demonstrates robust performance under sub-one-sample-per-class calibration.

Conclusion: SBC provides an efficient, one-shot post-training compression solution for SNNs that outperforms existing methods while being computationally efficient and robust to limited calibration data.

Abstract: Spiking Neural Networks (SNNs) have emerged as a new generation of energy-efficient neural networks suitable for implementation on neuromorphic hardware. As neuromorphic hardware has limited memory and computational resources, parameter pruning and quantization have recently been explored to improve the efficiency of SNNs. State-of-the-art SNN pruning/quantization methods employ multiple compression and training iterations, increasing the cost for pre-trained or very large SNNs. In this paper, we propose a novel one-shot post-training compression framework, Spiking Brain Compression (SBC), that extends the classical Optimal Brain Surgeon method to SNNs. SBC replaces the current-based objective found in the common layer-wise compression method with a spike-train-based objective whose Hessian is cheaply computable, allowing a single backward pass to compress parameters and analytically rescale the rest. Applying SBC to SNN pruning and quantization across event-based and static datasets (up to ImageNet), including SEW-ResNet152 and spike-driven Transformers, we achieve state-of-the-art one-shot post-training compression for SNNs, with single- to double-digit accuracy gains over ANN compression baselines ported to SNNs. We further report a synaptic-operation-based energy proxy and a calibration-size ablation, demonstrating robust performance under sub-one-sample-per-class calibration.

[468] Tracing Mathematical Proficiency Through Problem-Solving Processes

Jungyang Park, Suho Kang, Jaewoo Park, Jaehong Kim, Jaewoo Shin, Seonjoon Park, Youngjae Yu

Main category: cs.LG

TL;DR: KT-PSP incorporates problem-solving processes into knowledge tracing, with StatusKT using a three-stage LLM pipeline to extract mathematical proficiency indicators for better prediction and interpretability.

Details

Motivation: Traditional KT methods lack explainability and only use response correctness, ignoring rich information in students' problem-solving processes. There's a need to capture multidimensional aspects of mathematical proficiency.

Method: Propose KT-PSP incorporating problem-solving processes, create KT-PSP-25 dataset, and develop StatusKT framework with teacher-student-teacher three-stage LLM pipeline to extract mathematical proficiency indicators as intermediate signals.

Result: StatusKT improves prediction performance of existing KT methods on KT-PSP-25 dataset and provides interpretable explanations by explicitly modeling students’ mathematical proficiency.

Conclusion: Incorporating problem-solving processes and using LLMs to extract mathematical proficiency indicators enhances both prediction accuracy and explainability in knowledge tracing, addressing fundamental limitations of traditional methods.

Abstract: Knowledge Tracing (KT) aims to model student’s knowledge state and predict future performance to enable personalized learning in Intelligent Tutoring Systems. However, traditional KT methods face fundamental limitations in explainability, as they rely solely on the response correctness, neglecting the rich information embedded in students’ problem-solving processes. To address this gap, we propose Knowledge Tracing Leveraging Problem-Solving Process (KT-PSP), which incorporates students’ problem-solving processes to capture the multidimensional aspects of mathematical proficiency. We also introduce KT-PSP-25, a new dataset specifically designed for the KT-PSP. Building on this, we present StatusKT, a KT framework that employs a teacher-student-teacher three-stage LLM pipeline to extract students’ MP as intermediate signals. In this pipeline, the teacher LLM first extracts problem-specific proficiency indicators, then a student LLM generates responses based on the student’s solution process, and a teacher LLM evaluates these responses to determine mastery of each indicator. The experimental results on KT-PSP-25 demonstrate that StatusKT improves the prediction performance of existing KT methods. Moreover, StatusKT provides interpretable explanations for its predictions by explicitly modeling students’ mathematical proficiency.

[469] In-Context Bias Propagation in LLM-Based Tabular Data Generation

Pol G. Recasens, Alberto Gutierrez, Jordi Torres, Josep. Ll Berral, Javier Carnerero-Cano, Anisa Halimi, Kieran Fraser

Main category: cs.LG

TL;DR: LLMs used for synthetic tabular data generation are vulnerable to bias propagation from in-context examples, where even mild statistical biases can cause global distortions and enable adversarial attacks that compromise downstream fairness.

Details

Motivation: While LLMs show promise for synthetic data generation in data-scarce scenarios, real-world data is often noisy and demographically skewed. The paper investigates how statistical biases in in-context examples propagate to synthetic data and explores adversarial scenarios where malicious actors can inject bias to compromise downstream fairness.

Method: Systematically study how statistical biases within in-context examples propagate to synthetic tabular data distribution. Introduce adversarial scenario where malicious contributors inject bias via in-context examples. Evaluate mitigation strategies based on preprocessing in-context examples.

Result: Even mild in-context biases lead to global statistical distortions in synthetic data. Adversarial bias injection can compromise fairness of downstream classifiers for targeted protected subgroups. Preprocessing interventions can attenuate disparity but LLMs remain sensitive to adversarial prompts.

Conclusion: LLM-based data generation pipelines have a critical new vulnerability in sensitive domains, where bias propagation from in-context examples poses significant fairness risks that current mitigation strategies cannot fully address.

Abstract: Large Language Models (LLMs) are increasingly used for synthetic tabular data generation through in-context learning (ICL), offering a practical solution for data augmentation in data scarce scenarios. While prior work has shown the potential of LLMs to improve downstream task performance through augmenting underrepresented groups, these benefits often assume access to a subset of unbiased in-context examples, representative of the real dataset. In real-world settings, however, data is frequently noisy and demographically skewed. In this paper, we systematically study how statistical biases within in-context examples propagate to the distribution of synthetic tabular data, showing that even mild in-context biases lead to global statistical distortions. We further introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset via a subset of in-context examples, ultimately compromising the fairness of downstream classifiers for a targeted and protected subgroup. Finally, we evaluate mitigation strategies based on preprocessing in-context examples, demonstrating that while such interventions can attenuate disparity, the inherent sensitivity of LLMs to adversarial prompts remains a persistent challenge. Our findings highlight a critical new vulnerability in LLM-based data generation pipelines within sensitive domains.

[470] Predictor-Free and Hardware-Aware Federated Neural Architecture Search via Pareto-Guided Supernet Training

Bostan Khan, Masoud Daneshtalab

Main category: cs.LG

TL;DR: DeepFedNAS is a novel two-phase framework for Federated Neural Architecture Search that addresses bottlenecks in supernet training and subnet discovery through Pareto-optimal caching and predictor-free search, achieving SOTA accuracy with 61x speedup.

Details

Motivation: Current FedNAS approaches suffer from unguided supernet training that yields suboptimal models, and costly multi-hour pipelines for post-training subnet discovery, making hardware-aware FL deployments impractical.

Method: Two-phase framework: 1) Federated Pareto Optimal Supernet Training using pre-computed Pareto-optimal cache as intelligent curriculum, 2) Predictor-Free Search Method using multi-objective fitness function as direct accuracy proxy for on-demand subnet discovery.

Result: Achieves state-of-the-art accuracy (up to 1.21% absolute improvement on CIFAR-100), superior parameter and communication efficiency, 61x speedup in total pipeline time, reducing from over 20 hours to ~20 minutes, with 20-second individual subnet searches.

Conclusion: DeepFedNAS makes hardware-aware FL deployments instantaneous and practical by addressing critical bottlenecks in FedNAS through intelligent curriculum training and predictor-free search, enabling efficient on-demand model design for privacy-preserving FL.

Abstract: Federated Neural Architecture Search (FedNAS) aims to automate model design for privacy-preserving Federated Learning (FL) but currently faces two critical bottlenecks: unguided supernet training that yields suboptimal models, and costly multi-hour pipelines for post-training subnet discovery. We introduce DeepFedNAS, a novel, two-phase framework underpinned by a multi-objective fitness function that synthesizes mathematical network design with architectural heuristics. Enabled by a re-engineered supernet, DeepFedNAS introduces Federated Pareto Optimal Supernet Training, which leverages a pre-computed Pareto-optimal cache of high-fitness architectures as an intelligent curriculum to optimize shared supernet weights. Subsequently, its Predictor-Free Search Method eliminates the need for costly accuracy surrogates by utilizing this fitness function as a direct, zero-cost proxy for accuracy, enabling on-demand subnet discovery in mere seconds. DeepFedNAS achieves state-of-the-art accuracy (e.g., up to 1.21% absolute improvement on CIFAR-100), superior parameter and communication efficiency, and a substantial ~61x speedup in total post-training search pipeline time. By reducing the pipeline from over 20 hours to approximately 20 minutes (including initial cache generation) and enabling 20-second individual subnet searches, DeepFedNAS makes hardware-aware FL deployments instantaneous and practical. The complete source code and experimental scripts are available at: https://github.com/bostankhan6/DeepFedNAS

[471] Spectral Representation-based Reinforcement Learning

Chenxiao Gao, Haotian Sun, Na Li, Dale Schuurmans, Bo Dai

Main category: cs.LG

TL;DR: Spectral representations of transition operators provide a theoretically grounded framework for RL that addresses issues with neural network approximations, offering effective algorithms validated on challenging control tasks.

Details

Motivation: Traditional RL with powerful function approximations like neural networks suffers from theoretical ambiguities, optimization instability, exploration difficulty, and high computational costs. The authors seek a more principled approach.

Method: Introduce spectral representations derived from spectral decomposition of transition operators. Show how to construct these representations for systems with latent variable structures or energy-based structures, developing corresponding learning methods that yield effective RL algorithms. Extend the framework to partially observable MDPs.

Result: Algorithms based on spectral representations achieve comparable or superior performance to state-of-the-art model-free and model-based baselines on over 20 challenging tasks from the DeepMind Control Suite.

Conclusion: Spectral representations provide a theoretically clear and practically effective alternative to neural network approximations in RL, addressing key challenges while maintaining strong empirical performance on complex tasks.

Abstract: In real-world applications with large state and action spaces, reinforcement learning (RL) typically employs function approximations to represent core components like the policies, value functions, and dynamics models. Although powerful approximations such as neural networks offer great expressiveness, they often present theoretical ambiguities, suffer from optimization instability and exploration difficulty, and incur substantial computational costs in practice. In this paper, we introduce the perspective of spectral representations as a solution to address these difficulties in RL. Stemming from the spectral decomposition of the transition operator, this framework yields an effective abstraction of the system dynamics for subsequent policy optimization while also providing a clear theoretical characterization. We reveal how to construct spectral representations for transition operators that possess latent variable structures or energy-based structures, which implies different learning methods to extract spectral representations from data. Notably, each of these learning methods realizes an effective RL algorithm under this framework. We also provably extend this spectral view to partially observable MDPs. Finally, we validate these algorithms on over 20 challenging tasks from the DeepMind Control Suite, where they achieve performances comparable or superior to current state-of-the-art model-free and model-based baselines.

[472] Taxonomy of reduction matrices for Graph Coarsening

Antonin Joly, Nicolas Keriven, Aline Roumy

Main category: cs.LG

TL;DR: The paper introduces a more flexible graph coarsening framework by decoupling reduction and lifting matrices, showing that optimizing the reduction matrix alone can improve spectral approximation and GNN performance.

Details

Motivation: Traditional graph coarsening frameworks impose fixed relationships between reduction and lifting matrices (typically as pseudo-inverses), but the authors observe these roles are not symmetric. Constraining only the lifting matrix ensures important graph objects exist, allowing for more flexible reduction matrix design to reduce information loss.

Method: The authors introduce a general notion of reduction matrix not necessarily tied to the lifting matrix’s pseudo-inverse. They establish a taxonomy of “admissible” reduction matrix families, discuss required properties, and explore examples including constrained optimization of Restricted Spectral Approximation (RSA). They also test the approach on node classification tasks with Graph Neural Networks.

Result: The paper demonstrates that for a fixed coarsening (fixed lifting matrix), the RSA can be further reduced by modifying the reduction matrix alone. Different reduction matrix families are shown to impact spectral approximation quality, and these choices affect GNN performance on node classification tasks using coarsened graphs.

Conclusion: Decoupling reduction and lifting matrices provides a more flexible graph coarsening framework that can reduce information loss (RSA) and improve downstream task performance. The taxonomy of admissible reduction matrices offers practical design choices for graph coarsening applications.

Abstract: Graph coarsening aims to diminish the size of a graph to lighten its memory footprint, and has numerous applications in graph signal processing and machine learning. It is usually defined using a reduction matrix and a lifting matrix, which, respectively, allows to project a graph signal from the original graph to the coarsened one and back. This results in a loss of information measured by the so-called Restricted Spectral Approximation (RSA). Most coarsening frameworks impose a fixed relationship between the reduction and lifting matrices, generally as pseudo-inverses of each other, and seek to define a coarsening that minimizes the RSA. In this paper, we remark that the roles of these two matrices are not entirely symmetric: indeed, putting constraints on the lifting matrix alone ensures the existence of important objects such as the coarsened graph’s adjacency matrix or Laplacian. In light of this, in this paper, we introduce a more general notion of reduction matrix, that is not necessarily the pseudo-inverse of the lifting matrix. We establish a taxonomy of ``admissible’’ families of reduction matrices, discuss the different properties that they must satisfy and whether they admit a closed-form description or not. We show that, for a fixed coarsening represented by a fixed lifting matrix, the RSA can be further reduced simply by modifying the reduction matrix. We explore different examples, including some based on a constrained optimization process of the RSA. Since this criterion has also been linked to the performance of Graph Neural Networks, we also illustrate the impact of this choices on different node classification tasks on coarsened graphs.

[473] Feature-Space Adversarial Robustness Certification for Multimodal Large Language Models

Song Xia, Meiwen Ding, Chenqi Kong, Wenhan Yang, Xudong Jiang

Main category: cs.LG

TL;DR: Feature-space Smoothing (FS) provides certified robustness guarantees for MLLMs by smoothing feature representations against adversarial perturbations, with a plug-and-play Gaussian Smoothness Booster (GSB) module to enhance robustness without retraining.

Details

Motivation: Multimodal large language models (MLLMs) are powerful but vulnerable to adversarial perturbations that distort feature representations and cause erroneous predictions. There's a need for certified robustness guarantees at the feature level.

Method: Propose Feature-space Smoothing (FS) framework that converts feature extractors into smoothed variants with certified cosine similarity bounds under ℓ₂-bounded perturbations. Also introduce Gaussian Smoothness Booster (GSB) module to enhance Gaussian robustness scores of pretrained MLLMs without retraining.

Result: FS provides strong certified feature-space robustness for various MLLMs. The approach consistently leads to robust task-oriented performance across diverse applications, with GSB effectively enhancing guaranteed robustness.

Conclusion: Feature-space Smoothing offers a general framework for certified robustness in MLLMs at the feature representation level, with practical plug-and-play enhancement via GSB, making MLLMs more reliable against adversarial attacks.

Abstract: Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose Feature-space Smoothing (FS), a general framework that provides certified robustness guarantees at the feature representation level of MLLMs. We theoretically prove that FS converts a given feature extractor into a smoothed variant that is guaranteed a certified lower bound on the cosine similarity between clean and adversarial features under $\ell_2$-bounded perturbations. Moreover, we establish that the value of this Feature Cosine Similarity Bound (FCSB) is determined by the intrinsic Gaussian robustness score of the given encoder. Building on this insight, we introduce the Gaussian Smoothness Booster (GSB), a plug-and-play module that enhances the Gaussian robustness score of pretrained MLLMs, thereby strengthening the robustness guaranteed by FS, without requiring additional MLLM retraining. Extensive experiments demonstrate that applying the FS to various MLLMs yields strong certified feature-space robustness and consistently leads to robust task-oriented performance across diverse applications.

[474] KV Admission: Learning What to Write for Efficient Long-Context Inference

Yen-Chieh Huang, Pi-Cheng Hsiu, Rui Fang, Ming-Syan Chen

Main category: cs.LG

TL;DR: WG-KV introduces a learnable KV cache admission mechanism that predicts token utility before writing to memory, reducing memory usage by 46-68% and achieving 3-4x speedups for long-context LLM inference.

Details

Motivation: Long-context LLM inference suffers from quadratic attention complexity and linear KV cache growth. Existing approaches use post-hoc selection/eviction but fail to address the root problem: indiscriminate writing to memory.

Method: Formalizes KV cache management as a causal system with three primitives: KV Admission, Selection, and Eviction. Instantiates KV Admission via Write-Gated KV (WG-KV), a lightweight mechanism that learns to predict token utility before cache entry, filtering low-utility states early.

Result: Reduces memory usage by 46-68%, delivers 3.03-3.70x prefill and 1.85-2.56x decode speedups on Llama and Qwen models, while maintaining compatibility with FlashAttention and Paged-KV systems.

Conclusion: Learning what to write is a principled and practical approach for efficient long-context inference, addressing the root inefficiency of indiscriminate memory writing.

Abstract: Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV (WG-KV), a lightweight mechanism that learns to predict token utility before cache entry. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, WG-KV reduces memory usage by 46-68% and delivers 3.03-3.70x prefill and 1.85-2.56x decode speedups on Llama and Qwen models, while maintaining compatibility with FlashAttention and Paged-KV systems. These results demonstrate that learning what to write is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV.

[475] Continuous Evolution Pool: Taming Recurring Concept Drift in Online Time Series Forecasting

Tianxiang Zhan, Ming Jin, Yuanpeng He, Yuxuan Liang, Yong Deng, Shirui Pan

Main category: cs.LG

TL;DR: CEP is a privacy-preserving framework for online time series forecasting that maintains a dynamic pool of specialized forecasters using lightweight statistical genes instead of storing raw data, addressing catastrophic forgetting under privacy constraints.

Details

Motivation: Address the dual challenge of mitigating catastrophic forgetting in recurring concept drift while adhering to strict privacy constraints that prevent retaining historical data, overcoming limitations of existing approaches that suffer from knowledge overwriting or privacy risks.

Method: Continuous Evolution Pool (CEP) framework with three components: 1) Retrieval mechanism to identify nearest concept based on gene similarity, 2) Evolution strategy to spawn new forecasters upon detecting distribution shifts, and 3) Elimination policy to prune obsolete models under memory constraints.

Result: CEP significantly outperforms state-of-the-art baselines on real-world datasets, reducing forecasting error by over 20% without accessing historical ground truth.

Conclusion: CEP effectively addresses the privacy-preserving online forecasting challenge by decoupling concept identification from forecasting using lightweight statistical genes, enabling adaptive learning without storing sensitive historical data.

Abstract: Recurring concept drift poses a dual challenge in online time series forecasting: mitigating catastrophic forgetting while adhering to strict privacy constraints that prevent retaining historical data. Existing approaches predominantly rely on parameter updates or experience replay, which inevitably suffer from knowledge overwriting or privacy risks. To address this, we propose the Continuous Evolution Pool (CEP), a privacy-preserving framework that maintains a dynamic pool of specialized forecasters. Instead of storing raw samples, CEP utilizes lightweight statistical genes to decouple concept identification from forecasting. Specifically, it employs a Retrieval mechanism to identify the nearest concept based on gene similarity, an Evolution strategy to spawn new forecasters upon detecting distribution shifts, and an Elimination policy to prune obsolete models under memory constraints. Experiments on real-world datasets demonstrate that CEP significantly outperforms state-of-the-art baselines, reducing forecasting error by over 20% without accessing historical ground truth.

[476] NavFormer: IGRF Forecasting in Moving Coordinate Frames

Yoontae Hwang, Dongwoo Lee, Minseok Choi, Heechan Park, Yong Sup Ihn, Daham Kim, Deok-Young Lee

Main category: cs.LG

TL;DR: NavFormer uses rotation-invariant features and a Canonical SPD module to forecast IGRF total intensity from triad magnetometer data, achieving better performance than baselines across multiple flight scenarios.

Details

Motivation: Triad magnetometer readings change with sensor attitude even when the IGRF (International Geomagnetic Reference Field) total intensity remains constant, creating challenges for accurate forecasting in autonomous navigation systems.

Method: Uses rotation-invariant scalar features and a Canonical SPD (Symmetric Positive Definite) module that stabilizes the spectrum of window-level second moments of the triads without sign discontinuities. The module builds a canonical frame from a Gram matrix per window and applies state-dependent spectral scaling in original coordinates.

Result: Experiments across five flights show lower error than strong baselines in standard training, few-shot training, and zero-shot transfer scenarios.

Conclusion: NavFormer provides robust IGRF forecasting for autonomous navigators by effectively handling attitude-dependent magnetometer variations while maintaining rotation invariance, with code publicly available for further research.

Abstract: Triad magnetometer components change with sensor attitude even when the IGRF total intensity target stays invariant. NavFormer forecasts this invariant target with rotation invariant scalar features and a Canonical SPD module that stabilizes the spectrum of window level second moments of the triads without sign discontinuities. The module builds a canonical frame from a Gram matrix per window and applies state dependent spectral scaling in the original coordinates. Experiments across five flights show lower error than strong baselines in standard training, few shot training, and zero shot transfer. The code is available at: https://anonymous.4open.science/r/NavFormer-Robust-IGRF-Forecasting-for-Autonomous-Navigators-0765

[477] Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, Tianle Cai, Wenhao Huang

Main category: cs.LG

TL;DR: Behavioral calibration training enables LLMs to admit uncertainty by abstaining when not confident, allowing smaller models to surpass frontier models in uncertainty quantification despite lower factual accuracy.

Details

Motivation: LLM hallucinations persist in critical domains despite scaling improvements. Current RLHF training incentivizes models to guess whenever correctness probability exceeds zero rather than being honest communicators, creating a misalignment between model behavior and accuracy.

Method: Proposes training interventions optimizing strictly proper scoring rules for models to output calibrated probability of correctness. Methods enable models to either abstain from complete responses or flag uncertain claims. Uses Qwen3-4B-Instruct with behavior-calibrated reinforcement learning.

Result: Behavior-calibrated models achieve superior uncertainty quantification: 0.806 log-scale Accuracy-to-Hallucination Ratio gain vs GPT-5’s 0.207 on math reasoning (BeyondAIME). In factual QA (SimpleQA), 4B LLM achieves zero-shot calibration error on par with frontier models (Grok-4, Gemini-2.5-Pro) despite much lower factual accuracy.

Conclusion: Behavioral calibration enables smaller models to surpass frontier models in uncertainty quantification—a transferable meta-skill decoupled from raw predictive accuracy. This addresses the fundamental misalignment in standard RLHF training that prioritizes test-taking over honest communication.

Abstract: LLM deployment in critical domains is currently impeded by persistent hallucinations–generating plausible but factually incorrect assertions. While scaling laws drove significant improvements in general capabilities, theoretical frameworks suggest hallucination is not merely stochastic error but a predictable statistical consequence of training objectives prioritizing mimicking data distribution over epistemic honesty. Standard RLVR paradigms, utilizing binary reward signals, inadvertently incentivize models as good test-takers rather than honest communicators, encouraging guessing whenever correctness probability exceeds zero. This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when not confident, aligning model behavior with accuracy. Synthesizing recent advances, we propose and evaluate training interventions optimizing strictly proper scoring rules for models to output a calibrated probability of correctness. Our methods enable models to either abstain from producing a complete response or flag individual claims where uncertainty remains. Utilizing Qwen3-4B-Instruct, empirical analysis reveals behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification–a transferable meta-skill decouplable from raw predictive accuracy. Trained on math reasoning tasks, our model’s log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5’s (0.207) in a challenging in-domain evaluation (BeyondAIME). Moreover, in cross-domain factual QA (SimpleQA), our 4B LLM achieves zero-shot calibration error on par with frontier models including Grok-4 and Gemini-2.5-Pro, even though its factual accuracy is much lower.

[478] Generalization in Reinforcement Learning for Radio Access Networks

Burak Demirel, Yu Wang, Cristian Tatino, Pablo Soldati

Main category: cs.LG

TL;DR: A generalization-focused RL framework for RAN control that improves throughput and spectral efficiency by 10-20% over baselines through robust state reconstruction, domain randomization, and distributed training.

Details

Motivation: Traditional rule-based RRM algorithms underperform in dynamic RAN environments, while existing RL approaches overfit to training conditions and fail to generalize to diverse deployments and unpredictable radio conditions.

Method: Three key components: 1) Robust state reconstruction from partial/noisy observations with graph representations for static/semi-static network information, 2) Domain randomization to broaden training distribution, 3) Distributed data generation with centralized training in O-RAN compatible architecture.

Result: 10% average throughput/spectral efficiency improvement over OLLA baseline in full-buffer MIMO/mMIMO, >20% improvement under high mobility, matches specialized RL in full-buffer traffic, 4x gains in eMBB benchmarks, 2x gains in mixed-traffic benchmarks, and 30% higher throughput with GAT models over MLP baselines in nine-cell deployments.

Conclusion: The framework demonstrates effective generalization across diverse network conditions and offers a scalable path toward AI-native 6G RAN using a single, generalizable RL agent.

Abstract: Modern RAN operate in highly dynamic and heterogeneous environments, where hand-tuned, rule-based RRM algorithms often underperform. While RL can surpass such heuristics in constrained settings, the diversity of deployments and unpredictable radio conditions introduce major generalization challenges. Data-driven policies frequently overfit to training conditions, degrading performance in unseen scenarios. To address this, we propose a generalization-centered RL framework for RAN control that: (i) robustly reconstructs dynamically varying states from partial and noisy observations, while encoding static and semi-static information, such as radio nodes, cell attributes, and their topology, through graph representations; (ii) applies domain randomization to broaden the training distribution; and (iii) distributes data generation across multiple actors while centralizing training in a cloud-compatible architecture aligned with O-RAN principles. Although generalization increases computational and data-management complexity, our distributed design mitigates this by scaling data collection and training across diverse network conditions. Applied to downlink link adaptation in five 5G benchmarks, our policy improves average throughput and spectral efficiency by ~10% over an OLLA baseline (10% BLER target) in full-buffer MIMO/mMIMO and by >20% under high mobility. It matches specialized RL in full-buffer traffic and achieves up to 4- and 2-fold gains in eMBB and mixed-traffic benchmarks, respectively. In nine-cell deployments, GAT models offer 30% higher throughput over MLP baselines. These results, combined with our scalable architecture, offer a path toward AI-native 6G RAN using a single, generalizable RL agent.

[479] Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

Rui Pan, Zhuofu Chen, Hongyi Liu, Arvind Krishnamurthy, Ravi Netravali

Main category: cs.LG

TL;DR: FailFast is a speculative decoding framework that uses diffusion LLMs as drafters to accelerate autoregressive LLMs by dynamically adapting draft lengths - failing fast in hard regions and winning big in easy regions for up to 4.9x speedup.

Details

Motivation: Diffusion LLMs offer fast parallel token generation but suffer from an efficiency-quality tradeoff when used standalone. The authors aim to leverage dLLMs' strengths as drafters in speculative decoding, where their parallel decoding capability can lower rejection risk and enable longer drafts for greater speedups.

Method: FailFast uses dLLMs as drafters in speculative decoding with AR verifiers. It dynamically adapts speculation length: “fails fast” by using minimal compute in hard-to-speculate regions to reduce speculation latency, and “wins big” by aggressively extending draft lengths in easier regions to reduce verification latency (sometimes speculating 70+ tokens at once).

Result: Without fine-tuning, FailFast achieves lossless acceleration of AR LLMs with up to 4.9x speedup over vanilla decoding, 1.7x over best naive dLLM drafter, and 1.7x over EAGLE-3 across diverse models and workloads.

Conclusion: dLLMs can be effectively used as drafters in speculative decoding when their attributes are properly leveraged. The dynamic adaptation approach of FailFast successfully addresses the efficiency-quality tradeoff, enabling practical realization of lengthy drafts for substantial speedups while maintaining quality.

Abstract: Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM’s speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It “fails fast” by spending minimal compute in hard-to-speculate regions to shrink speculation latency and “wins big” by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.7$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.

[480] Sparse Equation Matching: A Derivative-Free Learning for General-Order Dynamical Systems

Jiaqiang Li, Jianbin Tan, Xueqin Wang

Main category: cs.LG

TL;DR: SEM is a unified framework for equation discovery that uses integral-based sparse regression with Green’s functions, enabling derivative-free estimation for general-order dynamical systems, validated on EEG data.

Details

Motivation: Existing equation discovery methods rely on accurate derivative estimation and are limited to first-order systems, restricting their applicability in real-world scenarios like brain connectivity analysis, climate modeling, and physical simulation.

Method: Sparse Equation Matching (SEM) introduces an integral-based sparse regression approach using Green’s functions, enabling derivative-free estimation of differential operators and their associated driving functions in general-order dynamical systems.

Result: SEM demonstrates effectiveness through extensive simulations, outperforming derivative-based approaches. When applied to EEG data from 52 participants in oculomotor tasks, it identifies active brain regions across participants and reveals task-specific connectivity patterns.

Conclusion: SEM provides a unified framework for equation discovery that overcomes limitations of derivative-based methods, offering valuable insights into brain connectivity and neural mechanisms through its application to real-world EEG data.

Abstract: Equation discovery is a fundamental learning task for uncovering the underlying dynamics of complex systems, with wide-ranging applications in areas such as brain connectivity analysis, climate modeling, gene regulation, and physical simulation. However, many existing approaches rely on accurate derivative estimation and are limited to first-order dynamical systems, restricting their applicability in real-world scenarios. In this work, we propose Sparse Equation Matching (SEM), a unified framework that encompasses several existing equation discovery methods under a common formulation. SEM introduces an integral-based sparse regression approach using Green’s functions, enabling derivative-free estimation of differential operators and their associated driving functions in general-order dynamical systems. The effectiveness of SEM is demonstrated through extensive simulations, benchmarking its performance against derivative-based approaches. We then apply SEM to electroencephalographic (EEG) data recorded during multiple oculomotor tasks, collected from 52 participants in a brain-computer interface experiment. Our method identifies active brain regions across participants and reveals task-specific connectivity patterns. These findings offer valuable insights into brain connectivity and the underlying neural mechanisms.

[481] The Bayesian Geometry of Transformer Attention

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

Main category: cs.LG

TL;DR: Transformers implement Bayesian inference through geometric mechanisms: residual streams store beliefs, feed-forward networks update posteriors, and attention provides routing. Small transformers achieve high accuracy on controlled tasks where true posteriors are known, while MLPs fail.

Details

Motivation: To rigorously verify if transformers perform Bayesian reasoning, overcoming limitations of natural data (lack of analytic posteriors) and large models (conflating reasoning with memorization).

Method: Created “Bayesian wind tunnels” - controlled environments with known true posteriors where memorization is impossible. Tested small transformers on bijection elimination and HMM state tracking tasks, analyzing geometric mechanisms through diagnostics.

Result: Small transformers reproduce Bayesian posteriors with 10^-3-10^-4 bit accuracy, while capacity-matched MLPs fail by orders of magnitude. Transformers implement Bayesian inference through specific geometric mechanisms: residual streams as belief substrate, feed-forward networks for posterior updates, attention for routing.

Conclusion: Hierarchical attention realizes Bayesian inference by geometric design, explaining why attention is necessary and why flat architectures fail. Bayesian wind tunnels provide a foundation for connecting small, verifiable systems to reasoning in large language models.

Abstract: Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} – controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks – bijection elimination and Hidden Markov Model (HMM) state tracking – we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.

[482] FNODE: Flow-Matching for data-driven simulation of constrained multibody systems

Hongyu Wang, Jingquan Wang, Dan Negrut

Main category: cs.LG

TL;DR: FNODE framework learns acceleration mapping directly from trajectory data using supervised regression, eliminating ODE solver backpropagation bottleneck, while enforcing constraints via coordinate partitioning.

Details

Motivation: Data-driven modeling of constrained multibody dynamics faces two main challenges: (1) high training cost of Neural ODEs requiring backpropagation through ODE solvers, and (2) error accumulation in rollout predictions.

Method: FNODE learns acceleration mapping directly from trajectory data by supervising accelerations rather than integrated states. Acceleration targets are obtained via hybrid FFT+FD numerical differentiation. Kinematic constraints are enforced through coordinate partitioning - learning accelerations only for independent generalized coordinates while dependent coordinates are recovered by solving position-level constraint equations.

Result: FNODE achieves improved prediction accuracy and training/runtime efficiency compared to MBD-NODE, LSTM, and fully connected baselines across multiple benchmarks (mass-spring-damper systems, double pendulum, slider crank, vehicle model, cart-pole), while maintaining constraint satisfaction.

Conclusion: FNODE provides an efficient framework for constrained multibody dynamics modeling that eliminates the ODE solver backpropagation bottleneck through acceleration supervision and maintains constraint satisfaction via coordinate partitioning, with open-source code for reproducibility.

Abstract: Data-driven modeling of constrained multibody dynamics remains challenged by (i) the training cost of Neural ODEs, which typically require backpropagation through an ODE solver, and (ii) error accumulation in rollout predictions. We introduce a Flow-Matching Neural ODE (FNODE) framework that learns the acceleration mapping directly from trajectory data by supervising accelerations rather than integrated states, turning training into a supervised regression problem and eliminating the ODE-adjoint/solver backpropagation bottleneck. Acceleration targets are obtained efficiently via numerical differentiation using a hybrid fast Fourier transform (FFT) and finite-difference (FD) scheme. Kinematic constraints are enforced through coordinate partitioning: FNODE learns accelerations only for the independent generalized coordinates, while the dependent coordinates are recovered by solving the position-level constraint equations. We evaluate FNODE on single and triple mass-spring-damper systems, a double pendulum, a slider crank with and without friction, a vehicle model, and a cart-pole, and compare against MBD-NODE, LSTM, and fully connected baselines. Across these benchmarks, FNODE achieves improved prediction accuracy and training/runtime efficiency, while maintaining constraint satisfaction through the partitioning procedure. Our code and scripts are released as open source to support reproducibility and follow-on research.

[483] Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident Reports

Jian Chen, Jiabao Dou

Main category: cs.LG

TL;DR: ABEX-RAT: A resource-efficient framework combining generative data augmentation (ABEX) with robust adversarial training (RAT) to address class imbalance in occupational accident report classification, achieving state-of-the-art results on OSHA dataset.

Details

Motivation: Occupational accident report classification faces persistent challenges of severe class imbalance and data scarcity, which hinder workplace safety analysis. Traditional approaches using computationally expensive LLM fine-tuning are inefficient for specialized domains with limited data.

Method: Two-stage approach: 1) ABEX (Abstractive-Expansive) pipeline uses prompt-guided LLM to distill label-critical semantics into concise abstracts, then expands them into diverse synthetic samples to balance data distribution. 2) RAT (Random Adversarial Training) trains lightweight classifier with stochastic perturbations to enhance generalization without significant computational overhead.

Result: Achieved state-of-the-art performance on OSHA dataset with Macro-F1 score of 90.32%, significantly outperforming both traditional baselines and fine-tuned large models.

Conclusion: Targeted data augmentation combined with robust adversarial training offers a superior, data-efficient alternative for specialized domain classification, demonstrating that resource-efficient approaches can outperform computationally expensive LLM fine-tuning for imbalanced datasets.

Abstract: The automatic classification of occupational accident reports is pivotal for workplace safety analysis but is persistently hindered by severe class imbalance and data scarcity. In this paper, we propose ABEX-RAT, a resource-efficient framework that synergizes generative data augmentation with robust adversarial learning. Unlike computationally expensive large language models (LLMs) fine-tuning, our approach employs a two-stage abstractive-expansive (ABEX) pipeline: it first utilizes a prompt-guided LLM to distill label-critical semantics into concise abstracts, which are then expanded into diverse synthetic samples to balance the data distribution. Subsequently, we train a lightweight classifier using a random adversarial training (RAT) protocol, which stochastically injects perturbations to enhance generalization without significant computational overhead. Experimental results on the OSHA dataset demonstrate that ABEXRAT establishes a new state-of-the-art, achieving a Macro-F1 score of 90.32% and significantly outperforming both traditional baselines and fine-tuned large models. This confirms that targeted augmentation combined with robust training offers a superior, data-efficient alternative for specialized domain classification. The source code will be made publicly available upon acceptance.

[484] Geometric Scaling of Bayesian Inference in LLMs

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

Main category: cs.LG

TL;DR: Modern LLMs preserve geometric structures enabling Bayesian inference, with value representations organizing along entropy-aligned axes that encode uncertainty.

Details

Motivation: To investigate whether geometric signatures observed in small transformers trained for Bayesian inference persist in production-grade language models, and to understand the role of this geometry in uncertainty representation.

Method: Analyzed value representations across Pythia, Phi-2, Llama-3, and Mistral families, identifying dominant axes correlated with predictive entropy. Performed targeted interventions on Pythia-410M’s entropy-aligned axis during in-context learning to test causal role.

Result: Found last-layer value representations organize along a single dominant axis strongly correlated with predictive entropy. Domain-restricted prompts collapse structure into low-dimensional manifolds. Interventions disrupting entropy-aligned axis selectively disrupt local uncertainty geometry, but don’t proportionally degrade Bayesian-like behavior.

Conclusion: Modern language models preserve the geometric substrate enabling Bayesian inference, organizing approximate Bayesian updates along this substrate. The geometry serves as a privileged readout of uncertainty rather than a singular computational bottleneck.

Abstract: Recent work has shown that small transformers trained in controlled “wind-tunnel’’ settings can implement exact Bayesian inference, and that their training dynamics produce a geometric substrate – low-dimensional value manifolds and progressively orthogonal keys – that encodes posterior structure. We investigate whether this geometric signature persists in production-grade language models. Across Pythia, Phi-2, Llama-3, and Mistral families, we find that last-layer value representations organize along a single dominant axis whose position strongly correlates with predictive entropy, and that domain-restricted prompts collapse this structure into the same low-dimensional manifolds observed in synthetic settings. To probe the role of this geometry, we perform targeted interventions on the entropy-aligned axis of Pythia-410M during in-context learning. Removing or perturbing this axis selectively disrupts the local uncertainty geometry, whereas matched random-axis interventions leave it intact. However, these single-layer manipulations do not produce proportionally specific degradation in Bayesian-like behavior, indicating that the geometry is a privileged readout of uncertainty rather than a singular computational bottleneck. Taken together, our results show that modern language models preserve the geometric substrate that enables Bayesian inference in wind tunnels, and organize their approximate Bayesian updates along this substrate.

[485] Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models

Ilyass Moummad, Kawtar Zaher, Lukas Rauch, Alexis Joly

Main category: cs.LG

TL;DR: Hashing-Baseline: A training-free hashing method using pretrained encoders with classical techniques (PCA, random orthogonal projection, threshold binarization) for competitive retrieval performance without fine-tuning.

Details

Motivation: State-of-the-art hashing methods require expensive, scenario-specific training, creating a need for a strong training-free alternative that leverages powerful pretrained encoders.

Method: Combines frozen embeddings from state-of-the-art vision/audio encoders with classical training-free hashing techniques: principal component analysis, random orthogonal projection, and threshold binarization.

Result: Achieves competitive retrieval performance on standard image retrieval benchmarks and a new audio hashing benchmark without any additional learning or fine-tuning.

Conclusion: Hashing-Baseline provides a strong, generalizable training-free hashing baseline that effectively leverages pretrained encoders for scalable fast search applications.

Abstract: Information retrieval with compact binary embeddings, also referred to as hashing, is crucial for scalable fast search applications, yet state-of-the-art hashing methods require expensive, scenario-specific training. In this work, we introduce Hashing-Baseline, a strong training-free hashing method leveraging powerful pretrained encoders that produce rich pretrained embeddings. We revisit classical, training-free hashing techniques: principal component analysis, random orthogonal projection, and threshold binarization, to produce a strong baseline for hashing. Our approach combines these techniques with frozen embeddings from state-of-the-art vision and audio encoders to yield competitive retrieval performance without any additional learning or fine-tuning. To demonstrate the generality and effectiveness of this approach, we evaluate it on standard image retrieval benchmarks as well as a newly introduced benchmark for audio hashing.

[486] Diagonal Linear Networks and the Lasso Regularization Path

Raphaël Berthier

Main category: cs.LG

TL;DR: Diagonal linear networks’ training trajectory closely mirrors lasso regularization path, with training time acting as inverse regularization parameter.

Details

Motivation: To deepen theoretical understanding of implicit regularization in diagonal linear networks by connecting their full training trajectory to lasso regularization paths.

Method: Analyze diagonal linear networks (linear activation with diagonal weight matrices) from small initialization, establishing connection between training dynamics and lasso path.

Result: Under monotonicity assumption, exact connection between training trajectory and lasso path; approximate connection in general case. Training time corresponds to inverse regularization parameter.

Conclusion: Diagonal linear networks’ training dynamics provide a continuous-time analog to lasso regularization, offering new insights into implicit regularization mechanisms in neural networks.

Abstract: Diagonal linear networks are neural networks with linear activation and diagonal weight matrices. Their theoretical interest is that their implicit regularization can be rigorously analyzed: from a small initialization, the training of diagonal linear networks converges to the linear predictor with minimal 1-norm among minimizers of the training loss. In this paper, we deepen this analysis showing that the full training trajectory of diagonal linear networks is closely related to the lasso regularization path. In this connection, the training time plays the role of an inverse regularization parameter. Both rigorous results and simulations are provided to illustrate this conclusion. Under a monotonicity assumption on the lasso regularization path, the connection is exact while in the general case, we show an approximate connection.

[487] Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

Longteng Zhang, Sen Wu, Shuai Hou, Zhengyu Qing, Zhuo Zheng, Danning Ke, Qihong Lin, Qiang Wang, Shaohuai Shi, Xiaowen Chu

Main category: cs.LG

TL;DR: SALR: A fine-tuning method that combines low-rank adaptation with sparse pruning to reduce model size by 2× while maintaining LoRA performance, achieving 50% sparsity and 1.7× inference speedup.

Details

Motivation: Current fine-tuning approaches for large language models require substantial resources - either fine-tuning millions of parameters or using dense weight updates. LoRA reduces trainable parameters but still has high storage/computation costs from dense weights. Magnitude-based pruning can degrade LoRA performance when applied naively.

Method: SALR unifies low-rank adaptation with sparse pruning under a mean-squared-error framework. It statically prunes only frozen base weights to minimize pruning error bound, then recovers discarded residual information via truncated-SVD low-rank adapter. For hardware efficiency, it fuses multiple low-rank adapters into a single concatenated GEMM and uses bitmap-based encoding with two-stage pipelined decoding + GEMM design.

Result: Achieves 50% sparsity on various LLMs while matching LoRA performance on GSM8K and MMLU benchmarks. Reduces model size by 2× and delivers up to 1.7× inference speedup.

Conclusion: SALR provides an effective fine-tuning paradigm that combines the benefits of sparse pruning and low-rank adaptation, enabling efficient deployment of large language models in resource-constrained environments while maintaining performance.

Abstract: Adapting large pre-trained language models to downstream tasks often entails fine-tuning millions of parameters or deploying costly dense weight updates, which hinders their use in resource-constrained environments. Low-rank Adaptation (LoRA) reduces trainable parameters by factorizing weight updates, yet the underlying dense weights still impose high storage and computation costs. Magnitude-based pruning can yield sparse models but typically degrades LoRA’s performance when applied naively. In this paper, we introduce SALR (Sparsity-Aware Low-Rank Representation), a novel fine-tuning paradigm that unifies low-rank adaptation with sparse pruning under a rigorous mean-squared-error framework. We prove that statically pruning only the frozen base weights minimizes the pruning error bound, and we recover the discarded residual information via a truncated-SVD low-rank adapter, which provably reduces per-entry MSE by a factor of $(1 - r/\min(d,k))$. To maximize hardware efficiency, we fuse multiple low-rank adapters into a single concatenated GEMM, and we adopt a bitmap-based encoding with a two-stage pipelined decoding + GEMM design to achieve true model compression and speedup. Empirically, SALR attains 50% sparsity on various LLMs while matching the performance of LoRA on GSM8K and MMLU, reduces model size by $2\times$, and delivers up to a $1.7\times$ inference speedup.

[488] Effective Model Pruning: Measure The Redundancy of Model Components

Yixuan Wang, Dan Guralnik, Saiedeh Akbari, Warren Dixon

Main category: cs.LG

TL;DR: EMP (Effective Model Pruning) is a novel pruning method that automatically determines optimal sparsity from importance score distributions using effective sample size, providing theoretical guarantees on performance preservation.

Details

Motivation: The paper addresses a fundamental question in model pruning: given importance scores for model components, how many components can be discarded without sacrificing performance? Current methods often rely on arbitrary thresholds or heuristics.

Method: EMP uses the effective sample size (inverse Simpson index) from particle filtering to derive an adaptive threshold from score distributions. It computes N_eff(s) from importance scores, then discards the N-N_eff lowest-scoring components, with theoretical bounds on preserved mass fraction.

Result: EMP provides provable upper bounds on loss change relative to original models. Experiments show effectiveness across diverse architectures (MLPs, CNNs, Transformers, LLMs, KAN) and various pruning criteria (weight magnitude, attention scores, feature-level signals).

Conclusion: EMP offers a universal, theoretically-grounded approach to model pruning that automatically determines optimal sparsity from score distributions, addressing multiple pruning criteria across diverse neural network architectures.

Abstract: This article initiates the study of a basic question about model pruning. Given a vector $s$ of importance scores assigned to model components, how many of the scored components could be discarded without sacrificing performance? We propose Effective Model Pruning (EMP), which derives the desired sparsity directly from the score distribution using the notion of effective sample size from particle filtering, also known as the inverse Simpson index. Rather than prescribe a pruning criterion, EMP supplies a universal adaptive threshold derived from the distribution of the score $s$ over the model components: EMP maps $s$ to a number $N_{eff}=N_{eff}(s)$, called the effective sample size. The $N-N_{eff}$ lowest scoring components are discarded. A tight lower bound on the preserved mass fraction $s_{eff}$ (the sum of retained normalized scores) in terms of $N_{eff}$ is derived. This process yields models with a provable upper bound on the loss change relative to the original dense model. Numerical experiments are performed demonstrating this phenomenon across a variety of network architectures including MLPs, CNNs, Transformers, LLMs, and KAN. It is also shown that EMP addresses a rich set of pruning criteria such as weight magnitude, attention score, KAN importance score, and even feature-level signals such as image pixels.

[489] Higher-Order Feature Attribution: Bridging Statistics, Explainable AI, and Topological Signal Processing

Kurt Butler, Guanchao Feng, Petar Djuric

Main category: cs.LG

TL;DR: The paper proposes a general theory of higher-order feature attribution that extends Integrated Gradients to handle feature interactions in machine learning models.

Details

Motivation: Feature attributions become unclear when predictive models involve feature interactions (multiplicative relationships or joint contributions), which existing methods struggle to interpret properly.

Method: Develops a general theory of higher-order feature attribution based on Integrated Gradients (IG), establishing connections to statistics and topological signal processing.

Result: Provides several theoretical results establishing the theory and validates it on examples, showing how IG can be extended to handle feature interactions.

Conclusion: The proposed higher-order feature attribution theory extends existing explainable AI frameworks and provides clearer interpretation of feature interactions in complex models.

Abstract: Feature attributions are post-training analysis methods that assess how various input features of a machine learning model contribute to an output prediction. Their interpretation is straightforward when features act independently, but it becomes less clear when the predictive model involves interactions, such as multiplicative relationships or joint feature contributions. In this work, we propose a general theory of higher-order feature attribution, which we develop on the foundation of Integrated Gradients (IG). This work extends existing frameworks in the literature on explainable AI. When using IG as the method of feature attribution, we discover natural connections to statistics and topological signal processing. We provide several theoretical results that establish the theory, and we validate our theory on a few examples.

[490] The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning

Ren Zhuang, Ben Wang, Shuifa Sun

Main category: cs.LG

TL;DR: TGR is a training-free framework that improves long chain-of-thought reasoning through manifold-informed latent foresight search with memory-efficient chunk-wise processing.

Details

Motivation: Existing approaches for scaling test-time compute in long chain-of-thought reasoning face a fundamental trade-off between computational cost and coverage quality - either requiring high training expense or producing redundant trajectories.

Method: TGR performs manifold-informed latent foresight search under strict memory bounds. It scores candidate latent anchors using lightweight look-ahead estimates combined with soft geometric regularizers for smooth trajectories and diverse exploration. Chunk-wise KV cache resets keep memory linear in chunk length.

Result: On challenging math and code benchmarks, TGR improves robust trajectory coverage (measured by area under Pass@k curve) by up to 13 points on Qwen3-8B, with negligible overhead of about 1.1-1.3 times.

Conclusion: TGR provides an effective training-free solution that balances computational efficiency with high-quality trajectory coverage for long chain-of-thought reasoning tasks.

Abstract: Scaling test-time compute enhances long chain-of-thought (CoT) reasoning, yet existing approaches face a fundamental trade-off between computational cost and coverage quality: either incurring high training expense or yielding redundant trajectories. We introduce The Geometric Reasoner (TGR), a training-free framework that performs manifold-informed latent foresight search under strict memory bounds. At each chunk boundary, TGR scores candidate latent anchors via a lightweight look-ahead estimate combined with soft geometric regularizers that encourage smooth trajectories and diverse exploration. Chunk-wise KV cache resets keep memory linear in chunk length. On challenging math and code benchmarks, TGR improves robust trajectory coverage, measured by the area under the Pass@$k$ curve (AUC), by up to 13 points on Qwen3-8B, with negligible overhead of about 1.1–1.3 times.

[491] SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening

Murtaza Rangwala, Farag Azzedin, Richard O. Sinnott, Rajkumar Buyya

Main category: cs.LG

TL;DR: SketchGuard reduces communication costs in decentralized federated learning by using sketch-based screening to filter Byzantine attackers before full model exchange.

Details

Motivation: Decentralized federated learning is vulnerable to Byzantine attacks, and existing defenses require exchanging high-dimensional model vectors with all neighbors each round, creating prohibitive communication costs at scale.

Method: SketchGuard decouples Byzantine filtering from aggregation using sketch-based screening. It compresses d-dimensional models to k-dimensional sketches (k ≪ d) using Count Sketch, then fetches full models only from accepted neighbors after screening.

Result: SketchGuard reduces communication complexity from O(d|N_i|) to O(k|N_i| + d|S_i|), maintains state-of-the-art robustness (mean TER deviation ≤0.5 percentage points), reduces computation by up to 82%, and communication by 50-70%.

Conclusion: SketchGuard provides an efficient solution for Byzantine-resilient decentralized federated learning by significantly reducing communication and computation overhead while maintaining strong security guarantees.

Abstract: Decentralized Federated Learning enables privacy-preserving collaborative training without centralized servers but remains vulnerable to Byzantine attacks. Existing defenses require exchanging high-dimensional model vectors with all neighbors each round, creating prohibitive costs at scale. We propose SketchGuard, which decouples Byzantine filtering from aggregation via sketch-based screening. SketchGuard compresses $d$-dimensional models to $k$-dimensional sketches ($k \ll d$) using Count Sketch, then fetches full models only from accepted neighbors, reducing communication complexity from $O(d|N_i|)$ to $O(k|N_i| + d|S_i|)$, where $|N_i|$ is the neighbor count and $|S_i| \le |N_i|$ is the accepted count. We prove convergence in strongly convex and non-convex settings, showing that approximation errors introduce only a $(1+O(ε))$ factor in the effective threshold. Experiments demonstrate SketchGuard maintains state-of-the-art robustness (mean TER deviation $\leq$0.5 percentage points) while reducing computation by up to 82% and communication by 50-70%.

[492] EVEREST: An Evidential, Tail-Aware Transformer for Rare-Event Time-Series Forecasting

Antanas Zilinskas, Robert N. Shorten, Jakub Marecek

Main category: cs.LG

TL;DR: EVEREST: A transformer-based architecture for probabilistic rare-event forecasting with calibrated predictions, uncertainty estimation, and tail risk modeling, achieving state-of-the-art performance on space-weather flare prediction.

Details

Motivation: Forecasting rare events in multivariate time-series is challenging due to severe class imbalance, long-range dependencies, and distributional uncertainty. Existing methods struggle with these issues in high-stakes domains like space weather, industrial monitoring, and satellite diagnostics.

Method: Transformer-based architecture with four components: (1) learnable attention bottleneck for temporal dynamics aggregation, (2) evidential head for aleatoric/epistemic uncertainty via Normal-Inverse-Gamma distribution, (3) extreme-value head for tail risk using Generalized Pareto Distribution, (4) lightweight precursor head for early detection. Jointly optimized with composite loss (focal loss, evidential NLL, tail-sensitive EVT penalty).

Result: Achieves state-of-the-art True Skill Statistic (TSS) of 0.973/0.970/0.966 at 24/48/72-hour horizons for C-class flares on decade-long space-weather data. Model is compact (0.81M parameters), efficient to train on commodity hardware, with no inference overhead.

Conclusion: EVEREST provides effective probabilistic rare-event forecasting with uncertainty quantification and tail risk estimation. Limitations include fixed-length inputs and exclusion of image modalities, suggesting future work on streaming and multimodal forecasting.

Abstract: Forecasting rare events in multivariate time-series data is challenging due to severe class imbalance, long-range dependencies, and distributional uncertainty. We introduce EVEREST, a transformer-based architecture for probabilistic rare-event forecasting that delivers calibrated predictions and tail-aware risk estimation, with auxiliary interpretability via attention-based signal attribution. EVEREST integrates four components: (i) a learnable attention bottleneck for soft aggregation of temporal dynamics; (ii) an evidential head for estimating aleatoric and epistemic uncertainty via a Normal–Inverse–Gamma distribution; (iii) an extreme-value head that models tail risk using a Generalized Pareto Distribution; and (iv) a lightweight precursor head for early-event detection. These modules are jointly optimized with a composite loss (focal loss, evidential NLL, and a tail-sensitive EVT penalty) and act only at training time; deployment uses a single classification head with no inference overhead (approximately 0.81M parameters). On a decade of space-weather data, EVEREST achieves state-of-the-art True Skill Statistic (TSS) of 0.973/0.970/0.966 at 24/48/72-hour horizons for C-class flares. The model is compact, efficient to train on commodity hardware, and applicable to high-stakes domains such as industrial monitoring, weather, and satellite diagnostics. Limitations include reliance on fixed-length inputs and exclusion of image-based modalities, motivating future extensions to streaming and multimodal forecasting.

[493] Federated k-Means over Networks

Xu Yang, Salvatore Rastelli, Alexander Jung

Main category: cs.LG

TL;DR: Federated k-means algorithm for collaborative clustering across devices using GTVMin formulation with privacy-preserving aggregated information exchange.

Details

Motivation: Enable collaborative clustering across interconnected devices while preserving data privacy, addressing the challenge of clustering distributed private datasets without centralizing sensitive data.

Method: Formulate federated k-means as generalized total variation minimization (GTVMin), where each device updates local centroids by solving regularized k-means with consistency regularization between neighboring devices.

Result: Developed a privacy-friendly federated clustering algorithm that exchanges only aggregated information between devices while maintaining clustering performance.

Conclusion: The GTVMin formulation provides an effective framework for federated k-means clustering that preserves privacy through aggregated communication and enforces consistency across distributed devices.

Abstract: We study federated clustering, where interconnected devices collaboratively cluster the data points of private local datasets. Focusing on hard clustering via the k-means principle, we formulate federated k-means as an instance of generalized total variation minimization (GTVMin). This leads to a federated k-means algorithm in which each device updates its local cluster centroids by solving a regularized k-means problem with a regularizer that enforces consistency between neighbouring devices. The resulting algorithm is privacy-friendly, as only aggregated information is exchanged.

[494] A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction

Jinkyu Sung, Myunggeum Jee, Joonseok Lee

Main category: cs.LG

TL;DR: Proposes a scalable Gaussian copula-based method for link sign prediction on signed graphs by representing correlation matrix as Gramian of edge embeddings and reformulating conditional probability to reduce computational cost.

Details

Motivation: Link sign prediction on signed graphs is challenging because negative edges violate the homophily assumption, making regular graph methods inapplicable without auxiliary structures. Existing methods like CopulaGNN have computational limitations when modeling edge-edge dependencies directly.

Method: Extends CopulaGNN by modeling latent statistical dependency among edges using Gaussian copula and correlation matrix. To address computational intractability: 1) represents correlation matrix as Gramian of edge embeddings to reduce parameters, and 2) reformulates conditional probability distribution to dramatically reduce inference cost.

Result: The method achieves significantly faster convergence than baselines while maintaining competitive prediction performance with state-of-the-art models. Theoretical verification shows linear convergence, proving scalability.

Conclusion: The proposed approach successfully addresses computational challenges in modeling edge dependencies for signed graph link prediction, providing an efficient and scalable solution with strong empirical performance.

Abstract: Link sign prediction on a signed graph is a task to determine whether the relationship represented by an edge is positive or negative. Since the presence of negative edges violates the graph homophily assumption that adjacent nodes are similar, regular graph methods have not been applicable without auxiliary structures to handle them. We aim to directly model the latent statistical dependency among edges with the Gaussian copula and its corresponding correlation matrix, extending CopulaGNN (Ma et al., 2021). However, a naive modeling of edge-edge relations is computationally intractable even for a graph with moderate scale. To address this, we propose to 1) represent the correlation matrix as a Gramian of edge embeddings, significantly reducing the number of parameters, and 2) reformulate the conditional probability distribution to dramatically reduce the inference cost. We theoretically verify scalability of our method by proving its linear convergence. Also, our extensive experiments demonstrate that it achieves significantly faster convergence than baselines, maintaining competitive prediction performance to the state-of-the-art models.

[495] Self-induced stochastic resonance: A physics-informed machine learning approach

Divyesh Savaliya, Marius E. Yamakou

Main category: cs.LG

TL;DR: Physics-informed neural network accurately models noise-induced coherence in neurons without external forcing

Details

Motivation: To develop an efficient, interpretable surrogate model for Self-induced Stochastic Resonance (SISR) in excitable systems, overcoming limitations of purely data-driven methods and expensive stochastic simulations

Method: Physics-Informed Neural Network (PINN) with Noise-Augmented State Predictor architecture that embeds stochastic differential equations and SISR asymptotic constraints; composite loss integrates data fidelity, dynamical residuals, and barrier-based physical constraints from Kramers’ escape theory

Result: PINN accurately predicts spike-train coherence dependence on noise intensity, excitability, and timescale separation; matches direct stochastic simulations with improved accuracy, generalization, and significantly less computation compared to purely data-driven methods

Conclusion: The framework provides a data-efficient, interpretable surrogate model for simulating and analyzing noise-induced coherence in multiscale stochastic systems, enabling efficient study of SISR phenomena

Abstract: Self-induced stochastic resonance (SISR) is the emergence of coherent oscillations in slow-fast excitable systems driven solely by noise, without external periodic forcing or proximity to a bifurcation. This work presents a physics-informed machine learning framework for modeling and predicting SISR in the stochastic FitzHugh-Nagumo neuron. We embed the governing stochastic differential equations and SISR-asymptotic timescale-matching constraints directly into a Physics-Informed Neural Network (PINN) based on a Noise-Augmented State Predictor architecture. The composite loss integrates data fidelity, dynamical residuals, and barrier-based physical constraints derived from Kramers’ escape theory. The trained PINN accurately predicts the dependence of spike-train coherence on noise intensity, excitability, and timescale separation, matching results from direct stochastic simulations with substantial improvements in accuracy and generalization compared with purely data-driven methods, while requiring significantly less computation. The framework provides a data-efficient and interpretable surrogate model for simulating and analyzing noise-induced coherence in multiscale stochastic systems.

[496] R^3: Replay, Reflection, and Ranking Rewards for LLM Reinforcement Learning

Zhizheng Jiang, Kang Zhao, Weikai Xu, Xinkui Lin, Wei Liu, Jian Luan, Shuo Shang, Peng Han

Main category: cs.LG

TL;DR: R³: A reinforcement learning method for large reasoning models using cross-context replay, in-context self-reflection, and structural entropy ranking to improve training stability and efficiency on challenging tasks.

Details

Motivation: Existing group-based policy optimization methods for large reasoning models rely on advantage gaps from high-quality samples within batches, making training fragile and inefficient when intra-group advantages collapse on challenging tasks.

Method: Proposes R³ with three components: (1) cross-context replay strategy to maintain intra-group advantage by recalling valuable examples from historical trajectories of the same query, (2) in-context self-reflection mechanism enabling models to refine outputs using past failures, and (3) structural entropy ranking reward that assigns relative rewards to truncated/failed samples by ranking responses based on token-level entropy patterns.

Result: Achieves state-of-the-art performance on several math benchmarks with significant improvements and fewer reasoning tokens over base models when implemented on Deepseek-R1-Distill-Qwen-1.5B trained on DeepscaleR-40k.

Conclusion: R³ addresses limitations of existing group-based policy optimization methods by maintaining advantage stability through cross-context replay, enabling self-improvement through reflection, and providing meaningful rewards for partial solutions through entropy-based ranking.

Abstract: Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning. Recent advances in group-based policy optimization methods have shown promise in enabling stable advantage estimation without reliance on process-level annotations. However, these methods rely on advantage gaps induced by high-quality samples within the same batch, which makes the training process fragile and inefficient when intra-group advantages collapse under challenging tasks. To address these problems, we propose a reinforcement learning mechanism named \emph{\textbf{R^3}} that along three directions: (1) a \emph{cross-context \underline{\textbf{R}}eplay} strategy that maintains the intra-group advantage by recalling valuable examples from historical trajectories of the same query, (2) an \emph{in-context self-\underline{\textbf{R}}eflection} mechanism enabling models to refine outputs by leveraging past failures, and (3) a \emph{structural entropy \underline{\textbf{R}}anking reward}, which assigns relative rewards to truncated or failed samples by ranking responses based on token-level entropy patterns, capturing both local exploration and global stability. We implement our method on Deepseek-R1-Distill-Qwen-1.5B and train it on the DeepscaleR-40k in the math domain. Experiments demonstrate our method achieves SoTA performance on several math benchmarks, representing significant improvements and fewer reasoning tokens over the base models. Code and model will be released.

[497] FusionLog: Cross-System Log-based Anomaly Detection via Fusion of General and Proprietary Knowledge

Xinlong Zhao, Tong Jia, Minghua He, Xixuan Yang, Ying Li

Main category: cs.LG

TL;DR: FusionLog is a zero-label cross-system log anomaly detection method that fuses general and proprietary knowledge without requiring labeled target logs, achieving over 90% F1-score.

Details

Motivation: Existing transfer learning methods for log anomaly detection focus only on transferring general knowledge while neglecting the mismatch between this general knowledge and the proprietary knowledge specific to target systems, limiting performance when deploying in new systems with insufficient labeled logs.

Method: 1) Training-free router partitions unlabeled target logs into ‘general logs’ and ‘proprietary logs’ based on semantic similarity. 2) For general logs: uses small model with system-agnostic representation meta-learning. 3) For proprietary logs: iteratively generates pseudo-labels and fine-tunes small model using multi-round collaborative knowledge distillation and fusion between LLM and small model.

Result: Experimental results on three public log datasets show FusionLog achieves over 90% F1-score under fully zero-label setting, significantly outperforming state-of-the-art cross-system log anomaly detection methods.

Conclusion: FusionLog effectively addresses the limitation of existing methods by fusing general and proprietary knowledge, enabling successful cross-system generalization without any labeled target logs.

Abstract: Log-based anomaly detection is critical for ensuring the stability and reliability of web systems. One of the key problems in this task is the lack of sufficient labeled logs, which limits the rapid deployment in new systems. Existing works usually leverage large-scale labeled logs from a mature web system and a small amount of labeled logs from a new system, using transfer learning to extract and generalize general knowledge across both domains. However, these methods focus solely on the transfer of general knowledge and neglect the disparity and potential mismatch between such knowledge and the proprietary knowledge of target system, thus constraining performance. To address this limitation, we propose FusionLog, a novel zero-label cross-system log-based anomaly detection method that effectively achieves the fusion of general and proprietary knowledge, enabling cross-system generalization without any labeled target logs. Specifically, we first design a training-free router based on semantic similarity that dynamically partitions unlabeled target logs into ‘general logs’ and ‘proprietary logs.’ For general logs, FusionLog employs a small model based on system-agnostic representation meta-learning for direct training and inference, inheriting the general anomaly patterns shared between the source and target systems. For proprietary logs, we iteratively generate pseudo-labels and fine-tune the small model using multi-round collaborative knowledge distillation and fusion based on large language model (LLM) and small model (SM) to enhance its capability to recognize anomaly patterns specific to the target system. Experimental results on three public log datasets from different systems show that FusionLog achieves over 90% F1-score under a fully zero-label setting, significantly outperforming state-of-the-art cross-system log-based anomaly detection methods.

[498] ProToken: Token-Level Attribution for Federated Large Language Models

Waris Gill, Ahmad Humayun, Ali Anwar, Muhammad Ali Gulzar

Main category: cs.LG

TL;DR: ProToken enables token-level client attribution in federated LLMs while preserving privacy, achieving 98% accuracy across diverse models and domains.

Details

Motivation: Federated LLMs deployed in critical applications lack attribution mechanisms to identify which clients contributed to specific generated responses, hindering debugging, malicious client identification, fair reward allocation, and trust verification.

Method: ProToken uses two key insights: (1) transformer architectures concentrate task-specific signals in later blocks for strategic layer selection, and (2) gradient-based relevance weighting filters irrelevant neural activations to focus on neurons directly influencing token generation.

Result: ProToken achieves 98% average attribution accuracy in correctly localizing responsible clients across 16 configurations spanning four LLM architectures (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical, coding), maintaining high accuracy when scaling client numbers.

Conclusion: ProToken provides a practical provenance methodology for token-level client attribution in federated LLMs that addresses critical deployment needs while maintaining privacy constraints, validating its viability for real-world applications.

Abstract: Federated Learning (FL) enables collaborative training of Large Language Models (LLMs) across distributed data sources while preserving privacy. However, when federated LLMs are deployed in critical applications, it remains unclear which client(s) contributed to specific generated responses, hindering debugging, malicious client identification, fair reward allocation, and trust verification. We present ProToken, a novel Provenance methodology for Token-level attribution in federated LLMs that addresses client attribution during autoregressive text generation while maintaining FL privacy constraints. ProToken leverages two key insights to enable provenance at each token: (1) transformer architectures concentrate task-specific signals in later blocks, enabling strategic layer selection for computational tractability, and (2) gradient-based relevance weighting filters out irrelevant neural activations, focusing attribution on neurons that directly influence token generation. We evaluate ProToken across 16 configurations spanning four LLM architectures (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical, coding). ProToken achieves 98% average attribution accuracy in correctly localizing responsible client(s), and maintains high accuracy when the number of clients are scaled, validating its practical viability for real-world deployment settings.

[499] Practical Policy Distillation for Reinforcement Learning in Radio Access Networks

Sara Khosravi, Burak Demirel, Linghui Zhou, Javier Rasines, Pablo Soldati

Main category: cs.LG

TL;DR: Policy distillation enables deployment of lightweight AI models for 5G link adaptation that meet strict RAN hardware constraints while maintaining generalization across diverse network conditions.

Details

Motivation: AI deployment in RANs faces challenges: limited link-level measurements, real-time processing constraints (<1ms), network heterogeneity, and computational/memory limitations of legacy 4G hardware lacking neural accelerators. Lightweight models (<1Mb, <100μs) are needed but struggle with generalization across diverse conditions.

Method: Investigates policy distillation for RL-based link adaptation: 1) Single-policy distillation - compresses scenario-agnostic teacher into generalized student; 2) Multi-policy distillation - consolidates multiple scenario-specific teachers into single generalist student model.

Result: Experimental evaluations in high-fidelity 5G-compliant simulator show both strategies produce compact student models that preserve teachers’ generalization capabilities while meeting computational/memory constraints of existing RAN hardware.

Conclusion: Policy distillation effectively addresses the trade-off between model complexity and generalization, enabling deployment of lightweight AI models that comply with stringent RAN hardware limitations while maintaining performance across diverse network conditions.

Abstract: Adopting artificial intelligence (AI) in radio access networks (RANs) presents several challenges, including limited availability of link-level measurements (e.g., CQI reports), stringent real-time processing constraints (e.g., sub-1 ms per TTI), and network heterogeneity (different spectrum bands, cell types, and vendor equipment). A critical yet often overlooked barrier lies in the computational and memory limitations of RAN baseband hardware, particularly in legacy 4th Generation (4G) systems, which typically lack on-chip neural accelerators. As a result, only lightweight AI models (under 1 Mb and sub-100~μs inference time) can be effectively deployed, limiting both their performance and applicability. However, achieving strong generalization across diverse network conditions often requires large-scale models with substantial resource demands. To address this trade-off, this paper investigates policy distillation in the context of a reinforcement learning-based link adaptation task. We explore two strategies: single-policy distillation, where a scenario-agnostic teacher model is compressed into one generalized student model; and multi-policy distillation, where multiple scenario-specific teachers are consolidated into a single generalist student. Experimental evaluations in a high-fidelity, 5th Generation (5G)-compliant simulator demonstrate that both strategies produce compact student models that preserve the teachers’ generalization capabilities while complying with the computational and memory limitations of existing RAN hardware.

[500] GCL-OT: Graph Contrastive Learning with Optimal Transport for Heterophilic Text-Attributed Graphs

Yating Ren, Yikun Ban, Huobin Tan

Main category: cs.LG

TL;DR: GCL-OT is a graph contrastive learning framework with optimal transport that addresses multi-granular heterophily in text-attributed graphs for better structure-text alignment.

Details

Motivation: Existing structure-text contrastive learning methods rely on homophily assumptions and hard optimization objectives, limiting their effectiveness on heterophilic graphs. They treat textual embeddings as static targets, leading to suboptimal alignment, especially when dealing with mixed, noisy, and missing semantic correlations in text-attributed graphs.

Method: Proposes GCL-OT framework with optimal transport and tailored mechanisms for three types of heterophily: 1) RealSoftMax-based similarity estimator for partial heterophily to emphasize key neighbor-word interactions, 2) prompt-based filter for complete heterophily to exclude irrelevant noise during OT alignment, and 3) OT-guided soft supervision to uncover potential neighbors with similar semantics for latent homophily.

Result: Extensive experiments on nine benchmarks show GCL-OT outperforms state-of-the-art methods, demonstrating effectiveness and robustness. Theoretical analysis shows GCL-OT improves mutual information bound and Bayes error guarantees.

Conclusion: GCL-OT provides a flexible and bidirectional alignment framework that effectively handles multi-granular heterophily in text-attributed graphs through optimal transport and tailored mechanisms, achieving superior performance over existing methods.

Abstract: Recently, structure-text contrastive learning has shown promising performance on text-attributed graphs by leveraging the complementary strengths of graph neural networks and language models. However, existing methods typically rely on homophily assumptions in similarity estimation and hard optimization objectives, which limit their applicability to heterophilic graphs. Although existing methods can mitigate heterophily through structural adjustments or neighbor aggregation, they usually treat textual embeddings as static targets, leading to suboptimal alignment. In this work, we identify multi-granular heterophily in text-attributed graphs, including complete heterophily, partial heterophily, and latent homophily, which makes structure-text alignment particularly challenging due to mixed, noisy, and missing semantic correlations. To achieve flexible and bidirectional alignment, we propose GCL-OT, a novel graph contrastive learning framework with optimal transport, equipped with tailored mechanisms for each type of heterophily. Specifically, for partial heterophily, we design a RealSoftMax-based similarity estimator to emphasize key neighbor-word interactions while easing background noise. For complete heterophily, we introduce a prompt-based filter that adaptively excludes irrelevant noise during optimal transport alignment. Furthermore, we incorporate OT-guided soft supervision to uncover potential neighbors with similar semantics, enhancing the learning of latent homophily. Theoretical analysis shows that GCL-OT can improve the mutual information bound and Bayes error guarantees. Extensive experiments on nine benchmarks show that GCL-OT outperforms state-of-the-art methods, demonstrating its effectiveness and robustness.

[501] Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu

Main category: cs.LG

TL;DR: CadLLM is a training-free method to accelerate diffusion-based LLM inference by dynamically adjusting generation parameters based on token confidence, achieving up to 2.28x throughput improvement.

Details

Motivation: The motivation is to improve inference throughput of diffusion-based LLMs (dLLMs) without requiring retraining, addressing the computational inefficiency in current dLLM inference methods.

Method: CadLLM uses a lightweight adaptive approach that controls generation block size, step size, and threshold based on average confidence of unmasked tokens. It also reduces softmax overhead by dynamically using a subset of the vocabulary to regulate sampling breadth. The method is plug-and-play and model-agnostic, compatible with KV-cache-based dLLMs.

Result: Extensive experiments on four popular tasks show CadLLM yields up to 2.28x throughput improvement over state-of-the-art baselines while maintaining competitive accuracy.

Conclusion: CadLLM provides an effective training-free solution for accelerating diffusion-based LLM inference through adaptive parameter control and vocabulary subset optimization, achieving significant speedups with minimal accuracy loss.

Abstract: We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.

[502] Blog Data Showdown: Machine Learning vs Neuro-Symbolic Models for Gender Classification

Natnael Tilahun Sinshaw, Mengmei He, Tadesse K. Bahiru, Sudhir Kumar Mohapatra

Main category: cs.LG

TL;DR: Comparative analysis of ML algorithms (SVM, NB, LR, AdaBoost, XGBoost, SVM_R) vs neuro-symbolic AI for text classification, exploring text representations (TF-IDF, USE, RoBERTa) and feature extraction techniques (Chi-Square, Mutual Information, PCA).

Details

Motivation: Text classification is well-studied but needs comparative analysis of traditional ML algorithms versus emerging neuro-symbolic AI approaches, especially with different text representations and feature extraction methods.

Method: Comparative analysis of SVM, Naive Bayes, Logistic Regression, AdaBoost, XGBoost, SVM_R, and neuro-symbolic AI. Evaluation of text representations (TF-IDF, Universal Sentence Encoder, RoBERTa) and feature extraction techniques (Chi-Square, Mutual Information, PCA).

Result: Neuro-symbolic AI approach matched strong MLP results despite limited dataset, showing competitive performance with traditional machine learning methods.

Conclusion: Neuro-symbolic AI shows promise for text classification tasks. Future work will expand knowledge base, embedding types, and hyperparameter configurations to further study NeSy effectiveness.

Abstract: Text classification problems, such as gender classification from a blog, have been a well-matured research area that has been well studied using machine learning algorithms. It has several application domains in market analysis, customer recommendation, and recommendation systems. This study presents a comparative analysis of the widely used machine learning algorithms, namely Support Vector Machines (SVM), Naive Bayes (NB), Logistic Regression (LR), AdaBoost, XGBoost, and an SVM variant (SVM_R) with neuro-symbolic AI (NeSy). The paper also explores the effect of text representations such as TF-IDF, the Universal Sentence Encoder (USE), and RoBERTa. Additionally, various feature extraction techniques, including Chi-Square, Mutual Information, and Principal Component Analysis, are explored. Building on these, we introduce a comparative analysis of the machine learning and deep learning approaches in comparison to the NeSy. The experimental results show that the use of the NeSy approach matched strong MLP results despite a limited dataset. Future work on this research will expand the knowledge base, the scope of embedding types, and the hyperparameter configuration to further study the effectiveness of the NeSy approach.

[503] The Ensemble Schr{ö}dinger Bridge filter for Nonlinear Data Assimilation

Hui Sun

Main category: cs.LG

TL;DR: Proposes Ensemble Schrödinger Bridge nonlinear filter combining prediction with diffusion generative modeling for analysis, achieving derivative-free, training-free filtering that outperforms EnKF and PF in nonlinear high-dimensional chaotic systems.

Details

Motivation: To develop a nonlinear optimal filter that can handle highly nonlinear dynamics in high-dimensional chaotic environments without structural model errors, while being computationally efficient and outperforming classical methods like ensemble Kalman filter and particle filter.

Method: Combines standard prediction procedure with diffusion generative modeling for analysis step. The Ensemble Schrödinger Bridge filter is derivative-free, training-free, and highly parallelizable, using Schrödinger bridge theory to connect prior and posterior distributions.

Result: The filter performs well with highly nonlinear dynamics in dimensions up to 40+ under chaotic conditions. It shows better performance than ensemble Kalman filter and particle filter across various tests with different levels of nonlinearity.

Conclusion: The Ensemble Schrödinger Bridge nonlinear filter is an effective approach for nonlinear filtering problems, demonstrating superior performance to classical methods. Future work will focus on meteorological applications and establishing rigorous convergence analysis.

Abstract: This work puts forward a novel nonlinear optimal filter namely the Ensemble Schr{ö}dinger Bridge nonlinear filter. The proposed filter finds marriage of the standard prediction procedure and the diffusion generative modeling for the analysis procedure to realize one filtering step. The designed approach finds no structural model error, and it is derivative free, training free and highly parallizable. Experimental results show that the designed algorithm performs well given highly nonlinear dynamics in (mildly) high dimension up to 40 or above under a chaotic environment. It also shows better performance than classical methods such as the ensemble Kalman filter and the Particle filter in numerous tests given different level of nonlinearity. Future work will focus on extending the proposed approach to practical meteorological applications and establishing a rigorous convergence analysis.

[504] Stochastic Voronoi Ensembles for Anomaly Detection

Yang Cao, Sikun Yang, Xuyun Zhang, Yujiu Yang

Main category: cs.LG

TL;DR: SVEAD is a novel anomaly detection method using ensemble random Voronoi diagrams to handle varying local densities with linear time and constant space complexity, outperforming 12 state-of-the-art methods.

Details

Motivation: Existing anomaly detection methods struggle with datasets having varying local densities: distance-based methods miss local anomalies, while density-based approaches require careful parameter tuning and have quadratic time complexity. The authors observed that local anomalies become detectable when data space is decomposed into restricted regions and examined independently.

Method: SVEAD constructs ensemble random Voronoi diagrams and scores points using normalized cell-relative distances weighted by local scale. This geometric approach decomposes data space into restricted regions (Voronoi cells) and examines each region independently to identify local anomalies.

Result: The method achieves linear time complexity and constant space complexity. Experiments on 45 datasets demonstrate that SVEAD outperforms 12 state-of-the-art anomaly detection approaches.

Conclusion: SVEAD provides an effective solution for anomaly detection in datasets with varying local densities by leveraging geometric decomposition through ensemble random Voronoi diagrams, offering superior performance with efficient computational complexity.

Abstract: Anomaly detection aims to identify data instances that deviate significantly from majority of data, which has been widely used in fraud detection, network security, and industrial quality control. Existing methods struggle with datasets exhibiting varying local densities: distance-based methods miss local anomalies, while density-based approaches require careful parameter selection and incur quadratic time complexity. We observe that local anomalies, though indistinguishable under global analysis, become conspicuous when the data space is decomposed into restricted regions and each region is examined independently. Leveraging this geometric insight, we propose SVEAD (Stochastic Voronoi Ensembles Anomaly Detector), which constructs ensemble random Voronoi diagrams and scores points by normalized cell-relative distances weighted by local scale. The proposed method achieves linear time complexity and constant space complexity. Experiments on 45 datasets demonstrate that SVEAD outperforms 12 state-of-the-art approaches.

[505] Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization

Matthew Landers, Taylor W. Killian, Thomas Hartvigsen, Afsaneh Doryab

Main category: cs.LG

TL;DR: SPIN is a two-stage RL framework that first learns valid action structures, then trains lightweight policy heads, improving performance and convergence speed in discrete combinatorial action spaces.

Details

Motivation: Existing RL approaches in discrete combinatorial action spaces either assume independence across sub-actions (leading to incoherent/invalid actions) or jointly learn action structure and control (slow and unstable). There's a need for a method that efficiently handles the exponential search space while ensuring coherent action combinations.

Method: Structured Policy Initialization (SPIN) uses a two-stage framework: 1) Pre-train an Action Structure Model (ASM) to capture the manifold of valid actions, 2) Freeze this representation and train lightweight policy heads for control, separating structure learning from policy optimization.

Result: On challenging discrete DM Control benchmarks, SPIN improves average return by up to 39% over state-of-the-art methods while reducing time to convergence by up to 12.8×.

Conclusion: SPIN demonstrates that decoupling action structure learning from policy optimization leads to more efficient and effective reinforcement learning in discrete combinatorial action spaces, addressing the fundamental challenge of searching over exponentially many joint actions.

Abstract: Reinforcement learning in discrete combinatorial action spaces requires searching over exponentially many joint actions to simultaneously select multiple sub-actions that form coherent combinations. Existing approaches either simplify policy learning by assuming independence across sub-actions, which often yields incoherent or invalid actions, or attempt to learn action structure and control jointly, which is slow and unstable. We introduce Structured Policy Initialization (SPIN), a two-stage framework that first pre-trains an Action Structure Model (ASM) to capture the manifold of valid actions, then freezes this representation and trains lightweight policy heads for control. On challenging discrete DM Control benchmarks, SPIN improves average return by up to 39% over the state of the art while reducing time to convergence by up to 12.8$\times$.

[506] An interpretable data-driven approach to optimizing clinical fall risk assessment

Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Holley Farley, Kimia Ghobadi

Main category: cs.LG

TL;DR: Data-driven optimization of Johns Hopkins Fall Risk Assessment Tool (JHFRAT) using constrained score optimization improves fall prediction while maintaining interpretability and clinical workflow.

Details

Motivation: To better align fall risk prediction with clinically meaningful measures while maintaining the tool's interpretability and existing clinical workflow, addressing the need for improved inpatient fall prevention protocols.

Method: Retrospective cohort analysis of 54,209 inpatient admissions, using constrained score optimization (CSO) models to reweight JHFRAT scoring weights while preserving its additive structure and clinical thresholds. Compared with benchmark black-box XGBoost model.

Result: CSO model significantly improved predictive performance (AUC-ROC=0.91 vs JHFRAT’s 0.86), translating to protecting 35 additional high-risk patients per week. CSO performed similarly with and without EHR variables and showed more robustness to risk labeling variations than XGBoost (AUC-ROC=0.94).

Conclusion: Evidence-based approach provides robust foundation for health systems to enhance inpatient fall prevention using data-driven optimization while maintaining interpretability and clinical workflow, improving risk assessment and resource allocation.

Abstract: In this study, we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models to reweight the JHFRAT scoring weights, while preserving its additive structure and clinical thresholds. Recalibration refers to adjusting item weights so that the resulting score can order encounters more consistently by the study’s risk labels, and without changing the tool’s form factor or deployment workflow. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). This performance improvement translates to protecting an additional 35 high-risk patients per week across the Johns Hopkins Health System. The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labeling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.

[507] SourceNet: Interpretable Sim-to-Real Inference on Variable-Geometry Sensor Arrays for Earthquake Source Inversion

Zhe Jia, Xiaotian Zhang, Junpeng Li

Main category: cs.LG

TL;DR: SourceNet: Transformer-based framework using Physics-Structured Domain Randomization to infer high-dimensional physical states from sparse sensor arrays, achieving SOTA precision with exceptional data efficiency and discovering optimal experimental design principles.

Details

Motivation: Standard architectures (CNNs, DeepSets) struggle with irregular geometries and relational physics in domains like seismology when inferring high-dimensional physical states from sparse, ad-hoc sensor arrays.

Method: Transformer-based framework with Physics-Structured Domain Randomization (PSDR) that randomizes governing physical dynamics to enforce invariance to unmodeled environmental heterogeneity. Pre-trained on 100,000 synthetic events and fine-tuned on ~2,500 real-world events.

Result: Achieves state-of-the-art precision on held-out real data, demonstrates exceptional data efficiency and real-time capability compared to classical solvers. Model autonomously discovers geometric information bottlenecks and learns attention policy that prioritizes sparse sensor placements.

Conclusion: SourceNet bridges the Sim-to-Real gap effectively, shows scientific-agent-like features by recovering principles of optimal experimental design from data alone, and provides interpretable insights into sensor placement strategies.

Abstract: Inferring high-dimensional physical states from sparse, ad-hoc sensor arrays is a fundamental challenge across AI for Science, yet standard architectures like CNNs and DeepSets struggle to capture the irregular geometries and relational physics inherent to domains like seismology. To address this, we propose SourceNet, a Transformer-based framework that bridges the profound Sim-to-Real gap via Physics-Structured Domain Randomization (PSDR), a protocol that randomizes governing physical dynamics to enforce invariance to unmodeled environmental heterogeneity. By pre-training on 100,000 synthetic events and fine-tuning on ~2,500 real-world events, SourceNet achieves state-of-the-art precision on held-out real data, demonstrating exceptional data efficiency and real-time capability compared to classical solvers. Beyond prediction, interpretability analysis reveals that the model shows scientific-agent-like features: it autonomously discovers geometric information bottlenecks and learns an attention policy that prioritizes sparse sensor placements, effectively recovering principles of optimal experimental design from data alone.

[508] A New Convergence Analysis of Plug-and-Play Proximal Gradient Descent Under Prior Mismatch

Guixian Xu, Jinglai Li, Junqi Tang

Main category: cs.LG

TL;DR: First convergence proof for plug-and-play proximal gradient descent (PnP-PGD) under prior mismatch, where denoiser is trained on different data distribution than inference task.

Details

Motivation: Existing PnP algorithms lack convergence guarantees when there's prior mismatch between training and inference data distributions. Previous theoretical results require restrictive and unverifiable assumptions.

Method: Develop new convergence theory for PnP-PGD under prior mismatch, removing restrictive assumptions. Also extend theory to equivariant PnP (EPnP) in same setting.

Result: First convergence proof for PnP-PGD under prior mismatch. EPnP shown to reduce error variance and explicitly tighten convergence bounds compared to standard PnP.

Conclusion: Theoretical framework provides rigorous convergence guarantees for PnP methods in practical settings with distribution mismatch, with EPnP offering improved performance through error variance reduction.

Abstract: In this work, we provide a new convergence theory for plug-and-play proximal gradient descent (PnP-PGD) under prior mismatch where the denoiser is trained on a different data distribution to the inference task at hand. To the best of our knowledge, this is the first convergence proof of PnP-PGD under prior mismatch. Compared with the existing theoretical results for PnP algorithms, our new results removed the need for several restrictive and unverifiable assumptions. Moreover, we derive the convergence theory for equivariant PnP (EPnP) under the prior mismatch setting, proving that EPnP reduces error variance and explicitly tightens the convergence bound.

[509] Constant Metric Scaling in Riemannian Computation

Kisung You

Main category: cs.LG

TL;DR: This note clarifies how constant rescaling of Riemannian metrics affects computational quantities while preserving core geometric structures, distinguishing between what changes (norms, distances, volumes) and what remains invariant (connection, geodesics, parallel transport).

Details

Motivation: To address confusion in computational settings where constant metric scaling is often introduced but its effects are not clearly distinguished from changes in curvature, manifold structure, or coordinate representation.

Method: Provides a self-contained mathematical analysis of constant metric scaling on arbitrary Riemannian manifolds, systematically examining which quantities transform and which remain invariant under scaling.

Result: Identifies that norms, distances, volume elements, and gradient magnitudes change under scaling, while the Levi-Civita connection, geodesics, exponential/logarithmic maps, and parallel transport remain invariant.

Conclusion: Constant metric scaling can be safely introduced in Riemannian computation as a global step size rescaling without altering the underlying geometric structures, clarifying its proper interpretation in optimization methods.

Abstract: Constant rescaling of a Riemannian metric appears in many computational settings, often through a global scale parameter that is introduced either explicitly or implicitly. Although this operation is elementary, its consequences are not always made clear in practice and may be confused with changes in curvature, manifold structure, or coordinate representation. In this note we provide a short, self-contained account of constant metric scaling on arbitrary Riemannian manifolds. We distinguish between quantities that change under such a scaling, including norms, distances, volume elements, and gradient magnitudes, and geometric objects that remain invariant, such as the Levi–Civita connection, geodesics, exponential and logarithmic maps, and parallel transport. We also discuss implications for Riemannian optimization, where constant metric scaling can often be interpreted as a global rescaling of step sizes rather than a modification of the underlying geometry. The goal of this note is purely expository and is intended to clarify how a global metric scale parameter can be introduced in Riemannian computation without altering the geometric structures on which these methods rely.

[510] Trend-Adjusted Time Series Models with an Application to Gold Price Forecasting

Sina Kazemdehbashi

Main category: cs.LG

TL;DR: TATS model reframes time series forecasting as two-part task (trend prediction + quantitative forecasting) and adjusts forecasts based on predicted trends, outperforming standard LSTM/Bi-LSTM on volatile financial data.

Details

Motivation: Existing time series forecasting approaches often treat prediction as a single task, but separating trend prediction from quantitative forecasting could improve accuracy, especially for volatile data like financial time series.

Method: Proposes Trend-Adjusted Time Series (TATS) model that: 1) Uses binary classifier to predict directional trend (up/down), 2) Uses LSTM/Bi-LSTM for quantitative value forecasting, 3) Adjusts forecasted values based on predicted trend from classifier.

Result: TATS consistently outperforms standard LSTM and Bi-LSTM models on daily gold price forecasting with significantly lower forecasting error. Also shows that traditional metrics (MSE, MAE) are insufficient, so trend detection accuracy should be included.

Conclusion: Reframing time series forecasting as two-part task (trend + quantitative) with adjustment mechanism improves performance. Trend detection accuracy should be considered alongside traditional error metrics for comprehensive evaluation.

Abstract: Time series data play a critical role in various fields, including finance, healthcare, marketing, and engineering. A wide range of techniques (from classical statistical models to neural network-based approaches such as Long Short-Term Memory (LSTM)) have been employed to address time series forecasting challenges. In this paper, we reframe time series forecasting as a two-part task: (1) predicting the trend (directional movement) of the time series at the next time step, and (2) forecasting the quantitative value at the next time step. The trend can be predicted using a binary classifier, while quantitative values can be forecasted using models such as LSTM and Bidirectional Long Short-Term Memory (Bi-LSTM). Building on this reframing, we propose the Trend-Adjusted Time Series (TATS) model, which adjusts the forecasted values based on the predicted trend provided by the binary classifier. We validate the proposed approach through both theoretical analysis and empirical evaluation. The TATS model is applied to a volatile financial time series (the daily gold price) with the objective of forecasting the next days price. Experimental results demonstrate that TATS consistently outperforms standard LSTM and Bi-LSTM models by achieving significantly lower forecasting error. In addition, our results indicate that commonly used metrics such as MSE and MAE are insufficient for fully assessing time series model performance. Therefore, we also incorporate trend detection accuracy, which measures how effectively a model captures trends in a time series.

[511] Accelerated Sinkhorn Algorithms for Partial Optimal Transport

Nghia Thu Truong, Qui Phu Pham, Quang Nguyen, Dung Luong, Mai Tran

Main category: cs.LG

TL;DR: ASPOT introduces accelerated Sinkhorn algorithm for Partial Optimal Transport with improved O(n^{7/3}ε^{-5/3}) complexity, outperforming standard Sinkhorn methods.

Details

Motivation: Standard Sinkhorn methods for Partial Optimal Transport have suboptimal complexity bounds that limit scalability, especially when dealing with distributions of unequal size or containing outliers.

Method: ASPOT integrates alternating minimization with Nesterov-style acceleration in the POT setting, plus an informed choice of the entropic parameter γ to improve rates for classical Sinkhorn.

Result: Achieves improved complexity of O(n^{7/3}ε^{-5/3}) and shows that informed γ selection improves rates for classical Sinkhorn. Experiments validate theories and demonstrate favorable performance.

Conclusion: ASPOT provides a more scalable and efficient solution for Partial Optimal Transport problems with theoretical guarantees and practical benefits.

Abstract: Partial Optimal Transport (POT) addresses the problem of transporting only a fraction of the total mass between two distributions, making it suitable when marginals have unequal size or contain outliers. While Sinkhorn-based methods are widely used, their complexity bounds for POT remain suboptimal and can limit scalability. We introduce Accelerated Sinkhorn for POT (ASPOT), which integrates alternating minimization with Nesterov-style acceleration in the POT setting, yielding a complexity of $\mathcal{O}(n^{7/3}\varepsilon^{-5/3})$. We also show that an informed choice of the entropic parameter $γ$ improves rates for the classical Sinkhorn method. Experiments on real-world applications validate our theories and demonstrate the favorable performance of our proposed methods.

[512] AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning

Wei Lin, Yining Jiang, Qingyu Song, Qiao Xiang, Hong Xu

Main category: cs.LG

TL;DR: AGZO is a zeroth-order optimization method that uses activation structure to guide perturbations, outperforming isotropic ZO methods and narrowing the gap with first-order fine-tuning while maintaining low memory usage.

Details

Motivation: Existing ZO methods use isotropic perturbations that ignore structural information from forward passes, wasting computational resources. There's a need for ZO optimization that leverages activation structure to improve efficiency and performance while maintaining memory efficiency for LLM fine-tuning.

Method: AGZO extracts a compact, activation-informed subspace during forward passes and restricts perturbations to this low-rank subspace. It leverages the insight that gradients of linear layers are confined to the subspace spanned by input activations, creating a subspace-smoothed objective.

Result: AGZO consistently outperforms state-of-the-art ZO baselines on Qwen3 and Pangu models across various benchmarks. It significantly narrows the performance gap with first-order fine-tuning while maintaining almost the same peak memory footprint as other ZO methods.

Conclusion: AGZO demonstrates that leveraging activation structure in ZO optimization leads to more efficient and effective LLM fine-tuning, providing a practical solution for memory-constrained environments while approaching first-order performance.

Abstract: Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically employ isotropic perturbations, neglecting the rich structural information available during the forward pass. In this paper, we identify a crucial link between gradient formation and activation structure: the gradient of a linear layer is confined to the subspace spanned by its input activations. Leveraging this insight, we propose Activation-Guided Zeroth-Order optimization (AGZO). Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. We provide a theoretical framework showing that AGZO optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines. Empirically, we evaluate AGZO on Qwen3 and Pangu models across various benchmarks. AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning, while maintaining almost the same peak memory footprint as other ZO methods.

[513] A Thermodynamic Theory of Learning I: Irreversible Ensemble Transport and Epistemic Costs

Daisuke Okanohara

Main category: cs.LG

TL;DR: Learning as an irreversible process where acquiring structured knowledge requires entropy production, with a derived Epistemic Speed Limit bounding minimal entropy needed for distributional transformations.

Details

Motivation: To resolve the paradox that deterministic learning systems seem to create structured representations without increasing information, which contradicts classical information theory stating deterministic transformations don't increase information.

Method: Model learning as a transport process in probability distribution space using an epistemic free-energy framework, defining free-energy reduction as bookkeeping quantity, and deriving the Epistemic Speed Limit inequality.

Result: Derived the Epistemic Speed Limit (ESL) - a finite-time inequality that lower-bounds the minimal entropy production required for any learning process to achieve a given distributional transformation, depending only on Wasserstein distance between initial and final distributions.

Conclusion: Learning is inherently irreversible over finite time, and acquiring epistemic structure necessarily incurs entropy production, with fundamental limits on how quickly this can occur.

Abstract: Learning systems acquire structured internal representations from data, yet classical information-theoretic results state that deterministic transformations do not increase information. This raises a fundamental question: how can learning produce abstraction and insight without violating information-theoretic limits? We argue that learning is inherently an irreversible process when performed over finite time, and that the realization of epistemic structure necessarily incurs entropy production. To formalize this perspective, we model learning as a transport process in the space of probability distributions over model configurations and introduce an epistemic free-energy framework. Within this framework, we define the free-energy reduction as a bookkeeping quantity that records the total reduction of epistemic free energy along a learning trajectory. This formulation highlights that realizing such a reduction over finite time necessarily incurs irreversible entropy production. We then derive the Epistemic Speed Limit (ESL), a finite-time inequality that lower-bounds the minimal entropy production required by any learning process to realize a given distributional transformation. This bound depends only on the Wasserstein distance between initial and final ensemble distributions and is independent of the specific learning algorithm.

[514] Variational Quantum Circuit-Based Reinforcement Learning for Dynamic Portfolio Optimization

Vincent Gurgul, Ying Chen, Stefan Lessmann

Main category: cs.LG

TL;DR: Quantum Reinforcement Learning (QRL) using Variational Quantum Circuits achieves comparable risk-adjusted performance to classical Deep RL for dynamic portfolio optimization, but practical deployment faces latency challenges.

Details

Motivation: To explore quantum reinforcement learning for dynamic portfolio optimization in financial markets, leveraging quantum computing's potential advantages for complex, high-dimensional, non-stationary environments.

Method: Implemented quantum analogues of classical Deep RL algorithms (Deep Deterministic Policy Gradient and Deep Q-Network) using Variational Quantum Circuits, and evaluated on real-world financial data.

Result: Quantum agents achieved risk-adjusted performance comparable to or exceeding classical Deep RL models with several orders of magnitude more parameters, but practical deployment suffers from substantial latency due to cloud-based quantum system overhead.

Conclusion: QRL is theoretically competitive with state-of-the-art classical reinforcement learning and may become practically advantageous as deployment overheads diminish, positioning it as a promising paradigm for complex decision-making in financial markets.

Abstract: This paper presents a Quantum Reinforcement Learning (QRL) solution to the dynamic portfolio optimization problem based on Variational Quantum Circuits. The implemented QRL approaches are quantum analogues of the classical neural-network-based Deep Deterministic Policy Gradient and Deep Q-Network algorithms. Through an empirical evaluation on real-world financial data, we show that our quantum agents achieve risk-adjusted performance comparable to, and in some cases exceeding, that of classical Deep RL models with several orders of magnitude more parameters. However, while quantum circuit execution is inherently fast at the hardware level, practical deployment on cloud-based quantum systems introduces substantial latency, making end-to-end runtime currently dominated by infrastructural overhead and limiting practical applicability. Taken together, our results suggest that QRL is theoretically competitive with state-of-the-art classical reinforcement learning and may become practically advantageous as deployment overheads diminish. This positions QRL as a promising paradigm for dynamic decision-making in complex, high-dimensional, and non-stationary environments such as financial markets. The complete codebase is released as open source at: https://github.com/VincentGurgul/qrl-dpo-public

cs.MA

[515] Game-Theoretic Autonomous Driving: A Graphs of Convex Sets Approach

Nikolaj Käfer, Ahmed Khalil, Edward Huynh, Efstathios Bakolas, David Fridovich-Keil

Main category: cs.MA

TL;DR: IBR-GCS: Iterative Best Response planning using Graphs of Convex Sets for multi-vehicle autonomous driving, combining game theory and trajectory planning in a unified framework.

Details

Motivation: Multi-vehicle autonomous driving requires handling strategic interactions between vehicles while planning hybrid (discrete-continuous) maneuvers under shared safety constraints. Existing approaches need to better integrate combinatorial reasoning, trajectory planning, and game-theoretic interactions.

Method: IBR-GCS uses Iterative Best Response with Graphs of Convex Sets. Each vehicle builds its own strategy-dependent GCS graph conditioned on other vehicles’ current strategies. Vertices represent lane-specific, time-varying, convex, collision-free regions; edges encode dynamically feasible transitions. This yields shortest-path problems in GCS for each best-response step, solved via efficient convex relaxation without exhaustive discrete search.

Result: The approach produces safe trajectories and strategically consistent interactive behaviors in multi-lane, multi-vehicle scenarios. The inexact best-response updates converge to approximate generalized Nash equilibrium under certain conditions.

Conclusion: IBR-GCS successfully integrates combinatorial maneuver reasoning, trajectory planning, and game-theoretic interaction in a unified framework for multi-vehicle autonomous driving, enabling efficient strategic planning without exhaustive discrete search.

Abstract: Multi-vehicle autonomous driving couples strategic interaction with hybrid (discrete-continuous) maneuver planning under shared safety constraints. We introduce IBR-GCS, an Iterative Best Response (IBR) planning approach based on the Graphs of Convex Sets (GCS) framework that models highway driving as a generalized noncooperative game. IBR-GCS integrates combinatorial maneuver reasoning, trajectory planning, and game-theoretic interaction within a unified framework. The key novelty is a vehicle-specific, strategy-dependent GCS construction. Specifically, at each best-response update, each vehicle builds its own graph conditioned on the current strategies of the other vehicles, with vertices representing lane-specific, time-varying, convex, collision-free regions and edges encoding dynamically feasible transitions. This yields a shortest-path problem in GCS for each best-response step, which admits an efficient convex relaxation that can be solved using convex optimization tools without exhaustive discrete tree search. We then apply an iterative best-response scheme in which vehicles update their trajectories sequentially and provide conditions under which the resulting inexact updates converge to an approximate generalized Nash equilibrium. Simulation results across multi-lane, multi-vehicle scenarios demonstrate that IBR-GCS produces safe trajectories and strategically consistent interactive behaviors.

[516] Interpreting Emergent Extreme Events in Multi-Agent Systems

Ling Tang, Jilin Mei, Dongrui Liu, Chen Qian, Dawei Cheng, Jing Shao, Xia Hu

Main category: cs.MA

TL;DR: A framework for explaining emergent extreme events in LLM-powered multi-agent systems using Shapley value attribution to identify when events originate, who drives them, and what behaviors contribute to them.

Details

Motivation: Multi-agent systems often experience extreme events whose origins are obscured by emergence, making interpretation critical for system safety. There's a need to understand when events originate, who drives them, and what behaviors contribute to them.

Method: Adapts Shapley value to attribute extreme event occurrence to each agent action at different time steps, aggregates attribution scores across time, agent, and behavior dimensions, and designs metrics based on contribution scores to characterize event features.

Result: Experiments across diverse multi-agent scenarios (economic, financial, social) demonstrate framework effectiveness and provide general insights into extreme phenomena emergence.

Conclusion: The proposed framework successfully explains emergent extreme events in multi-agent systems by quantifying risk contributions across temporal, agent, and behavioral dimensions, enabling better understanding and safety analysis.

Abstract: Large language model-powered multi-agent systems have emerged as powerful tools for simulating complex human-like systems. The interactions within these systems often lead to extreme events whose origins remain obscured by the black box of emergence. Interpreting these events is critical for system safety. This paper proposes the first framework for explaining emergent extreme events in multi-agent systems, aiming to answer three fundamental questions: When does the event originate? Who drives it? And what behaviors contribute to it? Specifically, we adapt the Shapley value to faithfully attribute the occurrence of extreme events to each action taken by agents at different time steps, i.e., assigning an attribution score to the action to measure its influence on the event. We then aggregate the attribution scores along the dimensions of time, agent, and behavior to quantify the risk contribution of each dimension. Finally, we design a set of metrics based on these contribution scores to characterize the features of extreme events. Experiments across diverse multi-agent system scenarios (economic, financial, and social) demonstrate the effectiveness of our framework and provide general insights into the emergence of extreme phenomena.

[517] Modelleme ve Simulasyon

Serdar Abut

Main category: cs.MA

TL;DR: This paper surveys modeling and simulation approaches developed since the 1970s, covering applications in social sciences, risk management, cloud-based systems, and agent-based modeling.

Details

Motivation: To provide a comprehensive overview of modeling and simulation approaches that have emerged since the 1970s, highlighting their applications across different domains including social sciences, risk management, and information systems.

Method: The paper presents a survey and summary of existing modeling and simulation approaches, including descriptive and predictive modes, with specific focus on simulation models in social sciences, risk management, cloud-based information systems, and agent-based modeling.

Result: A systematic presentation of modeling and simulation approaches developed since the 1970s, with summaries of applications in various domains and information about agent-based modeling techniques.

Conclusion: The paper serves as a comprehensive reference for understanding the evolution and current state of modeling and simulation approaches across multiple disciplines, providing valuable insights into their applications and methodologies.

Abstract: Computer modeling and simulation is used to analyze system behavior and evaluate strategies for operating in descriptive or predictive modes. In this part of the book, modeling and simulation approaches that have been proposed since the 1970s have been tried to be presented. Simulation models used in social sciences, risk management and cloud-based information systems are tried to be summarized, and information about agent-based modeling and simulation approach is given.

[518] LLM Multi-Agent Systems: Challenges and Open Problems

Shanshan Han, Qifan Zhang, Weizhao Jin, Zhaozhuo Xu

Main category: cs.MA

TL;DR: This paper examines challenges in multi-agent systems, focusing on task allocation, reasoning, context management, and memory, with applications in blockchain and distributed systems.

Details

Motivation: To identify and address inadequately solved challenges in multi-agent systems that leverage agent collaboration for complex tasks, and explore their potential in real-world distributed applications like blockchain.

Method: The paper discusses optimization approaches for task allocation, robust reasoning through iterative debates, management of complex layered context information, and enhanced memory management for agent interactions.

Result: Identifies key challenges in multi-agent systems and proposes solutions for improving collaboration, with exploration of blockchain applications showing potential for future distributed system development.

Conclusion: Multi-agent systems have significant potential for solving complex tasks through collaboration, but require better solutions for task allocation, reasoning, context management, and memory - particularly promising for blockchain and distributed systems applications.

Abstract: This paper explores multi-agent systems and identify challenges that remain inadequately addressed. By leveraging the diverse capabilities and roles of individual agents, multi-agent systems can tackle complex tasks through agent collaboration. We discuss optimizing task allocation, fostering robust reasoning through iterative debates, managing complex and layered context information, and enhancing memory management to support the intricate interactions within multi-agent systems. We also explore potential applications of multi-agent systems in blockchain systems to shed light on their future development and application in real-world distributed systems.

[519] Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow

Xinlei Yu, Chengming Xu, Guibin Zhang, Yongbo He, Zhangquan Chen, Zhucun Xue, Jiangning Zhang, Yue Liao, Xiaobin Hu, Yu-Gang Jiang, Shuicheng Yan

Main category: cs.MA

TL;DR: ViF addresses multi-agent visual hallucination snowballing in VLM-powered MAS by identifying diminishing visual attention patterns and introducing visual flow relay with attention reallocation.

Details

Motivation: Multi-agent systems using visual language models suffer from hallucination snowballing where visual hallucinations from one agent get amplified by others due to over-reliance on textual communication, causing loss of visual evidence.

Method: Analyzes attention patterns to identify vision tokens with unimodal attention peaks, then proposes ViF - a plug-and-play paradigm that relays inter-agent messages with visual flow using selected visual relay tokens and applies attention reallocation.

Result: ViF significantly reduces hallucination snowballing and consistently improves performance across eight benchmarks using four MAS structures and ten base models.

Conclusion: The proposed ViF method effectively mitigates visual hallucination snowballing in multi-agent VLM systems through visual flow relay and attention reallocation, offering a lightweight solution for more reliable multi-agent visual reasoning.

Abstract: Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code is publicly available at: https://github.com/YU-deep/ViF.git.

[520] VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing

Guanyuan Pan, Shuai Wang, Yugui Lin, Tiansheng Zhou, Pietro Liò, Zhenxin Zhao, Yaqi Wang

Main category: cs.MA

TL;DR: VLM-CAD: A vision-language model collaborative agent workflow for analog circuit sizing that integrates schematic analysis with explainable optimization, balancing performance and power while maintaining physics-based explainability.

Details

Motivation: Existing analog circuit sizing approaches ignore circuit schematics, breaking the cognitive link between design and performance. Current ML methods lack explainability needed for industrial sign-off, and LLMs risk hallucinations without ground-truth verification.

Method: Proposes VLM-CAD workflow with Image2Net for schematic annotation and structured JSON generation. Uses collaborative agents for circuit analysis, DC optimization, inference-based sizing, and external optimization. Introduces ExTuRBO for explainable Bayesian optimization with warm-start seeds and dual-granularity sensitivity analysis.

Result: Successfully sized amplifiers using 180nm, 90nm, and 45nm Predictive Technology Models. VLM-CAD met all specifications while maintaining low power consumption, with total runtime under 66 minutes across all experiments on two amplifiers.

Conclusion: VLM-CAD effectively balances power and performance while maintaining physics-based explainability, addressing the limitations of existing methods by integrating schematic analysis with explainable optimization for industrial-grade analog circuit design.

Abstract: Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches rely solely on netlists, ignoring the circuit schematic, which hinders the cognitive link between the schematic and its performance. Furthermore, the black-box nature of machine learning methods and hallucination risks in large language models fail to provide the necessary ground-truth explainability required for industrial sign-off. To address these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing, and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-start from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance while maintaining physics-based explainability. VLM-CAD meets all specification requirements while maintaining low power consumption in optimizing an amplifier with a complementary input and a class-AB output stage, with a total runtime under 66 minutes across all experiments on two amplifiers.

cs.MM

[521] SFQA: A Comprehensive Perceptual Quality Assessment Dataset for Singing Face Generation

Zhilin Gao, Yunhao Li, Sijing Wu, Yucheng Zhu, Huiyu Duan, Guangtao Zhai

Main category: cs.MM

TL;DR: The paper introduces SFQA, a new dataset for Singing Face Generation quality assessment, created using 12 generation methods with 5,184 videos, and benchmarks existing quality assessment algorithms.

Details

Motivation: Singing face generation is important but often underestimated due to lack of datasets and quality assessment methods. Current SFG quality varies significantly across methods and scenarios, requiring effective quality assessment to ensure user experience.

Method: Created SFQA dataset using 12 representative generation methods, 100 photographs/portraits, and 36 music clips from 7 styles to generate 5,184 singing face videos. Conducted subjective quality assessment with evaluators and benchmarked existing objective quality assessment algorithms.

Result: Subjective ratings revealed significant quality variation among different generation methods. The SFQA dataset enables comprehensive benchmarking of current objective quality assessment algorithms for singing face generation.

Conclusion: The SFQA dataset addresses critical gaps in singing face generation research by providing a standardized benchmark for quality assessment, which is essential for improving SFG methods and ensuring better user experience across applications.

Abstract: The Talking Face Generation task has enormous potential for various applications in digital humans and agents, etc. Singing, as a common facial movement second only to talking, can be regarded as a universal language across ethnicities and cultures. However, it is often underestimated in the field due to lack of singing face datasets and the domain gap between singing and talking in rhythm and amplitude. More significantly, the quality of Singing Face Generation (SFG) often falls short and is uneven or limited by different applicable scenarios, which prompts us to propose timely and effective quality assessment methods to ensure user experience. To address existing gaps in this domain, this paper introduces a new SFG content quality assessment dataset SFQA, built using 12 representative generation methods. During the construction of the dataset, 100 photographs or portraits, as well as 36 music clips from 7 different styles, are utilized to generate 5,184 singing face videos that constitute the SFQA dataset. To further explore the quality of SFG methods, subjective quality assessment is conducted by evaluators, whose ratings reveal a significant variation in quality among different generation methods. Based on our proposed SFQA dataset, we comprehensively benchmark the current objective quality assessment algorithms.

[522] Block Erasure-Aware Semantic Multimedia Compression via JSCC Autoencoder

Homa Esfahanizadeh, Nargis Fayaz, Jinfeng Du, Harish Viswanathan

Main category: cs.MM

TL;DR: AI-based semantic transmission framework for multimedia over lossy channels using joint source-channel coding to achieve graceful quality degradation without retransmissions.

Details

Motivation: Address the challenge of transmitting multimedia data over band-limited, time-varying channels where packets may be dropped, especially for latency-sensitive applications like video conferencing and robotic control where retransmissions cause unacceptable delays.

Method: Uses joint source-channel coding (JSCC) framework that splits large content into multiple packets and enables semantic reconstruction even with packet loss. Features tunable design parameter for balancing robustness vs. fidelity, compatibility with existing protocols, and support for intelligent congestion control and unequal error protection.

Result: Demonstrates significant robustness improvement over state-of-the-art baselines in both image and video domains, achieving reliable semantic reconstruction with graceful quality degradation as channel conditions worsen.

Conclusion: The proposed AI-based semantic transmission framework effectively handles packet loss in time-varying channels without retransmissions, making it suitable for latency-sensitive applications while maintaining compatibility with existing network infrastructure.

Abstract: We present an AI-based framework for semantic transmission of multimedia data over band-limited, time-varying channels. The method targets scenarios where large content is split into multiple packets, with an unknown number potentially dropped due to channel impairments. Using joint source-channel coding (JSCC), our approach achieves reliable semantic reconstruction with graceful quality degradation as channel conditions worsen, eliminating the need for retransmissions that cause unacceptable delays in latency-sensitive applications such as video conferencing and robotic control. The framework is compatible with existing network protocols and further enables intelligent congestion control and unequal error protection. A tunable design parameter allows balancing robustness at low channel quality against fidelity at high channel quality. Experiments demonstrate significant robustness improvement over state-of-the-art baselines in both image and video domains.

[523] Benchmarking Multimodal Large Language Models for Missing Modality Completion in Product Catalogues

Junchen Fu, Wenhao Deng, Kaiwen Zheng, Ioannis Arapakis, Yu Ye, Yongxin Ni, Joemon M. Jose, Xuri Ge

Main category: cs.MM

TL;DR: MLLMs struggle with fine-grained multimodal completion for e-commerce products despite capturing high-level semantics, with performance varying by category and no clear correlation to model size.

Details

Motivation: Missing product modalities (images/text) on e-commerce platforms impair product presentation and downstream applications like recommendations; MLLMs' generative capabilities offer potential solutions but their effectiveness for this specific task is unexplored.

Method: Proposed MMPCBench with two sub-benchmarks (Content Quality Completion and Recommendation), evaluated 6 SOTA MLLMs from Qwen2.5-VL and Gemma-3 families across 9 e-commerce categories for image-to-text and text-to-image completion tasks, and explored GRPO for task alignment.

Result: MLLMs capture high-level semantics but struggle with fine-grained word/pixel alignment; performance varies substantially across categories; no trivial correlation between model size and performance; GRPO improves image-to-text but not text-to-image completion.

Conclusion: Current MLLMs have limitations in real-world cross-modal generation for e-commerce product completion, exposing challenges in fine-grained alignment and category-specific performance, representing an early step toward more effective missing-modality solutions.

Abstract: Missing-modality information on e-commerce platforms, such as absent product images or textual descriptions, often arises from annotation errors or incomplete metadata, impairing both product presentation and downstream applications such as recommendation systems. Motivated by the multimodal generative capabilities of recent Multimodal Large Language Models (MLLMs), this work investigates a fundamental yet underexplored question: can MLLMs generate missing modalities for products in e-commerce scenarios? We propose the Missing Modality Product Completion Benchmark (MMPCBench), which consists of two sub-benchmarks: a Content Quality Completion Benchmark and a Recommendation Benchmark. We further evaluate six state-of-the-art MLLMs from the Qwen2.5-VL and Gemma-3 model families across nine real-world e-commerce categories, focusing on image-to-text and text-to-image completion tasks. Experimental results show that while MLLMs can capture high-level semantics, they struggle with fine-grained word-level and pixel- or patch-level alignment. In addition, performance varies substantially across product categories and model scales, and we observe no trivial correlation between model size and performance, in contrast to trends commonly reported in mainstream benchmarks. We also explore Group Relative Policy Optimization (GRPO) to better align MLLMs with this task. GRPO improves image-to-text completion but does not yield gains for text-to-image completion. Overall, these findings expose the limitations of current MLLMs in real-world cross-modal generation and represent an early step toward more effective missing-modality product completion.

eess.AS

[524] MK-SGC-SC: Multiple Kernel guided Sparse Graph Construction in Spectral Clustering for Unsupervised Speaker Diarization

Nikhil Raghav, Avisek Gupta, Swagatam Das, Md Sahidullah

Main category: eess.AS

TL;DR: The paper proposes a fully unsupervised speaker diarization method using multiple kernel similarities to create sparse graphs for spectral clustering, achieving SOTA results without pretraining or supervision.

Details

Motivation: Unsupervised speaker diarization is challenging but valuable for identifying speaker regions without pretraining or weak supervision. The research is motivated by the need for effective clustering techniques that don't rely on labeled data.

Method: The method measures multiple kernel similarities (four polynomial kernels and a degree one arccosine kernel) of speaker embeddings to construct sparse graphs in a principled manner. These graphs emphasize local similarities and are used for spectral clustering.

Result: The approach achieves state-of-the-art performances in fully unsupervised speaker diarization across challenging environments in DIHARD-III, AMI, and VoxConverse corpora.

Conclusion: Measuring multiple kernel similarities to craft sparse graphs for spectral clustering is sufficient for achieving excellent unsupervised speaker diarization performance, demonstrating the effectiveness of principled graph construction without supervision.

Abstract: Speaker diarization aims to segment audio recordings into regions corresponding to individual speakers. Although unsupervised speaker diarization is inherently challenging, the prospect of identifying speaker regions without pretraining or weak supervision motivates research on clustering techniques. In this work, we share the notable observation that measuring multiple kernel similarities of speaker embeddings to thereafter craft a sparse graph for spectral clustering in a principled manner is sufficient to achieve state-of-the-art performances in a fully unsupervised setting. Specifically, we consider four polynomial kernels and a degree one arccosine kernel to measure similarities in speaker embeddings, using which sparse graphs are constructed in a principled manner to emphasize local similarities. Experiments show the proposed approach excels in unsupervised speaker diarization over a variety of challenging environments in the DIHARD-III, AMI, and VoxConverse corpora. To encourage further research, our implementations are available at https://github.com/nikhilraghav29/MK-SGC-SC.

[525] RIR-Mega-Speech: A Reverberant Speech Corpus with Comprehensive Acoustic Metadata and Reproducible Evaluation

Mandip Goswami

Main category: eess.AS

TL;DR: RIR-Mega-Speech: A 117.5-hour corpus created by convolving LibriSpeech with simulated room impulse responses, providing standardized acoustic annotations (RT60, DRR, C50) and reproducible evaluation scripts.

Details

Motivation: Addressing the difficulty in comparing reverberant speech methods due to lack of per-file acoustic annotations and limited documentation for reproduction in existing corpora.

Method: Created corpus by convolving LibriSpeech utterances with ~5,000 simulated room impulse responses from RIR-Mega collection, providing RT60, DRR, and C50 annotations with reproducible procedures and rebuild scripts.

Result: Using Whisper small on 1,500 paired utterances: 5.20% WER on clean speech vs 7.70% on reverberant versions (48% relative degradation). WER increases with RT60 and decreases with DRR, consistent with prior perceptual studies.

Conclusion: Provides standardized resource with transparent acoustic conditions and independently verifiable results to support reproducible research in reverberant speech processing, with one-command rebuild instructions for Windows and Linux.

Abstract: Despite decades of research on reverberant speech, comparing methods remains difficult because most corpora lack per-file acoustic annotations or provide limited documentation for reproduction. We present RIR-Mega-Speech, a corpus of approximately 117.5 hours created by convolving LibriSpeech utterances with roughly 5,000 simulated room impulse responses from the RIR-Mega collection. Every file includes RT60, direct-to-reverberant ratio (DRR), and clarity index ($C_{50}$) computed from the source RIR using clearly defined, reproducible procedures. We also provide scripts to rebuild the dataset and reproduce all evaluation results. Using Whisper small on 1,500 paired utterances, we measure 5.20% WER (95% CI: 4.69–5.78) on clean speech and 7.70% (7.04–8.35) on reverberant versions, corresponding to a paired increase of 2.50 percentage points (2.06–2.98). This represents a 48% relative degradation. WER increases monotonically with RT60 and decreases with DRR, consistent with prior perceptual studies. While the core finding that reverberation harms recognition is well established, we aim to provide the community with a standardized resource where acoustic conditions are transparent and results can be verified independently. The repository includes one-command rebuild instructions for both Windows and Linux environments.

[526] VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

Yuxiang Wang, Hongyu Liu, Dekun Chen, Xueyao Zhang, Zhizheng Wu

Main category: eess.AS

TL;DR: VoxPrivacy is the first benchmark for evaluating interactional privacy in Speech Language Models (SLMs) across three difficulty tiers, revealing widespread vulnerabilities in current models and providing a solution through fine-tuning.

Details

Motivation: As SLMs move to multi-user environments like smart homes, they need to distinguish between users to prevent privacy failures (interactional privacy). Current benchmarks overlook speaker identity and contextual privacy-sensitive information.

Method: Introduces VoxPrivacy benchmark with three tiers of increasing difficulty, evaluates nine SLMs on a 32-hour bilingual dataset, validates on Real-VoxPrivacy (human-recorded subset), and demonstrates improvement through fine-tuning on a 4,000-hour training set.

Result: Most open-source models perform near random chance (~50% accuracy) on conditional privacy decisions, while even strong closed-source systems struggle with proactive privacy inference. Failures persist in real speech. Fine-tuning improves privacy-preserving abilities while maintaining robustness.

Conclusion: SLMs have significant interactional privacy vulnerabilities that current benchmarks miss. VoxPrivacy addresses this gap and provides a path forward through fine-tuning. The benchmark, training set, and fine-tuned model are released to foster safer, context-aware SLMs.

Abstract: As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user’s confidential schedule to another, a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM benchmarks test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextual privacy-sensitive information (e.g., a user’s private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most open-source models perform close to random chance (around 50% accuracy) on conditional privacy decisions, while even strong closed-source systems fall short on proactive privacy inference. We further validate these findings on Real-VoxPrivacy, a human-recorded subset, confirming that failures observed on synthetic data persist in real speech. Finally, we demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve privacy-preserving abilities while maintaining robustness. To support future work, we release the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to foster the development of safer and more context-aware SLMs.

[527] Do we really need Self-Attention for Streaming Automatic Speech Recognition?

Youness Dkhissi, Valentin Vielzeuf, Elys Allesiardo, Anthony Larcher

Main category: eess.AS

TL;DR: Transformers may not be suitable for constrained/streaming tasks due to high computational costs; deformable convolution can replace self-attention in streaming ASR with minimal performance loss.

Details

Motivation: The paper questions the direct application of transformer architectures to constrained tasks like streaming applications, arguing that their high computational requirements and latency issues don't align well with real-time streaming needs.

Method: The study examines transformer suitability for constrained environments and proposes using deformable convolution as an alternative to self-attention mechanisms in streaming Automatic Speech Recognition (ASR).

Result: Shows that computational cost for streaming ASR can be reduced using deformable convolution instead of self-attention, and that self-attention mechanisms can be entirely removed without significant degradation in Word Error Rate.

Conclusion: Promotes searching for alternative strategies to improve efficiency without sacrificing performance in constrained/streaming applications, challenging the automatic use of transformers in all domains.

Abstract: Transformer-based architectures are the most used architectures in many deep learning fields like Natural Language Processing, Computer Vision or Speech processing. It may encourage the direct use of Transformers in the constrained tasks, without questioning whether it will yield the same benefits as in standard tasks. Given specific constraints, it is essential to evaluate the relevance of transformer models. This work questions the suitability of transformers for specific domains. We argue that the high computational requirements and latency issues associated with these models do not align well with streaming applications. Our study promotes the search for alternative strategies to improve efficiency without sacrificing performance. In light of this observation, our paper critically examines the usefulness of transformer architecture in such constrained environments. As a first attempt, we show that the computational cost for Streaming Automatic Speech Recognition (ASR) can be reduced using deformable convolution instead of Self-Attention. Furthermore, we show that Self-Attention mechanisms can be entirely removed and not replaced, without observing significant degradation in the Word Error Rate.

[528] T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS

Haibin Wu, Bach Viet Do, Naveen Suda, Julian Chan, Madhavan C R, Gene-Ping Yang, Yi-Chiao Wu, Naoyuki Kanda, Yossef Adi, Xin Lei, Yue Liu, Florian Metze, Yuzong Liu

Main category: eess.AS

TL;DR: T-Mimi replaces Mimi codec’s convolutional decoder with transformer architecture, reducing TTS latency from 42.1ms to 4.4ms on edge devices, with quantization insights for maintaining audio quality.

Details

Motivation: Mimi's hybrid transformer-convolution decoder causes significant latency bottlenecks on edge devices due to compute-intensive deconvolution layers that are not mobile-CPU friendly, particularly with frameworks like XNNPACK.

Method: Introduces T-Mimi, a modified Mimi codec decoder that replaces convolutional components with a purely transformer-based architecture inspired by TS3-Codec, plus quantization aware training to identify sensitive layers.

Result: Dramatically reduces on-device TTS latency from 42.1ms to just 4.4ms. Quantization analysis reveals that the final two transformer layers and concluding linear layers near the waveform output are highly sensitive and must remain at full precision.

Conclusion: Transformer-only decoder architecture enables significant latency reduction for edge device TTS while maintaining audio quality through careful quantization strategy that preserves sensitive layers at full precision.

Abstract: Neural audio codecs provide promising acoustic features for speech synthesis, with representative streaming codecs like Mimi providing high-quality acoustic features for real-time Text-to-Speech (TTS) applications. However, Mimi’s decoder, which employs a hybrid transformer and convolution architecture, introduces significant latency bottlenecks on edge devices due to the the compute intensive nature of deconvolution layers which are not friendly for mobile-CPUs, such as the most representative framework XNNPACK. This paper introduces T-Mimi, a novel modification of the Mimi codec decoder that replaces its convolutional components with a purely transformer-based decoder, inspired by the TS3-Codec architecture. This change dramatically reduces on-device TTS latency from 42.1ms to just 4.4ms. Furthermore, we conduct quantization aware training and derive a crucial finding: the final two transformer layers and the concluding linear layers of the decoder, which are close to the waveform, are highly sensitive to quantization and must be preserved at full precision to maintain audio quality.

[529] ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy

Ya-Tse Wu, Chi-Chun Lee

Main category: eess.AS

TL;DR: Emotional speech synthesis and targeted generative strategies improve ASR performance on emotional speech without harming clean speech recognition.

Details

Motivation: To understand how emotional speech and different generative strategies affect automatic speech recognition (ASR) performance, particularly for building emotion-aware ASR systems.

Method: Analyzed speech from three emotional TTS models, identified substitution errors as dominant, then introduced two generative strategies: one based on transcription correctness and another based on emotional salience to create fine-tuning subsets.

Result: Consistent WER improvements on real emotional datasets without degradation on clean LibriSpeech. Combined strategy achieved strongest gains, especially for expressive speech.

Conclusion: Targeted augmentation strategies are crucial for developing effective emotion-aware ASR systems, with combined generative approaches yielding the best performance improvements.

Abstract: This work investigates how emotional speech and generative strategies affect ASR performance. We analyze speech synthesized from three emotional TTS models and find that substitution errors dominate, with emotional expressiveness varying across models. Based on these insights, we introduce two generative strategies: one using transcription correctness and another using emotional salience, to construct fine-tuning subsets. Results show consistent WER improvements on real emotional datasets without noticeable degradation on clean LibriSpeech utterances. The combined strategy achieves the strongest gains, particularly for expressive speech. These findings highlight the importance of targeted augmentation for building emotion-aware ASR systems.

[530] Listen, Look, Drive: Coupling Audio Instructions for User-aware VLA-based Autonomous Driving

Ziang Guo, Feng Yang, Xuefeng Zhang, Jiaqi Guo, Kun Zhao, Yixiao Zhou, Peng Lu, Zufeng Zhang, Sifa Zheng

Main category: eess.AS

TL;DR: EchoVLA is a user-aware Vision Language Action model for autonomous driving that incorporates real-time audio instructions with emotional cues, enabling more responsive and emotionally adaptive driving behavior.

Details

Motivation: Current VLA models treat language as static prior, forcing them to infer continuously shifting objectives from pixels alone, resulting in delayed or overly conservative maneuvers. There's a need for online channels where users can influence driving with specific intentions.

Method: Augmented nuScenes dataset with temporally aligned, intent-specific synthetic speech commands. Created multimodal Chain-of-Thought (CoT) with emotional speech-trajectory pairs. Fine-tuned Qwen2.5-Omni MLM to interpret both semantic content and emotional context from audio commands.

Result: Reduced average L2 error by 59.4% and collision rate by 74.4% compared to vision-only baseline. Validated on nuScenes that EchoVLA steers trajectories through audio instructions and modulates driving behavior based on detected emotions in user’s speech.

Conclusion: EchoVLA demonstrates that incorporating real-time audio instructions with emotional context enables more responsive, nuanced, and emotionally adaptive autonomous driving behavior, addressing limitations of static language priors in current VLA models.

Abstract: Vision Language Action (VLA) models promise an open-vocabulary interface that can translate perceptual ambiguity into semantically grounded driving decisions, yet they still treat language as a static prior fixed at inference time. As a result, the model must infer continuously shifting objectives from pixels alone, yielding delayed or overly conservative maneuvers. We argue that effective VLAs for autonomous driving need an online channel in which users can influence driving with specific intentions. To this end, we present EchoVLA, a user-aware VLA that couples camera streams with in situ audio instructions. We augment the nuScenes dataset with temporally aligned, intent-specific speech commands generated by converting ego-motion descriptions into synthetic audios. Further, we compose emotional speech-trajectory pairs into a multimodal Chain-of-Thought (CoT) for fine-tuning a Multimodal Large Model (MLM) based on Qwen2.5-Omni. Specifically, we synthesize the audio-augmented dataset with different emotion types paired with corresponding driving behaviors, leveraging the emotional cues embedded in tone, pitch, and speech tempo to reflect varying user states, such as urgent or hesitant intentions, thus enabling our EchoVLA to interpret not only the semantic content but also the emotional context of audio commands for more nuanced and emotionally adaptive driving behavior. In open-loop benchmarks, our approach reduces the average L2 error by $59.4%$ and the collision rate by $74.4%$ compared to the baseline of vision-only perception. More experiments on nuScenes dataset validate that EchoVLA not only steers the trajectory through audio instructions, but also modulates driving behavior in response to the emotions detected in the user’s speech.

[531] Erasing Your Voice Before It’s Heard: Training-free Speaker Unlearning for Zero-shot Text-to-Speech

Myungjin Lee, Eunji Shin, Jiyoung Lee

Main category: eess.AS

TL;DR: TruS is a training-free framework for speaker unlearning in TTS systems that prevents generation of specific speaker identities at inference time without retraining.

Details

Motivation: Zero-shot TTS models pose crime risks by synthesizing voices without consent. Existing speaker unlearning methods require costly retraining and only work for speakers seen during training.

Method: TruS uses inference-time control to steer identity-specific hidden activations, suppressing target speakers while preserving other attributes like prosody and emotion.

Result: TruS effectively prevents voice generation for both seen and unseen opt-out speakers, establishing a scalable safeguard for speech synthesis.

Conclusion: TruS provides a practical, training-free solution for speaker unlearning that addresses privacy and consent concerns in modern TTS systems.

Abstract: Modern zero-shot text-to-speech (TTS) models offer unprecedented expressivity but also pose serious crime risks, as they can synthesize voices of individuals who never consented. In this context, speaker unlearning aims to prevent the generation of specific speaker identities upon request. Existing approaches, reliant on retraining, are costly and limited to speakers seen in the training set. We present TruS, a training-free speaker unlearning framework that shifts the paradigm from data deletion to inference-time control. TruS steers identity-specific hidden activations to suppress target speakers while preserving other attributes (e.g., prosody and emotion). Experimental results show that TruS effectively prevents voice generation on both seen and unseen opt-out speakers, establishing a scalable safeguard for speech synthesis. The demo and code are available on http://mmai.ewha.ac.kr/trus.

[532] Decoding Speech Envelopes from Electroencephalogram with a Contrastive Pearson Correlation Coefficient Loss

Yayun Liang, Yuanming Zhang, Fei Chen, Jing Lu, Zhibin Lin

Main category: eess.AS

TL;DR: Proposes contrastive PCC loss for EEG-based auditory attention decoding to improve separation between attended and unattended speech envelope reconstructions.

Details

Motivation: Existing DNN-based EEG envelope reconstruction methods focus on maximizing correlation with attended speech envelope (attended PCC), but the difference between attended and unattended PCC is crucial for accurate attention decoding.

Method: Introduces contrastive PCC loss that explicitly represents the difference between attended PCC and unattended PCC, evaluated on three public EEG datasets using four DNN architectures.

Result: Proposed objective improves envelope separability and AAD accuracy across many settings, but reveals dataset- and architecture-dependent failure cases.

Conclusion: Contrastive PCC loss effectively enhances auditory attention decoding performance by better separating attended and unattended speech envelope reconstructions, though performance varies with datasets and model architectures.

Abstract: Recent advances in reconstructing speech envelopes from Electroencephalogram (EEG) signals have enabled continuous auditory attention decoding (AAD) in multi-speaker environments. Most Deep Neural Network (DNN)-based envelope reconstruction models are trained to maximize the Pearson correlation coefficients (PCC) between the attended envelope and the reconstructed envelope (attended PCC). While the difference between the attended PCC and the unattended PCC plays an essential role in auditory attention decoding, existing methods often focus on maximizing the attended PCC. We therefore propose a contrastive PCC loss which represents the difference between the attended PCC and the unattended PCC. The proposed approach is evaluated on three public EEG AAD datasets using four DNN architectures. Across many settings, the proposed objective improves envelope separability and AAD accuracy, while also revealing dataset- and architecture-dependent failure cases.

[533] Confidence intervals for forced alignment boundaries using model ensembles

Matthew C. Kelley

Main category: eess.AS

TL;DR: Neural network ensemble method creates confidence intervals for forced alignment boundaries, improving accuracy and providing uncertainty estimates.

Details

Motivation: Traditional forced alignment tools only provide single boundary estimates without uncertainty measures, making it difficult to assess reliability or identify boundaries needing review.

Method: Uses ensemble of 10 pre-trained neural network segment classifiers, repeats alignment with each model, places boundary at median of ensemble boundaries, and constructs 97.85% confidence intervals using order statistics.

Result: Ensemble boundaries show slight overall improvement over single models on Buckeye and TIMIT corpora, and confidence intervals provide uncertainty estimates for boundary placement.

Conclusion: The method successfully provides confidence intervals for forced alignment boundaries, enabling uncertainty estimation and boundary review while slightly improving accuracy, with outputs available in both JSON and Praat TextGrid formats.

Abstract: Forced alignment is a common tool to align audio with orthographic and phonetic transcriptions. Most forced alignment tools provide only a single estimate of a boundary. The present project introduces a method of deriving confidence intervals for these boundaries using a neural network ensemble technique. Ten different segment classifier neural networks were previously trained, and the alignment process is repeated with each model. The alignment ensemble is then used to place the boundary at the median of the boundaries in the ensemble, and 97.85% confidence intervals are constructed using order statistics. Having confidence intervals provides an estimate of the uncertainty in the boundary placement, facilitating tasks like finding boundaries that should be reviewed. As a bonus, on the Buckeye and TIMIT corpora, the ensemble boundaries show a slight overall improvement over using just a single model. The confidence intervals can be emitted during the alignment process as JSON files and a main table for programmatic and statistical analysis. For familiarity, they are also output as Praat TextGrids using a point tier to represent the intervals.

[534] Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, Shinji Watanabe, Hung-yi Lee

Main category: eess.AS

TL;DR: Full-Duplex-Bench v1.5 is the first automated benchmark for evaluating how spoken dialogue systems handle overlapping speech in four scenarios, revealing two distinct strategies used by state-of-the-art agents.

Details

Motivation: Current spoken dialogue systems operate in rigid turn-taking protocols, but true fluid conversations require handling overlapping speech. Existing systems are critically under-evaluated for managing speech overlap, which is central to achieving natural full-duplex interaction.

Method: Introduces Full-Duplex-Bench v1.5, an automated benchmark that simulates four overlap scenarios: user interruption, user backchannel, talking to others, and background speech. The framework works with both open-source and commercial API-based models and provides comprehensive metrics for categorical dialogue behaviors, stop/response latency, and prosodic adaptation.

Result: Benchmarking five state-of-the-art agents revealed two divergent strategies: 1) responsive approach prioritizing rapid response to user input, and 2) floor-holding approach that preserves conversational flow by filtering overlapping events.

Conclusion: The open-source Full-Duplex-Bench v1.5 framework enables practitioners to accelerate development of robust full-duplex systems by providing reproducible evaluation tools for assessing how models handle overlapping speech in natural conversations.

Abstract: Full-duplex spoken dialogue systems promise to transform human-machine interaction from a rigid, turn-based protocol into a fluid, natural conversation. However, the central challenge to realizing this vision, managing overlapping speech, remains critically under-evaluated. We introduce Full-Duplex-Bench v1.5, the first fully automated benchmark designed to systematically probe how models behave during speech overlap. The benchmark simulates four representative overlap scenarios: user interruption, user backchannel, talking to others, and background speech. Our framework, compatible with open-source and commercial API-based models, provides a comprehensive suite of metrics analyzing categorical dialogue behaviors, stop and response latency, and prosodic adaptation. Benchmarking five state-of-the-art agents reveals two divergent strategies: a responsive approach prioritizing rapid response to user input, and a floor-holding approach that preserves conversational flow by filtering overlapping events. Our open-source framework enables practitioners to accelerate the development of robust full-duplex systems by providing the tools for reproducible evaluation

[535] Query-Based Asymmetric Modeling with Decoupled Input-Output Rates for Speech Restoration

Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong, Hyung-Min Park

Main category: eess.AS

TL;DR: TF-Restormer is a query-based asymmetric modeling framework for speech restoration that handles decoupled input-output rates without redundant resampling, enabling consistent performance across arbitrary rate pairs.

Details

Motivation: Real-world speech restoration faces challenges due to compounded distortions and mismatches between input and desired output rates. Most existing systems assume fixed, shared input-output rates and rely on external resampling, which causes redundant computation and limits generality.

Method: Proposes TF-Restormer with a query-based asymmetric modeling framework. Uses a time-frequency dual-path encoder to concentrate analysis on observed input bandwidth, and a lightweight decoder that reconstructs missing spectral content via frequency extension queries.

Result: Experiments across diverse sampling rates, degradations, and operating modes show TF-Restormer maintains stable restoration behavior and balanced perceptual quality, including in real-time streaming scenarios.

Conclusion: TF-Restormer enables a single model to operate consistently across arbitrary input-output rate pairs without redundant resampling, addressing the limitations of existing speech restoration systems.

Abstract: Speech restoration in real-world conditions is challenging due to compounded distortions and mismatches between input and desired output rates. Most existing systems assume a fixed and shared input-output rate, relying on external resampling that incurs redundant computation and limits generality. We address this setting by formulating speech restoration under decoupled input-output rates, and propose TF-Restormer, a query-based asymmetric modeling framework. The encoder concentrates analysis on the observed input bandwidth using a time-frequency dual-path architecture, while a lightweight decoder reconstructs missing spectral content via frequency extension queries. This design enables a single model to operate consistently across arbitrary input-output rate pairs without redundant resampling. Experiments across diverse sampling rates, degradations, and operating modes show that TF-Restormer maintains stable restoration behavior and balanced perceptual quality, including in real-time streaming scenarios. Code and demos are available at https://tf-restormer.github.io/demo.

[536] WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen

Main category: eess.AS

TL;DR: Parameter-efficient speech deepfake detection using prompt-tuning fused with signal processing transforms, achieving SOTA performance with low trainable parameters.

Details

Motivation: Full fine-tuning of large pre-trained models for speech deepfake detection is parameter-inefficient and may generalize poorly to realistic in-the-wild data.

Method: Propose parameter-efficient front-ends fusing prompt-tuning with signal processing transforms (FourierPT-XLSR, WSPT-XLSR, Partial-WSPT-XLSR), and WaveSP-Net combining Partial-WSPT-XLSR front-end with bidirectional Mamba-based back-end.

Result: WaveSP-Net outperforms SOTA models on Deepfake-Eval-2024 and SpoofCeleb benchmarks with low trainable parameters and notable performance gains.

Conclusion: The proposed parameter-efficient approach effectively enhances localization of synthetic artifacts without altering frozen pre-trained parameters, achieving superior performance on challenging benchmarks.

Abstract: Modern front-end design for speech deepfake detection relies on full fine-tuning of large pre-trained models like XLSR. However, this approach is not parameter-efficient and may lead to suboptimal generalization to realistic, in-the-wild data types. To address these limitations, we introduce a new family of parameter-efficient front-ends that fuse prompt-tuning with classical signal processing transforms. These include FourierPT-XLSR, which uses the Fourier Transform, and two variants based on the Wavelet Transform: WSPT-XLSR and Partial-WSPT-XLSR. We further propose WaveSP-Net, a novel architecture combining a Partial-WSPT-XLSR front-end and a bidirectional Mamba-based back-end. This design injects multi-resolution features into the prompt embeddings, which enhances the localization of subtle synthetic artifacts without altering the frozen XLSR parameters. Experimental results demonstrate that WaveSP-Net outperforms several state-of-the-art models on two new and challenging benchmarks, Deepfake-Eval-2024 and SpoofCeleb, with low trainable parameters and notable performance gains. The code and models are available at https://github.com/xxuan-acoustics/WaveSP-Net.

[537] Adaptive Per-Channel Energy Normalization Front-end for Robust Audio Signal Processing

Hanyu Meng, Vidhyasaharan Sethu, Eliathamby Ambikairajah, Qiquan Zhang, Haizhou Li

Main category: eess.AS

TL;DR: Adaptive audio front-end with neural controller dynamically tunes parameters during inference, outperforming static front-ends in various acoustic conditions.

Details

Motivation: Current learnable audio front-ends have fixed parameters after training, lacking flexibility during inference and limiting robustness in dynamic, complex acoustic environments.

Method: Simplify LEAF architecture and integrate a neural controller that dynamically tunes Per-Channel Energy Normalization using current and buffered past subband energies for input-dependent adaptation.

Result: Consistently outperforms prior fixed and learnable front-ends on multiple audio classification tasks under both clean and complex acoustic conditions.

Conclusion: Neural adaptability represents a promising direction for next-generation audio front-ends, enabling dynamic parameter adjustment during inference for improved robustness.

Abstract: In audio signal processing, learnable front-ends have shown strong performance across diverse tasks by optimizing task-specific representation. However, their parameters remain fixed once trained, lacking flexibility during inference and limiting robustness under dynamic complex acoustic environments. In this paper, we introduce a novel adaptive paradigm for audio front-ends that replaces static parameterization with a closed-loop neural controller. Specifically, we simplify the learnable front-end LEAF architecture and integrate a neural controller for adaptive representation via dynamically tuning Per-Channel Energy Normalization. The neural controller leverages both the current and the buffered past subband energies to enable input-dependent adaptation during inference. Experimental results on multiple audio classification tasks demonstrate that the proposed adaptive front-end consistently outperforms prior fixed and learnable front-ends under both clean and complex acoustic conditions. These results highlight neural adaptability as a promising direction for the next generation of audio front-ends.

eess.IV

[538] Orthogonal Plane-Wave Transmit-Receive Isotropic-Focusing Micro-Ultrasound (OPTIMUS) with Bias-Switchable Row-Column Arrays

Darren Dahunsi, Randy Palamar, Tyler Henry, Mohammad Rahim Sobhani, Negar Majidi, Joy Wang, Afshin Kashani Ilkhechi, Roger Zemp

Main category: eess.IV

TL;DR: OPTIMUS is a novel ultrasound imaging scheme using TOBE arrays to achieve nearly isotropic focusing across large volumes, surpassing conventional row-column arrays and approaching ideal matrix probe performance.

Details

Motivation: Current ultrasound imaging faces limitations: matrix probes have limited field of view and element counts, while row-column arrays provide insufficient focusing. There's a need for volumetric imaging with both wide field of view and high-quality isotropic focusing.

Method: Developed OPTIMUS (Orthogonal Plane-Wave Transmit-Receive Isotropic-Focusing Micro-Ultrasound) using TOBE (Top-Orthogonal-to-Bottom-Electrode) arrays. Extended HERCULES scheme for volumetric imaging beyond aperture shadow. Used simulations with scatterer grids, experimental validation with commercial phantom, custom TOBE array, biasing electronics, and research ultrasound system, plus ex-vivo tissue imaging.

Result: TOBE arrays enable isotropic focusing comparable to ideal matrix probes with wider field of view than conventional RCAs. OPTIMUS achieves nearly isotropic focusing throughout expansive volumes and can image beyond aperture shadow. Experimental validation confirmed resolution performance, and ex-vivo imaging demonstrated ability to discern structural tissue information.

Conclusion: OPTIMUS with TOBE arrays provides a promising solution for high-quality structural volumetric ultrasound imaging, combining wide field of view with excellent focusing capabilities that bridge the gap between matrix probes and row-column arrays.

Abstract: High quality structural volumetric imaging is a challenging goal to achieve with modern ultrasound transducers. Matrix probes have limited fields of view and element counts, whereas row-column arrays (RCAs) provide insufficient focusing. In contrast, Top-Orthogonal-to-Bottom-Electrode (TOBE) arrays, also known as bias-switchable RCAs can enable isotropic focusing on par with ideal matrix probes, with a field of view surpassing conventional RCAs. Orthogonal Plane-Wave Transmit-Receive Isotropic-Focusing Micro-Ultrasound (OPTIMUS) is a novel imaging scheme that can use TOBE arrays to achieve nearly isotropic focusing throughout an expansive volume. This approach extends upon a similar volumetric imaging scheme, Hadamard Encoded Row Column Ultrasonic Expansive Scanning (HERCULES), that is even able to image beyond the shadow of the aperture, much like typical 2D matrix probes. We simulate a grid of scatterers to evaluate how the resolution varies across the volume, and validate these simulations experimentally using a commercial calibration phantom. Experimental measurements were done with a custom fabricated TOBE array, custom biasing electronics, and a research ultrasound system. Finally we performed ex-vivo imaging to assess our ability to discern structural tissue information.

[539] SegRap2025: A Benchmark of Gross Tumor Volume and Lymph Node Clinical Target Volume Segmentation for Radiotherapy Planning of Nasopharyngeal Carcinoma

Jia Fu, Litingyu Wang, He Li, Zihao Luo, Huamin Wang, Chenyuan Bian, Zijun Gao, Chunbin Gu, Xin Weng, Jianghao Wu, Yicheng Wu, Jin Ye, Linhao Li, Yiwen Ye, Yong Xia, Elias Tappeiner, Fei He, Abdul qayyum, Moona Mazher, Steven A Niederer, Junqiang Chen, Chuanyi Huang, Lisheng Wang, Zhaohu Xing, Hongqiu Wang, Lei Zhu, Shichuan Zhang, Shaoting Zhang, Wenjun Liao, Guotai Wang

Main category: eess.IV

TL;DR: SegRap2025 challenge builds on SegRap2023 to improve generalization of NPC radiotherapy segmentation models across imaging centers and modalities, with two tasks: GTV segmentation (cross-center) and LN CTV segmentation (cross-center and cross-modality).

Details

Motivation: Accurate delineation of GTV, LN CTV, and OAR from CT scans is crucial for precise radiotherapy planning in Nasopharyngeal Carcinoma, but existing models need better generalization across different imaging centers and modalities.

Method: SegRap2025 challenge with two tasks: Task01 for GTV segmentation using paired CT from SegRap2023 dataset plus external testing for cross-center evaluation; Task02 for LN CTV segmentation using multi-center training data with unseen external testing, including both paired CT and single-modality cases.

Result: Top models achieved average DSC of 74.61% (internal) and 56.79% (external) for GTV segmentation; for LN CTV segmentation, best DSC values were 60.24% (paired CT), 60.50% (ceCT-only), and 57.23% (ncCT-only).

Conclusion: SegRap2025 establishes a large-scale multi-center, multi-modality benchmark for evaluating generalization and robustness in radiotherapy target segmentation, providing insights toward clinically applicable automated radiotherapy planning systems.

Abstract: Accurate delineation of Gross Tumor Volume (GTV), Lymph Node Clinical Target Volume (LN CTV), and Organ-at-Risk (OAR) from Computed Tomography (CT) scans is essential for precise radiotherapy planning in Nasopharyngeal Carcinoma (NPC). Building upon SegRap2023, which focused on OAR and GTV segmentation using single-center paired non-contrast CT (ncCT) and contrast-enhanced CT (ceCT) scans, the SegRap2025 challenge aims to enhance the generalizability and robustness of segmentation models across imaging centers and modalities. SegRap2025 comprises two tasks: Task01 addresses GTV segmentation using paired CT from the SegRap2023 dataset, with an additional external testing set to evaluate cross-center generalization, and Task02 focuses on LN CTV segmentation using multi-center training data and an unseen external testing set, where each case contains paired CT scans or a single modality, emphasizing both cross-center and cross-modality robustness. This paper presents the challenge setup and provides a comprehensive analysis of the solutions submitted by ten participating teams. For GTV segmentation task, the top-performing models achieved average Dice Similarity Coefficient (DSC) of 74.61% and 56.79% on the internal and external testing cohorts, respectively. For LN CTV segmentation task, the highest average DSC values reached 60.24%, 60.50%, and 57.23% on paired CT, ceCT-only, and ncCT-only subsets, respectively. SegRap2025 establishes a large-scale multi-center, multi-modality benchmark for evaluating the generalization and robustness in radiotherapy target segmentation, providing valuable insights toward clinically applicable automated radiotherapy planning systems. The benchmark is available at: https://hilab-git.github.io/SegRap2025_Challenge.

[540] Task-Based Adaptive Transmit Beamforming for Efficient Ultrasound Quantification

Oisín Nolan, Wessel L. van Nierop, Louis D. van Harten, Tristan S. W. Stevens, Ruud J. G. van Sloun

Main category: eess.IV

TL;DR: Task-based adaptive ultrasound beamforming reduces transmit events by 98% while maintaining accurate ventricular dimension measurements.

Details

Motivation: Wireless/wearable ultrasound monitoring faces power consumption and data throughput challenges; reducing transmit events per second directly addresses both issues.

Method: Bayesian active perception formulation for adaptive transmit beamforming that selectively chooses where to scan based on information gain for downstream quantitative measurements, avoiding redundant transmissions.

Result: TBIG recovers accurate ventricular dimensions using fewer than 2% of typical scan lines, enabling large reductions in power usage and data rates for monitoring.

Conclusion: Task-based adaptive ultrasound scanning shows significant potential for making continuous ultrasound monitoring feasible through dramatic reductions in power and data requirements.

Abstract: Wireless and wearable ultrasound devices promise to enable continuous ultrasound monitoring, but power consumption and data throughput remain critical challenges. Reducing the number of transmit events per second directly impacts both. We propose a task-based adaptive transmit beamforming method, formulated as a Bayesian active perception problem, that adaptively chooses where to scan in order to gain information about downstream quantitative measurements, avoiding redundant transmit events. Our proposed Task-Based Information Gain (TBIG) strategy applies to any differentiable downstream task function. When applied to recovering ventricular dimensions from echocardiograms, TBIG recovers accurate results using fewer than 2% of scan lines typically used, showing potential for large reductions in the power usage and data rates necessary for monitoring. Code is available at https://github.com/tue-bmd/task-based-ulsa.

[541] Leveraging Second-Order Curvature for Efficient Learned Image Compression: Theory and Empirical Evidence

Yichi Zhang, Fengqing Zhu

Main category: eess.IV

TL;DR: SOAP, a second-order quasi-Newton optimizer, dramatically improves training efficiency and final performance for learned image compression models by resolving gradient conflicts in the rate-distortion trade-off.

Details

Motivation: Standard first-order optimizers (SGD, Adam) struggle with gradient conflicts arising from competing rate and distortion objectives in learned image compression, leading to slow convergence and suboptimal performance.

Method: Proposes SOAP - a simple utilization of a second-order quasi-Newton optimizer as a drop-in replacement for first-order optimizers, using Newton preconditioning to resolve intra-step and inter-step update conflicts in the R-D objective.

Result: SOAP dramatically improves both training efficiency and final performance across diverse LICs, with faster and more stable convergence. Additionally, second-order trained models exhibit significantly fewer activation and latent outliers, enhancing robustness to post-training quantization.

Conclusion: Second-order optimization is established as a powerful, practical tool for advancing the efficiency and real-world readiness of learned image compression models, offering both training acceleration and improved deployability through better quantization robustness.

Abstract: Training learned image compression (LIC) models entails navigating a challenging optimization landscape defined by the fundamental trade-off between rate and distortion. Standard first-order optimizers, such as SGD and Adam, struggle with \emph{gradient conflicts} arising from competing objectives, leading to slow convergence and suboptimal rate-distortion performance. In this work, we demonstrate that a simple utilization of a second-order quasi-Newton optimizer, \textbf{SOAP}, dramatically improves both training efficiency and final performance across diverse LICs. Our theoretical and empirical analyses reveal that Newton preconditioning inherently resolves the intra-step and inter-step update conflicts intrinsic to the R-D objective, facilitating faster, more stable convergence. Beyond acceleration, we uncover a critical deployability benefit: second-order trained models exhibit significantly fewer activation and latent outliers. This substantially enhances robustness to post-training quantization. Together, these results establish second-order optimization, achievable as a seamless drop-in replacement of the imported optimizer, as a powerful, practical tool for advancing the efficiency and real-world readiness of LICs.

[542] X-LRM: X-ray Large Reconstruction Model for Extremely Sparse-View Computed Tomography Recovery in One Second

Guofeng Zhang, Ruyi Zha, Hao He, Yixun Liang, Alan Yuille, Hongdong Li, Yuanhao Cai

Main category: eess.IV

TL;DR: X-LRM is a novel X-ray Large Reconstruction Model for sparse-view 3D CT reconstruction that handles <10 input views using a Transformer-based encoder and triplane representation, trained on a new large-scale dataset Torso-16K.

Details

Motivation: Existing feedforward methods for sparse-view CT reconstruction are limited by: 1) scarcity of large-scale training datasets, and 2) absence of direct and consistent 3D representations.

Method: X-LRM consists of two components: 1) X-former - uses MLP-based image tokenizer and Transformer encoder to handle arbitrary number of input views, and 2) X-triplane - upsamples tokens into triplane representation that models 3D radiodensity as implicit neural field. Trained on Torso-16K dataset with over 16K volume-projection pairs.

Result: Outperforms state-of-the-art method by 1.5 dB, achieves 27× faster speed with better flexibility. Evaluation on lung segmentation tasks demonstrates practical value.

Conclusion: X-LRM provides an effective solution for extremely sparse-view CT reconstruction with superior performance, speed, and flexibility, enabled by the novel architecture and large-scale dataset.

Abstract: Sparse-view 3D CT reconstruction aims to recover volumetric structures from a limited number of 2D X-ray projections. Existing feedforward methods are constrained by the scarcity of large-scale training datasets and the absence of direct and consistent 3D representations. In this paper, we propose an X-ray Large Reconstruction Model (X-LRM) for extremely sparse-view ($<$10 views) CT reconstruction. X-LRM consists of two key components: X-former and X-triplane. X-former can handle an arbitrary number of input views using an MLP-based image tokenizer and a Transformer-based encoder. The output tokens are then upsampled into our X-triplane representation, which models the 3D radiodensity as an implicit neural field. To support the training of X-LRM, we introduce Torso-16K, a large-scale dataset comprising over 16K volume-projection pairs of various torso organs. Extensive experiments demonstrate that X-LRM outperforms the state-of-the-art method by 1.5 dB and achieves 27$\times$ faster speed with better flexibility. Furthermore, the evaluation of lung segmentation tasks also suggests the practical value of our approach. Our code and dataset will be released at https://github.com/Richard-Guofeng-Zhang/X-LRM

[543] BRISC: Annotated Dataset for Brain Tumor Segmentation and Classification

Amirreza Fateh, Yasin Rezvani, Sara Moayedi, Sadjad Rezvani, Fatemeh Fateh, Mansoor Fateh, Vahid Abolghasemi

Main category: eess.IV

TL;DR: BRISC is a new brain tumor MRI dataset with 6,000 expert-annotated scans for segmentation and classification, addressing data quality gaps in medical imaging.

Details

Motivation: Current brain tumor MRI analysis suffers from lack of high-quality, balanced, diverse datasets with expert annotations, hindering accurate segmentation and classification.

Method: Created BRISC dataset by collecting 6,000 contrast-enhanced T1-weighted MRI scans from multiple public sources, then having certified radiologists and physicians provide expert annotations with high-resolution segmentation masks for three tumor types and non-tumorous cases across three imaging planes.

Result: Produced a comprehensive dataset with expert annotations, benchmark results for segmentation and classification tasks using standard deep learning models, and made the dataset publicly available on Kaggle.

Conclusion: BRISC addresses critical data quality gaps in brain tumor analysis and provides a valuable resource for robust model development and cross-view generalization in medical image analysis.

Abstract: Accurate segmentation and classification of brain tumors from Magnetic Resonance Imaging (MRI) remain key challenges in medical image analysis, primarily due to the lack of high-quality, balanced, and diverse datasets with expert annotations. In this work, we address this gap by introducing BRISC, a dataset designed for brain tumor segmentation and classification tasks, featuring high-resolution segmentation masks. The dataset comprises 6,000 contrast-enhanced T1-weighted MRI scans, which were collated from multiple public datasets that lacked segmentation labels. Our primary contribution is the subsequent expert annotation of these images, performed by certified radiologists and physicians. It includes three major tumor types, namely glioma, meningioma, and pituitary, as well as non-tumorous cases. Each sample includes high-resolution labels and is categorized across axial, sagittal, and coronal imaging planes to facilitate robust model development and cross-view generalization. To demonstrate the utility of the dataset, we provide benchmark results for both tasks using standard deep learning models. The BRISC dataset is made publicly available. datasetlink: https://www.kaggle.com/datasets/briscdataset/brisc2025/

[544] Random forest-based out-of-distribution detection for robust lung cancer segmentation

Aneesh Rangnekar, Harini Veeraraghavan

Main category: eess.IV

TL;DR: RF-Deep uses random forest classifier with deep features from pretrained transformer encoder to detect OOD CT scans and improve cancer segmentation reliability.

Details

Motivation: Transformer-based segmentation models perform well on in-distribution data but degrade on out-of-distribution datasets, creating reliability issues for clinical applications like cancer treatment planning.

Method: Combines Swin Transformer encoder pretrained with SimMIM on 10,432 unlabeled 3D CT scans, with convolution decoder for segmentation. Uses random forest classifier (RF-Deep) on deep features from pretrained encoder to detect OOD scans.

Result: RF-Deep achieved FPR95 of 18.26% on PE, 27.66% on COVID-19, and <0.1% on abdominal CTs, consistently outperforming established OOD detection methods. Enhances segmentation reliability across ID and OOD scenarios.

Conclusion: RF-Deep provides a simple and effective approach to detect OOD cases and improve reliability of cancer segmentation models, making them more robust for clinical applications with diverse datasets.

Abstract: Accurate detection and segmentation of cancerous lesions from computed tomography (CT) scans is essential for automated treatment planning and cancer treatment response assessment. Transformer-based models with self-supervised pretraining can produce reliably accurate segmentation from in-distribution (ID) data but degrade when applied to out-of-distribution (OOD) datasets. We address this challenge with RF-Deep, a random forest classifier that utilizes deep features from a pretrained transformer encoder of the segmentation model to detect OOD scans and enhance segmentation reliability. The segmentation model comprises a Swin Transformer encoder, pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans covering cancerous and non-cancerous conditions, with a convolution decoder, trained to segment lung cancers in 317 3D scans. Independent testing was performed on 603 3D CT public datasets that included one ID dataset and four OOD datasets comprising chest CTs with pulmonary embolism (PE) and COVID-19, and abdominal CTs with kidney cancers and healthy volunteers. RF-Deep detected OOD cases with a FPR95 of 18.26%, 27.66%, and less than 0.1% on PE, COVID-19, and abdominal CTs, consistently outperforming established OOD approaches. The RF-Deep classifier provides a simple and effective approach to enhance reliability of cancer segmentation in ID and OOD scenarios.

[545] A Multi-Stage Deep Learning Framework with PKCP-MixUp Augmentation for Pediatric Liver Tumor Diagnosis Using Multi-Phase Contrast-Enhanced CT

Wanqi Wang, Chun Yang, Jianbo Shao, Yaokai Zhang, Xuehua Peng, Jin Sun, Chao Xiong, Long Lu, Lianting Hu

Main category: eess.IV

TL;DR: A multi-stage deep learning framework using multi-phase CT scans for non-invasive diagnosis of pediatric liver tumors, achieving high accuracy in distinguishing benign vs malignant and classifying subtypes.

Details

Motivation: Current invasive biopsy procedures for pediatric liver tumors have significant limitations: high complication risks due to vascular liver tissue and fragile tumors, need for anesthesia in young children, increased costs, and psychological trauma. There's a gap in AI applications specifically for pediatric liver tumors despite AI's growing role in clinical settings.

Method: Developed a multi-stage DL framework using multi-phase contrast-enhanced CT scans. Used novel PKCP-MixUp data augmentation to address data scarcity and class imbalance. Trained a tumor detection model to extract ROIs, then implemented a two-stage diagnosis pipeline with three backbones using ROI-masked images. Included ablation studies, Shapley-Value, and CAM interpretability analyses.

Result: Tumor detection model achieved mAP=0.871. First-stage benign vs malignant classification reached AUC=0.989. Final diagnosis models showed robustness: benign subtype classification AUC=0.915, malignant subtype classification AUC=0.979. Framework demonstrated effectiveness through multi-level comparative analyses.

Conclusion: The framework fills the pediatric-specific DL diagnostic gap, provides actionable insights for CT phase selection and model design, and paves the way for precise, accessible non-invasive pediatric liver tumor diagnosis, potentially replacing risky invasive biopsies.

Abstract: Pediatric liver tumors are one of the most common solid tumors in pediatrics, with differentiation of benign or malignant status and pathological classification critical for clinical treatment. While pathological examination is the gold standard, the invasive biopsy has notable limitations: the highly vascular pediatric liver and fragile tumor tissue raise complication risks such as bleeding; additionally, young children with poor compliance require anesthesia for biopsy, increasing medical costs or psychological trauma. Although many efforts have been made to utilize AI in clinical settings, most researchers have overlooked its importance in pediatric liver tumors. To establish a non-invasive examination procedure, we developed a multi-stage deep learning (DL) framework for automated pediatric liver tumor diagnosis using multi-phase contrast-enhanced CT. Two retrospective and prospective cohorts were enrolled. We established a novel PKCP-MixUp data augmentation method to address data scarcity and class imbalance. We also trained a tumor detection model to extract ROIs, and then set a two-stage diagnosis pipeline with three backbones with ROI-masked images. Our tumor detection model has achieved high performance (mAP=0.871), and the first stage classification model between benign and malignant tumors reached an excellent performance (AUC=0.989). Final diagnosis models also exhibited robustness, including benign subtype classification (AUC=0.915) and malignant subtype classification (AUC=0.979). We also conducted multi-level comparative analyses, such as ablation studies on data and training pipelines, as well as Shapley-Value and CAM interpretability analyses. This framework fills the pediatric-specific DL diagnostic gap, provides actionable insights for CT phase selection and model design, and paves the way for precise, accessible pediatric liver tumor diagnosis.

Today’s Research Highlights

Table of Contents

cs.CL

[1] From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text

[2] Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments

[3] Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

[4] PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review

[5] PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

[6] Lowest Span Confidence: A Zero-Shot Metric for Efficient and Black-Box Hallucination Detection in LLMs

[7] FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition

[8] A Dialectic Pipeline for Improving LLM Robustness

[9] Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

[10] HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

[11] Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

[12] OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

[13] Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

[14] Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR

[15] Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling

[16] The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

[17] MiLorE-SSL: Scaling Multilingual Capabilities in Self-Supervised Models without Forgetting

[18] Attribution Techniques for Mitigating Hallucinated Information in RAG Systems: A Survey

[19] Towards a Mechanistic Understanding of Large Reasoning Models: A Survey of Training, Inference, and Failures

[20] Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding

[21] SDUs DAISY: A Benchmark for Danish Culture

[22] CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity

[23] “Newspaper Eat” Means “Not Tasty”: A Taxonomy and Benchmark for Coded Languages in Real-World Chinese Online Reviews

[24] AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

[25] Text-to-State Mapping for Non-Resolution Reasoning: The Contradiction-Preservation Principle

[26] Quantifying non deterministic drift in large language models

[27] Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

[28] Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen

[29] On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

[30] LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?

[31] Semantic Uncertainty Quantification of Hallucinations in LLMs: A Quantum Tensor Network Based Method

[32] TAIGR: Towards Modeling Influencer Content on Social Media via Structured, Pragmatic Inference

[33] VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

[34] Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects

[35] FFE-Hallu:Hallucinations in Fixed Figurative Expressions:Benchmark of Idioms and Proverbs in the Persian Language

[36] Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

[37] BengaliSent140: A Large-Scale Bengali Binary Sentiment Dataset for Hate and Non-Hate Speech Classification

[38] Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents

[39] Me-Agent: A Personalized Mobile Agent with Two-Level User Habit Learning for Enhanced Interaction

[40] Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems

[41] Automated Benchmark Generation from Domain Guidelines Informed by Bloom’s Taxonomy

[42] SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility

[43] RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis

[44] Beyond the Needle’s Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

[45] SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger

[46] Beyond Speedup – Utilizing KV Cache for Sampling and Reasoning

[47] CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

[48] PsychePass: Calibrating LLM Therapeutic Competence via Trajectory-Anchored Tournaments

[49] MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

[50] Improving Diffusion Language Model Decoding through Joint Search in Generation Order and Token Space

[51] Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

[52] SpeechMapper: Speech-to-text Embedding Projector for LLMs

[53] Hopes and Fears – Emotion Distribution in the Topic Landscape of Finnish Parliamentary Speech 2000-2020

[54] PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use

[55] MuVaC: AVariational Causal Framework for Multimodal Sarcasm Understanding in Dialogues

[56] BMAM: Brain-inspired Multi-Agent Memory Framework

[57] Can We Improve Educational Diagram Generation with In-Context Examples? Not if a Hallucination Spoils the Bunch

[58] Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models

[59] Single-Nodal Spontaneous Symmetry Breaking in NLP Models

[60] A Computational Approach to Language Contact – A Case Study of Persian

[61] AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

[62] P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering

[63] Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science

[64] Efficient Multimodal Planning Agent for Visual Question-Answering

[65] ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code

[66] Online Density-Based Clustering for Real-Time Narrative Evolution Monitorin

[67] AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

[68] QueerGen: How LLMs Reflect Societal Norms on Gender and Sexuality in Sentence Completion Tasks

[69] Like a Therapist, But Not: Reddit Narratives of AI in Mental Health Contexts

[70] Persona Prompting as a Lens on LLM Social Reasoning

[71] SERA: Soft-Verified Efficient Repository Agents

[72] Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

[73] Structured Semantic Information Helps Retrieve Better Examples for In-Context Learning in Few-Shot Relation Extraction

[74] Linear representations in language models can change dramatically over a conversation

[75] When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation

[76] An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases

[77] LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP