Daily arXiv Papers - 2026-03-18

AI-enhanced summaries of 0 research papers from arXiv

Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression

Bingzhou Li, Tao Huang

Main category: cs.MM

TL;DR: DASH is a training-free compression framework for OmniLLMs that uses audio-driven semantic chunking to dynamically align token compression with semantic structure, achieving higher compression ratios while maintaining accuracy.

DetailsMotivation: OmniLLMs process long audio-visual token sequences making inference expensive. Existing compression methods use fixed window partitioning and attention-based pruning, which ignore semantic structure and become fragile under aggressive token reduction.

Method: Uses audio embeddings as semantic anchor to detect boundary candidates via cosine-similarity discontinuities, creating dynamic variable-length segments. Projects boundaries onto video tokens for cross-modal segmentation. Within segments, token retention uses tri-signal importance estimator fusing structural boundary cues, representational distinctiveness, and attention-based salience.

Result: Extensive experiments on AVUT, VideoMME, and WorldSense show DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods.

Conclusion: DASH provides effective training-free compression for OmniLLMs by aligning token reduction with semantic structure through audio-driven chunking and multi-signal importance estimation.

Abstract: Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: https://github.com/laychou666/DASH.

Relevance: 9/10

[2] Diffusion Models for Joint Audio-Video Generation

Alejandro Paredes La Torre

Main category: cs.SD

TL;DR: Proposes a sequential two-step pipeline for joint audio-video generation, releases two high-quality paired datasets, trains MM-Diffusion from scratch, investigates joint latent diffusion challenges, and demonstrates modular text-to-audio-video synthesis.

DetailsMotivation: While multimodal generative models have advanced in single-modality synthesis, truly joint audio-video generation remains an open challenge. The paper aims to advance this field by addressing the lack of high-quality paired datasets and developing effective methods for synchronized audio-video generation.

Method: Four key contributions: 1) Release two high-quality paired audio-video datasets (13h video-game clips, 64h concert performances, segmented into 34-second samples). 2) Train MM-Diffusion architecture from scratch on these datasets. 3) Investigate joint latent diffusion using pretrained encoders/decoders. 4) Propose sequential two-step text-to-audio-video pipeline: generate video first, then condition on both video output and original prompt to synthesize synchronized audio.

Result: Demonstrates ability to produce semantically coherent audio-video pairs with quantitative evaluation of alignment on rapid actions and musical cues. Shows challenges in joint latent diffusion decoding. The sequential modular approach yields high-fidelity generations of audio-video content.

Conclusion: The paper advances joint audio-video generation through dataset contributions, architecture training, and a practical sequential pipeline that effectively addresses synchronization challenges, providing a foundation for future multimodal generative research.

Abstract: Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.

Relevance: 9/10

[3] Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits

Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

Main category: cs.MM

TL;DR: A pipeline for joint audio-visual editing that first edits video, then generates aligned audio using a conditional video-to-audio generation model that adapts source audio influence based on edit complexity.

DetailsMotivation: Current video editing techniques often create visual changes that break audio-visual coherence, requiring audio editing to match the edited video content and maintain multimodal consistency.

Method: Two-stage pipeline: 1) Apply video editing to produce target video, 2) Use novel video-to-audio generation model conditioned on source audio, target video, and text prompt. Model incorporates conditional audio input, uses data augmentation for training efficiency, and dynamically adjusts source audio influence based on edit complexity.

Result: Method outperforms existing approaches in maintaining audio-visual alignment and content integrity, demonstrating effective joint audio-visual editing capabilities.

Conclusion: The proposed pipeline successfully addresses audio-visual coherence in editing tasks through a conditional generation approach that adapts to edit complexity while preserving original audio structure when possible.

Abstract: We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.

Relevance: 9/10


Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Keivan Alizadeh, Parshin Shojaee, Minsik Cho, Mehrdad Farajtabar

Main category: cs.CL

TL;DR: SRLM introduces self-reflection with uncertainty signals (self-consistency, reasoning length, verbalized confidence) to improve program selection in recursive language models for long-context handling.

DetailsMotivation: Long-context handling remains challenging for language models, even with extended windows. While Recursive Language Models (RLM) use programmatic interaction to decompose long contexts, their success depends heavily on how context-interaction programs are selected, which has been largely unexplored.

Method: SRLM augments programmatic context interaction with uncertainty-aware self-reflection using three intrinsic signals: self-consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of model uncertainty to evaluate and compare candidate context-interaction programs.

Result: SRLM consistently outperforms state-of-the-art baselines across diverse benchmarks, context lengths, and backbone models, yielding up to 22% improvement over RLM under the same time budget. It shows recursion itself isn’t the primary driver of RLM performance, and self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms.

Conclusion: Self-reflection provides semantic signals that better steer reasoning in semantically intensive tasks where heuristic program search is insufficient. SRLM yields consistent gains across both short and long contexts, while RLMs with recursion often degrade performance relative to base models for context lengths within the model’s window.

Abstract: Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model’s internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model’s window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.

[2] MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

Eric Wu, Kevin Wu, Jason Hom, Paul H. Yi, Angela Zhang, Alejandro Lozano, Jeff Nirschl, Jeff Tangney, Kevin Byram, Braydon Dymm, Narender Annapureddy, Eric Topol, David Ouyang, James Zou

Main category: cs.CL

TL;DR: MedArena is an interactive platform for clinicians to evaluate LLMs using real medical queries, revealing that top models are Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o, with clinicians valuing depth/detail and clarity over raw factual accuracy.

DetailsMotivation: Current medical LLM evaluations rely on static benchmarks that don't capture real-world clinical complexity, creating a gap between benchmark performance and actual clinical utility.

Method: Interactive platform where clinicians submit medical queries, receive responses from two randomly selected LLMs, and select preferred responses; collected 1571 preferences across 12 LLMs.

Result: Top models: Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, GPT-4o; only 1/3 of questions were factual recall, majority addressed treatment selection, documentation, patient communication; clinicians prioritized depth/detail and clarity over factual accuracy.

Conclusion: MedArena provides scalable, clinically-grounded evaluation of medical LLMs, revealing that real-world utility depends more on readability and clinical nuance than benchmark performance.

Abstract: Large language models (LLMs) are increasingly central to clinician workflows, spanning clinical decision support, medical education, and patient communication. However, current evaluation methods for medical LLMs rely heavily on static, templated benchmarks that fail to capture the complexity and dynamics of real-world clinical practice, creating a dissonance between benchmark performance and clinical utility. To address these limitations, we present MedArena, an interactive evaluation platform that enables clinicians to directly test and compare leading LLMs using their own medical queries. Given a clinician-provided query, MedArena presents responses from two randomly selected models and asks the user to select the preferred response. Out of 1571 preferences collected across 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o were the top three models by Bradley-Terry rating. Only one-third of clinician-submitted questions resembled factual recall tasks (e.g., MedQA), whereas the majority addressed topics such as treatment selection, clinical documentation, or patient communication, with ~20% involving multi-turn conversations. Additionally, clinicians cited depth and detail and clarity of presentation more often than raw factual accuracy when explaining their preferences, highlighting the importance of readability and clinical nuance. We also confirm that the model rankings remain stable even after controlling for style-related factors like response length and formatting. By grounding evaluation in real-world clinical questions and preferences, MedArena offers a scalable platform for measuring and improving the utility and efficacy of medical LLMs.

[3] MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification

MiroMind Team, S. Bai, L. Bing, L. Lei, R. Li, X. Li, X. Lin, E. Min, L. Su, B. Wang, L. Wang, L. Wang, S. Wang, X. Wang, Y. Zhang, Z. Zhang, G. Chen, L. Chen, Z. Cheng, Y. Deng, Z. Huang, D. Ng, J. Ni, Q. Ren, X. Tang, B. L. Wang, H. Wang, N. Wang, C. Wei, Q. Wu, J. Xia, Y. Xiao, H. Xu, X. Xu, C. Xue, Z. Yang, Z. Yang, F. Ye, H. Ye, J. Yu, C. Zhang, W. Zhang, H. Zhao, P. Zhu

Main category: cs.CL

TL;DR: MiroThinker-1.7 is a research agent for complex long-horizon reasoning, with MiroThinker-H1 adding heavy-duty reasoning capabilities through verification mechanisms for more reliable multi-step problem solving.

DetailsMotivation: To develop research agents capable of complex, multi-step reasoning tasks that require sustained interaction, structured planning, and reliable problem-solving across domains like open-web research, scientific reasoning, and financial analysis.

Method: MiroThinker-1.7 uses agentic mid-training emphasizing structured planning, contextual reasoning, and tool interaction. MiroThinker-H1 adds verification mechanisms at local (intermediate decision evaluation/refinement) and global (reasoning trajectory auditing) levels during inference.

Result: MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. MiroThinker-1.7 and its mini version are released as open-source models with competitive research-agent capabilities and improved efficiency.

Conclusion: The MiroThinker framework advances research agent capabilities through structured reasoning and verification mechanisms, enabling more reliable complex problem-solving while providing open-source models for broader accessibility.

Abstract: We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.

[4] Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs

Yara Alakeel, Chatrine Qwaider, Hanan Aldarmaki, Sawsan Alqahtani

Main category: cs.CL

TL;DR: LLMs and tokenizers struggle to capture Arabic root-pattern morphology, showing tokenizer morphological alignment doesn’t guarantee morphological generation capabilities.

DetailsMotivation: To investigate how LLMs and their tokenization schemes represent and generate Arabic root-pattern morphology, testing whether they capture genuine morphological structure or rely on surface memorization.

Method: Evaluated morphological fidelity across Arabic and multilingual tokenizers against gold-standard segmentation, then analyzed LLM performance in productive root-pattern generation using a newly developed test set across seven Arabic-centric and multilingual LLMs.

Result: Tokenizer morphological alignment is neither necessary nor sufficient for morphological generation, questioning the role of morphological tokenization in downstream performance.

Conclusion: LLMs struggle with genuine Arabic morphological structure, suggesting current tokenization approaches may not effectively capture complex non-concatenative morphology.

Abstract: This work investigates how effectively large language models (LLMs) and their tokenization schemes represent and generate Arabic root-pattern morphology, probing whether they capture genuine morphological structure or rely on surface memorization. Arabic morphological system provides a rich testbed for analyzing how LLMs handle complex, non-concatenative forms and how tokenization choices influence this process. Our study begins with an evaluation of morphological fidelity across Arabic and multilingual tokenizers against gold-standard segmentation, followed by an analysis of LLM performance in productive root-pattern generation using a newly developed test set. Our findings across seven Arabic-centric and multilingual LLMs and their respective tokenizers reveal that tokenizer morphological alignment is not necessary nor sufficient for morphological generation, which questions the role of morphological tokenization in downstream performance.

[5] COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives

Azwad Anjum Islam, Tisa Islam Erana

Main category: cs.CL

TL;DR: LLM ensemble system for word sense plausibility rating using multiple prompting strategies achieves strong performance on SemEval-2026 Task 5

DetailsMotivation: Address the challenge of rating word sense plausibility in homonyms within short stories, dealing with subjective semantic evaluation and inter-annotator variation

Method: Three prompting strategies with commercial LLMs: zero-shot baseline, Chain-of-Thought with structured reasoning, and comparative prompting; ensemble averaging across models and strategies

Result: Official system placed 4th with 0.88 accuracy and 0.83 Spearman’s rho (0.86 average); post-competition improvements to 0.92 accuracy and 0.85 Spearman’s rho (0.89 average)

Conclusion: Comparative prompting consistently improves performance, and LLM ensembles are well-suited for subjective semantic evaluation tasks with multiple annotators

Abstract: We describe our system for SemEval-2026 Task 5, which requires rating the plausibility of given word senses of homonyms in short stories on a 5-point Likert scale. Systems are evaluated by the unweighted average of accuracy (within one standard deviation of mean human judgments) and Spearman Rank Correlation. We explore three prompting strategies using multiple closed-source commercial LLMs: (i) a baseline zero-shot setup, (ii) Chain-of-Thought (CoT) style prompting with structured reasoning, and (iii) a comparative prompting strategy for evaluating candidate word senses simultaneously. Furthermore, to account for the substantial inter-annotator variation present in the gold labels, we propose an ensemble setup by averaging model predictions. Our best official system, comprising an ensemble of LLMs across all three prompting strategies, placed 4th on the competition leaderboard with 0.88 accuracy and 0.83 Spearman’s rho (0.86 average). Post-competition experiments with additional models further improved this performance to 0.92 accuracy and 0.85 Spearman’s rho (0.89 average). We find that comparative prompting consistently improved performance across model families, and model ensembling significantly enhanced alignment with mean human judgments, suggesting that LLM ensembles are especially well suited for subjective semantic evaluation tasks involving multiple annotators.

[6] RECOVER: Robust Entity Correction via agentic Orchestration of hypothesis Variants for Evidence-based Recovery

Abhishek Kumar, Aashraya Sachdeva

Main category: cs.CL

TL;DR: RECOVER: An agentic correction framework for ASR entity recognition that uses multiple ASR hypotheses, entity retrieval, and LLM correction to improve rare/domain-specific entity recognition.

DetailsMotivation: Entity recognition in ASR is challenging for rare and domain-specific terms (finance, medicine, air traffic control), and errors are costly. When entities are entirely absent from ASR output, post-ASR correction becomes difficult.

Method: RECOVER is a tool-using agent framework that leverages multiple ASR hypotheses as evidence, retrieves relevant entities, and applies LLM correction under constraints. Uses four strategies: 1-Best, Entity-Aware Select, ROVER Ensemble, and LLM-Select.

Result: Achieves 8-46% relative reductions in entity-phrase word error rate (E-WER) and increases recall by up to 22 percentage points across five diverse datasets. LLM-Select achieves best overall performance in entity correction while maintaining overall WER.

Conclusion: RECOVER effectively improves entity recognition in ASR for rare and domain-specific terms through multi-hypothesis evidence, entity retrieval, and LLM-based correction.

Abstract: Entity recognition in Automatic Speech Recognition (ASR) is challenging for rare and domain-specific terms. In domains such as finance, medicine, and air traffic control, these errors are costly. If the entities are entirely absent from the ASR output, post-ASR correction becomes difficult. To address this, we introduce RECOVER, an agentic correction framework that serves as a tool-using agent. It leverages multiple hypotheses as evidence from ASR, retrieves relevant entities, and applies Large Language Model (LLM) correction under constraints. The hypotheses are used using different strategies, namely, 1-Best, Entity-Aware Select, Recognizer Output Voting Error Reduction (ROVER) Ensemble, and LLM-Select. Evaluated across five diverse datasets, it achieves 8-46% relative reductions in entity-phrase word error rate (E-WER) and increases recall by up to 22 percentage points. The LLM-Select achieves the best overall performance in entity correction while maintaining overall WER.

[7] Agent-based imitation dynamics can yield efficiently compressed population-level vocabularies

Nathaniel Imel, Richard Futrell, Michael Franke, Noga Zaslavsky

Main category: cs.CL

TL;DR: Evolutionary game theory combined with Information Bottleneck framework explains how near-optimal vocabulary compression emerges through imprecise strategy imitation in signaling games.

DetailsMotivation: To understand the social dynamics that drive language evolution toward efficient compression of meanings into words, bridging evolutionary game theory with Information Bottleneck framework.

Method: Unified model integrating evolutionary game theory with Information Bottleneck framework, analyzing imprecise strategy imitation dynamics in signaling games.

Result: Near-optimal compression emerges through evolutionary game dynamics; key parameters regulating precision and confusion of similar states constrain tradeoff variation.

Conclusion: Evolutionary game dynamics provide mechanistic basis for evolution of vocabularies with information-theoretically optimal properties.

Abstract: Natural languages have been argued to evolve under pressure to efficiently compress meanings into words by optimizing the Information Bottleneck (IB) complexity-accuracy tradeoff. However, the underlying social dynamics that could drive the optimization of a language’s vocabulary towards efficiency remain largely unknown. In parallel, evolutionary game theory has been invoked to explain the emergence of language from rudimentary agent-level dynamics, but it has not yet been tested whether such an approach can lead to efficient compression in the IB sense. Here, we provide a unified model integrating evolutionary game theory with the IB framework and show how near-optimal compression can arise in a population through an independently motivated dynamic of imprecise strategy imitation in signaling games. We find that key parameters of the model – namely, those that regulate precision in these games, as well as players’ tendency to confuse similar states – lead to constrained variation of the tradeoffs achieved by emergent vocabularies. Our results suggest that evolutionary game dynamics could potentially provide a mechanistic basis for the evolution of vocabularies with information-theoretically optimal and empirically attested properties.

[8] CTG-DB: An Ontology-Based Transformation of ClinicalTrials.gov to Enable Cross-Trial Drug Safety Analyses

Jeffery L. Painter, François Haguinet, Andrew Bate

Main category: cs.CL

TL;DR: CTG-DB transforms ClinicalTrials.gov data into a relational database with standardized adverse event terminology using MedDRA for systematic pharmacovigilance analytics.

DetailsMotivation: ClinicalTrials.gov has registry-oriented architecture and heterogeneous adverse event terminology that limits systematic pharmacovigilance analytics, requiring manual reconciliation of safety concepts.

Method: Created an open-source pipeline that ingests complete CT.gov XML archive, produces relational database aligned to MedDRA terminology, preserves arm-level denominators, represents placebo/comparator arms, and uses deterministic exact and fuzzy matching for terminology normalization.

Result: CTG-DB enables concept-level retrieval and cross-trial aggregation for scalable placebo-referenced safety analyses and integration of clinical trial evidence into downstream pharmacovigilance signal detection.

Conclusion: The framework provides transparent and reproducible mappings for systematic pharmacovigilance analytics from clinical trial data.

Abstract: ClinicalTrials.gov (CT.gov) is the largest publicly accessible registry of clinical studies, yet its registry-oriented architecture and heterogeneous adverse event (AE) terminology limit systematic pharmacovigilance (PV) analytics. AEs are typically recorded as investigator-reported text rather than standardized identifiers, requiring manual reconciliation to identify coherent safety concepts. We present the ClinicalTrials.gov Transformation Database (CTG-DB), an open-source pipeline that ingests the complete CT.gov XML archive and produces a relational database aligned to standardized AE terminology using the Medical Dictionary for Regulatory Activities (MedDRA). CTG-DB preserves arm-level denominators, represents placebo and comparator arms, and normalizes AE terminology using deterministic exact and fuzzy matching to ensure transparent and reproducible mappings. This framework enables concept-level retrieval and cross-trial aggregation for scalable placebo-referenced safety analyses and integration of clinical trial evidence into downstream PV signal detection.

[9] BANGLASOCIALBENCH: A Benchmark for Evaluating Sociopragmatic and Cultural Alignment of LLMs in Bangladeshi Social Interaction

Tanvir Ahmed Sijan, S. M Golam Rifat, Pankaj Chowdhury Partha, Md. Tanjeed Islam, Md. Musfique Anwar

Main category: cs.CL

TL;DR: BANGLASOCIALBENCH: A benchmark for evaluating sociopragmatic competence in Bangla across address terms, kinship reasoning, and social customs, revealing systematic cultural misalignment in LLMs.

DetailsMotivation: While LLMs show multilingual fluency, they lack sociopragmatic competence - the ability to use language appropriately in social contexts. Bangla presents particular challenges with its complex pronominal system, kinship-based addressing, and culturally embedded social customs that require sensitivity to hierarchy, roles, and norms.

Method: Created BANGLASOCIALBENCH with 1,719 culturally grounded instances across three domains (Bangla Address Terms, Kinship Reasoning, Social Customs), written and verified by native Bangla speakers. Evaluated 12 contemporary LLMs in zero-shot settings to assess sociopragmatic competence.

Result: LLMs show systematic patterns of cultural misalignment: defaulting to overly formal address forms, failing to recognize multiple socially acceptable address pronouns, and conflating kinship terminology across religious contexts. Sociopragmatic failures are structured and non-random.

Conclusion: Current LLMs have persistent limitations in inferring and applying culturally appropriate language use in realistic Bangladeshi social interactions, revealing gaps in sociopragmatic competence despite multilingual fluency.

Abstract: Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BANGLASOCIALBENCH, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, and consists of 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random, revealing persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.

[10] POLAR:A Per-User Association Test in Embedding Space

Pedro Bento, Arthur Buzelin, Arthur Chagas, Yan Aquino, Victoria Estanislau, Samira Malaquias, Pedro Robles Dutenhefner, Gisele L. Pappa, Virgilio Almeida, Wagner MeiraJr

Main category: cs.CL

TL;DR: POLAR is a per-user lexical association test that projects author embeddings onto curated lexical axes to measure individual-level language associations, useful for detecting bots and analyzing ideological drift.

DetailsMotivation: Existing association tests operate at word, sentence, or corpus levels, obscuring author-level variation. There's a need for per-user diagnostics to analyze individual language patterns in computational social science.

Method: Uses masked language model embeddings with private deterministic tokens for authors, projects these vectors onto curated lexical axes, and reports standardized effects with permutation p-values and Benjamini-Hochberg control.

Result: Successfully separates LLM-driven bots from organic accounts on Twitter, quantifies alignment with slur lexicons on extremist forums, and reveals rightward drift over time. Modular to new attribute sets.

Conclusion: POLAR provides concise, per-author diagnostics for computational social science, offering a flexible method for analyzing individual-level language associations in embedding space.

Abstract: Most intrinsic association probes operate at the word, sentence, or corpus level, obscuring author-level variation. We present POLAR (Per-user On-axis Lexical Association Re-port), a per-user lexical association test that runs in the embedding space of a lightly adapted masked language model. Authors are represented by private deterministic to-kens; POLAR projects these vectors onto curated lexicalaxes and reports standardized effects with permutation p-values and Benjamini–Hochberg control. On a balanced bot–human Twitter benchmark, POLAR cleanly separates LLM-driven bots from organic accounts; on an extremist forum,it quantifies strong alignment with slur lexicons and reveals rightward drift over time. The method is modular to new attribute sets and provides concise, per-author diagnostics for computational social science. All code is publicly avail-able at https://github.com/pedroaugtb/POLAR-A-Per-User-Association-Test-in-Embedding-Space.

[11] A Family of LLMs Liberated from Static Vocabularies

Aleph Alpha, :, Adnen Abdessaied, Artur Baranowski, Lukas Balles, Michael Barlow, Fabien C. Y. Benureau, Felix Berkenkamp, Lukas Bluebaum, Bastian Boll, Thomas F. Burns, Björn Deiseroth, Constantin Eichenberg, David Friede, Pablo Iyu Guerrero, Ahmed Hammam, Bastian Harren, Johann Higl, Yasser Jadidi, Carina Kauf, Johannes Messner, Jan Hendrik Metzen, Max Meuer, Vedant Nanda, Pit Neitemeier, Koen Oostermeijer, Letitia Parcalabescu, Markus Pernpointner, Felix Reinfurt, Dylan Rodriquez, Grégory Schott, Philipp Siedler, Martin Simonovsky, Till Speicher, Volker Stampa, Stephan Wäldchen, Samuel Weinbach, Gregor Ziegltrum

Main category: cs.CL

TL;DR: HAT (Hierarchical Autoregressive Transformer) architecture replaces learned tokenizers with byte-level processing using encoder-decoder structure around pre-trained LLM backbones, improving compression and robustness while maintaining performance.

DetailsMotivation: Current LLMs rely on learned tokenizers with fixed vocabularies that have poor adaptability to new domains/languages and handle intra-word variations poorly. The authors aim to create more flexible, byte-level models that can leverage existing pre-trained LLMs while avoiding tokenizer limitations.

Method: Propose HAT architecture: encoder transformer aggregates bytes into word embeddings, feeds to pre-trained LLM backbone (classical autoregressive transformer), then decoder cross-attends to backbone outputs and converts back to bytes. Convert Llama 3.1 models by adapting pre-trained backbones to handle word embeddings instead of tokens, with encoder/decoder trained from scratch. Also train 7B model from scratch on 4 trillion words.

Result: HAT improves text compression by reducing sequence positions needed and enhances robustness to intra-word variations (spelling differences). Models show strong proficiency in English and German, improving on original Llama 3.1 in most benchmarks. Released models include 200 pre-training checkpoints.

Conclusion: HAT architecture successfully addresses tokenizer limitations by enabling byte-level processing while leveraging existing pre-trained LLMs, demonstrating improved compression, robustness, and multilingual performance compared to token-based approaches.

Abstract: Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.

[12] MoLoRA: Composable Specialization via Per-Token Adapter Routing

Shrey Shah, Justin Wagle

Main category: cs.CL

TL;DR: MoLoRA enables per-token routing to different adapters for multimodal generation and mixed-capability requests, allowing specialized LoRAs to be combined without retraining.

DetailsMotivation: Current multi-adapter serving systems route entire sequences to single adapters, which fails for multimodal generation (where text and image tokens need different adapters) and mixed-capability requests requiring expertise from multiple specialized adapters.

Method: Introduces per-token routing that routes individual tokens to adapters based on vocabulary structure (for multimodal models) or learned gating (for semantic specialization). Proposes MoLoRA (Mixture of LoRA) for composable specialization where multiple domain-specific adapters can be loaded and a learned router selects appropriate adapter per-token.

Result: MoLoRA enables Qwen3-1.7B to exceed Qwen3-8B across four reasoning benchmarks while being 4.7x smaller, demonstrating that specialization dramatically beats scale. Per-token routing is provably optimal with work N for N tokens versus K·N for per-sequence routing with K adapter types.

Conclusion: Enables modular expertise at inference time: train focused LoRAs independently, combine them without retraining, and add new capabilities by simply loading new adapters. Provides efficient solution for multimodal and mixed-capability scenarios.

Abstract: Multi-adapter serving systems route entire sequences to a single adapter, forcing a choice when requests span multiple domains. This assumption fails in two important settings: (1) multimodal generation, where text and image tokens require different adapters within the same sequence, and (2) mixed-capability requests like “write code to solve this equation,” which need expertise from multiple specialized adapters. We introduce per-token routing, which routes individual tokens to adapters based on either vocabulary structure (for multimodal models) or learned gating (for semantic specialization). Per-token routing is provably optimal, achieving work N for N tokens versus K \cdot N for per-sequence routing with K adapter types. Our key contribution is MoLoRA (Mixture of LoRA), which enables composable specialization: load multiple domain-specific adapters and let a learned router select the appropriate adapter per-token. We demonstrate that specialization dramatically beats scale: MoLoRA enables Qwen3-1.7B to exceed Qwen3-8B across four reasoning benchmarks while being 4.7x smaller. This enables modular expertise at inference time: train focused LoRAs independently, combine them without retraining, and add new capabilities by simply loading new adapters.

[13] Robust Language Identification for Romansh Varieties

Charlotte Model, Sina Ahmadi, Jannis Vamvas

Main category: cs.CL

TL;DR: SVM-based language identification system for Romansh regional idioms and Rumantsch Grischun, achieving 97% accuracy on new benchmark dataset.

DetailsMotivation: Romansh has multiple regional varieties with limited mutual intelligibility, but lacks documented language identification systems to distinguish between these idioms and the supra-regional Rumantsch Grischun variety.

Method: Developed an SVM (Support Vector Machine) approach for language identification, evaluated on a newly curated benchmark dataset across two domains.

Result: Achieved average in-domain accuracy of 97%, enabling practical applications like idiom-aware spell checking and machine translation. The classifier is publicly available.

Conclusion: Successfully created a functional LID system for Romansh idioms that addresses the novel classification challenge of distinguishing between regional varieties and the combined supra-regional variety.

Abstract: The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.

[14] Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

Jingxiang Chen, Minseok Kim, Seong-Gyun Leem, Yin Huang, Rashi Rungta, Zhicheng Ouyang, Haibin Wu, Surya Teja Appini, Ankur Bansal, Yang Bai, Yue Liu, Florian Metze, Ahmed A Aly, Anuj Kumar, Ariya Rastrow, Zhaojiang Lin

Main category: cs.CL

TL;DR: PALLM: A paralinguistics-aware speech LLM using multi-task reinforcement learning with chain-of-thought prompting to improve emotional understanding in speech by jointly optimizing sentiment classification and response generation.

DetailsMotivation: Speech LLMs need to understand paralinguistic cues (prosody, emotion, non-verbal sounds) for better intent understanding, but face challenges: limited training data, annotation difficulty, and models relying on lexical shortcuts instead of paralinguistic signals.

Method: Proposes PALLM with multi-task RL and chain-of-thought prompting that elicits explicit affective reasoning. Uses a two-stage pipeline: 1) jointly optimizes sentiment classification from audio, and 2) paralinguistics-aware response generation.

Result: Improves paralinguistics understanding by 8-12% over supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) on Expresso, IEMOCAP, and RAVDESS datasets.

Conclusion: Modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs that can better understand and respond to human speech with emotional context.

Abstract: Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds–crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

[15] NLP Occupational Emergence Analysis: How Occupations Form and Evolve in Real Time – A Zero-Assumption Method Demonstrated on AI in the US Technology Workforce, 2022-2026

David Nordfors

Main category: cs.CL

TL;DR: The paper proposes a co-attractor theory of occupations as self-reinforcing structures where shared vocabulary and practitioner cohesion mutually sustain each other, and applies this to detect occupational emergence from resume data, finding AI has vocabulary cohesion but lacks population cohesion.

DetailsMotivation: Occupations evolve faster than classification systems can track, creating a need for methods to detect genuine occupational emergence without relying on predefined taxonomies or job titles.

Method: Proposes a zero-assumption method using resume data to test vocabulary cohesion and population cohesion independently, with ablation to test if vocabulary binds the population. Applied to 8.2 million US resumes (2022-2026).

Result: Method correctly identifies established occupations. For AI: cohesive professional vocabulary formed rapidly in early 2024, but practitioner population never cohered. Pre-existing AI community dissolved as tools went mainstream, with vocabulary absorbed into existing careers rather than binding a new occupation.

Conclusion: AI appears to be a diffusing technology rather than an emerging occupation. Discussion on whether introducing “AI Engineer” occupational category could catalyze population cohesion around the already-formed vocabulary to complete the co-attractor.

Abstract: Occupations form and evolve faster than classification systems can track. We propose that a genuine occupation is a self-reinforcing structure (a bipartite co-attractor) in which a shared professional vocabulary makes practitioners cohesive as a group, and the cohesive group sustains the vocabulary. This co-attractor concept enables a zero-assumption method for detecting occupational emergence from resume data, requiring no predefined taxonomy or job titles: we test vocabulary cohesion and population cohesion independently, with ablation to test whether the vocabulary is the mechanism binding the population. Applied to 8.2 million US resumes (2022-2026), the method correctly identifies established occupations and reveals a striking asymmetry for AI: a cohesive professional vocabulary formed rapidly in early 2024, but the practitioner population never cohered. The pre-existing AI community dissolved as the tools went mainstream, and the new vocabulary was absorbed into existing careers rather than binding a new occupation. AI appears to be a diffusing technology, not an emerging occupation. We discuss whether introducing an “AI Engineer” occupational category could catalyze population cohesion around the already-formed vocabulary, completing the co-attractor.

[16] RadAnnotate: Large Language Models for Efficient and Reliable Radiology Report Annotation

Saisha Pradeep Shetty, Roger Eric Goldman, Vladimir Filkov

Main category: cs.CL

TL;DR: RadAnnotate: LLM-based framework using retrieval-augmented synthetic reports and confidence-based automation to reduce expert effort for radiology report annotation, focusing on entity labeling for RadGraph.

DetailsMotivation: Manual annotation of radiology reports for clinical NLP is slow and expensive, creating a need for automated solutions that can reduce expert effort while maintaining quality.

Method: Three-stage approach: 1) Train entity-specific classifiers on gold-standard reports and analyze performance across anatomy/observation categories, 2) Generate synthetic reports using RAG (Retrieval-Augmented Generation) and evaluate synthetic-only vs. gold-trained models, 3) Implement confidence-based selective automation with entity-specific thresholds to automatically annotate high-confidence cases while routing uncertain ones for expert review.

Result: Synthetic-only models perform within 1-2 F1 points of gold-trained models; synthetic augmentation improves uncertain observation F1 from 0.61 to 0.70 in low-resource settings; RadAnnotate can automatically annotate 55-90% of reports at 0.86-0.92 entity match score.

Conclusion: RadAnnotate demonstrates effective LLM-based automation for radiology report annotation, with synthetic data generation and confidence-based selective automation significantly reducing expert labeling effort while maintaining quality.

Abstract: Radiology report annotation is essential for clinical NLP, yet manual labeling is slow and costly. We present RadAnnotate, an LLM-based framework that studies retrieval-augmented synthetic reports and confidence-based selective automation to reduce expert effort for labeling in RadGraph. We study RadGraph-style entity labeling (graph nodes) and leave relation extraction (edges) to future work. First, we train entity-specific classifiers on gold-standard reports and characterize their strengths and failure modes across anatomy and observation categories, with uncertain observations hardest to learn. Second, we generate RAG-guided synthetic reports and show that synthetic-only models remain within 1-2 F1 points of gold-trained models, and that synthetic augmentation is especially helpful for uncertain observations in a low-resource setting, improving F1 from 0.61 to 0.70. Finally, by learning entity-specific confidence thresholds, RadAnnotate can automatically annotate 55-90% of reports at 0.86-0.92 entity match score while routing low-confidence cases for expert review.

[17] Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

Fan Huang, Haewoon Kwak, Jisun An

Main category: cs.CL

TL;DR: Analysis of moral reasoning trajectories in LLMs reveals systematic multi-framework deliberation, framework switching patterns, and representation-level encoding of ethical frameworks.

DetailsMotivation: LLMs increasingly participate in morally sensitive decision-making, but how they organize ethical frameworks across reasoning steps remains underexplored. The paper aims to understand the dynamics of moral reasoning in LLMs.

Method: Introduced moral reasoning trajectories (sequences of ethical framework invocations), analyzed six models across three benchmarks, used linear probes to localize framework-specific encoding, performed activation steering, and proposed Moral Representation Consistency (MRC) metric.

Result: Found systematic multi-framework deliberation with 55.4-57.7% framework switching, unstable trajectories 1.29× more susceptible to attacks, localized framework encoding to specific layers, achieved 13.8-22.6% lower KL divergence, and MRC metric strongly correlates with LLM coherence ratings (r=0.715).

Conclusion: Moral reasoning in LLMs involves complex multi-framework deliberation with measurable patterns, and representation-level analysis reveals structured encoding of ethical frameworks that correlates with reasoning coherence.

Abstract: Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4–57.7% of consecutive steps involve framework switches, and only 16.4–17.8% of trajectories remain framework-consistent. Unstable trajectories remain 1.29$\times$ more susceptible to persuasive attacks ($p=0.015$). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8–22.6% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7–8.9% drift reduction) and amplifies the stability–accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly ($r=0.715$, $p<0.0001$) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity $= 0.859$).

[18] SEAHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Southeast Asia

Ri Chi Ng, Aditi Kumaresan, Yujia Hu, Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: SEAHateCheck introduces a culturally relevant hate speech detection benchmark dataset for four Southeast Asian languages (Indonesian, Tagalog, Thai, Vietnamese) to address resource gaps in low-resource language hate speech moderation.

DetailsMotivation: Current hate speech detection relies heavily on linguistic resources available only in high-resource languages like English and Chinese, creating barriers for developing tools for low-resource Southeast Asian languages with diverse socio-linguistic contexts that complicate online hate moderation.

Method: Built on HateCheck’s functional testing framework and refined SGHateCheck’s methods, SEAHateCheck provides culturally relevant test cases augmented by large language models and validated by local experts for accuracy across Indonesian, Tagalog, Thai, and Vietnamese.

Result: Experiments with state-of-the-art multilingual models revealed limitations: Tagalog test cases showed lowest accuracy due to linguistic complexity and limited training data; slang-based functional tests were hardest as models struggled with culturally nuanced expressions; models showed weaknesses in implicit hate detection and counter-speech expression.

Conclusion: SEAHateCheck is the first functional test suite for these Southeast Asian languages, providing researchers with a robust benchmark to advance development of practical, culturally attuned hate speech detection tools for inclusive online content moderation.

Abstract: Hate speech detection relies heavily on linguistic resources, which are primarily available in high-resource languages such as English and Chinese, creating barriers for researchers and platforms developing tools for low-resource languages in Southeast Asia, where diverse socio-linguistic contexts complicate online hate moderation. To address this, we introduce SEAHateCheck, a pioneering dataset tailored to Indonesia, Thailand, the Philippines, and Vietnam, covering Indonesian, Tagalog, Thai, and Vietnamese. Building on HateCheck’s functional testing framework and refining SGHateCheck’s methods, SEAHateCheck provides culturally relevant test cases, augmented by large language models and validated by local experts for accuracy. Experiments with state-of-the-art and multilingual models revealed limitations in detecting hate speech in specific low-resource languages. In particular, Tagalog test cases showed the lowest model accuracy, likely due to linguistic complexity and limited training data. In contrast, slang-based functional tests proved the hardest, as models struggled with culturally nuanced expressions. The diagnostic insights of SEAHateCheck further exposed model weaknesses in implicit hate detection and models’ struggles with counter-speech expression. As the first functional test suite for these Southeast Asian languages, this work equips researchers with a robust benchmark, advancing the development of practical, culturally attuned hate speech detection tools for inclusive online content moderation.

[19] ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

Aniket Pramanick, Yufang Hou, Saif M. Mohammad, Iryna Gurevych

Main category: cs.CL

TL;DR: ClaimFlow: A dataset and framework for analyzing scientific claim evolution in NLP literature through manual annotation of claims and their cross-paper relations (support, extend, qualify, refute, background).

DetailsMotivation: Existing citation analysis methods only capture fragments of scientific dialogue. The paper aims to make claim-level interactions explicit to understand how scientific claims evolve, are supported, extended, qualified, or refuted over time in NLP research.

Method: Created ClaimFlow dataset from 304 ACL Anthology papers (1979-2025) with manual annotation of 1,084 claims and 832 cross-paper claim relations. Defined Claim Relation Classification task to infer scientific stance toward cited claims from text and citation context.

Result: Baseline performance of 0.78 macro-F1 on claim relation classification task. Analysis of ~13k NLP papers reveals 63.5% claims are never reused, only 11.1% are ever challenged, and widely propagated claims are more often reshaped through qualification/extension than directly confirmed/refuted.

Conclusion: ClaimFlow provides a framework for examining idea evolution in NLP and assessing model capabilities in interpreting scientific argumentation, revealing patterns of claim propagation and transformation in the field.

Abstract: Scientific papers do more than report results $-$ they advance $\textit{claims}$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $304$ ACL Anthology papers (1979$-$2025) that are manually annotated with $1{,}084$ claims and $832$ cross-paper claim relations, indicating whether a citing paper $\textit{supports}$, $\textit{extends}$, $\textit{qualifies}$, $\textit{refutes}$, or references a claim as $\textit{background}$. Using $\texttt{ClaimFlow}$, we define a new task $-$ $\textit{Claim Relation Classification}$ $-$ which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating strong neural models and large language models on this task, we report baseline performance of $0.78$ macro-F1, highlighting that claim-relation classification is feasible but challenging. We further apply our model to $\sim$$13k$ NLP papers to analyze how claims evolve across decades of NLP research. Our analysis reveals that $63.5$% claims are never reused; only $11.1$% are ever challenged; meanwhile, widely propagated claims are more often $\textit{reshaped}$ through qualification and extension than directly confirmed or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP, and a foundation for assessing whether models can interpret scientific argumentation.

[20] CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

Tianyi Huang, Ying Kai Deng

Main category: cs.CL

TL;DR: CounterRefine is an inference-time repair layer for retrieval-grounded QA that tests provisional answers by gathering additional evidence and only accepting revisions that pass validation.

DetailsMotivation: Many factual QA errors are failures of commitment rather than access - systems retrieve relevant evidence but still give wrong answers. Current models need better mechanisms to use evidence to reconsider and repair their own answers.

Method: CounterRefine works in three steps: 1) Produce short answer from retrieved evidence, 2) Gather additional support/conflicting evidence with follow-up queries conditioned on draft answer, 3) Apply restricted refinement step with KEEP/REVISE decisions, accepting revisions only if they pass deterministic validation.

Result: On SimpleQA benchmark, CounterRefine improves matched GPT-5 Baseline-RAG by 5.8 points, reaches 73.1% correct rate, and exceeds reported one-shot GPT-5.4 score by ~40 points.

Conclusion: Knowledgeable foundation models should go beyond accessing evidence to using evidence to reconsider and repair answers. CounterRefine demonstrates a simple but effective direction for improving factual QA through evidence-based answer refinement.

Abstract: In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair layer for retrieval-grounded question answering. CounterRefine first produces a short answer from retrieved evidence, then gathers additional support and conflicting evidence with follow-up queries conditioned on that draft answer, and finally applies a restricted refinement step that outputs either KEEP or REVISE, with proposed revisions accepted only if they pass deterministic validation. In effect, CounterRefine turns retrieval into a mechanism for testing a provisional answer rather than merely collecting more context. On the full SimpleQA benchmark, CounterRefine improves a matched GPT-5 Baseline-RAG by 5.8 points and reaches a 73.1 percent correct rate, while exceeding the reported one-shot GPT-5.4 score by roughly 40 points. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.

[21] Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca

Main category: cs.CL

TL;DR: ZipCal is a model-agnostic data curation method for LLM compression that selects calibration data based on lexical diversity using Zipfian power laws, outperforming random sampling and matching perplexity-based methods while being 240x faster.

DetailsMotivation: Current LLM compression methods focus on algorithms but neglect optimal calibration data selection, which is crucial for preserving model capabilities across tasks. Existing perplexity-based methods are computationally expensive for large models.

Method: ZipCal uses Zipfian power laws to maximize lexical diversity in calibration data selection. It analyzes intrinsic data properties (lexical distribution) rather than model-specific signals, making it model-agnostic with linear complexity.

Result: ZipCal consistently outperforms uniform random sampling across pruning benchmarks and performs on par with state-of-the-art perplexity-based methods while being ~240x faster due to its tractable linear complexity.

Conclusion: Lexical diversity based on Zipfian distributions is an effective, efficient criterion for calibration data selection in LLM compression, offering a practical alternative to expensive model-specific methods.

Abstract: Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://anonymous.4open.science/r/zipcal-71CD/.}.

[22] ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

Tik Yu Yim, Wenting Tan, Sum Yee Chan, Tak-Wah Lam, Siu Ming Yiu

Main category: cs.CL

TL;DR: ASDA is a training-free framework that automatically generates structured skill artifacts through error analysis to improve financial reasoning in LLMs without weight modification.

DetailsMotivation: Current training-free methods for adapting LLMs to specialized domains like finance show limited gains, while fine-tuning approaches are expensive and produce model-locked expertise that can't be easily shared or audited.

Method: ASDA uses a teacher model to analyze student model failures on financial reasoning tasks, clusters errors by subfield and type, then synthesizes structured skill files containing reasoning procedures, code templates, and worked examples that are dynamically injected during inference.

Result: On the FAMMA financial reasoning benchmark, ASDA achieves up to +17.33% improvement on arithmetic reasoning and +5.95% on non-arithmetic reasoning, substantially outperforming all training-free baselines.

Conclusion: ASDA provides a practical, auditable path to domain adaptation without weight access or retraining, producing human-readable, version-controlled skill artifacts compatible with open standards.

Abstract: Adapting large language models (LLMs) to specialized financial reasoning typically requires expensive fine-tuning that produces model-locked expertise. Training-free alternatives have emerged, yet our experiments show that leading methods (GEPA and ACE) achieve only marginal gains on the FAMMA financial reasoning benchmark, exposing the limits of unstructured text optimization for complex, multi-step domain reasoning. We introduce Automated Skill Distillation and Adaptation (ASDA), a framework that automatically generates structured skill artifacts through iterative error-corrective learning without modifying model weights. A teacher model analyzes a student model’s failures on financial reasoning tasks, clusters errors by subfield and error type, and synthesizes skill files containing reasoning procedures, code templates, and worked examples, which are dynamically injected during inference. Evaluated on FAMMA, ASDA achieves up to +17.33% improvement on arithmetic reasoning and +5.95% on non-arithmetic reasoning, substantially outperforming all training-free baselines. The resulting skill artifacts are human-readable, version-controlled, and compatible with the Agent Skills open standard, offering any organization with a labeled domain dataset a practical and auditable path to domain adaptation without weight access or retraining.

[23] Language Models Don’t Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Nishant Balepur, Malachi Hamada, Varsha Kishore, Sergey Feldman, Amanpreet Singh, Pao Siangliulue, Joseph Chee Chang, Eunsol Choi, Jordan Lee Boyd-Graber, Aakanksha Naik

Main category: cs.CL

TL;DR: MyScholarQA is a personalized deep research tool that infers user research interests, proposes personalized actions for queries, and writes multi-section reports following user-approved actions, evaluated through both synthetic benchmarks and real user interviews.

DetailsMotivation: Current deep research tools lack understanding of their users and personalization capabilities, limiting their effectiveness in helping researchers cope with the growing volume of scientific publications.

Method: 1) Develops MyScholarQA with three components: user interest profiling, personalized action proposal, and multi-section report generation; 2) Evaluates using synthetic users and LLM judges benchmark; 3) Conducts real user interviews to identify nuanced errors missed by automated evaluation.

Result: MySQA outperforms baselines in citation metrics and personalized action-following in synthetic evaluations, but real user interviews reveal nine nuanced errors undetectable by LLM judges, highlighting limitations of automated evaluation for personalization.

Conclusion: Real progress in personalization requires real user involvement, as easy-to-use LLM judges can overlook important aspects of personalized deep research that users actually value.

Abstract: Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers’ queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user’s research interests; 2) proposes personalized actions for a user’s input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP’s standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.

[24] Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, Jun Suzuki

Main category: cs.CL

TL;DR: Warmup-Stable-Only (WSO) learning rate scheduling outperforms decay-based schedulers for downstream task performance after supervised fine-tuning, despite decay-based schedulers having better pre-training loss.

DetailsMotivation: While decay-based learning rate schedulers are widely used to minimize pre-training loss in large language models, their impact on downstream performance after supervised fine-tuning remains underexplored. The paper investigates whether optimizing for pre-training metrics compromises downstream adaptability.

Method: The paper examines Warmup-Stable-Only (WSO) scheduling which maintains a constant learning rate after warmup without decay. Experiments are conducted with 1B and 8B parameter models across different training regimes (mid-training and over-training). Loss landscape analysis is performed to understand minima characteristics.

Result: WSO consistently outperforms decay-based schedulers in downstream performance after SFT, even though decay-based schedulers show better pre-training loss. This holds across model sizes and training regimes. Loss landscape analysis reveals decay-based schedulers lead to sharper minima while WSO preserves flatter minima that support adaptability.

Conclusion: Applying learning rate decay to improve pre-training metrics may compromise downstream adaptability. WSO scheduling enhances model adaptability for downstream tasks, providing practical guidance for training and model release strategies.

Abstract: We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability. These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability. Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.

[25] Social Simulacra in the Wild: AI Agent Communities on Moltbook

Agam Goyal, Olivia Pal, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha

Main category: cs.CL

TL;DR: First large-scale empirical comparison of AI-agent vs human online communities shows AI communities have extreme participation inequality, high author overlap, emotionally flattened content, and distinct collective dynamics.

DetailsMotivation: As autonomous LLM-based agents increasingly populate social platforms, understanding AI-agent community dynamics is essential for communication research and platform governance, especially as AI-mediated communication reshapes online discourse.

Method: Large-scale empirical comparison analyzing 73,899 Moltbook (AI-agent) and 189,838 Reddit (human) posts across five matched communities, examining structural patterns, linguistic attributes, and community dynamics.

Result: AI-agent communities show extreme participation inequality (Gini = 0.84 vs. 0.47), high cross-community author overlap (33.8% vs. 0.5%), emotionally flattened content, cognitive shift toward assertion over exploration, social detachment, and author-level identifiability through outlier stylistic profiles.

Conclusion: AI-agent communities exhibit distinct collective communication dynamics from human communities, with homogenization primarily a structural artifact of shared authorship, providing empirical foundation for understanding multi-agent interaction effects on online discourse.

Abstract: As autonomous LLM-based agents increasingly populate social platforms, understanding the dynamics of AI-agent communities becomes essential for both communication research and platform governance. We present the first large-scale empirical comparison of AI-agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find that Moltbook exhibits extreme participation inequality (Gini = 0.84 vs. 0.47) and high cross-community author overlap (33.8% vs. 0.5%). In terms of linguistic attributes, content generated by AI-agents is emotionally flattened, cognitively shifted toward assertion over exploration, and socially detached. These differences give rise to apparent community-level homogenization, but we show this is primarily a structural artifact of shared authorship. At the author level, individual agents are more identifiable than human users, driven by outlier stylistic profiles amplified by their extreme posting volume. As AI-mediated communication reshapes online discourse, our work offers an empirical foundation for understanding how multi-agent interaction gives rise to collective communication dynamics distinct from those of human communities.

[26] SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

Han Jang, Junhyeok Lee, Kyu Sung Choi

Main category: cs.CL

TL;DR: SciZoom is a large-scale benchmark of 44,946 ML papers from 2020-2025 with hierarchical summarization targets (Abstract, Contributions, TL;DR) and temporal stratification into Pre-LLM/Post-LLM eras, enabling multi-granularity summarization research and analysis of LLM impact on scientific writing.

DetailsMotivation: Addresses limitations in existing scientific summarization benchmarks (limited scale, single granularity, outdated) and the need to analyze how LLM adoption has transformed scientific writing since ChatGPT's release, with no existing resources for this analysis.

Method: Created SciZoom dataset with 44,946 papers from top ML venues (NeurIPS, ICLR, ICML, EMNLP) spanning 2020-2025, stratified into Pre-LLM (2020-2022) and Post-LLM (2023-2025) eras. Provides three hierarchical summarization targets with compression ratios up to 600:1. Conducted linguistic analysis to detect writing pattern shifts.

Result: Revealed striking shifts in phrase patterns (up to 10x increase in formulaic expressions) and rhetorical style (23% decline in hedging), suggesting LLM-assisted writing produces more confident yet homogenized prose. Dataset serves as benchmark for summarization and resource for analyzing scientific discourse evolution.

Conclusion: SciZoom bridges gaps in scientific summarization benchmarks and enables analysis of LLM impact on scientific writing, revealing significant linguistic shifts toward more confident but homogenized prose in the generative AI era.

Abstract: The explosive growth of AI research has created unprecedented information overload, increasing the demand for scientific summarization at multiple levels of granularity beyond traditional abstracts. While LLMs are increasingly adopted for summarization, existing benchmarks remain limited in scale, target only a single granularity, and predate the LLM era. Moreover, since the release of ChatGPT in November 2022, researchers have rapidly adopted LLMs for drafting manuscripts themselves, fundamentally transforming scientific writing, yet no resource exists to analyze how this writing has evolved. To bridge these gaps, we introduce SciZoom, a benchmark comprising 44,946 papers from four top-tier ML venues (NeurIPS, ICLR, ICML, EMNLP) spanning 2020 to 2025, explicitly stratified into Pre-LLM and Post-LLM eras. SciZoom provides three hierarchical summarization targets (Abstract, Contributions, and TL;DR) achieving compression ratios up to 600:1, enabling both multi-granularity summarization research and temporal mining of scientific writing patterns. Our linguistic analysis reveals striking shifts in phrase patterns (up to 10x for formulaic expressions) and rhetorical style (23% decline in hedging), suggesting that LLM-assisted writing produces more confident yet homogenized prose. SciZoom serves as both a challenging benchmark and a unique resource for mining the evolution of scientific discourse in the generative AI era. Our code and dataset are publicly available on GitHub (https://github.com/janghana/SciZoom) and Hugging Face (https://huggingface.co/datasets/hanjang/SciZoom), respectively.

[27] SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment

Zhouwei Zhai, Mengxiang Chen, Anmeng Zhang

Main category: cs.CL

TL;DR: SI framework integrates knowledge graphs and behavioral logs to create knowledgeable and secure e-commerce search LLMs, addressing hallucination and security issues through synthesis, injection, and alignment techniques.

DetailsMotivation: LLMs have transformative potential for e-commerce search but face two critical challenges in industrial deployment: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance.

Method: Proposes SI framework with three components: (1) Synthesizes high-quality natural language corpus combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data; (2) Parameter-efficient pre-training via Depth Up-Scaling to inject domain knowledge while preserving general capabilities; (3) Dual-path alignment through multi-task instruction tuning and adversarial training for task performance and safety robustness.

Result: Deployed at JD.com with A/B tests across five core search scenarios demonstrating significant improvements in key business metrics, validating industrial effectiveness and scalability.

Conclusion: The SI framework successfully addresses knowledge hallucination and security vulnerabilities in e-commerce search LLMs, enabling practical industrial deployment with validated business impact.

Abstract: Large language models offer transformative potential for e-commerce search by enabling intent-aware recommendations. However, their industrial deployment is hindered by two critical challenges: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance. To address these issues, we propose SI–a Synthesize-Inject-Align framework for building knowledgeable and secure e-commerce search LLMs. Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data.We then introduce a parameter-efficient pre-training strategy based on Depth Up-Scaling to inject domain knowledge while preserving general capabilities. Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness. The framework has been deployed at JD.com, China’s largest self-operated e-commerce platform, where A/B tests across five core search scenarios demonstrate significant improvements in key business metrics, validating its industrial effectiveness and scalability.

[28] Parametric Social Identity Injection and Diversification in Public Opinion Simulation

Hexi Wang, Yujia Zhou, Bangde Du, Qingyao Ai, Yiqun Liu

Main category: cs.CL

TL;DR: PSII framework injects parametric social identity representations into LLM hidden states to improve diversity in opinion simulation, addressing the Diversity Collapse problem where LLMs produce overly homogeneous responses across demographic groups.

DetailsMotivation: Current LLM-based public opinion simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses within demographic groups, which the authors identify as a Diversity Collapse phenomenon in LLM hidden representations.

Method: Proposes Parametric Social Identity Injection (PSII), a framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs, enabling fine-grained and controllable identity modulation at the representation level rather than just prompt-based persona conditioning.

Result: Extensive experiments on the World Values Survey using multiple open-source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real-world survey data while enhancing overall diversity.

Conclusion: PSII provides new insights into representation-level control of LLM agents and advances scalable, diversity-aware public opinion simulation by addressing the Diversity Collapse problem in LLM-based opinion simulation.

Abstract: Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM-based simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses within demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt-based persona conditioning, PSII enables fine-grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open-source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real-world survey data while enhancing overall diversity. This work provides new insights into representation-level control of LLM agents and advances scalable, diversity-aware public opinion simulation. Code and data are available at https://github.com/halsayxi/PSII.

[29] Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR

Quy-Anh Dang, Chris Ngo

Main category: cs.CL

TL;DR: Polyglot-Lion is a compact multilingual ASR model family for Singapore’s languages (English, Mandarin, Tamil, Malay) that achieves competitive performance with much larger models at dramatically lower training cost.

DetailsMotivation: To develop efficient multilingual ASR models for Singapore's linguistic landscape that can compete with much larger models while being cost-effective for deployment.

Method: Fine-tune Qwen3-ASR-0.6B and 1.7B models on publicly available speech corpora using balanced sampling (equal utterances per language) without language-tag conditioning, forcing implicit language identification from audio.

Result: Polyglot-Lion-1.7B achieves average error rate of 14.85 on 12 benchmarks, competitive with MERaLiON-2-10B-ASR (14.32) while being 6x smaller, training cost $81 vs $18,862, and 20x faster inference (0.10 s/sample vs 2.02 s/sample).

Conclusion: Linguistically balanced fine-tuning of moderate-scale pretrained models can produce deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.

Abstract: We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of $81 on a single RTX PRO 6000 GPU compared to $18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.

[30] Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Xiaobing Sun, Perry Lam, Shaohua Li, Zizhou Wang, Rick Siow Mong Goh, Yong Liu, Liangli Zhen

Main category: cs.CL

TL;DR: S2C is a novel jailbreak attack framework that manipulates semantic intent reconstruction during LLM inference through contextual reframing, content fragmentation, and clue-guided camouflage to bypass safety mechanisms.

DetailsMotivation: Modern LLMs have advanced safety mechanisms that detect malicious intent even in obfuscated inputs by analyzing latent semantic representations. Existing surface-level obfuscation attacks are becoming ineffective, requiring new approaches that manipulate how semantic intent is reconstructed during inference.

Method: Structured Semantic Cloaking (S2C) uses three complementary mechanisms: 1) Contextual Reframing - embedding requests in plausible high-stakes scenarios to bias models toward compliance; 2) Content Fragmentation - dispersing semantic signatures across disjoint prompt segments; 3) Clue-Guided Camouflage - disguising residual semantic cues while embedding recoverable markers for output generation.

Result: S2C improves Attack Success Rate by 12.4% on HarmBench and 9.7% on JBB-Behaviors over state-of-the-art methods. It achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors.

Conclusion: S2C effectively bypasses modern LLM safety mechanisms by delaying and restructuring semantic consolidation, demonstrating that current safety triggers can be degraded when malicious intent is not coherently reconstructed at decoding time.

Abstract: Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.

[31] SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation

Hang Lv, Sheng Liang, Hao Wang, Yongyue Zhang, Hongchao Gu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen

Main category: cs.CL

TL;DR: SpecSteer: An asymmetric collaborative inference framework that combines on-device personalization with cloud-scale reasoning using Bayesian knowledge fusion and speculative decoding to achieve private, high-quality personalized generation.

DetailsMotivation: Personalized intelligence faces a privacy-capacity dilemma: centralized LLMs risk privacy by accessing user data, while on-device small models lack reasoning capacity for high-quality generation. Current local enhancements are insufficient to bridge this gap.

Method: SpecSteer uses Bayesian knowledge fusion and repurposes speculative decoding as a distributed alignment protocol with a Draft-Verify-Recover pipeline: on-device model drafts personalized sequences; cloud validates via ratio-based mechanism that decouples reasoning verification from private context; steering recovery injects local intent during correction upon rejection.

Result: SpecSteer successfully closes the reasoning gap, achieves superior personalized generation performance, and delivers 2.36x speedup over standard baselines while maintaining privacy.

Conclusion: SpecSteer provides an effective asymmetric collaborative inference framework that synergizes private on-device context with cloud-scale reasoning, solving the privacy-capacity dilemma in personalized intelligence systems.

Abstract: Realizing personalized intelligence faces a core dilemma: sending user history to centralized large language models raises privacy concerns, while on-device small language models lack the reasoning capacity required for high-quality generation. Our pilot study shows that purely local enhancements remain insufficient to reliably bridge this gap. We therefore propose SpecSteer, an asymmetric collaborative inference framework that synergizes private on-device context with cloud-scale reasoning. SpecSteer casts collaboration as Bayesian knowledge fusion and repurposes speculative decoding as a distributed alignment protocol, yielding a Draft–Verify–Recover pipeline: the on-device model drafts personalized sequences; the cloud validates via a ratio-based mechanism that decouples reasoning verification from private context, filtering logical flaws without accessing raw user context; upon rejection, a steering recovery injects local intent during correction. Experiments demonstrate that SpecSteer successfully closes the reasoning gap and achieves superior personalized generation performance, while delivering a 2.36x speedup over standard baselines.

[32] More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

Song Tae-Eun

Main category: cs.CL

TL;DR: Dynamic Cross-Context Review (D-CCR) extends single-pass review by allowing multi-turn Q&A exchanges between reviewer and author, but experimental results show this degrades verification performance compared to single-pass review.

DetailsMotivation: To improve LLM verification by extending Cross-Context Review (CCR) to multi-turn interactions where reviewers can ask follow-up questions and receive responses, creating a more dynamic review process.

Method: Controlled experiment with 30 artifacts and 150 injected errors, testing four D-CCR variants against single-pass CCR baseline. Variants included different multi-turn configurations with question-and-answer exchanges and independent re-review conditions.

Result: Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants. Multi-turn review increased recall (+0.08) but generated 62% more false positives, collapsing precision from 0.30 to 0.20. Two degradation mechanisms identified: false positive pressure and Review Target Drift.

Conclusion: Multi-turn review degrades verification performance compared to single-pass review. The problem is not information amount but that reviewing again invites noise through false positive pressure and target drift, where reviewers shift from reviewing artifacts to critiquing conversations.

Abstract: Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, $p < 0.001$, $d = -0.59$). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure – reviewers in later rounds fabricate findings when the artifact’s real errors have been exhausted, and (2) Review Target Drift – reviewers provided with prior Q&A exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount – within multi-turn conditions, more information actually helps (D-CCR-2b > D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.

[33] Is Semi-Automatic Transcription Useful in Corpus Creation? Preliminary Considerations on the KIParla Corpus

Martina Simonotti, Ludovica Pannitto, Eleonora Zucchini, Silvia Ballarè, Caterina Mauri

Main category: cs.CL

TL;DR: ASR-assisted transcription workflows can speed up corpus creation but don’t consistently improve accuracy, with effectiveness depending on workflow configuration, conversation type, and annotator experience.

DetailsMotivation: To analyze how Automatic Speech Recognition (ASR) can be integrated into transcription workflows for spoken language corpora, specifically examining whether ASR assistance improves transcription speed and accuracy compared to manual methods.

Method: Two-phase experiment with 11 expert and novice transcribers producing both manual and ASR-assisted transcriptions of identical audio segments across three conversation types. Analysis used statistical modeling, word-level alignment, and annotation-based metrics.

Result: ASR-assisted workflows increase transcription speed but do not consistently improve overall accuracy. Effectiveness depends on workflow configuration, conversation type, and annotator experience. The study provides a systematic framework for monitoring transcription behavior.

Conclusion: ASR-assisted transcription, potentially with task-specific fine-tuning, could be integrated into corpus creation workflows to accelerate the process without compromising quality, though careful consideration of workflow design is needed.

Abstract: This paper analyses the implementation of Automatic Speech Recognition (ASR) into the transcription workflow of the KIParla corpus, a resource of spoken Italian. Through a two-phase experiment, 11 expert and novice transcribers produced both manual and ASR-assisted transcriptions of identical audio segments across three different types of conversation, which were subsequently analyzed through a combination of statistical modeling, word-level alignment and a series of annotation-based metrics. Results show that ASR-assisted workflows can increase transcription speed but do not consistently improve overall accuracy, with effects depending on multiple factors such as workflow configuration, conversation type and annotator experience. Analyses combining alignment-based metrics, descriptive statistics and statistical modeling provide a systematic framework to monitor transcription behavior across annotators and workflows. Despite limitations, ASR-assisted transcription, potentially supported by task-specific fine-tuning, could be integrated into the KIParla transcription workflow to accelerate corpus creation without compromising transcription quality.

[34] Attention-guided Evidence Grounding for Spoken Question Answering

Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao

Main category: cs.CL

TL;DR: AEG is an end-to-end framework for Spoken QA that uses attention mechanisms in SpeechLLMs to ground evidence in latent space, reducing hallucinations and improving efficiency over cascaded ASR systems.

DetailsMotivation: Spoken QA faces challenges with cross-modal alignment between acoustic queries and textual knowledge, and cascaded ASR systems suffer from latency and error propagation issues.

Method: Proposes Attention-guided Evidence Grounding (AEG) framework that leverages SpeechLLM’s cross-modal attention to locate evidence in latent space, with Learning to Focus on Evidence (LFE) fine-tuning to calibrate attention mechanisms.

Result: Outperforms large-scale cascaded baselines (Whisper-Large-v3 + Reranker) on SQuAD, HotpotQA, and MuSiQue datasets, reducing hallucinations and achieving ~62% lower inference latency.

Conclusion: AEG provides an effective end-to-end solution for Spoken QA that addresses cross-modal alignment challenges while improving efficiency and reducing error propagation.

Abstract: Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model’s latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model’s attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

[35] PyPhonPlan: Simulating phonetic planning with dynamic neural fields and task dynamics

Sam Kirkham

Main category: cs.CL

TL;DR: PyPhonPlan is a Python toolkit for implementing dynamical models of phonetic planning using coupled dynamic neural fields and task dynamic simulations, with applications in speech production/perception modeling.

DetailsMotivation: The paper aims to provide a computational toolkit for speech communication research that enables modeling of interactive speech dynamics using neurally-grounded, temporally-principled representations. There's a need for open-source tools that support reproducibility and cumulative development in phonetic planning research.

Method: Developed PyPhonPlan toolkit with modular components for defining planning, perception and memory fields, between-field coupling, gestural inputs, and using field activation profiles to solve tract variable trajectories. The framework uses coupled dynamic neural fields and task dynamic simulations.

Result: Created an open-source Python toolkit with executable examples that demonstrates capabilities through simulating production/perception loops with coupled memory fields. The toolkit successfully models interactive speech dynamics with phonetically-rich representations.

Conclusion: PyPhonPlan provides a valuable computational resource for speech communication research, enabling modeling of phonetic planning with neural grounding and temporal principles while promoting reproducibility and extensibility in the field.

Abstract: We introduce PyPhonPlan, a Python toolkit for implementing dynamical models of phonetic planning using coupled dynamic neural fields and task dynamic simulations. The toolkit provides modular components for defining planning, perception and memory fields, as well as between-field coupling, gestural inputs, and using field activation profiles to solve tract variable trajectories. We illustrate the toolkit’s capabilities through an example application:~simulating production/perception loops with a coupled memory field, which demonstrates the framework’s ability to model interactive speech dynamics using representations that are temporally-principled, neurally-grounded, and phonetically-rich. PyPhonPlan is released as open-source software and contains executable examples to promote reproducibility, extensibility, and cumulative computational development for speech communication research.

[36] Omnilingual MT: Machine Translation for 1,600 Languages

Omnilingual MT Team, Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Chierh Cheng, Joe Chuang, Paul-Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, Gabriel Mejia Gonzalez, Holger Schwenk, Edan Toledo, Arina Turkatenko, Albert Ventayol-Boada, Rashel Moritz, Alexandre Mourachko, Surya Parimi, Mary Williamson, Shireen Yates, David Dale, Marta R. Costa-jussà

Main category: cs.CL

TL;DR: OMT is a machine translation system supporting over 1,600 languages using LLM specialization approaches, outperforming larger baselines and enabling coherent generation for many undersupported languages.

DetailsMotivation: Current MT systems cover only about 200 languages on target side, leaving thousands of languages unsupported. There's a need for more comprehensive multilingual translation systems and better evaluation benchmarks.

Method: Two LLM specialization approaches: decoder-only (OMT-LLaMA) and encoder-decoder architecture (OMT-NLLB). Uses comprehensive data strategy integrating public multilingual corpora with newly created datasets like MeDLEY bitext.

Result: 1B to 8B parameter OMT models match/exceed 70B LLM baseline performance. OMT-LLaMA substantially expands languages with coherent generation. Models show improved cross-lingual transfer and understanding capabilities for 1,600 languages.

Conclusion: OMT demonstrates effective LLM specialization for massively multilingual MT, enabling high-quality translation for over 1,600 languages with efficient compute requirements.

Abstract: High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world’s 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the “understanding” part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

[37] PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

Hanif Rahman

Main category: cs.CL

TL;DR: PashtoCorp: A 1.25B-word corpus for underrepresented Pashto language, assembled from multiple sources with reproducible pipeline, enabling improved NLP performance through continued pretraining.

DetailsMotivation: Pashto is spoken by 60 million people but severely underrepresented in NLP research, with existing resources being extremely limited compared to other languages.

Method: Corpus assembled from 39 sources (7 HuggingFace datasets + 32 custom web scrapers), processed through reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering.

Result: PashtoCorp is 40x larger than OSCAR Pashto subset; continued pretraining reduces perplexity by 25.1%, improves NER F1 by 10% relative, reduces training variance 7x, and achieves 64.6% accuracy on reading comprehension benchmark.

Conclusion: PashtoCorp significantly advances Pashto NLP capabilities, demonstrating the importance of high-quality corpora for underrepresented languages, with Wikipedia being particularly critical for NER performance.

Abstract: We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at https://huggingface.co/datasets/ihanif/pashto-corpus, https://huggingface.co/ihanif/xlmr-pashto, and https://github.com/ihanif/pashto-corpus.

[38] Fanar 2.0: Arabic Generative AI Stack

FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad, Minhaj Ahmad, Abdulaziz Al-Homaid, Anas Al-Nuaimi, Enes Altinisik, Ehsaneddin Asgari, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky, Ahmed Elmagarmid, Mohamed Eltabakh, Asim Ersoy, Masoomali Fatehkia, Mohammed Qusay Hashim, Majd Hawasly, Mohamed Hefeeda, Mus’ab Husaini, Keivin Isufaj, Soon-Gyo Jung, Houssam Lachemat, Ji Kim Lucas, Abubakr Mohamed, Tasnim Mohiuddin, Basel Mousi, Hamdy Mubarak, Ahmad Musleh, Mourad Ouzzani, Amin Sadeghi, Husrev Taha Sencar, Mohammed Shinoy, Omar Sinan, Yifan Zhang

Main category: cs.CL

TL;DR: Fanar 2.0 is Qatar’s sovereign Arabic-centric multimodal AI platform featuring a 27B LLM, speech recognition, vision models, and specialized Arabic capabilities developed with resource constraints.

DetailsMotivation: To create a sovereign Arabic-centric generative AI platform that addresses the scarcity of Arabic web data (only ~0.5% despite 400M speakers) while maintaining cultural alignment and safety, demonstrating that resource-constrained development can produce competitive systems.

Method: Adopts data quality over quantity strategy with continual pre-training from Gemma-3-27B backbone on 120B curated tokens, model merging, and introduces multimodal components: FanarGuard (4B bilingual moderation), Aura (long-form ASR), Oryx (Arabic-aware vision/video), tool-calling framework, multi-agent Islamic content system, poetry generation, translation, and multi-layer orchestrator.

Result: Fanar-27B achieved substantial benchmark improvements despite 8x fewer pre-training tokens than Fanar 1.0: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). The platform delivers comprehensive multimodal capabilities with state-of-the-art components.

Conclusion: Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce competitive multimodal systems with specialized Arabic capabilities, cultural alignment, and comprehensive safety measures.

Abstract: We present Fanar 2.0, the second generation of Qatar’s Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.

[39] Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Finnur Ágúst Ingimundarson, Steinunn Rut Friðriksdóttir, Bjarki Ármannsson, Iris Edda Nowenstein, Steinþór Steingrímsson

Main category: cs.CL

TL;DR: Paper critiques LLM benchmarking for Icelandic, showing synthetic/machine-translated benchmarks contain flawed examples that skew results, calling for better evaluation methods in low-resource languages.

DetailsMotivation: To identify problems in current LLM benchmarking for Icelandic and low/medium-resource languages, particularly issues with synthetic or machine-translated data that haven't been verified, which undermines benchmark validity.

Method: Evaluates existing LLM benchmarks for Icelandic, conducts quantitative error analysis comparing human-authored/translated benchmarks vs. synthetic/machine-translated benchmarks to demonstrate quality differences.

Result: Shows clear differences between human-authored/-translated benchmarks and synthetic/machine-translated benchmarks, with the latter containing severely flawed test examples that likely skew evaluation results.

Conclusion: Warns against using synthetic/machine-translated benchmarks without verification in low/medium-resource settings, calls for improved evaluation methods that account for translation quality limitations.

Abstract: This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests’ validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.

[40] PlotTwist: A Creative Plot Generation Framework with Small Language Models

Abhinav Thorat, Ravi Kolla, Jyotin Goel, Niranjan Pedanekar

Main category: cs.CL

TL;DR: PlotTwist enables small language models (≤5B params) to generate high-quality creative plots competitive with frontier models 200× larger through structured preference alignment.

DetailsMotivation: Creative plot generation requires transforming premises into coherent narratives with global structure, character development, and emotional resonance. While LLMs need preference alignment for specialized domains, frontier model alignment is computationally prohibitive, limiting accessibility.

Method: Three-component framework: (1) Aspect Rating Reward Model trained via Positive-Negative prompting for five Narrative Quality Dimensions; (2) Mixture-of-Experts plot generator aligned via Direct Preference Optimization; (3) Agentic Evaluation module for unbiased post-hoc assessment.

Result: PlotTwist consistently outperforms frontier models across multiple narrative quality dimensions despite tighter capacity constraints. Shows strong sensitivity to narrative quality, reliably distinguishing plots from acclaimed vs. panned screenplays.

Conclusion: Structured, preference-based alignment enables resource-efficient high-quality creative plot generation with small language models, making specialized narrative generation more accessible.

Abstract: Creative plot generation presents a fundamental challenge for language models: transforming a concise premise into a coherent narrative that sustains global structure, character development, and emotional resonance. Although recent Large Language Models (LLMs) demonstrate strong fluency across general-purpose tasks, they typically require preference alignment to perform well on specialized domains such as creative plot generation. However, conducting such alignment at the scale of frontier LLMs is computationally prohibitive, significantly limiting accessibility and practical deployment. To address this, we present PlotTwist, a structured framework that enables Small Language Models (SLMs) with $\leq$ 5B active parameters to generate high-quality, premise-conditioned plots competitive with frontier systems up to $200\times$ larger. Our approach decomposes generation into three specialized components: (1) an Aspect Rating Reward Model trained via a novel Positive-Negative prompting strategy to deliver structured narratives across five Narrative Quality Dimensions (NQDs); (2) a Mixture-of-Experts (MoE) plot generator aligned via Direct Preference Optimization on high-confidence preference pairs; and (3) an Agentic Evaluation module that emulates human critical judgment for unbiased post-hoc assessment. Extensive experiments demonstrate that PlotTwist consistently outperforms frontier models across multiple NQDs despite substantially tighter capacity constraints. Further validation confirms strong sensitivity to narrative quality, as the framework reliably distinguishes plots derived from critically acclaimed versus widely panned screenplays. Together, these results establish structured, preference-based alignment as a resource-efficient approach to high-quality creative plot generation.

[41] IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time

Zhenghua Bao, Yi Shi

Main category: cs.CL

TL;DR: IndexRAG shifts cross-document reasoning from online inference to offline indexing by identifying bridge entities and generating bridging facts as retrievable units, improving multi-hop QA performance with single-pass retrieval.

DetailsMotivation: Existing RAG approaches for multi-hop QA require either graph-based methods with additional online processing or iterative multi-step reasoning, which can be computationally expensive and slow at inference time.

Method: IndexRAG identifies bridge entities shared across documents and generates bridging facts as independently retrievable units during offline indexing, requiring no additional training or fine-tuning. At inference, it uses only single-pass retrieval and a single LLM call.

Result: On three multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue), IndexRAG improves F1 over Naive RAG by 4.6 points on average. When combined with IRCoT, it outperforms all graph-based baselines including HippoRAG and FastGraphRAG while relying solely on flat retrieval.

Conclusion: IndexRAG demonstrates that shifting cross-document reasoning to offline indexing can achieve strong multi-hop QA performance with efficient single-pass retrieval, offering a practical alternative to graph-based or iterative reasoning methods.

Abstract: Multi-hop question answering (QA) requires reasoning across multiple documents, yet existing retrieval-augmented generation (RAG) approaches address this either through graph-based methods requiring additional online processing or iterative multi-step reasoning. We present IndexRAG, a novel approach that shifts cross-document reasoning from online inference to offline indexing. IndexRAG identifies bridge entities shared across documents and generates bridging facts as independently retrievable units, requiring no additional training or fine-tuning. Experiments on three widely-used multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue) show that IndexRAG improves F1 over Naive RAG by 4.6 points on average, while requiring only single-pass retrieval and a single LLM call at inference time. When combined with IRCoT, IndexRAG outperforms all graph-based baselines on average, including HippoRAG and FastGraphRAG, while relying solely on flat retrieval. Our code will be released upon acceptance.

[42] EngGPT2: Sovereign, Efficient and Open Intelligence

G. Ciarfaglia, A. Rosanova, S. Cipolla, J. Bartoli, A. Di Domenico, C. Fioroni, A. Fontana, M. R. Scoleri, M. I. Mone, D. Franchi, M. C. Del Gaudio, F. Picariello, M. Gabusi, S. Bonura, V. Morreale, I. Bailo

Main category: cs.CL

TL;DR: EngGPT2-16B-A3B is an efficient Italian-focused Mixture-of-Experts LLM trained on 2.5T tokens with strong performance comparable to 8B-16B dense models while using significantly less inference power and training data.

DetailsMotivation: To create a sovereign, efficient, and open European LLM that combines performance with resource efficiency while being fully aligned with the EU AI Act, with particular focus on Italian language capabilities.

Method: Trained-from-scratch Mixture-of-Experts architecture with 16B total parameters (3B active per inference), trained on 2.5 trillion tokens including 25% Italian-language data, featuring multiple reasoning modes.

Result: Delivers performance comparable to dense 8B-16B models on benchmarks (MMLU-Pro, GSM8K, IFEval, HumanEval) while requiring 1/5 to 1/2 inference power and 1/10 to 1/6 training data/power.

Conclusion: EngGPT2 sets a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts, positioning itself as a key contributor to open-weight European models.

Abstract: EngGPT2-16B-A3B is the latest iteration of Engineering Group’s Italian LLM and it’s built to be a Sovereign, Efficient and Open model. EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3’s 36T or Llama3’s 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring one-fifth to half of the inference power, and between one-tenth to one-sixth of the training data and consequent needed training power. Designed as a trained-from-scratch Mixture-of-Experts (MoE) architecture, EngGPT2 features 16 billion parameters with 3 billion active per inference, with expert sizes positioned between those used in GPT-OSS and Qwen3. Approximately 25% of its training corpus consists of Italian-language data, to deliver strong capabilities for European and Italian NLP tasks among models of similar scale. This efficiency aims to position EngGPT2 as a key contributor to the growing portfolio of open-weight European models, combining performance and efficiency with full alignment to the EU AI Act. EngGPT2 is also a single model capable of multiple reasoning modes: non-reasoning, reasoning in Italian or English, and turbo-reasoning (a concise, bullet-point style reasoning available in both languages designed for real-time reasoning use cases). EngGPT2 aims to set a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts.

[43] VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

Yixuan Wang, Qingyu Shi, Jiayu Zhou, Dianbo Liu, Ziwei He, Zhouhan Lin

Main category: cs.CL

TL;DR: VQKV introduces vector quantization for training-free KV cache compression in LLMs, achieving high compression ratios while maintaining model performance.

DetailsMotivation: The growing context length of LLMs creates large KV caches that limit deployment in resource-limited environments. Existing training-free compression methods (low-rank approximation or scalar quantization) fail to achieve both high compression ratios and high reconstruction fidelity simultaneously.

Method: Proposes VQKV, a novel training-free method using vector quantization to obtain highly compressed KV representations. It represents thousands of floating-point values with just a few integer indices, preserving model fidelity while achieving significant compression.

Result: Achieves 82.8% compression ratio on LLaMA3.1-8B while retaining 98.6% of baseline performance on LongBench. Enables 4.3x longer generation length on the same memory footprint.

Conclusion: VQKV provides an effective training-free solution for KV cache compression that balances high compression ratios with model performance preservation, enabling longer context handling in resource-constrained environments.

Abstract: The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8% compression ratio on LLaMA3.1-8B while retaining 98.6% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.

[44] DynHD: Hallucination Detection for Diffusion Large Language Models via Denoising Dynamics Deviation Learning

Yanyu Qian, Yue Tan, Yixin Liu, Wang Yu, Shirui Pan

Main category: cs.CL

TL;DR: DynHD is a novel hallucination detection method for Diffusion LLMs that addresses spatial token imbalance and temporal denoising dynamics to improve detection accuracy and efficiency.

DetailsMotivation: While token-level uncertainty metrics help detect hallucinations in D-LLMs, current approaches fail to account for uneven token contributions to hallucination detection and ignore valuable temporal evolution patterns in uncertainty throughout the diffusion process.

Method: Proposes DynHD with two key components: 1) semantic-aware evidence construction to filter non-informative tokens and emphasize meaningful ones, addressing spatial imbalance; 2) reference evidence generator and deviation-based detector that model expected uncertainty evolution trajectories and measure discrepancies for hallucination prediction.

Result: Extensive experiments show DynHD consistently outperforms state-of-the-art baselines while achieving higher efficiency across multiple benchmarks and backbone models.

Conclusion: DynHD effectively addresses both spatial and temporal aspects of hallucination detection in D-LLMs, providing a more comprehensive and efficient solution that leverages the unique characteristics of diffusion-based generation.

Abstract: Diffusion large language models (D-LLMs) have emerged as a promising alternative to auto-regressive models due to their iterative refinement capabilities. However, hallucinations remain a critical issue that hinders their reliability. To detect hallucination responses from model outputs, token-level uncertainty (e.g., entropy) has been widely used as an effective signal to indicate potential factual errors. Nevertheless, the fixed-length generation paradigm of D-LLMs implies that tokens contribute unevenly to hallucination detection, with only a small subset providing meaningful signals. Moreover, the evolution trend of uncertainty throughout the diffusion process can also provide important signals, highlighting the necessity of modeling its denoising dynamics for hallucination detection. In this paper, we propose DynHD that bridge these gaps from both spatial (token sequence) and temporal (denoising dynamics) perspectives. To address the information density imbalance across tokens, we propose a semantic-aware evidence construction module that extracts hallucination-indicative signals by filtering out non-informative tokens and emphasizing semantically meaningful ones. To model denoising dynamics for hallucination detection, we introduce a reference evidence generator that learns the expected evolution trajectory of uncertainty evidence, along with a deviation-based hallucination detector that makes predictions by measuring the discrepancy between the observed and reference trajectories. Extensive experiments demonstrate that DynHD consistently outperforms state-of-the-art baselines while achieving higher efficiency across multiple benchmarks and backbone models.

[45] On the Emotion Understanding of Synthesized Speech

Yuan Ge, Haishu Zhao, Aokai Hao, Junxiang Zhang, Bei Li, Xiaoqian Liu, Chenglong Wang, Jianjin Wang, Bingsen Zhou, Bingyu Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao

Main category: cs.CL

TL;DR: SER models fail to generalize to synthesized speech due to representation mismatch and SLMs rely on text semantics over paralinguistic cues.

DetailsMotivation: To critically examine the assumption that emotion understanding models learn fundamental representations that transfer to synthesized speech, which is important for using SER as evaluation metric for speech synthesis.

Method: Systematic evaluation of Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative/generative SER models, and diverse synthesis models.

Result: Current SER models cannot generalize to synthesized speech due to representation mismatch from speech token prediction during synthesis. Generative SLMs infer emotion from textual semantics while ignoring paralinguistic cues.

Conclusion: Existing SER models exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.

Abstract: Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.

[46] AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni, Leqi Zheng, Jiajun Zhang, Peixi Wu, Dacheng Yin, Jing Lyu, Chun Yuan, Fengyun Rao

Main category: cs.CL

TL;DR: AdaMem is an adaptive user-centric memory framework for LLM agents that organizes dialogue history into multiple memory types and uses question-conditioned retrieval routes to improve long-horizon reasoning and user modeling.

DetailsMotivation: Existing memory systems for LLM agents face three core challenges: over-reliance on semantic similarity (missing user-centric evidence), storing experiences as isolated fragments (weakening temporal/causal coherence), and using static memory granularities that don't adapt to different question requirements.

Method: AdaMem organizes dialogue history into four memory types: working (recent context), episodic (structured long-term experiences), persona (stable user traits), and graph (relation-aware connections). At inference, it resolves target participants, builds question-conditioned retrieval routes combining semantic retrieval with relation-aware graph expansion when needed, and uses role-specialized pipeline for evidence synthesis and response generation.

Result: AdaMem achieves state-of-the-art performance on both LoCoMo and PERSONAMEM benchmarks for long-horizon reasoning and user modeling.

Conclusion: AdaMem provides an effective adaptive memory framework that addresses key limitations of existing systems for long-horizon dialogue agents, improving both reasoning and user modeling capabilities.

Abstract: Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user-centric understanding; they frequently store related experiences as isolated fragments, weakening temporal and causal coherence; and they typically use static memory granularities that do not adapt well to the requirements of different questions. We propose AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem organizes dialogue history into working, episodic, persona, and graph memories, enabling the system to preserve recent context, structured long-term experiences, stable user traits, and relation-aware connections within a unified framework. At inference time, AdaMem first resolves the target participant, then builds a question-conditioned retrieval route that combines semantic retrieval with relation-aware graph expansion only when needed, and finally produces the answer through a role-specialized pipeline for evidence synthesis and response generation. We evaluate AdaMem on the LoCoMo and PERSONAMEM benchmarks for long-horizon reasoning and user modeling. Experimental results show that AdaMem achieves state-of-the-art performance on both benchmarks. The code will be released upon acceptance.

[47] How often do Answers Change? Estimating Recency Requirements in Question Answering

Bhawna Piryani, Zehra Mert, Adam Jatowt

Main category: cs.CL

TL;DR: RecencyQA: A dataset and taxonomy for evaluating LLMs on time-sensitive questions with recency-stationarity categorization

DetailsMotivation: LLMs struggle with time-sensitive questions, giving confident but outdated answers. Existing benchmarks don't capture how often answers change or whether questions inherently need up-to-date information.

Method: Introduce recency-stationarity taxonomy categorizing questions by answer change frequency and time-invariance. Create RecencyQA dataset of 4,031 open-domain questions with recency/stationarity labels.

Result: Non-stationary questions (context-dependent recency) are significantly more challenging for LLMs, with difficulty increasing as update frequency rises.

Conclusion: RecencyQA enables fine-grained temporal reasoning benchmarking beyond binary freshness notions, providing foundation for recency-aware, context-sensitive QA systems.

Abstract: Large language models (LLMs) often rely on outdated knowledge when answering time-sensitive questions, leading to confident yet incorrect responses. Without explicit signals indicating whether up-to-date information is required, models struggle to decide when to retrieve external evidence, how to reason about stale facts, and how to rank answers by their validity. Existing benchmarks either periodically refresh answers or rely on fixed templates, but they do not reflect on how frequently answers change or whether a question inherently requires up-to-date information. To address this gap, we introduce a recency-stationarity taxonomy that categorizes questions by how often their answers change and whether this change frequency is time-invariant or context-dependent. Building on this taxonomy, we present RecencyQA, a dataset of 4,031 open-domain questions annotated with recency and stationarity labels. Through human evaluation and empirical analysis, we show that non-stationary questions, i.e., those where context changes the recency requirement, are significantly more challenging for LLMs, with difficulty increasing as update frequency rises. By explicitly modeling recency and context dependence, RecencyQA enables fine-grained benchmarking and analysis of temporal reasoning beyond binary notions of freshness, and provides a foundation for developing recency-aware and context-sensitive question answering systems.

[48] DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

Lei Wang, Min Huang, Eduard Dragut

Main category: cs.CL

TL;DR: DanceHA is a multi-agent framework for document-level Aspect-Based Sentiment Intensity Analysis (ABSIA) that uses divide-and-conquer strategies and human-AI collaboration to handle informal writing styles and extract fine-grained ACOSI tuples.

DetailsMotivation: Current ABSIA research focuses on domain-specific, sentence-level settings, leaving document-level ABSIA underexplored, especially for complex tasks like extracting Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) tuples from informal writing styles.

Method: DanceHA framework with two components: 1) Dance - divide-and-conquer strategy decomposing long-context ABSIA into smaller sub-tasks for specialized agent collaboration, and 2) HA - Human-AI collaboration for annotation. Created Inf-ABSIA dataset with multi-domain document-level ABSIA data.

Result: Extensive experiments show effectiveness of the agentic framework, successful knowledge transfer from multi-agent system to student models, and highlight the importance of informal styles in ABSIA as they often intensify aspect-specific opinions.

Conclusion: DanceHA addresses the gap in document-level ABSIA, particularly for informal writing styles, and demonstrates that multi-agent frameworks can effectively handle complex sentiment analysis tasks while enabling knowledge transfer to smaller models.

Abstract: Aspect-Based Sentiment Intensity Analysis (ABSIA) has garnered increasing attention, though research largely focuses on domain-specific, sentence-level settings. In contrast, document-level ABSIA–particularly in addressing complex tasks like extracting Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) tuples–remains underexplored. In this work, we introduce DanceHA, a multi-agent framework designed for open-ended, document-level ABSIA with informal writing styles. DanceHA has two main components: Dance, which employs a divide-and-conquer strategy to decompose the long-context ABSIA task into smaller, manageable sub-tasks for collaboration among specialized agents; and HA, Human-AI collaboration for annotation. We release Inf-ABSIA, a multi-domain document-level ABSIA dataset featuring fine-grained and high-accuracy labels from DanceHA. Extensive experiments demonstrate the effectiveness of our agentic framework and show that the multi-agent knowledge in DanceHA can be effectively transferred into student models. Our results highlight the importance of the overlooked informal styles in ABSIA, as they often intensify opinions tied to specific aspects.

[49] EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models

Yifei Zhang, Mingyang Li, Henry Gao, Liang Zhao

Main category: cs.CL

TL;DR: EmoLLM is a framework that integrates emotional intelligence (EQ) with cognitive intelligence (IQ) in large language models using appraisal theory and structured reasoning graphs to generate emotionally appropriate and factually reliable responses in dialogue settings.

DetailsMotivation: While LLMs demonstrate strong cognitive intelligence (IQ), real-world interactions require emotional intelligence (EQ) for responses that are both factually reliable and emotionally appropriate, especially in settings like emotional support, technical assistance, and consultation where effective dialogue depends on situational appraisal of user needs, goals, and coping capacity.

Method: Proposes EmoLLM framework using Appraisal Reasoning Graph (ARG) to structure intermediate reasoning over contextual facts, inferred user needs, appraisal dimensions, emotional states, and response strategies before generating replies. Trained with reinforcement learning in multi-turn role-play environment using reverse-perspective reasoning for reward signals based on predicted user-side consequences.

Result: Across diverse dialogue settings, EmoLLM improves emotional state outcomes and response quality over strong baselines while preserving strong factual reliability.

Conclusion: The appraisal-grounded framework enables effective IQ/EQ co-reasoning in dialogue systems, demonstrating that structured reasoning about emotional dimensions can enhance response appropriateness without compromising factual accuracy.

Abstract: Large language models (LLMs) demonstrate strong cognitive intelligence (IQ), yet many real-world interactions also require emotional intelligence (EQ) to produce responses that are both factually reliable and emotionally appropriate. In settings such as emotional support, technical assistance, and consultation, effective dialogue depends on how situations are appraised with respect to the user’s needs, goals, and coping capacity. Inspired by appraisal theory, we propose EmoLLM, an appraisal-grounded framework for IQ/EQ co-reasoning in dialogue. EmoLLM uses an explicit Appraisal Reasoning Graph (ARG) to structure intermediate reasoning over contextual facts, inferred user needs, appraisal dimensions, emotional states, and response strategies before generating a reply. We train EmoLLM in a multi-turn role-play environment with reinforcement learning, where reverse-perspective reasoning provides reward signals based on predicted user-side consequences of responses. Across diverse dialogue settings, EmoLLM improves emotional state outcomes and response quality over strong baselines while preserving strong factual reliability.

[50] Characterizing Delusional Spirals through Human-LLM Chat Logs

Jared Moore, Ashish Mehta, William Agnew, Jacy Reese Anthis, Ryan Louie, Yifan Mai, Peggy Yin, Myra Cheng, Samuel J Paech, Kevin Klyman, Stevie Chancellor, Eric Lin, Nick Haber, Desmond C. Ong

Main category: cs.CL

TL;DR: Analysis of real-world chatbot conversations where users experienced psychological harm, including delusions and suicidal thoughts, with quantitative coding of harmful patterns and recommendations for mitigation.

DetailsMotivation: To understand the real psychological harms caused by LLM chatbots through analysis of actual user-chatbot conversations, moving beyond speculation to evidence-based study of harmful interactions.

Method: Analyzed 391,562 messages from 19 users who reported psychological harm from chatbot use, developed an inventory of 28 codes to categorize harmful patterns, and examined co-occurrence of message codes across conversations.

Result: Found significant harmful patterns: 15.5% of user messages showed delusional thinking, 69 validated user messages expressed suicidal thoughts, 21.2% of chatbot messages misrepresented themselves as sentient. Romantic interest and sentience claims increased in longer conversations.

Conclusion: Provides concrete recommendations for policymakers, developers, and users to mitigate harm, with tools for analyzing harmful conversation patterns and identifying where safeguards degrade in multi-turn interactions.

Abstract: As large language models (LLMs) have proliferated, disturbing anecdotal reports of negative psychological effects, such as delusions, self-harm, and AI psychosis,'' have emerged in global media and legal discourse. However, it remains unclear how users and chatbots interact over the course of lengthy delusional spirals,’’ limiting our ability to understand and mitigate the harm. In our work, we analyze logs of conversations with LLM chatbots from 19 users who report having experienced psychological harms from chatbot use. Many of our participants come from a support group for such chatbot users. We also include chat logs from participants covered by media outlets in widely-distributed stories about chatbot-reinforced delusions. In contrast to prior work that speculates on potential AI harms to mental health, to our knowledge we present the first in-depth study of such high-profile and veridically harmful cases. We develop an inventory of 28 codes and apply it to the $391,562$ messages in the logs. Codes include whether a user demonstrates delusional thinking (15.5% of user messages), a user expresses suicidal thoughts (69 validated user messages), or a chatbot misrepresents itself as sentient (21.2% of chatbot messages). We analyze the co-occurrence of message codes. We find, for example, that messages that declare romantic interest and messages where the chatbot describes itself as sentient occur much more often in longer conversations, suggesting that these topics could promote or result from user over-engagement and that safeguards in these areas may degrade in multi-turn settings. We conclude with concrete recommendations for how policymakers, LLM chatbot developers, and users can use our inventory and conversation analysis tool to understand and mitigate harm from LLM chatbots. Warning: This paper discusses self-harm, trauma, and violence.

[51] Diverging Transformer Predictions for Human Sentence Processing: A Comprehensive Analysis of Agreement Attraction Effects

Titus von der Malsburg, Sebastian Padó

Main category: cs.CL

TL;DR: Transformers fail to fully replicate human sentence processing patterns, showing mixed results on agreement attraction tasks with performance varying by syntactic configuration.

DetailsMotivation: To evaluate the cognitive adequacy of transformer models as models of human sentence processing, specifically testing whether they can replicate human patterns in agreement attraction phenomena.

Method: Used surprisal-based linking mechanism to systematically evaluate 11 autoregressive transformers of varying sizes and architectures on comprehensive English agreement attraction configurations, comparing model predictions to human reading time data.

Result: Mixed results: transformers align with human data for prepositional phrase configurations but perform poorly on object-extracted relative clauses, with predictions diverging across models and none replicating human asymmetric interference patterns.

Conclusion: Current transformer models do not explain human morphosyntactic processing, and evaluations must adopt rigorous, comprehensive experimental designs to avoid spurious generalizations.

Abstract: Transformers underlie almost all state-of-the-art language models in computational linguistics, yet their cognitive adequacy as models of human sentence processing remains disputed. In this work, we use a surprisal-based linking mechanism to systematically evaluate eleven autoregressive transformers of varying sizes and architectures on a more comprehensive set of English agreement attraction configurations than prior work. Our experiments yield mixed results: While transformer predictions generally align with human reading time data for prepositional phrase configurations, performance degrades significantly on object-extracted relative clause configurations. In the latter case, predictions also diverge markedly across models, and no model successfully replicates the asymmetric interference patterns observed in humans. We conclude that current transformer models do not explain human morphosyntactic processing, and that evaluations of transformers as cognitive models must adopt rigorous, comprehensive experimental designs to avoid spurious generalizations from isolated syntactic configurations or individual models.

[52] BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

Ji-Fu Li, Manyi Zhang, Xiaobo Xia, Han Bao, Haoli Bai, Zhenhua Dong, Xianzhi Yu

Main category: cs.CL

TL;DR: BATQuant is a novel quantization method for MXFP4 formats that uses block-wise affine transformations and Kronecker decomposition to prevent cross-block outlier propagation and optimize distribution shaping for MLLMs and LLMs.

DetailsMotivation: Existing post-training quantization methods, especially rotation-based techniques designed for integer formats, suffer severe performance collapse when applied to MXFP4 formats for MLLMs and LLMs due to format mismatch issues like cross-block outlier propagation and bimodal activation distributions.

Method: Proposes BATQuant with block-wise affine transformations restricted to MXFP granularity to prevent cross-block outlier propagation, Global and Private Kronecker decomposition for parameter efficiency, and block-wise learnable clipping to suppress residual outliers.

Result: Establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.

Conclusion: BATQuant effectively addresses the fundamental format mismatch issues in MXFP4 quantization for MLLMs and LLMs, providing an efficient solution for deploying these models on modern accelerator architectures.

Abstract: Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.

[53] Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry

Mo El-Haj

Main category: cs.CL

TL;DR: Tarab Corpus is a large-scale Arabic creative text dataset containing 2.56M verses of songs and poetry spanning 14 centuries, covering multiple Arabic varieties with rich metadata for linguistic and cultural analysis.

DetailsMotivation: To create a comprehensive, open-source Arabic corpus that bridges the gap between song lyrics and poetry, enabling comparative analysis across genres, linguistic varieties, and historical periods in Arabic creative expression.

Method: Collected and processed 2.56 million verses (13.5M tokens) from Arabic songs and poetry, covering Classical Arabic, Modern Standard Arabic, and six major regional varieties. Implemented data collection, normalization, and validation pipeline with structured metadata for linguistic variety, geographic origin, and historical context.

Result: Created the largest open Arabic corpus of creative text, balanced between songs and poems, spanning 14 centuries from Pre-Islamic to 21st century. Includes baseline analyses for variety identification and genre differentiation. Dataset is publicly available on HuggingFace.

Conclusion: Tarab Corpus provides a valuable resource for computational linguistics, cultural studies, and Arabic NLP research, enabling comparative analysis of Arabic creative expression across time, geography, and genres.

Abstract: We introduce the Tarab Corpus, a large-scale cultural and linguistic resource that brings together Arabic song lyrics and poetry within a unified analytical framework. The corpus comprises 2.56 million verses and more than 13.5 million tokens, making it, to our knowledge, the largest open Arabic corpus of creative text spanning both classical and contemporary production. Tarab is broadly balanced between songs and poems and covers Classical Arabic, Modern Standard Arabic (MSA), and six major regional varieties: Egyptian, Gulf, Levantine, Iraqi, Sudanese, and Maghrebi Arabic. The artists and poets represented in the corpus are associated with 28 modern nation states and multiple historical eras, covering over fourteen centuries of Arabic creative expression from the Pre-Islamic period to the twenty-first century. Each verse is accompanied by structured metadata describing linguistic variety, geographic origin, and historical or cultural context, enabling comparative linguistic, stylistic, and diachronic analysis across genres and time. We describe the data collection, normalisation, and validation pipeline and present baseline analyses for variety identification and genre differentiation. The dataset is publicly available on HuggingFace at https://huggingface.co/datasets/drelhaj/Tarab.

[54] Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Omnilingual SONAR Team, João Maria Janeiro, Pere-Lluís Huguet Cabot, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramírez, Loic Barrault, Belen Alastruey, Yu-An Chung, Marta R. Costa-Jussa, David Dale, Kevin Heffernan, Jaehyeong Jo, Artyom Kozhevnikov, Alexandre Mourachko, Christophe Ropers, Holger Schwenk, Paul-Ambroise Duquenne

Main category: cs.CL

TL;DR: OmniSONAR is an omnilingual, cross-modal sentence embedding model that embeds text, speech, code, and math in a single semantic space across thousands of languages, achieving state-of-the-art performance on cross-lingual and cross-modal tasks.

DetailsMotivation: Existing cross-lingual sentence encoders cover only a few hundred languages, often trade downstream quality for alignment, and lack cross-modal capabilities, limiting their adoption and utility.

Method: Progressive training: 1) Learn foundational space for 200 languages with LLM-initialized encoder-decoder using token-level decoding, split-softmax contrastive loss, and synthetic hard negatives; 2) Expand to thousands of languages via two-stage teacher-student encoder distillation; 3) Map 177 spoken languages into the space for cross-modal extension.

Result: Halves cross-lingual similarity search error on FLORES (200 languages), reduces error by 15x on BIBLE (1,560 languages), outperforms NLLB-3B on translation, achieves 43% lower speech similarity-search error, and reaches 97% of SeamlessM4T speech-to-text quality zero-shot.

Conclusion: OmniSONAR enables high-quality, scalable cross-lingual and cross-modal semantic understanding across thousands of languages and modalities, with strong downstream performance and extensibility to complex tasks via embedding-based language models.

Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.

[55] Domain Mixture Design via Log-Likelihood Differences for Aligning Language Models with a Target Model

Ryo Kishino, Riku Shiomi, Hiroaki Yamagiwa, Momose Oyama, Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: Proposes method to align base model with target model via domain mixture weighting in pretraining, using log-likelihood space alignment rather than direct distillation.

DetailsMotivation: Addresses model alignment problem by designing domain mixture weights for pretraining data to align base model with target model in distribution, providing alternative to knowledge distillation.

Method: Views models as points in log-likelihood space and aligns training update direction with direction toward target model by determining optimal domain weights for training data mixture.

Result: Experiments with NanoGPT show consistent reduction in KL divergence to target model compared to uniform weighting over the Pile; downstream task performance tends toward target model’s performance.

Conclusion: Proposed method effectively aligns models via domain weighting, though knowledge distillation remains more effective when available; provides meaningful alignment alternative.

Abstract: Instead of directly distilling a language model, this study addresses the problem of aligning a base model with a target model in distribution by designing the domain mixture of training data for pretraining or continued pretraining as a fixed training recipe. We propose a method for determining domain weights by viewing models as points in log-likelihood space and aligning the training update direction with the direction toward the target model. Experiments with NanoGPT show that the proposed method consistently reduces the KL divergence to the target model compared with uniform weighting over the Pile. Although knowledge distillation remains more effective when available, the proposed method still achieves meaningful alignment, and downstream task performance also tends to become closer to that of the target model.

[56] Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy

Zhaoxin Feng, Zheng Chen, Jianfei Ma, Yip Tin Po, Emmanuele Chersoni, Bo Li

Main category: cs.CL

TL;DR: Chain-of-Thought reasoning both reduces and masks sycophancy in LLMs, with models showing more sycophancy in subjective tasks and under authority bias.

DetailsMotivation: To investigate whether Chain-of-Thought reasoning mitigates or masks sycophancy in LLMs, particularly in objective vs. subjective tasks and under authority bias.

Method: Evaluated various LLMs across objective and subjective tasks, analyzed reasoning processes, and conducted mechanistic analysis on three open-source models to track sycophancy dynamics.

Result: Reasoning generally reduces sycophancy in final decisions but masks it in some samples through deceptive justifications. LLMs show more sycophancy in subjective tasks and under authority bias, with sycophancy being dynamic during reasoning rather than predetermined.

Conclusion: CoT reasoning has dual effects on sycophancy - both mitigating and masking it, with sycophancy being a dynamic process influenced by task type and authority bias.

Abstract: Alignment techniques often inadvertently induce sycophancy in LLMs. While prior studies studied this behaviour in direct-answer settings, the role of Chain-of-Thought (CoT) reasoning remains under-explored: does it serve as a logical constraint that mitigates sycophancy, or a tool for post-hoc rationalization that masks it? We evaluate a range of models across objective and subjective tasks to investigate the issue. Results show that reasoning generally reduces sycophancy in final decisions but also masks sycophancy in some samples, where models construct deceptive justifications through logical inconsistencies, calculation errors, and one-sided arguments etc. Furthermore, LLMs are more prone to sycophancy in subjective tasks and under authority-bias. Our mechanistic analysis on three open-source models reveals that the tendency of sycophancy is dynamic during the reasoning process rather than being pre-determined at the input stage.

[57] Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li

Main category: cs.CL

TL;DR: Omanic is a multi-hop QA dataset with decomposed reasoning steps for analyzing LLM reasoning processes, containing synthetic training data and human-annotated evaluation data.

DetailsMotivation: Existing LLM evaluation focuses on final answers without exposing intermediate reasoning steps, making it hard to diagnose reasoning failures, and current multi-hop QA benchmarks lack step-level annotations.

Method: Created Omanic dataset with 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench), providing decomposed sub-questions and intermediate answers for stepwise reasoning analysis.

Result: State-of-the-art LLMs achieve only 73.11% accuracy on OmanicBench, showing high difficulty. Stepwise analysis reveals CoT performance depends on factual completeness, with gains diminishing under knowledge gaps. Fine-tuning on OmanicSynth brings 7.41 average point gains across six reasoning benchmarks.

Conclusion: Omanic provides valuable resources for analyzing LLM reasoning processes and improving reasoning capabilities through stepwise evaluation and synthetic training data.

Abstract: Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT’s performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine-tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset’s quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

Aishwarya Ramasethu, Niyathi Allu, Rohin Garg, Harshwardhan Fartale, Dun Li Chan

Main category: cs.CL

TL;DR: LLMs show limited effectiveness in extremely low-resource machine translation; pivot languages and few-shot demonstrations provide modest improvements in specific configurations but gains are inconsistent and sensitive to example construction.

DetailsMotivation: To address LLMs' limitations in extremely low-resource machine translation where standard adaptation techniques requiring large-scale parallel data or extensive fine-tuning are infeasible for underrepresented languages.

Method: Investigates data-efficient setup combining linguistically related pivot languages with few-shot in-context examples without parameter updates, evaluating translation behavior under controlled conditions.

Result: Pivot-based prompting yields modest improvements in certain configurations, especially when target language is less represented in model’s vocabulary, but gains are inconsistent and sensitive to few-shot example construction.

Conclusion: Provides empirical guidance on when inference-time prompting and pivot-based examples can serve as lightweight alternatives to fine-tuning in low-resource translation settings.

Abstract: Large Language Models (LLMs) have achieved strong performance across many downstream tasks, yet their effectiveness in extremely low-resource machine translation remains limited. Standard adaptation techniques typically rely on large-scale parallel data or extensive fine-tuning, which are infeasible for the long tail of underrepresented languages. In this work, we investigate a more constrained question: in data-scarce settings, to what extent can linguistically similar pivot languages and few-shot demonstrations provide useful guidance for on-the-fly adaptation in LLMs? We study a data-efficient experimental setup that combines linguistically related pivot languages with few-shot in-context examples, without any parameter updates, and evaluate translation behavior under controlled conditions. Our analysis shows that while pivot-based prompting can yield improvements in certain configurations, particularly in settings where the target language is less well represented in the model’s vocabulary, the gains are often modest and sensitive to few shot example construction. For closely related or better represented varieties, we observe diminishing or inconsistent gains. Our findings provide empirical guidance on how and when inference-time prompting and pivot-based examples can be used as a lightweight alternative to fine-tuning in low-resource translation settings.

[59] Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models

Mohamed Adel, Bashar Alhafni, Nizar Habash

Main category: cs.CL

TL;DR: LLMs evaluated on Arabic morphosyntactic tagging and dependency parsing, showing prompt design and retrieval-based ICL significantly impact performance, with proprietary models approaching supervised baselines.

DetailsMotivation: To assess LLMs' ability to produce explicit linguistic structure, particularly for challenging languages like Arabic with rich morphology and orthographic ambiguity that create strong morphology-syntax interactions.

Method: Evaluated instruction-tuned LLMs on two structured prediction tasks: morphosyntactic tagging and labeled dependency parsing for Standard Arabic. Compared zero-shot prompting with retrieval-based in-context learning using examples from Arabic treebanks.

Result: Prompt design and demonstration selection strongly affect performance. Proprietary models approach supervised baselines for feature-level tagging and become competitive with specialized dependency parsers. In raw-text settings, tokenization remains challenging, though retrieval-based ICL improves both parsing and tokenization.

Conclusion: LLMs show promising capabilities for Arabic linguistic structure prediction, but performance depends heavily on prompt engineering and demonstration selection. The analysis reveals which aspects of Arabic morphosyntax and syntax LLMs capture reliably versus which remain difficult.

Abstract: Large language models (LLMs) perform strongly on many NLP tasks, but their ability to produce explicit linguistic structure remains unclear. We evaluate instruction-tuned LLMs on two structured prediction tasks for Standard Arabic: morphosyntactic tagging and labeled dependency parsing. Arabic provides a challenging testbed due to its rich morphology and orthographic ambiguity, which create strong morphology-syntax interactions. We compare zero-shot prompting with retrieval-based in-context learning (ICL) using examples from Arabic treebanks. Results show that prompt design and demonstration selection strongly affect performance: proprietary models approach supervised baselines for feature-level tagging and become competitive with specialized dependency parsers. In raw-text settings, tokenization remains challenging, though retrieval-based ICL improves both parsing and tokenization. Our analysis highlights which aspects of Arabic morphosyntax and syntax LLMs capture reliably and which remain difficult.

[60] Probing Cultural Signals in Large Language Models through Author Profiling

Valentin Lafargue, Ariel Guerra-Adames, Emmanuelle Claeys, Elouan Vuichard, Jean-Michel Loubes

Main category: cs.CL

TL;DR: LLMs show cultural biases in zero-shot author profiling from song lyrics, with systematic alignment toward North American ethnicity and varying bias levels across models.

DetailsMotivation: As LLMs are increasingly deployed in applications with societal impact, there's growing concern about the cultural biases they encode. The paper aims to probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting.

Method: Evaluated several open-source LLMs on more than 10,000 song lyrics in a zero-shot setting, inferring singers’ gender and ethnicity without task-specific fine-tuning. Introduced two fairness metrics: Modality Accuracy Divergence (MAD) and Recall Divergence (RD) to quantify disparities.

Result: LLMs achieve non-trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek-1.5B aligns more strongly with Asian ethnicity. Ministral-8B displays the strongest ethnicity bias, whereas Gemma-12B shows the most balanced behavior.

Conclusion: LLMs encode cultural biases that manifest in zero-shot author profiling tasks, with systematic alignment patterns that vary across models. The findings highlight the need for better cultural representation in LLM training and evaluation.

Abstract: Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting, inferring singers’ gender and ethnicity without task-specific fine-tuning. Across several open-source models evaluated on more than 10,000 lyrics, we find that LLMs achieve non-trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek-1.5B aligns more strongly with Asian ethnicity. This finding emerges from both the models’ prediction distributions and an analysis of their generated rationales. To quantify these disparities, we introduce two fairness metrics, Modality Accuracy Divergence (MAD) and Recall Divergence (RD), and show that Ministral-8B displays the strongest ethnicity bias among the evaluated models, whereas Gemma-12B shows the most balanced behavior. Our code is available on GitHub (https://github.com/ValentinLafargue/CulturalProbingLLM).

[61] TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

Victoria Graf, Valentina Pyatkin, Nouha Dziri, Nathan Lambert, Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: A new benchmark and data pipeline for evaluating and training multi-turn conversational abilities in language models, showing that multi-turn training data significantly improves performance.

DetailsMotivation: Current open training and evaluation data focus on single-turn settings, failing to capture the complexities of multi-turn conversations which are a common and critical mode of language model interaction.

Method: Introduced TurnWiseEval benchmark for multi-turn capabilities with pairwise comparison to single-turn settings, and TurnWiseData pipeline for scalable generation of synthetic multi-turn training data.

Result: Experiments with Olmo 3 show training with multi-turn data is vital for strong multi-turn performance, with just 10k multi-turn conversations improving TurnWiseEval by 12%.

Conclusion: Multi-turn conversational ability requires specific training data, and the introduced benchmark and data pipeline address the gap between single-turn and multi-turn evaluation and training.

Abstract: Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.

[62] SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

Jonggeun Lee, Junseong Pyo, Jeongmin Park, Yohan Jo

Main category: cs.CL

TL;DR: SpokenTOD: Large-scale spoken task-oriented dialogue dataset with 52k dialogues and 1k+ hours of speech, augmented with spoken user behaviors. SpokenUS: Spoken user simulator with barge-in architecture that outperforms baselines in human evaluation.

DetailsMotivation: Robust spoken dialogue agents need exposure to diverse speech interactions, but existing datasets are limited in scale and domain coverage, lacking systematic augmentation pipelines for spoken user behaviors.

Method: Created SpokenTOD dataset with 52,390 dialogues across diverse speakers/domains, augmented with four spoken behaviors: cross-turn slots, barge-in, disfluency, and emotional prosody. Built SpokenUS simulator with dedicated barge-in architecture grounded in TOD.

Result: SpokenUS achieves comparable goal coverage to larger models while substantially outperforming baselines in Human MOS. It discloses slot values gradually like humans rather than front-loading them. Analysis confirms its behaviors pose meaningful challenges to downstream agents.

Conclusion: SpokenTOD and SpokenUS provide practical tools for training and evaluating more robust spoken dialogue systems by addressing the need for large-scale, behaviorally-rich spoken TOD data and effective user simulation.

Abstract: Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors – cross-turn slots, barge-in, disfluency, and emotional prosody – across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all baselines in Human MOS, disclosing slot values gradually across the dialogue as humans do rather than front-loading them. Further analysis confirms that SpokenUS’s spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.

[63] Mediocrity is the key for LLM as a Judge Anchor Selection

Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend

Main category: cs.CL

TL;DR: Systematic investigation reveals anchor selection significantly impacts reliability of LLM-as-a-judge evaluations, with poor anchors reducing correlation with human rankings; provides guidelines for anchor selection and benchmark sizing.

DetailsMotivation: The LLM-as-a-judge paradigm uses pairwise comparisons but faces quadratic scaling costs, leading to anchor-based evaluation methods. However, the impact of anchor selection on result reliability remains unexplored despite widespread use in benchmarks like Arena-Hard and AlpacaEval.

Method: Systematically evaluated 22 different anchors on the Arena-Hard-v2.0 dataset, analyzing how anchor choice affects correlation with human rankings. Quantified effect size of anchor selection and conducted power analysis to determine sufficient benchmark sizes for reliable evaluation.

Result: Anchor selection is critical - poor anchors dramatically reduce correlation with human rankings. Common choices (best/worst-performing models) make poor anchors as they’re consistently better/worse than all others, providing little information about relative rankings. Anchor selection effect size is comparable to judge model selection. Standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish competitive models reliably.

Conclusion: Provides actionable recommendations: 1) Power analysis showing required benchmark sizes for anchor-based evaluation, 2) Guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices in LLM-as-a-judge paradigm.

Abstract: The ``LLM-as-a-judge’’ paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.

[64] Online Experiential Learning for Language Models

Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei

Main category: cs.CL

TL;DR: OEL enables LLMs to continuously improve from real-world deployment experience through experiential knowledge extraction and on-policy context distillation.

DetailsMotivation: Current LLM improvement relies on offline training with human annotations or simulations, ignoring valuable real-world deployment experience that could enable continuous learning.

Method: Two-stage framework: 1) Extract transferable experiential knowledge from user-side interaction trajectories, 2) Consolidate knowledge via on-policy context distillation without accessing user-side environment. Iterate to form online learning loop.

Result: OEL achieves consistent improvements over iterations in text-based games across model scales, enhancing task accuracy and token efficiency while preserving out-of-distribution performance. Experiential knowledge is more effective than raw trajectories.

Conclusion: OEL enables continuous LLM improvement from deployment experience, with experiential knowledge extraction and on-policy consistency being critical for effective learning.

Abstract: The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.

[65] Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

Sahil Sen, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah

Main category: cs.CL

TL;DR: Chronos is a temporal-aware memory framework for conversational AI that structures dialogue into event tuples with datetime ranges and entity aliases, enabling effective reasoning over long-term, evolving conversations through dual calendar indexing and dynamic retrieval guidance.

DetailsMotivation: Existing memory systems for conversational AI struggle with reasoning over temporally grounded facts and preferences that evolve across months of interaction, and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories.

Method: Decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. Uses dynamic prompting to generate tailored retrieval guidance for each question, directing retrieval across time ranges and multi-hop reasoning through iterative tool-calling over both calendars.

Result: Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy on LongMemEvalS benchmark (500 questions spanning six categories), setting new SOTA with 7.67% improvement over best prior system. Ablation shows events calendar accounts for 58.9% gain on baseline.

Conclusion: Chronos effectively addresses temporal reasoning challenges in long-term conversational AI by structuring dialogue into temporally-grounded events and providing intelligent retrieval mechanisms, significantly outperforming existing approaches.

Abstract: Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.

[66] Form and meaning co-determine the realization of tone in Taiwan Mandarin spontaneous speech: the case of T2-T3 and T3-T3 tone sandhi

Yuxin Lu, Yu-Ying Chuang, R. Harald Baayen

Main category: cs.CL

TL;DR: Tone 3 sandhi in spontaneous Taiwan Mandarin shows complete assimilation to Tone 2 when word-level effects are accounted for, unlike previous findings of incomplete sandhi in controlled speech.

DetailsMotivation: Previous studies on Mandarin Tone 3 sandhi have focused on laboratory speech and formal registers, showing incomplete assimilation. This study investigates how Tone 3 sandhi operates in spontaneous Taiwan Mandarin conversations and examines contextual factors affecting tonal realization.

Method: Analyzed pitch contours of two-character words with T2-T3 and T3-T3 tone patterns in spontaneous Taiwan Mandarin conversations using Generative Additive Mixed Models (GAMM) to examine F0 contours as a function of normalized time, considering factors like gender, duration, word position, bigram probability, neighboring tones, speaker, and novel predictors (word and word sense).

Result: In spontaneous Taiwan Mandarin, T3-T3 words become indistinguishable from T2-T3 words once word-level effects are accounted for, indicating complete sandhi rather than the incomplete assimilation previously observed in controlled speech.

Conclusion: Tone 3 sandhi in spontaneous Taiwan Mandarin shows complete assimilation to Tone 2 when considering word-level contextual factors, contrasting with previous findings from controlled speech environments.

Abstract: In Standard Chinese, Tone 3 (the dipping tone) becomes Tone 2 (rising tone) when followed by another Tone 3. Previous studies have noted that this sandhi process may be incomplete, in the sense that the assimilated Tone 3 is still distinct from a true Tone 2. While Mandarin Tone 3 sandhi is widely studied using carefully controlled laboratory speech (Xu, 1997) and more formal registers of Beijing Mandarin (Yuan & Y. Chen, 2014), less is known about its realization in spontaneous speech, and about the effect of contextual factors on tonal realization. The present study investigates the pitch contours of two-character words with T2-T3 and T3-T3 tone patterns in spontaneous Taiwan Mandarin conversations. Our analysis makes use of the Generative Additive Mixed Model (GAMM, Wood, 2017) to examine fundamental frequency (F0) contours as a function of normalized time. We consider various factors known to influence pitch contours, including gender, duration, word position, bigram probability, neighboring tones, speaker, and also novel predictors, word and word sense (Chuang et al., 2025). Our analyses revealed that in spontaneous Taiwan Mandarin, T3-T3 words become indistinguishable from T2-T3 words, indicating complete sandhi, once the strong effect of word (or word sense) is taken into account.

[67] LLMs as Repositories of Factual Knowledge: Limitations and Solutions

Seyed Mahed Mousavi, Simone Alghisi, Giuseppe Riccardi

Main category: cs.CL

TL;DR: The paper studies LLMs as repositories of factual knowledge, evaluating their reliability on time-sensitive factual questions and proposing an entity-aware fine-tuning method to improve consistency and accuracy.

DetailsMotivation: LLMs are trained on data snapshots containing factual information about entities collected at different times and from different sources, leading to potential inconsistencies and inaccuracies due to temporal changes and source variations. The paper aims to assess whether LLMs are appropriate as repositories of factual knowledge.

Method: The authors evaluate 24 state-of-the-art LLMs (closed, partially open, and fully open-source) on time-sensitive factual questions, measuring accuracy and consistency when prompts are perturbed. They also test existing methods to improve LLMs’ performance and propose ENtity-Aware Fine-tuning (ENAF), a soft neurosymbolic approach that provides structured entity representations during fine-tuning.

Result: The evaluation reveals issues with LLMs’ reliability as factual knowledge repositories, showing problems with accuracy and consistency on time-sensitive questions. The proposed ENAF method demonstrates effectiveness in reducing inconsistencies and improving response stability under prompt variations.

Conclusion: LLMs have limitations as repositories of factual knowledge due to temporal inconsistencies and source variations. The proposed entity-aware fine-tuning approach offers a promising direction for improving LLMs’ reliability in factual knowledge tasks.

Abstract: LLMs’ sources of knowledge are data snapshots containing factual information about entities collected at different timestamps and from different media types (e.g. wikis, social media, etc.). Such unstructured knowledge is subject to change due to updates through time from past to present. Equally important are the inconsistencies and inaccuracies occurring in different information sources. Consequently, the model’s knowledge about an entity may be perturbed while training over the sequence of snapshots or at inference time, resulting in inconsistent and inaccurate model performance. In this work, we study the appropriateness of Large Language Models (LLMs) as repositories of factual knowledge. We consider twenty-four state-of-the-art LLMs that are either closed-, partially (weights), or fully (weight and training data) open-source. We evaluate their reliability in responding to time-sensitive factual questions in terms of accuracy and consistency when prompts are perturbed. We further evaluate the effectiveness of state-of-the-art methods to improve LLMs’ accuracy and consistency. We then propose ENtity-Aware Fine-tuning (ENAF), a soft neurosymbolic approach aimed at providing structured representation of entities during fine-tuning to reduce inconsistencies and improve response stability under prompt variations.

[68] BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Mathew J. Koretsky, Maya Willey, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri

Main category: cs.CL

TL;DR: BiomedSQL is a benchmark for evaluating scientific reasoning in text-to-SQL generation over biomedical knowledge bases, requiring domain-specific inference rather than just syntactic translation.

DetailsMotivation: Current text-to-SQL systems struggle with mapping qualitative scientific questions into executable SQL when implicit domain reasoning is required, particularly in biomedical research where complex analytical tasks rely on large-scale structured databases.

Method: Created BiomedSQL benchmark with 68,000 question/SQL query/answer triples generated from templates and grounded in a harmonized BigQuery knowledge base integrating gene-disease associations, omics data causal inference, and drug approval records. Evaluated various LLMs across prompting strategies and interaction paradigms.

Result: Substantial performance gap observed: Gemini-3-Pro achieved 58.1% execution accuracy, custom multi-step agent BMSQL reached 62.6%, both well below expert baseline of 90.0%.

Conclusion: BiomedSQL provides a foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases, highlighting current limitations in domain-specific reasoning.

Abstract: Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples generated from templates and grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: Gemini-3-Pro achieves 58.1% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.

[69] Are LLMs Good Text Diacritizers? An Arabic and Yoruba Case Study

Hawau Olamide Toyin, Samar Mohamed Magdy, Hanan Aldarmaki

Main category: cs.CL

TL;DR: LLMs outperform specialized models for text diacritization in Arabic and Yoruba, but smaller models hallucinate; fine-tuning with LoRA improves Yoruba performance.

DetailsMotivation: To evaluate LLMs' effectiveness for text diacritization in typologically distinct languages (Arabic and Yoruba) and compare them against specialized diacritization models.

Method: Introduced MultiDiac dataset with diverse diacritic ambiguities; evaluated 12 LLMs varying in size/accessibility/language coverage against 4 specialized models; fine-tuned 4 small open-source models using LoRA for Yoruba.

Result: Off-the-shelf LLMs outperform specialized diacritization models; smaller models suffer from hallucinations; fine-tuning on small dataset improves diacritization performance and reduces hallucinations for Yoruba.

Conclusion: LLMs are effective for multilingual text diacritization, with fine-tuning needed for smaller models to reduce hallucinations and improve performance in low-resource languages.

Abstract: We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 12 LLMs varying in size, accessibility, and language coverage, and benchmark them against $4$ specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models, but smaller models suffer from hallucinations. We find that fine-tuning on a small dataset can help improve diacritization performance and reduce hallucinations for Yoruba.

[70] Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

Tianyi Zhou, Johanne Medina, Sanjay Chawla

Main category: cs.CL

TL;DR: LLMs generate confabulations (incorrect but fluent content), especially problematic in multi-turn applications. The paper investigates how in-context information influences model behavior and proposes a reliability estimation method using token-level uncertainty to guide aggregation of internal representations for detecting unreliable responses.

DetailsMotivation: LLMs are prone to generating fluent but incorrect content (confabulation), which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. There's a need to understand how in-context information influences model behavior and whether LLMs can identify their own unreliable responses.

Method: Proposes a reliability estimation method that leverages token-level uncertainty to guide aggregation of internal model representations. Computes aleatoric and epistemic uncertainty from output logits to identify salient tokens, then aggregates their hidden states into compact representations for response-level reliability prediction. Uses probing-based approach to capture shifts in model behavior.

Result: Through controlled experiments on open QA benchmarks: correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses (revealing misalignment between uncertainty and correctness). The probing-based method captures these behavioral shifts and improves detection of unreliable outputs across multiple open-source LLMs.

Conclusion: The results underscore limitations of direct uncertainty signals and highlight potential of uncertainty-guided probing for reliability-aware generation. The method shows promise for detecting unreliable LLM outputs, which is crucial for multi-turn applications where confabulations can propagate.

Abstract: Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.

[71] Transformer-Encoder Trees for Efficient Multilingual Machine Translation and Speech Translation

Yiwen Guan, Jacob Whitehill

Main category: cs.CL

TL;DR: TET is a hierarchical encoder-only architecture using CTC for multilingual translation that shares representations among similar languages, reducing redundancy and enabling parallel decoding of all target languages in one pass.

DetailsMotivation: Address computational redundancy in multilingual translation and improve quality for low-resource languages by sharing representations among linguistically similar languages.

Method: Transformer Encoder Tree (TET) architecture with hierarchical structure, non-autoregressive encoder-only design trained with Connectionist Temporal Classification (CTC), enabling parallel decoding across all target languages.

Result: 66% parameter reduction and 60% lower inference computation compared to naive one-to-many design; 7-14x speedup in speech translation while maintaining competitive quality.

Conclusion: TET effectively reduces computational redundancy in multilingual translation while improving low-resource language performance through hierarchical language representation sharing.

Abstract: Multilingual translation suffers from computational redundancy, especially when translating into multiple languages simultaneously. In addition, translation quality can suffer for low-resource languages. To address this, we introduce Transformer Encoder Tree (TET), a hierarchical, non-autoregressive encoder-only architecture trained with Connectionist Temporal Classification (CTC) for multilingual translation. TET shares intermediate representations among linguistically similar target languages, improving accuracy on low-resource languages while reducing computational redundancy and enabling the generation of all target languages in a single forward pass. TET eliminates the sequential bottleneck of autoregressive models and supports fully parallel decoding of all tokens across all target languages. Compared to a naive one-to-many multilingual design, TET reduces the total parameter count by 66% and lowers inference computation by 60%. In speech translation, combining TET with a non-autoregressive speech recognition backbone (Wav2Vec2) shows competitive translation quality compared to autoregressive systems while speeding up inference by approximately 7-14 times.

[72] SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages

Hannah Liu, Junghyun Min, En-Shiun Annie Lee, Ethan Yue Heng Cheung, Shou-Yi Hung, Elsie Chan, Shiyao Qian, Runtong Liang, Kimlan Huynh, Wing Yu Yip, York Hay Ng, TSZ Fung Yau, Ka Ieng Charlotte Lo, You-Wei Wu, Richard Tzong-Han Tsai

Main category: cs.CL

TL;DR: A fine-grained dataset for machine translation error detection with span, type, and severity annotations for English to Mandarin, Cantonese, Wu Chinese, and Mandarin-Hokkien translations, with baseline LLM evaluations showing limited performance.

DetailsMotivation: Progress in machine translation remains limited for low-resource languages lacking large-scale training data and linguistic resources, creating a need for fine-grained error analysis datasets to improve translation quality estimation and error detection.

Method: Built a novel dataset by annotating existing parallel corpora with error spans, types, and severity for English to Mandarin, Cantonese, and Wu Chinese translations, plus Mandarin-Hokkien from non-parallel sources. Established baseline results by evaluating multiple open and closed source LLMs using span-level and correlation-based MQM metrics.

Result: The dataset provides comprehensive error annotations for low-resource Chinese languages. Baseline evaluations reveal that current LLMs have limited precision in translation error detection, highlighting the need for specialized datasets like this one.

Conclusion: This dataset serves as a valuable resource for the MT community to fine-tune models with error detection capabilities, supporting research on translation quality estimation, error-aware generation, and low-resource language evaluation.

Abstract: Despite major advances in machine translation (MT) in recent years, progress remains limited for many low-resource languages that lack large-scale training data and linguistic resources. In this paper, we introduce \dsname, a novel fine-grained dataset that builds on existing parallel corpora to provide error span, error type, and error severity annotations in machine-translated examples from English to Mandarin, Cantonese, and Wu Chinese, along with a Mandarin-Hokkien component derived from a non-parallel source. Our dataset serves as a resource for the MT community to fine-tune models with error detection capabilities, supporting research on translation quality estimation, error-aware generation, and low-resource language evaluation. We also establish baseline results using language models to benchmark translation error detection performance. Specifically, we evaluate multiple open source and closed source LLMs using span-level and correlation-based MQM metrics, revealing their limited precision, underscoring the need for our dataset. Finally, we report our rigorous annotation process by native speakers, with analyses on pilot studies, iterative feedback, insights, and patterns in error type and severity.

[73] Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge

Main category: cs.CL

TL;DR: CCA is a novel attention method that compresses queries, keys, and values into a shared latent space to reduce parameters, KV-cache, and FLOPs simultaneously, with CCGQA combining this with head-sharing for optimal compute-bandwidth tradeoffs.

DetailsMotivation: Multi-headed Attention suffers from quadratic compute complexity and linearly growing KV-cache, making long-context transformers expensive for training and serving. Existing methods like GQA and MLA only address cache size but leave compute costs largely unchanged.

Method: Compressed Convolutional Attention (CCA) down-projects queries, keys, and values into a shared latent space where the entire attention operation is performed. This reduces parameters, KV-cache, and FLOPs by a compression factor. Combined with head-sharing as CCGQA for further optimization.

Result: CCGQA outperforms GQA and MLA at equal KV-cache compression on dense and MoE models, achieves 8x KV-cache compression with no performance drop compared to MHA, and reduces FLOP costs leading to 1.7x faster prefill and 1.3x faster backward on H100 GPUs at 16k sequence length.

Conclusion: CCA and CCGQA provide efficient attention mechanisms that simultaneously reduce computational costs and memory requirements while maintaining or improving performance, offering practical solutions for scaling transformers to longer contexts.

Abstract: Multi-headed Attention’s (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache, speeding decode, but leave compute, which determines prefill and training speed, largely unchanged. We introduce Compressed Convolutional Attention (CCA), a novel attention method which down-projects queries, keys, and values and performs the entire attention operation inside the shared latent space. This simple design dramatically cuts parameters, KV-cache, and FLOPs all at once by the desired compression factor. Because CCA is orthogonal to head-sharing, we combine the two to form Compressed Convolutional Grouped Query Attention (CCGQA), which further tightens the compute-bandwidth Pareto frontier so that users can tune compression toward either FLOP or memory limits without sacrificing quality. Experiments show that CCGQA consistently outperforms both GQA and MLA at equal KV-cache compression on dense and MoE models. Additionally, we show that CCGQA outperforms all other attention methods on MoE models with half the KV-cache of GQA and MLA, achieving an 8x KV-cache compression with no drop in performance compared to standard MHA. CCA and CCGQA also dramatically reduce the FLOP cost of attention which leads to substantially faster training and prefill than existing methods. On H100 GPUs, our fused CCA/CCGQA kernel reduces prefill latency by about 1.7x at a sequence length of 16k relative to MHA, and accelerates backward by about 1.3x.

[74] Protecting De-identified Documents from Search-based Linkage Attacks

Pierre Lison, Mark Anderson

Main category: cs.CL

TL;DR: A method to counter search-based linkage attacks in de-identified text by using N-gram analysis and LLM-based rewriting to prevent mapping de-identified documents back to their source while preserving semantic integrity.

DetailsMotivation: Current de-identification models conceal individual identities but fail to address linkage risks - the ability to map de-identified text back to its original source through search-based attacks using extracted phrases.

Method: Two-step approach: 1) Build inverted index of N-grams to identify phrases appearing in fewer than k documents, 2) Iteratively query LLM-based rewriter to reformulate those spans until linkage is no longer possible.

Result: Experimental results on court cases and Wikipedia biographies show the rewriting method effectively prevents search-based linkages while remaining faithful to original content, though more advanced semantics-oriented approaches can still enable linkages.

Conclusion: The method successfully counters search-based linkage attacks but highlights that semantic approaches remain a challenge, suggesting need for more sophisticated privacy-preserving techniques.

Abstract: While de-identification models can help conceal the identity of the individuals mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the text collection, making it possible to efficiently determine which N-grams appear in fewer than $k$ documents, either alone or in combination with other N-grams. An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on two datasets (court cases and Wikipedia biographies) show that the rewriting method can effectively prevent search-based linkages while remaining faithful to the original content. However, we also highlight that linkages remain feasible with the help of more advanced, semantics-oriented approaches.

[75] AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching

Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao

Main category: cs.CL

TL;DR: AdaSwitch: Adaptive switching between on-policy and off-policy knowledge distillation for small language models, using context-aware thresholds to balance generation consistency with high-quality teacher supervision.

DetailsMotivation: Small language models (SLMs) need efficient knowledge distillation from large teachers, but face a dilemma: off-policy distillation suffers from exposure bias (training-inference mismatch), while on-policy approaches are limited by low-quality student-generated outputs.

Method: AdaSwitch dynamically combines on-policy and off-policy generation via adaptive switching mechanism. It allows student to explore its predictions within capability and selectively integrates teacher guidance only when divergence exceeds a context-aware threshold.

Result: Experiments on three datasets show AdaSwitch consistently improves accuracy and reasoning capability with moderate overhead compared to existing knowledge distillation methods.

Conclusion: AdaSwitch provides an effective solution to the knowledge distillation dilemma for small language models, preserving generation consistency while ensuring high-quality supervision through adaptive switching between on-policy and off-policy approaches.

Abstract: Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods face a dilemma: off-policy distillation provides high-quality supervision but suffers from exposure bias (training inference mismatch), while on-policy approaches ensure consistency but are limited by the low quality of student-generated outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation via an adaptive switching mechanism. AdaSwitch allows the student to explore its predictions within its capability and selectively integrates teacher guidance only when divergence exceeds a context-aware threshold. This paradigm preserves generation consistency while ensuring high-quality supervision. Experiments on three datasets demonstrate that AdaSwitch consistently improves accuracy and reasoning capability with moderate overhead.

[76] Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon

Main category: cs.CL

TL;DR: Fine-tuned AI models can produce literary text preferred over human writing by both expert and general readers, while in-context prompting fails to capture author styles effectively.

DetailsMotivation: To investigate whether AI models can produce high-quality literary text that emulates authors' voices, addressing copyright concerns about AI generating derivative content from copyrighted books.

Method: Preregistered study comparing MFA-trained writers with three frontier models (ChatGPT, Claude, Gemini) writing 450-word excerpts emulating 50 award-winning authors’ styles. Used blind pairwise evaluations by 28 MFA-trained readers and 516 college-educated general readers, comparing in-context prompting vs. fine-tuning approaches.

Result: In-context prompting was strongly disfavored by MFA readers for stylistic fidelity and quality, while fine-tuned ChatGPT reversed these results: both groups preferred fine-tuned AI, with general readers showing stronger preference. Fine-tuned outputs were rarely flagged as AI-generated (3% vs. 97% for prompting).

Conclusion: Author-specific fine-tuning enables AI writing preferred over expert human writing, with median cost of $81 per author representing 99.7% reduction versus typical writer compensation, providing evidence relevant to copyright’s fair-use considerations.

Abstract: The use of copyrighted books for training AI has sparked lawsuits from authors concerned about AI generating derivative content. Yet whether these models can produce high-quality literary text emulating authors’ voices remains unclear. We conducted a preregistered study comparing MFA-trained writers with three frontier models (ChatGPT, Claude, Gemini) writing up to 450-word excerpts emulating 50 award-winning authors’ styles. In blind pairwise evaluations by 28 MFA-trained readers and 516 college-educated general readers, AI text from in-context prompting was strongly disfavored by MFA readers for stylistic fidelity (OR=0.16) and quality (OR=0.13), while general readers showed no fidelity preference (OR=1.06) but favored AI for quality (OR=1.82). Fine-tuning ChatGPT on authors’ complete works reversed these results: MFA readers favored AI for fidelity (OR=8.16) and quality (OR=1.87), with general readers showing even stronger preference (fidelity OR=16.65; quality OR=5.42). Both groups preferred fine-tuned AI, but the writer-type X reader-type interaction remained significant (p=0.021 for fidelity; p<10^-4 for quality), indicating general readers favored AI by a wider margin. Effects are robust under cluster-robust inference and generalize across authors in heterogeneity analyses. Fine-tuned outputs were rarely flagged as AI-generated (3% vs. 97% for prompting) by leading detectors. Mediation analysis shows fine-tuning eliminates detectable AI quirks that penalize in-context outputs, altering the nexus between detectability and preference. While not accounting for effort to transform AI output into publishable prose, the median fine-tuning cost of $81 per author represents a 99.7% reduction versus typical writer compensation. Author-specific fine-tuning enables non-verbatim AI writing preferred over expert human writing, providing evidence relevant to copyright’s fourth fair-use factor.

[77] Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models

Mingchen Tu, Zhiqiang Liu, Juan Li, Liangyurui Liu, Junjie Wang, Lei Liang, Wen Zhang

Main category: cs.CL

TL;DR: Evontree: An ontology rule-guided method for self-evolution of LLMs in low-resource specialized domains to reduce hallucinations and improve accuracy in knowledge-intensive fields like healthcare.

DetailsMotivation: LLMs suffer from hallucinations in specialized domains like healthcare and law where high interpretability is crucial. Existing fine-tuning methods require large professional datasets that are hard to obtain due to privacy regulations, and general self-evolution methods lack knowledge constraints for specialized domains.

Method: Evontree uses ontology rules to guide LLM self-evolution: 1) extracts domain ontology knowledge from raw models, 2) detects knowledge inconsistencies using two core ontology rules, and 3) reinforces gap knowledge via self-distilled fine-tuning.

Result: Extensive evaluations on medical QA benchmarks using Llama3-8B-Instruct and Med42-V2 show Evontree outperforms both base models and strong baselines, achieving up to 3.7% improvement in accuracy. Ablation studies validate the robustness of the approach.

Conclusion: Evontree effectively enables LLM self-evolution in low-resource specialized domains by leveraging ontology rules to detect and correct knowledge inconsistencies, reducing hallucinations without requiring large professional datasets.

Abstract: Although Large Language Models (LLMs) perform exceptionally well in general domains, the problem of hallucinations poses significant risks in specialized fields such as healthcare and law, where high interpretability is essential. Existing fine-tuning methods depend heavily on large-scale professional datasets, which are often hard to obtain due to the privacy regulations. Moreover, existing self-evolution methods are primarily designed for general domains, which may struggle to adapt to knowledge-intensive domains due to the lack of knowledge constraints. In this paper, we propose an ontology rule guided method Evontree to enable self-evolution of LLMs in low-resource specialized domains. Specifically, Evontree first extracts domain ontology knowledge from raw models, then detects knowledge inconsistencies using two core ontology rules, and finally reinforces gap knowledge into model via self-distilled fine-tuning. Extensive evaluations on medical QA benchmarks using Llama3-8B-Instruct and Med42-V2 demonstrate the effectiveness of Evontree, which outperforms both the base models and strong baselines, achieving up to a 3.7% improvement in accuracy. Detailed ablation studies further validate the robustness of our approach.

[78] Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata

Main category: cs.CL

TL;DR: Multilingual debate-style benchmark reveals narrative biases in LLMs across sensitive domains, showing entrenched stereotypes persist despite safety alignment, especially in low-resource languages.

DetailsMotivation: Current bias evaluations rely on English classification tasks, but real-world LLM deployment involves open-ended communication. There's a need for multilingual, generative benchmarks to assess narrative biases in realistic settings across diverse cultural contexts.

Method: Created CORPUSNAME benchmark with 8,400 structured debate prompts across four sensitive domains (Women’s Rights, Backwardness, Terrorism, Religion) in seven languages (including high-resource English/Chinese and low-resource Swahili/Nigerian Pidgin). Tested four flagship models (GPT-4o, Claude 3.5 Haiku, DeepSeek-Chat, LLaMA-3-70B), generating over 100,000 debate responses and automatically classifying demographic group assignments to stereotyped vs. modern roles.

Result: All models reproduce entrenched stereotypes despite safety alignment: Arabs linked to Terrorism and Religion (≥89%), Africans to socioeconomic “backwardness” (up to 77%), Western groups consistently framed as modern/progressive. Biases increase sharply in lower-resource languages, showing English-trained alignment doesn’t generalize globally.

Conclusion: Current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended multilingual contexts. There’s a persistent divide in multilingual fairness requiring better evaluation benchmarks and culturally inclusive alignment approaches.

Abstract: Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce \corpusname, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8{,}400 structured debate prompts spanning four sensitive domains – Women’s Rights, Backwardness, Terrorism, and Religion – across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude3.5Haiku, DeepSeek-Chat, and LLaMA-3-70B), we generate over 100{,}000 debate responses and automatically classify which demographic groups are assigned stereotyped versus modern roles. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to Terrorism and Religion ($\geq$89%), Africans to socioeconomic ``backwardness’’ (up to 77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our \corpusname benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.

[79] Steering LLMs toward Korean Local Speech: Iterative Refinement Framework for Faithful Dialect Translation

Keunhyeung Park, Seunguk Yu, Youngbin Kim

Main category: cs.CL

TL;DR: DIA-REFINE framework improves dialect machine translation using iterative translation-verification-feedback loops with dialect classifiers, introducing new metrics to better evaluate dialect fidelity beyond n-gram limitations.

DetailsMotivation: Standard-to-dialect MT faces challenges due to dialect gaps in LLMs and evaluation distortions from n-gram metrics that favor source copying over authentic dialect translation.

Method: Proposes DIA-REFINE framework with iterative translation, verification using external dialect classifiers, and feedback loops. Introduces dialect fidelity score (DFS) to quantify linguistic shift and target dialect ratio (TDR) to measure dialect translation success.

Result: Experiments on Korean dialects show DIA-REFINE consistently enhances dialect fidelity across zero-shot and in-context learning baselines. New metrics distinguish False Success (high n-gram but failed dialect) from True Attempt (low n-gram but genuine dialect effort).

Conclusion: Establishes robust framework for goal-directed, inclusive dialect translation with rigorous evaluation and insights into model performance. Shows models vary in responsiveness and in-context examples further improve dialect expression translation.

Abstract: Standard-to-dialect machine translation remains challenging due to a persistent dialect gap in large language models and evaluation distortions inherent in n-gram metrics, which favor source copying over authentic dialect translation. In this paper, we propose the dialect refinement (DIA-REFINE) framework, which guides LLMs toward faithful target dialect outputs through an iterative loop of translation, verification, and feedback using external dialect classifiers. To address the limitations of n-gram-based metrics, we introduce the dialect fidelity score (DFS) to quantify linguistic shift and the target dialect ratio (TDR) to measure the success of dialect translation. Experiments on Korean dialects across zero-shot and in-context learning baselines demonstrate that DIA-REFINE consistently enhances dialect fidelity. The proposed metrics distinguish between False Success cases, where high n-gram scores obscure failures in dialectal translation, and True Attempt cases, where genuine attempts at dialectal translation yield low n-gram scores. We also observed that models exhibit varying degrees of responsiveness to the framework, and that integrating in-context examples further improves the translation of dialectal expressions. Our work establishes a robust framework for goal-directed, inclusive dialect translation, providing both rigorous evaluation and critical insights into model performance.

[80] From Passive to Persuasive: Localized Activation Injection for Empathy and Negotiation

Niranjan Chebrolu, Kokil Jaidka, Gerard Christopher Yeo

Main category: cs.CL

TL;DR: STAR method uses attribution patching to identify causal origins of complex social behaviors in LLMs, then injects contrastive activation vectors at those specific locations, outperforming global steering approaches.

DetailsMotivation: The paper challenges the assumption that complex social behaviors like empathy and strategic politeness resist directional decomposition and activation steering, which has been effective for simpler attributes like sentiment or toxicity.

Method: STAR (Steering via Attribution and Representation) uses attribution patching to identify layer-token positions where behavioral traits causally originate, then injects contrastive activation vectors at precisely those localized locations.

Result: Localized injection consistently outperforms global steering and instruction priming in emotional dialogue and negotiation tasks across single- and multi-turn settings. Human evaluation confirms genuine improvements in perceived quality rather than lexical surface changes.

Conclusion: Complex interpersonal behaviors are encoded as localized, approximately linear directions in LLM activation space, and behavioral alignment is fundamentally a localization problem rather than requiring global interventions.

Abstract: Complex social behaviors, such as empathy and strategic politeness, are widely assumed to resist the directional decomposition that makes activation steering effective for coarse attributes like sentiment or toxicity. We present STAR: Steering via Attribution and Representation, which tests this assumption by using attribution patching to identify the layer–token positions where each behavioral trait causally originates, then injecting contrastive activation vectors at precisely those locations. Evaluated on emotional dialogue and negotiation in both single- and multi-turn settings, localized injection consistently outperforms global steering and instruction priming; human evaluation confirms that gains reflect genuine improvements in perceived quality rather than lexical surface change. Our results suggest that complex interpersonal behaviors are encoded as localized, approximately linear directions in LLM activation space, and that behavioral alignment is fundamentally a localization problem.

[81] Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles

Ramatu Oiza Abdulsalam, Segun Aroyehun

Main category: cs.CL

TL;DR: Analysis of math tutoring responses comparing expert tutors, novice tutors, and various LLMs shows LLMs approach expert quality but differ in instructional strategies and linguistic patterns.

DetailsMotivation: To understand how LLM-generated tutoring responses in mathematics compare to human expert practice, examining instructional alignment and pedagogical quality differences.

Method: Analyzed dataset of math remediation dialogues with expert tutors, novice tutors, and 7 LLMs (varying sizes, open-weight and commercial). Examined instructional strategies (uptake, pressing for accuracy/reasoning) and linguistic characteristics (lexical diversity, readability, politeness, agency).

Result: Larger LLMs approach expert performance on average but underuse discursive strategies characteristic of experts while generating longer, more lexically diverse, and more polite responses. Pressing for accuracy/reasoning, restating/revoicing, and lexical diversity positively associated with pedagogical quality, while agentic and polite language negatively associated.

Conclusion: Instructional strategies and linguistic characteristics are crucial for evaluating tutoring responses across human tutors and intelligent tutoring systems, highlighting systematic differences between LLM and human expert tutoring approaches.

Abstract: Recent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We analyze a dataset of math remediation dialogues in which expert tutors, novice tutors, and seven LLMs of varying sizes, comprising both open-weight and commercial models, respond to the same student errors. We examine instructional strategies and linguistic characteristics of tutoring responses, including uptake (restating and revoicing), pressing for accuracy and reasoning, lexical diversity, readability, politeness, and agency. We find that expert tutors produce higher-quality responses than novices, and that larger LLMs generally receive higher pedagogical quality ratings than smaller models, approaching expert performance on average. However, LLMs exhibit systematic differences in their instructional profiles: they underuse discursive strategies characteristic of expert tutors while generating longer, more lexically diverse, and more polite responses. Regression analyses show that pressing for accuracy and reasoning, restating and revoicing, and lexical diversity, are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. These findings highlight the importance of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.

[82] Toward Better Temporal Structures for Geopolitical Events Forecasting

Kian Ahrabian, Eric Boxer, Jay Pujara

Main category: cs.CL

TL;DR: The paper introduces Hyper-Relational Temporal Knowledge Generalized Hypergraphs (HTKGHs) to address limitations in existing temporal knowledge graphs for representing complex geopolitical events with multiple entities, and benchmarks LLMs on a new dataset derived from POLECAT.

DetailsMotivation: Existing temporal knowledge graphs (TKGs) and hyper-relational TKGs (HTKGs) lack expressive power for complex facts with more than two primary entities, which commonly occur in real-world geopolitical events. The authors aim to create a more expressive representation framework.

Method: 1) Formalize HTKGHs as a generalization of HTKGs with backward compatibility, 2) Create htkgh-polecat dataset from POLECAT global event database, 3) Benchmark and analyze popular LLMs on forecasting tasks using this dataset.

Result: The paper demonstrates that HTKGH formalization improves representation of complex geopolitical facts compared to existing frameworks, and provides insights into LLMs’ capabilities in complex temporal forecasting tasks.

Conclusion: HTKGHs provide a more expressive framework for representing complex temporal facts in geopolitical contexts, and LLMs show adaptability in forecasting tasks using this enhanced representation, though with varying capabilities.

Abstract: Forecasting on geopolitical temporal knowledge graphs (TKGs) through the lens of large language models (LLMs) has recently gained traction. While TKGs and their generalization, hyper-relational temporal knowledge graphs (HTKGs), offer a straightforward structure to represent simple temporal relationships, they lack the expressive power to convey complex facts efficiently. One of the critical limitations of HTKGs is a lack of support for more than two primary entities in temporal facts, which commonly occur in real-world events. To address this limitation, in this work, we study a generalization of HTKGs, Hyper-Relational Temporal Knowledge Generalized Hypergraphs (HTKGHs). We first derive a formalization for HTKGHs, demonstrating their backward compatibility while supporting two complex types of facts commonly found in geopolitical incidents. Then, utilizing this formalization, we introduce the htkgh-polecat dataset, built upon the global event database POLECAT. Finally, we benchmark and analyze popular LLMs on our dataset, providing insights into 1) the positive impact of utilizing the HTKGH formalization compared to existing ones and 2) LLMs’ adaptability and capabilities in complex forecasting tasks.

[83] LLM-Augmented Changepoint Detection: A Framework for Ensemble Detection and Automated Explanation

Fabian Lukassen, Christoph Weisser, Michael Schlee, Manish Kumar, Anton Thielmann, Benjamin Saefken, Thomas Kneib

Main category: cs.CL

TL;DR: Ensemble changepoint detection framework combining statistical methods with LLMs for improved accuracy and interpretability of time series regime changes.

DetailsMotivation: Addresses two key limitations: 1) Individual detection methods have complementary strengths/weaknesses making method selection difficult, and 2) Lack of automated contextual explanations for detected changes.

Method: Proposes ensemble method aggregating results from ten distinct changepoint detection algorithms, plus LLM-powered explanation pipeline that generates contextual narratives linking changepoints to real-world events. Includes RAG solution for domain-specific data.

Result: Achieves superior performance and robustness compared to individual methods. Framework demonstrates practical utility in finance, political science, and environmental science domains.

Conclusion: The framework transforms raw statistical output into actionable insights for analysts and decision-makers through improved detection accuracy and automated interpretability.

Abstract: This paper introduces a novel changepoint detection framework that combines ensemble statistical methods with Large Language Models (LLMs) to enhance both detection accuracy and the interpretability of regime changes in time series data. Two critical limitations in the field are addressed. First, individual detection methods exhibit complementary strengths and weaknesses depending on data characteristics, making method selection non-trivial and prone to suboptimal results. Second, automated, contextual explanations for detected changes are largely absent. The proposed ensemble method aggregates results from ten distinct changepoint detection algorithms, achieving superior performance and robustness compared to individual methods. Additionally, an LLM-powered explanation pipeline automatically generates contextual narratives, linking detected changepoints to potential real-world historical events. For private or domain-specific data, a Retrieval-Augmented Generation (RAG) solution enables explanations grounded in user-provided documents. The open source Python framework demonstrates practical utility in diverse domains, including finance, political science, and environmental science, transforming raw statistical output into actionable insights for analysts and decision-makers.

[84] SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering

Junli Liang, Pengfei Zhou, Wangqiu Zhou, Wenjie Qing, Qi Zhao, Ziwen Wang, Qi Song, Xiangyang Li

Main category: cs.CL

TL;DR: SentGraph is a sentence-level graph-based RAG framework that models fine-grained logical relationships between sentences to improve multi-hop question answering by constructing hierarchical sentence graphs with entity bridges and performing graph-guided evidence retrieval.

DetailsMotivation: Traditional RAG struggles with multi-hop QA tasks that require combining evidence from multiple documents, as chunk-based retrieval often provides irrelevant and logically incoherent context, leading to incomplete evidence chains and incorrect reasoning.

Method: Constructs hierarchical sentence graphs offline using Rhetorical Structure Theory to distinguish nucleus and satellite sentences, organizes them into topic-level subgraphs with cross-document entity bridges, and performs graph-guided evidence selection and path expansion during online retrieval.

Result: Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.

Conclusion: SentGraph addresses limitations of traditional RAG for multi-hop QA by modeling fine-grained logical relationships at the sentence level, enabling more coherent evidence retrieval and reasoning.

Abstract: Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with large language models but faces significant limitations in multi-hop question answering tasks, which require combining evidence from multiple documents. Existing chunk-based retrieval often provides irrelevant and logically incoherent context, leading to incomplete evidence chains and incorrect reasoning during answer generation. To address these challenges, we propose SentGraph, a sentence-level graph-based RAG framework that explicitly models fine-grained logical relationships between sentences for multi-hop question answering. Specifically, we construct a hierarchical sentence graph offline by first adapting Rhetorical Structure Theory to distinguish nucleus and satellite sentences, and then organizing them into topic-level subgraphs with cross-document entity bridges. During online retrieval, SentGraph performs graph-guided evidence selection and path expansion to retrieve fine-grained sentence-level evidence. Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.

[85] Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset

Z. Melce Hüsünbeyi, Virginie Mouilleron, Leonie Uhling, Daniel Foppe, Tatjana Scheffler, Djamé Seddah

Main category: cs.CL

TL;DR: A pipeline for constructing multimodal fact-checking datasets in French and German using ClaimReview feeds, scraping debunking articles, normalizing verdicts, and enriching with structured metadata and visual content, enhanced by LLMs for evidence extraction and justification generation.

DetailsMotivation: Addressing the urgent need for robust, up-to-date, explainable, and multilingual fact-checking resources, as existing datasets are limited in scope, lacking multimodal evidence, structured annotations, and detailed links between claims, evidence, and verdicts.

Method: Developed a comprehensive data collection and processing pipeline that aggregates ClaimReview feeds, scrapes full debunking articles, normalizes heterogeneous claim verdicts, and enriches them with structured metadata and aligned visual content. Used state-of-the-art LLMs and multimodal LLMs for evidence extraction under predefined categories and justification generation linking evidence to verdicts.

Result: Evaluation with G-Eval and human assessment demonstrates the pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates development of more interpretable and evidence-grounded fact-checking models, and lays groundwork for future research on multilingual, multimodal misinformation verification.

Conclusion: The pipeline successfully addresses limitations of existing fact-checking datasets by providing comprehensive multimodal resources in French and German, with structured annotations and explainable links between claims, evidence, and verdicts, supporting advanced research in misinformation verification.

Abstract: The rapid proliferation of misinformation across online platforms underscores the urgent need for robust, up-to-date, explainable, and multilingual fact-checking resources. However, existing datasets are limited in scope, often lacking multimodal evidence, structured annotations, and detailed links between claims, evidence, and verdicts. This paper introduces a comprehensive data collection and processing pipeline that constructs multimodal fact-checking datasets in French and German languages by aggregating ClaimReview feeds, scraping full debunking articles, normalizing heterogeneous claim verdicts, and enriching them with structured metadata and aligned visual content. We used state-of-the-art large language models (LLMs) and multimodal LLMs for (i) evidence extraction under predefined evidence categories and (ii) justification generation that links evidence to verdicts. Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and evidence-grounded fact-checking models, and lays the groundwork for future research on multilingual, multimodal misinformation verification.

[86] From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text

Shinwoo Park, Yo-Sub Han

Main category: cs.CL

TL;DR: LREAD is a Korean-specific rubric-based framework for human attribution of LLM-generated text, showing significant improvement in detection accuracy through expert calibration.

DetailsMotivation: Distinguishing human-written Korean text from fluent LLM outputs is challenging even for trained readers who may over-trust surface well-formedness, creating a need for reliable human attribution methods.

Method: Three-phase blind longitudinal study with linguistically trained annotators: Phase 1 measures intuition-only attribution, Phase 2 introduces criterion-anchored scoring with explicit justifications, and Phase 3 evaluates a limited held-out elementary-persona subset using majority-vote accuracy and Fleiss’ kappa.

Result: Majority-vote accuracy improved from 0.60 in Phase 1 to 0.90 in Phase 2, reaching 10/10 on Phase 3 subset; agreement increased from Fleiss’ κ = -0.09 to 0.82. Calibration primarily reduces false negatives on AI essays rather than inducing generalized over-detection.

Conclusion: Rubric-scaffolded human judgment can complement automated detectors by making attribution reasoning explicit, auditable, and adaptable. LREAD serves as pilot evidence for within-panel calibration in Korean argumentative-essay settings.

Abstract: Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for trained readers, who can over-trust surface well-formedness. We present LREAD, a Korean-specific instantiation of a rubric-based expert-calibration framework for human attribution of LLM-generated text. In a three-phase blind longitudinal study with three linguistically trained annotators, Phase 1 measures intuition-only attribution, Phase 2 introduces criterion-anchored scoring with explicit justifications, and Phase 3 evaluates a limited held-out elementary-persona subset. Majority-vote accuracy improves from 0.60 in Phase 1 to 0.90 in Phase 2, and reaches 10/10 on the limited Phase 3 subset (95% CI [0.692, 1.000]); agreement also increases from Fleiss’ $κ$ = -0.09 to 0.82. Error analysis suggests that calibration primarily reduces false negatives on AI essays rather than inducing generalized over-detection. We position LREAD as pilot evidence for within-panel calibration in a Korean argumentative-essay setting. These findings suggest that rubric-scaffolded human judgment can complement automated detectors by making attribution reasoning explicit, auditable, and adaptable. The rubric developed in this study, along with the dataset employed for the analysis, is available at https://github.com/Shinwoo-Park/lread.

[87] Time-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models

Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Yiqiao Huang, Ivor Tsang, Yang You

Main category: cs.CL

TL;DR: TAPS is a training-free inference method for Diffusion-LMs that leverages temporal structure to control generation diversity by encouraging semantic branching early and lexical refinement later.

DetailsMotivation: Diffusion language models have an explicit temporal dimension, but how to leverage this structure to control generation diversity for exploring multiple valid semantic or reasoning paths remains underexplored.

Method: Time-Annealed Perturbation Sampling (TAPS) - a training-free inference strategy that encourages semantic branching early in diffusion process while progressively reducing perturbations to preserve fluency and instruction adherence. Compatible with both non-autoregressive and semi-autoregressive Diffusion backbones.

Result: TAPS consistently improves output diversity across creative writing and reasoning benchmarks without compromising generation quality, demonstrated on LLaDA and TraDo models.

Conclusion: Diffusion-LMs exhibit temporal division of labor (early steps determine global semantics, later steps focus on lexical refinement), and TAPS effectively leverages this structure to enhance generation diversity while maintaining quality.

Abstract: Diffusion language models (Diffusion-LMs) introduce an explicit temporal dimension into text generation, yet how this structure can be leveraged to control generation diversity for exploring multiple valid semantic or reasoning paths remains underexplored. In this paper, we show that Diffusion-LMs, like diffusion models in image generation, exhibit a temporal division of labor: early denoising steps largely determine the global semantic structure, while later steps focus on local lexical refinement. Building on this insight, we propose Time-Annealed Perturbation Sampling (TAPS), a training-free inference strategy that encourages semantic branching early in the diffusion process while progressively reducing perturbations to preserve fluency and instruction adherence. TAPS is compatible with both non-autoregressive and semi-autoregressive Diffusion backbones, demonstrated on LLaDA and TraDo in our paper, and consistently improves output diversity across creative writing and reasoning benchmarks without compromising generation quality.

[88] LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

William Lugoloobi, Thomas Foster, William Bankes, Chris Russell

Main category: cs.CL

TL;DR: LLMs can predict their own success likelihood from internal representations before generation, enabling efficient routing of queries to appropriate models based on predicted difficulty.

DetailsMotivation: Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. The paper investigates whether LLMs' own likelihood of success is recoverable from their internal representations before generation.

Method: Train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks. Use E2H-AMC dataset providing both human and model performance on identical problems. Demonstrate routing queries across a pool of models based on predicted difficulty.

Result: Probes substantially outperform surface features like question length and TF-IDF. Models encode a model-specific notion of difficulty distinct from human difficulty, with distinction increasing with extended reasoning. Routing queries across models can exceed best-performing model while reducing inference cost by up to 70% on MATH dataset.

Conclusion: Internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. LLMs’ own success likelihood is recoverable from pre-generation activations and can guide more efficient inference.

Abstract: Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty

[89] On Theoretically-Driven LLM Agents for Multi-Dimensional Discourse Analysis

Maciej Uberna, Michał Wawer, Jarosław A. Chudziak, Marcin Koszowy

Main category: cs.CL

TL;DR: A multi-agent LLM framework with RAG-enhanced argumentation theory knowledge outperforms zero-shot baselines in detecting strategic rephrase functions (D-I-S-G-O) in political debates.

DetailsMotivation: LLMs struggle to identify strategic reformulation functions in discourse beyond surface-level similarity, missing pragmatic rhetorical functions crucial for computational argumentation.

Method: Comparative multi-agent framework with two parallel LLM systems: RAG-enhanced agents with argumentation theory knowledge vs. identical zero-shot baseline, evaluated on annotated political debates with D-I-S-G-O rephrase functions.

Result: RAG-enhanced agents substantially outperform baseline with ~30% Macro F1-score improvement, particularly strong in detecting Intensification and Generalisation contexts.

Conclusion: Theoretical grounding via RAG is essential for function-aware analysis of argumentative discourse, enabling scalable tools for identifying rhetorical strategies.

Abstract: Identifying the strategic uses of reformulation in discourse remains a key challenge for computational argumentation. While LLMs can detect surface-level similarity, they often fail to capture the pragmatic functions of rephrasing, such as its role within rhetorical discourse. This paper presents a comparative multi-agent framework designed to quantify the benefits of incorporating explicit theoretical knowledge for this task. We utilise an dataset of annotated political debates to establish a new standard encompassing four distinct rephrase functions: Deintensification, Intensification, Specification, Generalisation, and Other, which covers all remaining types (D-I-S-G-O). We then evaluate two parallel LLM-based agent systems: one enhanced by argumentation theory via Retrieval-Augmented Generation (RAG), and an identical zero-shot baseline. The results reveal a clear performance gap: the RAG-enhanced agents substantially outperform the baseline across the board, with particularly strong advantages in detecting Intensification and Generalisation context, yielding an overall Macro F1-score improvement of nearly 30%. Our findings provide evidence that theoretical grounding is not only beneficial but essential for advancing beyond mere paraphrase detection towards function-aware analysis of argumentative discourse. This comparative multi-agent architecture represents a step towards scalable, theoretically informed computational tools capable of identifying rhetorical strategies in contemporary discourse.

[90] Prompt Sensitivity and Answer Consistency of Small Open-Source Language Models for Clinical Question Answering in Low-Resource Healthcare

Shravani Hariprasad

Main category: cs.CL

TL;DR: Evaluation of small open-source LLMs for clinical QA shows consistency and accuracy are independent; models can be reliably wrong, with Llama 3.2 offering best balance for low-resource deployment.

DetailsMotivation: Small open-source language models are promising for healthcare in low-resource settings, but their reliability under different prompt phrasings for clinical questions is poorly understood.

Method: Evaluated five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, Meditron-7B) across three clinical QA datasets (MedQA, MedMCQA, PubMedQA) using five prompt styles (original, formal, simplified, roleplay, direct) on consumer CPU hardware without fine-tuning.

Result: Consistency and accuracy were independent: Gemma 2 had highest consistency (0.845-0.888) but lowest accuracy (33.0-43.5%), Llama 3.2 showed moderate consistency (0.774-0.807) with highest accuracy (49.0-65.0%). Roleplay prompts reduced accuracy across all models. Meditron-7B had near-complete instruction-following failure on PubMedQA.

Conclusion: High consistency doesn’t imply correctness - models can be reliably wrong, which is dangerous for clinical AI. Llama 3.2 demonstrated strongest balance for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.

Abstract: Small open-source language models are gaining attention for healthcare applications in low-resource settings where cloud infrastructure and GPU hardware may be unavailable. However, the reliability of these models under different phrasings of the same clinical question remains poorly understood. We evaluate five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B, a domain-pretrained model without instruction tuning) across three clinical question answering datasets (MedQA, MedMCQA, and PubMedQA) using five prompt styles: original, formal, simplified, roleplay, and direct. Model behavior is evaluated using consistency scores, accuracy, and instruction-following failure rates. All experiments were conducted locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent across models. Gemma 2 achieved the highest consistency (0.845-0.888) but the lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) alongside the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), indicating that domain pretraining alone is insufficient for structured clinical question answering. These findings show that high consistency does not imply correctness: models can be reliably wrong, a dangerous failure mode in clinical AI. Llama 3.2 demonstrated the strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.

[91] GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant

Zhuokang Shen, Yifan Wang, Hanyu Chen, Wenxuan Huang, Yunhang Shen, Shaohui Lin

Main category: cs.CL

TL;DR: GroupGPT: A token-efficient, privacy-preserving framework for multi-user chat assistants using small-large model collaboration to decouple intervention timing from response generation, supporting multimodal inputs.

DetailsMotivation: Existing LLM-based chatbots focus on single-user settings and don't generalize well to multi-user group chats, requiring more proactive intervention under complex contexts. Current approaches rely on LLMs for both reasoning and generation, leading to high token consumption, limited scalability, and privacy risks.

Method: Proposes GroupGPT with small-large model collaborative architecture to decouple intervention timing from response generation. Uses smaller models for timing decisions and larger models for response generation. Supports multimodal inputs (memes, images, videos, voice messages). Introduces MUIR benchmark dataset with 2,500 annotated group chat segments for evaluation.

Result: GroupGPT achieves average score of 4.72/5.0 in LLM-based evaluation, reduces token usage by up to 3x compared to baselines, and provides privacy sanitization of user messages. Well received by users across diverse group chat scenarios.

Conclusion: GroupGPT effectively addresses challenges in multi-user chat assistants through efficient architecture design, multimodal support, and privacy preservation, demonstrating strong performance on the MUIR benchmark.

Abstract: Recent advances in large language models (LLMs) have enabled increasingly capable chatbots. However, most existing systems focus on single-user settings and do not generalize well to multi-user group chats, where agents require more proactive and accurate intervention under complex, evolving contexts. Existing approaches typically rely on LLMs for both reasoning and generation, leading to high token consumption, limited scalability, and potential privacy risks. To address these challenges, we propose GroupGPT, a token-efficient and privacy-preserving agentic framework for multi-user chat assistant. GroupGPT adopts a small-large model collaborative architecture to decouple intervention timing from response generation, enabling efficient and accurate decision-making. The framework also supports multimodal inputs, including memes, images, videos, and voice messages. We further introduce MUIR, a benchmark dataset for multi-user chat assistant intervention reasoning. MUIR contains 2,500 annotated group chat segments with intervention labels and rationales, supporting evaluation of timing accuracy and response quality. We evaluate a range of models on MUIR, from large language models to smaller counterparts. Extensive experiments demonstrate that GroupGPT produces accurate and well-timed responses, achieving an average score of 4.72/5.0 in LLM-based evaluation, and is well received by users across diverse group chat scenarios. Moreover, GroupGPT reduces token usage by up to 3 times compared to baseline methods, while providing privacy sanitization of user messages before cloud transmission. Code is available at: https://github.com/Eliot-Shen/GroupGPT .

[92] CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad, Hassan AlOmaish, Sung Eun Kim, Oishi Banerjee, Pranav Rajpurkar

Main category: cs.CL

TL;DR: CRIMSON is a clinically grounded evaluation framework for chest X-ray report generation that assesses diagnostic correctness, contextual relevance, and patient safety using full clinical context and severity-aware weighting.

DetailsMotivation: Current metrics for chest X-ray report generation lack clinical grounding, fail to incorporate patient context, and don't properly weight clinical significance, leading to evaluations that don't align with real-world clinical needs.

Method: CRIMSON incorporates full clinical context (patient age, indication, guidelines), categorizes errors into comprehensive taxonomy (false/missing findings + 8 attribute errors), assigns clinical significance levels (urgent to benign), and uses severity-aware weighting. Validated through alignment with radiologist annotations and new benchmarks (RadJudge, RadPref).

Result: Strong alignment with radiologist annotations (Kendall’s tau = 0.61-0.71; Pearson’s r = 0.71-0.84), consistent agreement with expert judgment in challenging scenarios, and strongest alignment with radiologist preferences in pairwise comparisons. Released with benchmarks and fine-tuned MedGemma model.

Conclusion: CRIMSON provides a clinically meaningful evaluation framework for chest X-ray report generation that better aligns with real clinical practice by incorporating context, comprehensive error taxonomy, and severity-aware weighting.

Abstract: We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.

Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, Emad Mohamed

Main category: cs.CL

TL;DR: MAWARITH is a large-scale Arabic inheritance law dataset with 12,500 cases for evaluating LLMs on complex multi-step reasoning involving heir identification, blocking rules, and share calculations.

DetailsMotivation: Islamic inheritance law presents unique challenges for LLMs requiring structured multi-step reasoning and correct application of juristic rules, but existing datasets only offer multiple-choice questions rather than full reasoning chains.

Method: Created MAWARITH dataset with 12,500 annotated Arabic inheritance cases including step-by-step solutions, intermediate legal decisions, and justifications. Proposed MIR-E evaluation metric to score reasoning stages and error propagation.

Result: Gemini-2.5-flash achieved ~90% MIR-E score, while other models (Fanar-C, Fanar-Sadiq, LLaMA 3, Qwen 3) remained below 50%. Error analysis revealed patterns in scenario misinterpretation, heir identification errors, and rule application mistakes.

Conclusion: MAWARITH enables comprehensive evaluation of LLMs on complex legal reasoning tasks, revealing significant performance gaps and providing insights for improving multi-step reasoning capabilities.

Abstract: Islamic inheritance law (‘ilm al-mawarith) is challenging for large language models because solving inheritance cases requires complex, structured multi-step reasoning and the correct application of juristic rules to compute heirs’ shares. We introduce MAWARITH, a large-scale annotated dataset of 12,500 Arabic inheritance cases for training and evaluating models on the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (hajb) and allocation rules, and (iii) computing exact inheritance shares. Unlike prior datasets that restrict inheritance case solving to multiple-choice questions, MAWARITH supports the full reasoning chain and provides step-by-step solutions, including intermediate legal decisions and justifications based on classical juristic sources and established inheritance rules, as well as exact share calculations. To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline. We evaluate six LLMs in a zero-shot setting. Gemini-2.5-flash achieves about 90% MIR-E on both validation and test, while Fanar-C, Fanar-Sadiq, LLaMA 3, and Qwen 3 remain below 50%. Our error analysis identifies recurring failure patterns, including scenario misinterpretation, errors in heir identification, errors in share allocation, and missing or incorrect application of key inheritance rules such as ‘awl and radd. The MAWARITH dataset is publicly available at https://gitlab.com/islamgpt1/qias_shared_task_2026.

[94] APEX-Searcher: Augmenting LLMs’ Search Capabilities through Agentic Planning and Execution

Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao

Main category: cs.CL

TL;DR: APEX-Searcher: A novel agentic planning and execution framework that decouples multi-hop RAG into planning (RL-optimized) and execution (SFT-trained) stages to improve complex question answering.

DetailsMotivation: Existing multi-round retrieval approaches for complex multi-hop questions face challenges with ambiguous retrieval execution paths and sparse rewards in end-to-end RL training, leading to inaccurate retrieval and performance degradation.

Method: Two-stage agentic framework: 1) RL with decomposition-specific rewards for strategic planning optimization, 2) Supervised fine-tuning on high-quality multi-hop trajectories for robust iterative sub-task execution.

Result: Extensive experiments show significant improvements in both multi-hop RAG and task planning performances across multiple benchmarks.

Conclusion: APEX-Searcher effectively addresses challenges in complex multi-hop question answering by decoupling planning and execution, achieving superior performance through RL-optimized planning and SFT-trained execution.

Abstract: Retrieval-augmented generation (RAG), based on large language models (LLMs), serves as a vital approach to retrieving and leveraging external knowledge in various domain applications. When confronted with complex multi-hop questions, single-round retrieval is often insufficient for accurate reasoning and problem solving. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches significantly improve problem-solving performance, they are still faced with challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL) process, leading to inaccurate retrieval results and performance degradation. To address these issues, in this paper, we proposes APEX-Searcher, a novel Agentic Planning and Execution framework to augment LLM search capabilities. Specifically, we introduce a two-stage agentic framework that decouples the retrieval process into planning and execution: It first employs RL with decomposition-specific rewards to optimize strategic planning; Built on the sub-task decomposition, it then applies supervised fine-tuning on high-quality multi-hop trajectories to equip the model with robust iterative sub-task execution capabilities. Extensive experiments demonstrate that our proposed framework achieves significant improvements in both multi-hop RAG and task planning performances across multiple benchmarks.

[95] HindSight: Evaluating LLM-Generated Research Ideas via Future Impact

Bo Jiang

Main category: cs.CL

TL;DR: HindSight is a time-split evaluation framework that measures AI-generated research idea quality by matching them against real future publications and scoring based on citation impact and venue acceptance.

DetailsMotivation: Current evaluation of AI-generated research ideas relies on subjective LLM judges or human panels that are disconnected from actual research impact, creating a need for objective, evidence-based evaluation methods.

Method: Uses temporal cutoff T to restrict idea generation to pre-T literature, then evaluates outputs against papers published in subsequent 30 months, scoring ideas based on citation impact and venue acceptance.

Result: Experiments across 10 AI/ML topics show retrieval-augmented systems produce 2.5× higher-scoring ideas than vanilla generation, and HindSight scores are negatively correlated with LLM-judged novelty, revealing LLMs overvalue novel-sounding ideas that don’t materialize in real research.

Conclusion: HindSight provides objective evaluation of AI-generated research ideas by connecting them to real research impact, revealing limitations of current LLM-based evaluation methods.

Abstract: Evaluating AI-generated research ideas typically relies on LLM judges or human panels – both subjective and disconnected from actual research impact. We introduce HindSight, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~$T$, we restrict an idea generation system to pre-$T$ literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation ($p{=}0.584$), while HindSight shows the retrieval-augmented system produces 2.5$\times$ higher-scoring ideas ($p{<}0.001$). Moreover, HindSight scores are \emph{negatively} correlated with LLM-judged novelty ($ρ{=}{-}0.29$, $p{<}0.01$), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.

cs.CV

[96] Improving Generative Adversarial Network Generalization for Facial Expression Synthesis

Arbish Akram, Nazar Khan, Arif Mahmood

Main category: cs.CV

TL;DR: RegGAN improves facial expression synthesis by combining regression-based detail learning with adversarial refinement for better generalization beyond training distribution.

DetailsMotivation: Existing conditional GANs for facial expression synthesis degrade when test images differ from training data, needing better generalization to diverse inputs like celebrity photos, portraits, statues, and avatars.

Method: Two-component model: 1) Regression layer with local receptive fields learns expression details via ridge regression loss, 2) Refinement network trained adversarially enhances realism. Trained on CFEE dataset.

Result: Outperforms 6 SOTA models in Expression Classification Score, FID, and QualiCLIP; ranks 2nd in Face Similarity Score. Human evaluations show 25% better expression quality, 26% better identity preservation, and 30% better realism.

Conclusion: RegGAN effectively improves generalization for facial expression synthesis by combining regression-based detail learning with adversarial refinement, achieving state-of-the-art performance on both in-distribution and out-of-distribution images.

Abstract: Facial expression synthesis aims to generate realistic facial expressions while preserving identity. Existing conditional generative adversarial networks (GANs) achieve excellent image-to-image translation results, but their performance often degrades when test images differ from the training dataset. We present Regression GAN (RegGAN), a model that learns an intermediate representation to improve generalization beyond the training distribution. RegGAN consists of two components: a regression layer with local receptive fields that learns expression details by minimizing the reconstruction error through a ridge regression loss, and a refinement network trained adversarially to enhance the realism of generated images. We train RegGAN on the CFEE dataset and evaluate its generalization performance both on CFEE and challenging out-of-distribution images, including celebrity photos, portraits, statues, and avatar renderings. For evaluation, we employ four widely used metrics: Expression Classification Score (ECS) for expression quality, Face Similarity Score (FSS) for identity preservation, QualiCLIP for perceptual realism, and Fréchet Inception Distance (FID) for assessing both expression quality and realism. RegGAN outperforms six state-of-the-art models in ECS, FID, and QualiCLIP, while ranking second in FSS. Human evaluations indicate that RegGAN surpasses the best competing model by 25% in expression quality, 26% in identity preservation, and 30% in realism.

[97] SAC-NeRF: Adaptive Ray Sampling for Neural Radiance Fields via Soft Actor-Critic Reinforcement Learning

Chenyu Ge

Main category: cs.CV

TL;DR: SAC-NeRF uses reinforcement learning with Soft Actor-Critic to learn adaptive sampling policies for Neural Radiance Fields, reducing sampling points by 35-48% while maintaining rendering quality.

DetailsMotivation: NeRF achieves photorealistic novel view synthesis but suffers from computational inefficiency due to dense ray sampling during volume rendering. Current methods use uniform sampling which is wasteful in empty or homogeneous regions.

Method: Formulates sampling as a Markov Decision Process where an RL agent learns adaptive sampling policies using Soft Actor-Critic. Introduces three components: (1) Gaussian mixture distribution color model for uncertainty estimates, (2) multi-component reward function balancing quality, efficiency, and consistency, and (3) two-stage training strategy to address environment non-stationarity.

Result: Experiments on Synthetic-NeRF and LLFF datasets show SAC-NeRF reduces sampling points by 35-48% while maintaining rendering quality within 0.3-0.8 dB PSNR of dense sampling baselines.

Conclusion: While the learned policy is scene-specific and RL framework adds complexity compared to simpler heuristics, the work demonstrates that data-driven sampling strategies can discover effective patterns difficult to hand-design.

Abstract: Neural Radiance Fields (NeRF) have achieved photorealistic novel view synthesis but suffer from computational inefficiency due to dense ray sampling during volume rendering. We propose SAC-NeRF, a reinforcement learning framework that learns adaptive sampling policies using Soft Actor-Critic (SAC). Our method formulates sampling as a Markov Decision Process where an RL agent learns to allocate samples based on scene characteristics. We introduce three technical components: (1) a Gaussian mixture distribution color model providing uncertainty estimates, (2) a multi-component reward function balancing quality, efficiency, and consistency, and (3) a two-stage training strategy addressing environment non-stationarity. Experiments on Synthetic-NeRF and LLFF datasets show that SAC-NeRF reduces sampling points by 35-48% while maintaining rendering quality within 0.3-0.8 dB PSNR of dense sampling baselines. While the learned policy is scene-specific and the RL framework adds complexity compared to simpler heuristics, our work demonstrates that data-driven sampling strategies can discover effective patterns that would be difficult to hand-design.

[98] Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

Yu Li, Yuchen Zheng, Giles Hamilton-Fletcher, Marco Mezzavilla, Yao Wang, Sundeep Rangan, Maurizio Porfiri, Zhou Yu, John-Ross Rizzo

Main category: cs.CV

TL;DR: Evaluation of vision-language models (VLMs) for assisting blind/low-vision users in navigation tasks, comparing closed-source (GPT-4V, GPT-4o, Gemini, Claude) and open-source models on visual skills like obstacle counting, spatial reasoning, and scene understanding.

DetailsMotivation: To investigate the potential of vision-language models for assisting people with blindness and low vision (pBLV) in navigation tasks, assessing whether current VLMs can provide meaningful assistance and identifying their strengths and limitations for real-world assistive applications.

Method: Evaluated state-of-the-art closed-source models (GPT-4V, GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet) and open-source models (Llava-v1.6-mistral, Llava-onevision-qwen) on foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding scene understanding. Used pBLV-specific prompts to simulate real-world assistance tasks in navigation scenarios.

Result: GPT-4o consistently outperformed others across all tasks, particularly in spatial reasoning and scene understanding. Open-source models struggled with nuanced reasoning and adaptability in complex environments. Common challenges included difficulties in accurately counting objects in cluttered settings, biases in spatial reasoning, and prioritizing object details over spatial feedback.

Conclusion: VLMs show promise for wayfinding assistance but require better alignment with human feedback and improved spatial reasoning capabilities. The research provides actionable insights for developers to effectively integrate VLMs into assistive technologies while addressing key limitations for enhanced usability.

Abstract: This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with nuanced reasoning and adaptability in complex environments. Common challenges include difficulties in accurately counting objects in cluttered settings, biases in spatial reasoning, and a tendency to prioritize object details over spatial feedback, limiting their usability for pBLV in navigation tasks. Despite these limitations, VLMs show promise for wayfinding assistance when better aligned with human feedback and equipped with improved spatial reasoning. This research provides actionable insights into the strengths and limitations of current VLMs, guiding developers on effectively integrating VLMs into assistive technologies while addressing key limitations for enhanced usability.

[99] Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

Jiale Song, Jiaxin Luo, Xue-song Tang, Kuangrong Hao, Mingbo Zhao

Main category: cs.CV

TL;DR: SAE uses semantic segmentation to quantify visual attention uncertainty and detect/mitigate object hallucinations in LVLMs without training.

DetailsMotivation: Object hallucinations in LVLMs undermine reliability; while most research focuses on text modality issues, this paper identifies abnormal visual attention patterns as another source of hallucinations.

Method: Proposes Segmentation-based Attention Entropy (SAE) that leverages semantic segmentation to quantify visual attention uncertainty in object-level semantic space. Includes hallucination detection via reliability score and SAE-guided attention adjustment at inference time.

Result: SAE substantially reduces object hallucinations without additional training cost, enabling more trustworthy LVLM-driven perception and decision-making, validated on public benchmarks and real embodied multimodal scenarios with quadruped robots.

Conclusion: Visual attention patterns are a significant source of hallucinations in LVLMs, and SAE provides an effective training-free solution for detection and mitigation, improving reliability in multimodal applications.

Abstract: Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.

[100] OrthoAI v2: From Single-Agent Segmentation to Dual-Agent Treatment Planning for Clear Aligners

Lansiaux Edouard, Leman Margaux

Main category: cs.CV

TL;DR: OrthoAI v2 enhances AI-assisted orthodontic planning with multi-agent landmark detection, composite biomechanical scoring, and treatment simulation for clear aligners.

DetailsMotivation: To overcome limitations of the first version which only provided per-tooth centroid extraction, lacked landmark-level precision, and produced only scalar quality scores without staging simulation.

Method: Three main contributions: 1) Second agent using Conditioned Heatmap Regression Methodology for dental landmark detection fused with first agent via confidence-weighted orchestrator; 2) Composite six-category biomechanical scoring model; 3) Multi-frame treatment simulator generating temporally coherent 6-DoF tooth trajectories via SLERP interpolation.

Result: On 200 crowding scenarios, parallel ensemble reaches planning quality score of 92.8±4.1 vs 76.4±8.3 for v1 (+21% relative gain), while maintaining CPU deployability (4.2±0.8s).

Conclusion: OrthoAI v2 significantly improves orthodontic treatment planning quality through multi-agent fusion, comprehensive scoring, and simulation capabilities.

Abstract: We present OrthoAI v2, the second iteration of our open-source pipeline for AI-assisted orthodontic treatment planning with clear aligners, substantially extending the single-agent framework previously introduced. The first version established a proof-of-concept based on Dynamic Graph Convolutional Neural Networks (\dgcnn{}) for tooth segmentation but was limited to per-tooth centroid extraction, lacked landmark-level precision, and produced a scalar quality score without staging simulation. \vtwo{} addresses all three limitations through three principal contributions: (i)a second agent adopting the Conditioned Heatmap Regression Methodology (\charm{})\cite{rodriguez2025charm} for direct, segmentation-free dental landmark detection, fused with Agent~1 via a confidence-weighted orchestrator in three modes (parallel, sequential, single-agent); (ii)~a composite six-category biomechanical scoring model (biomechanics $\times$ 0.30 + staging $\times$ 0.20 + attachments $\times$ 0.15 + IPR $\times$ 0.10 + occlusion $\times$ 0.10 + predictability $\times$ 0.15) replacing the binary pass/fail check of v1; (iii)~a multi-frame treatment simulator generating $F = A \times r$ temporally coherent 6-DoF tooth trajectories via SLERP interpolation and evidence-based staging rules, enabling ClinCheck 4D visualisation. On a synthetic benchmark of 200 crowding scenarios, the parallel ensemble of OrthoAI v2 reaches a planning quality score of $92.8 \pm 4.1$ vs.\ $76.4 \pm 8.3$ for OrthoAI v1, a $+21%$ relative gain, while maintaining full CPU deployability ($4.2 \pm 0.8$~s).

[101] FlatLands: Generative Floormap Completion From a Single Egocentric View

Subhransu S. Bhattacharjee, Dylan Campbell, Rahul Shome

Main category: cs.CV

TL;DR: FlatLands is a dataset and benchmark for single-view bird’s-eye view floor completion from egocentric images, focusing on indoor navigation applications.

DetailsMotivation: Single egocentric images capture limited floor area, but complete metric traversability maps are needed for indoor navigation applications. Current approaches lack comprehensive datasets and benchmarks for evaluating floor completion from single views.

Method: Created FlatLands dataset with 270,575 observations from 17,656 real metric indoor scenes from six existing datasets, with aligned observation, visibility, validity, and ground-truth BEV maps. Benchmark includes in- and out-of-distribution evaluation protocols. Compared training-free approaches, deterministic models, ensembles, and stochastic generative models, and instantiated as end-to-end monocular RGB-to-floormaps pipeline.

Result: Provides a comprehensive dataset and benchmark for evaluating floor completion methods, enabling rigorous testing of uncertainty-aware indoor mapping and generative completion approaches for embodied navigation.

Conclusion: FlatLands establishes a standardized testbed for single-view BEV floor completion, facilitating research in uncertainty-aware mapping and generative completion for indoor navigation applications.

Abstract: A single egocentric image typically captures only a small portion of the floor, yet a complete metric traversability map of the surroundings would better serve applications such as indoor navigation. We introduce FlatLands, a dataset and benchmark for single-view bird’s-eye view (BEV) floor completion. The dataset contains 270,575 observations from 17,656 real metric indoor scenes drawn from six existing datasets, with aligned observation, visibility, validity, and ground-truth BEV maps, and the benchmark includes both in- and out-of-distribution evaluation protocols. We compare training-free approaches, deterministic models, ensembles, and stochastic generative models. Finally, we instantiate the task as an end-to-end monocular RGB-to-floormaps pipeline. FlatLands provides a rigorous testbed for uncertainty-aware indoor mapping and generative completion for embodied navigation.

[102] CLRNet: Targetless Extrinsic Calibration for Camera, Lidar and 4D Radar Using Deep Learning

Marcell Kegl, Andras Palffy, Csaba Benedek, Dariu M. Gavrila

Main category: cs.CV

TL;DR: CLRNet: A multi-modal deep learning network for extrinsic calibration of camera, lidar, and 4D radar sensors, achieving 50%+ error reduction compared to state-of-the-art methods.

DetailsMotivation: Accurate extrinsic calibration of radar sensors remains challenging due to sparse data, and existing methods struggle with multi-sensor calibration across camera, lidar, and radar modalities.

Method: Proposes CLRNet, an end-to-end deep learning network that uses equirectangular projection, camera-based depth prediction, additional radar channels, shared feature space, and loop closure loss for joint calibration of camera-lidar-radar or pairwise sensor calibration.

Result: Superior calibration accuracy on View-of-Delft and Dual-Radar datasets, reducing both median translational and rotational calibration errors by at least 50% compared to existing methods.

Conclusion: CLRNet effectively addresses multi-sensor calibration challenges and demonstrates strong domain transfer capabilities across datasets, with code to be made publicly available.

Abstract: In this paper, we address extrinsic calibration for camera, lidar, and 4D radar sensors. Accurate extrinsic calibration of radar remains a challenge due to the sparsity of its data. We propose CLRNet, a novel, multi-modal end-to-end deep learning (DL) calibration network capable of addressing joint camera-lidar-radar calibration, or pairwise calibration between any two of these sensors. We incorporate equirectangular projection, camera-based depth image prediction, additional radar channels, and leverage lidar with a shared feature space and loop closure loss. In extensive experiments using the View-of-Delft and Dual-Radar datasets, we demonstrate superior calibration accuracy compared to existing state-of-the-art methods, reducing both median translational and rotational calibration errors by at least 50%. Finally, we examine the domain transfer capabilities of the proposed network and baselines, when evaluating across datasets. The code will be made publicly available upon acceptance at: https://github.com/tudelft-iv.

[103] Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

Yulong Zhang, Tianyi Liang, Xinyue Huang, Erfei Cui, Guoqing Wang, Xu Guo, Chenhui Li, Gongshen Liu

Main category: cs.CV

TL;DR: CE-OCR: A training-free, model-agnostic framework using Consensus Entropy to measure inter-model agreement for unsupervised OCR quality control, improving F1 scores by 42.1% over VLM-as-Judge.

DetailsMotivation: Current VLMs struggle with detecting sample-level OCR errors and lack effective unsupervised quality control, despite OCR being fundamental to VLMs and LLM training data generation.

Method: Introduces Consensus Entropy (CE) metric that measures inter-model agreement entropy, where correct predictions converge while errors diverge. CE-OCR framework uses ensemble agreement for output verification, selects best outputs, and employs adaptive routing for efficiency.

Result: CE improves F1 scores by 42.1% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same computational cost.

Conclusion: CE provides robust, training-free quality verification for OCR in VLMs, enabling plug-and-play integration without supervision. The framework offers effective unsupervised quality control for vision-language applications.

Abstract: Optical Character Recognition (OCR) is fundamental to Vision-Language Models (VLMs) and high-quality data generation for LLM training. Yet, despite progress in average OCR accuracy, state-of-the-art VLMs still struggle with detecting sample-level errors and lack effective unsupervised quality control. We introduce Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy. The core insight is that correct predictions converge in output space, while errors diverge. Based on CE, we develop CE-OCR, a lightweight multi-model framework that verifies outputs by ensemble agreement, selects the best outputs, and further improves efficiency through adaptive routing. Experiments demonstrate that CE is robust for quality verification, improving F1 scores by 42.1% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same cost. Notably, CE requires no training or supervision, enabling plug-and-play integration.

[104] Domain Adaptation Without the Compute Burden for Efficient Whole Slide Image Analysis

Umar Marikkar, Muhammad Awais, Sara Atito

Main category: cs.CV

TL;DR: eWSI integrates Parameter-Efficient Fine-Tuning with Multiple Instance Learning for end-to-end training on Whole Slide Images, achieving strong performance without extensive domain-specific pre-training.

DetailsMotivation: Current WSI analysis methods use pre-trained feature extractors (often on natural images like ImageNet) that fail to capture domain-specific characteristics, while domain-specific pre-training is computationally expensive and lacks task-specificity.

Method: Proposes EfficientWSI (eWSI), which combines Parameter-Efficient Fine-Tuning (PEFT) with Multiple Instance Learning (MIL) to enable end-to-end training on WSI tasks, allowing task-specific adaptation without full model retraining.

Result: eWSI with ImageNet feature extractors matches or outperforms MIL with in-domain feature extractors on seven WSI-level tasks across Camelyon16, TCGA, and BRACS datasets. When applied with in-domain feature extractors, it further improves performance in most cases.

Conclusion: eWSI provides a computationally efficient, task-targeted approach for WSI analysis that reduces the need for extensive domain-specific pre-training while capturing task-specific information where beneficial.

Abstract: Computational methods on analyzing Whole Slide Images (WSIs) enable early diagnosis and treatments by supporting pathologists in detection and classification of tumors. However, the extremely high resolution of WSIs makes end-to-end training impractical compared to typical image analysis tasks. To address this, most approaches use pre-trained feature extractors to obtain fixed representations of whole slides, which are then combined with Multiple Instance Learning (MIL) for downstream tasks. These feature extractors are typically pre-trained on natural image datasets such as ImageNet, which fail to capture domain-specific characteristics. Although domain-specific pre-training on histopathology data yields more relevant feature representations, it remains computationally expensive and fail to capture task-specific characteristics within the domain. To address the computational cost and lack of task-specificity in domain-specific pre-training, we propose EfficientWSI (eWSI), a careful integration of Parameter-Efficient-Fine-Tuning (PEFT) and Multiple Instance Learning (MIL) that enables end-to-end training on WSI tasks. We evaluate eWSI on seven WSI-level tasks over Camelyon16, TCGA and BRACS datasets. Our results show that eWSI when applied with ImageNet feature extractors yields strong classification performance, matching or outperforming MILs with in-domain feature extractors, alleviating the need for extensive in-domain pre-training. Furthermore, when eWSI is applied with in-domain feature extractors, it further improves classification performance in most cases, demonstrating its ability to capture task-specific information where beneficial. Our findings suggest that eWSI provides a task-targeted, computationally efficient path for WSI tasks, offering a promising direction for task-specific learning in computational pathology.

[105] DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

Ngoc-Son Nguyen, Thanh V. T. Tran, Jeongsoo Choi, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

Main category: cs.CV

TL;DR: DiFlowDubber: A novel two-stage training framework for video dubbing that transfers knowledge from pre-trained TTS models using discrete flow matching, with facial expression-based prosody guidance and speech-lip synchronization modules.

DetailsMotivation: Existing video dubbing approaches either train on limited datasets or use two-stage TTS adaptation pipelines that struggle with expressive prosody, rich acoustic characteristics, and precise speech-lip synchronization.

Method: Two-stage training framework with discrete flow matching generative backbone. Includes FaPro module for capturing global prosody/stylistic cues from facial expressions, and Synchronizer module for bridging modality gaps between text, video, and speech to ensure precise lip synchronization.

Result: Outperforms previous methods across multiple metrics on two primary benchmark datasets.

Conclusion: DiFlowDubber effectively addresses limitations of existing video dubbing approaches by combining knowledge transfer from pre-trained TTS models with facial expression-guided prosody modeling and improved cross-modal synchronization.

Abstract: Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.

[106] Parallelised Differentiable Straightest Geodesics for 3D Meshes

Hippolyte Verninas, Caner Korkmaz, Stefanos Zafeiriou, Tolga Birdal, Simone Foti

Main category: cs.CV

TL;DR: Differentiable exponential map implementation for mesh surfaces with GPU acceleration, enabling geodesic computations and applications in geometric deep learning.

DetailsMotivation: Machine learning on non-Euclidean domains like surfaces/meshes lacks geometrically accurate methods due to missing closed-form Riemannian operators, non-differentiable discrete counterparts, and poor parallelization.

Method: Uses straightest geodesics framework for computing exponential map on meshes, provides parallel GPU implementation, and derives two differentiation methods: extrinsic proxy function and geodesic finite differences.

Result: Demonstrates parallelization performance and accuracy, improves learning/optimization pipelines on geometries, proposes new geodesic convolutional layer, flow matching method for meshes, and second-order optimizer for centroidal Voronoi tessellation.

Conclusion: Provides differentiable exponential map implementation (digeo library) that enables advanced geometric deep learning on mesh surfaces with GPU acceleration and various applications.

Abstract: Machine learning has been progressively generalised to operate within non-Euclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demonstrate how our differentiable exponential map can improve learning and optimisation pipelines on general geometries. In particular, to showcase the versatility of our method, we propose a new geodesic convolutional layer, a new flow matching method for learning on meshes, and a second-order optimiser that we apply to centroidal Voronoi tessellation. Our code, models, and pip-installable library (digeo) are available at: circle-group.github.io/research/DSG.

[107] Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

Ce Zhang, Jinxi He, Junyi He, Katia Sycara, Yaqi Xie

Main category: cs.CV

TL;DR: MM-SafetyBench++ benchmark for evaluating contextual safety in multimodal LLMs, with EchoSafe training-free framework using self-reflective memory for context-aware safety reasoning

DetailsMotivation: Current MLLMs show remarkable visual reasoning but have safety vulnerabilities, especially with contextual safety where models must distinguish subtle differences between similar-looking but safety-divergent scenarios. Prior work focuses on explicit unsafe input detection but overlooks contextual nuance.

Method: 1) Create MM-SafetyBench++ benchmark with unsafe image-text pairs and corresponding safe counterparts via minimal modifications that flip user intent while preserving context. 2) Develop EchoSafe framework with self-reflective memory bank to accumulate and retrieve safety insights from prior interactions, integrating past experiences into current prompts for context-aware reasoning.

Result: Extensive experiments on various multimodal safety benchmarks show EchoSafe consistently achieves superior performance, establishing strong baseline for advancing contextual safety in MLLMs.

Conclusion: The work addresses critical gap in MLLM safety evaluation through contextual safety benchmarking and demonstrates effectiveness of training-free memory-based approach for improving context-aware safety reasoning.

Abstract: Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code are available at https://echosafe-mllm.github.io.

[108] Feed-forward Gaussian Registration for Head Avatar Creation and Editing

Malte Prinzler, Paulo Gotardo, Siyu Tang, Timo Bolkart

Main category: cs.CV

TL;DR: MATCH is a fast multi-view Gaussian registration method for head avatar creation that predicts Gaussian splat textures in 0.5 seconds per frame without preprocessing, enabling applications like expression transfer and semantic editing.

DetailsMotivation: Existing multi-view head avatar methods require time-consuming head tracking and expensive optimization (taking over a day), creating a need for faster, more efficient avatar creation without preprocessing.

Method: Uses transformer-based model with registration-guided attention blocks to predict Gaussian splat textures in fixed UV layout of template mesh; each UV-map token attends only to image tokens depicting its corresponding mesh region.

Result: Outperforms existing methods in novel-view synthesis, geometry registration, and head avatar generation while being 10 times faster than closest baseline; enables expression transfer, optimization-free tracking, semantic editing, and identity interpolation.

Conclusion: MATCH provides efficient, high-quality head avatar creation with learned intra- and inter-subject correspondences, significantly reducing creation time while enabling various editing applications.

Abstract: We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatar methods require time-consuming head tracking followed by expensive avatar optimization, often resulting in a total creation time of more than one day. MATCH, in contrast, directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in just 0.5 seconds per frame, without requiring data preprocessing. The learned intra-subject correspondence across frames enables fast creation of personalized head avatars, while correspondence across subjects supports applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We establish these correspondences end-to-end using a transformer-based model that predicts Gaussian splat textures in the fixed UV layout of a template mesh. To achieve this, we introduce a novel registration-guided attention block, where each UV-map token attends exclusively to image tokens depicting its corresponding mesh region. This design improves efficiency and performance compared to dense cross-view attention. MATCH outperforms existing methods in novel-view synthesis, geometry registration, and head avatar generation, while making avatar creation 10 times faster than the closest competing baseline. The code and model weights are available on the project website.

[109] ModTrack: Sensor-Agnostic Multi-View Tracking via Identity-Informed PHD Filtering with Covariance Propagation

Aditya Iyer, Jack Roberts, Nora Ayanian

Main category: cs.CV

TL;DR: ModTrack: A modular multi-view multi-object tracking system that achieves end-to-end performance while providing cross-modal generalization and uncertainty quantification through closed-form analytical methods.

DetailsMotivation: Current end-to-end MV-MOT methods lack principled uncertainty accounting and are tightly coupled to training configurations, limiting generalization across different sensor layouts, modalities, or datasets without retraining.

Method: Confines learning to only detection and feature extraction, then uses closed-form analytical methods for fusion, association, and tracking. Reduces sensor outputs to calibrated position-covariance pairs, performs cross-view clustering and precision-weighted fusion, and uses a feedback-coupled GM-PHD filter with HMM motion modes for identity maintenance.

Result: Achieves 95.5 IDF1 and 91.4 MOTA on WildTrack, surpassing all prior modular methods by over 21 points and rivaling state-of-the-art end-to-end methods. Same tracker core transfers unchanged to MultiviewX and RadarScenes with only perception-module replacement.

Conclusion: ModTrack provides deployment flexibility and cross-modal generalization that end-to-end methods cannot, while matching their performance and offering traceable uncertainty quantification.

Abstract: Multi-View Multi-Object Tracking (MV-MOT) aims to localize and maintain consistent identities of objects observed by multiple sensors. This task is challenging, as viewpoint changes and occlusion disrupt identity consistency across views and time. Recent end-to-end approaches address this by jointly learning 2D Bird’s Eye View (BEV) representations and identity associations, achieving high tracking accuracy. However, these methods offer no principled uncertainty accounting and remain tightly coupled to their training configuration, limiting generalization across sensor layouts, modalities, or datasets without retraining. We propose ModTrack, a modular MV-MOT system that matches end-to-end performance while providing cross-modal, sensor-agnostic generalization and traceable uncertainty. ModTrack confines learning methods to just the \textit{Detection and Feature Extraction} stage of the MV-MOT pipeline, performing all fusion, association, and tracking with closed-form analytical methods. Our design reduces each sensor’s output to calibrated position-covariance pairs $(\mathbf{z}, R)$; cross-view clustering and precision-weighted fusion then yield unified estimates $(\hat{\mathbf{z}}, \hat{R})$ for identity assignment and temporal tracking. A feedback-coupled, identity-informed Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter with HMM motion modes uses these fused estimates to maintain identities under missed detections and heavy occlusion. ModTrack achieves 95.5 IDF1 and 91.4 MOTA on \textit{WildTrack}, surpassing all prior modular methods by over 21 points and rivaling the state-of-the-art end-to-end methods while providing deployment flexibility they cannot. Specifically, the same tracker core transfers unchanged to \textit{MultiviewX} and \textit{RadarScenes}, with only perception-module replacement required to extend to new domains and sensor modalities.

[110] Conflict-Aware Multimodal Fusion for Ambivalence and Hesitancy Recognition

Salah Eddine Bekhouche, Hichem Telli, Azeddine Benlamoudi, Salah Eddine Herrouz, Abdelmalik Taleb-Ahmed, Abdenour Hadid

Main category: cs.CV

TL;DR: ConflictAwareAH: A multimodal framework that detects ambivalence and hesitancy by analyzing conflicts between video, audio, and text modalities using pairwise conflict features and text-guided late fusion.

DetailsMotivation: Ambivalence and hesitancy are subtle affective states where people show conflicting signals through different channels (saying one thing while face/voice tells another). Recognizing these states automatically is valuable in clinical settings, but challenging for machines because evidence lives in disagreements between modalities.

Method: Uses three pre-trained encoders for video, audio, and text. Extracts pairwise conflict features as element-wise absolute differences between modality embeddings. These bidirectional cues flag A/H when differences are large, and confirm behavioral consistency when differences are small. Also employs text-guided late fusion strategy blending text-only auxiliary head with full model at inference.

Result: Achieves 0.694 Macro F1 on labeled test split and 0.715 on private leaderboard of BAH dataset from ABAW10 Ambivalence/Hesitancy Challenge. Outperforms published multimodal baselines by over 10 points. Improves F1-NoAH by +4.6 points over text alone and halves class-performance gap. Text-guided late fusion adds +4.1 Macro F1.

Conclusion: The conflict-aware multimodal approach effectively detects ambivalence and hesitancy by focusing on cross-modal disagreements, addressing limitations of text-dominant approaches. The method is efficient, running on a single GPU in under 25 minutes of training.

Abstract: Ambivalence and hesitancy (A/H) are subtle affective states where a person shows conflicting signals through different channels – saying one thing while their face or voice tells another story. Recognising these states automatically is valuable in clinical settings, but it is hard for machines because the key evidence lives in the \emph{disagreements} between what is said, how it sounds, and what the face shows. We present \textbf{ConflictAwareAH}, a multimodal framework built for this problem. Three pre-trained encoders extract video, audio, and text representations. Pairwise conflict features – element-wise absolute differences between modality embeddings – serve as \emph{bidirectional} cues: large cross-modal differences flag A/H, while small differences confirm behavioural consistency and anchor the negative class. This conflict-aware design addresses a key limitation of text-dominant approaches, which tend to over-detect A/H (high F1-AH) while struggling to confirm its absence: our multimodal model improves F1-NoAH by +4.6 points over text alone and halves the class-performance gap. A complementary \emph{text-guided late fusion} strategy blends a text-only auxiliary head with the full model at inference, adding +4.1 Macro F1. On the BAH dataset from the ABAW10 Ambivalence/Hesitancy Challenge, our method reaches \textbf{0.694 Macro F1} on the labelled test split and \textbf{0.715} on the private leaderboard, outperforming published multimodal baselines by over 10 points – all on a single GPU in under 25 minutes of training.

[111] Beyond the Embedding Bottleneck: Adaptive Retrieval-Augmented 3D CT Report Generation

Renjie Liang, Yiling Ma, Yang Xing, Zhengkang Fan, Jinqian Pan, Chengkun Sun, Li Li, Kuang Gong, Jie Xu

Main category: cs.CV

TL;DR: AdaRAG-CT addresses incomplete pathology coverage in 3D CT report generation by identifying a representational bottleneck in contrastive CT embeddings and using adaptive retrieval-augmented generation to supplement missing information.

DetailsMotivation: Automated radiology report generation from 3D CT volumes suffers from incomplete pathology coverage due to a representational bottleneck in visual embeddings, where contrastive 3D CT embeddings exhibit severe dimensional concentration (only 2 effective dimensions out of 512), limiting both generation and retrieval performance.

Method: Proposes AdaRAG-CT, an adaptive augmentation framework that compensates for the visual bottleneck by introducing supplementary textual information through controlled retrieval and selectively integrating it during generation, using both retrieval and generation components to improve clinical efficacy.

Result: On the CT-RATE benchmark, AdaRAG-CT achieves state-of-the-art clinical efficacy, improving Clinical F1 from 0.420 (CT-Agent) to 0.480 (+6 points); ablation studies confirm both retrieval and generation components contribute to the improvement.

Conclusion: The representational bottleneck in 3D CT embeddings limits pathology coverage in automated report generation, and adaptive retrieval-augmented generation can effectively compensate for this limitation to improve clinical efficacy.

Abstract: Automated radiology report generation from 3D CT volumes often suffers from incomplete pathology coverage. We provide empirical evidence that this limitation stems from a representational bottleneck: contrastive 3D CT embeddings encode discriminative pathology signals, yet exhibit severe dimensional concentration, with as few as 2 effective dimensions out of 512. Corroborating this, scaling the language model yields no measurable improvement, suggesting that the bottleneck lies in the visual representation rather than the generator. This bottleneck limits both generation and retrieval; naive static retrieval fails to improve clinical efficacy and can even degrade performance. We propose \textbf{AdaRAG-CT}, an adaptive augmentation framework that compensates for this visual bottleneck by introducing supplementary textual information through controlled retrieval and selectively integrating it during generation. On the CT-RATE benchmark, AdaRAG-CT achieves state-of-the-art clinical efficacy, improving Clinical F1 from 0.420 (CT-Agent) to 0.480 (+6 points); ablation studies confirm that both the retrieval and generation components contribute to the improvement. Code is available at https://github.com/renjie-liang/Adaptive-RAG-for-3DCT-Report-Generation.

[112] Deformation-Invariant Neural Network and Its Applications in Distorted Image Restoration and Analysis

Han Zhang, Qiguang Chen, Lok Ming Lui

Main category: cs.CV

TL;DR: DINN framework uses quasiconformal transformer network to handle geometrically distorted images by transforming them into improved versions closer to natural image distributions, enhancing performance on classification and restoration tasks.

DetailsMotivation: Geometric distortions in images pose significant challenges for deep learning models in computer vision tasks like object recognition, as existing models often fail to perform accurately on distorted images.

Method: Proposes Deformation-Invariant Neural Network (DINN) with a lightweight Quasiconformal Transformer Network (QCTN) component that outputs quasiconformal maps to transform distorted images into improved versions closer to natural image distributions.

Result: DINN achieves accurate classification of distorted images, outperforms existing GAN-based restoration methods for atmospheric and water turbulence distortions, and achieves satisfactory performance on face verification under atmospheric turbulence.

Conclusion: The DINN framework effectively handles geometric distortions in images through quasiconformal transformations, demonstrating superior performance in classification, restoration, and verification tasks compared to existing methods.

Abstract: Images degraded by geometric distortions pose a significant challenge to imaging and computer vision tasks such as object recognition. Deep learning-based imaging models usually fail to give accurate performance for geometrically distorted images. In this paper, we propose the deformation-invariant neural network (DINN), a framework to address the problem of imaging tasks for geometrically distorted images. The DINN outputs consistent latent features for images that are geometrically distorted but represent the same underlying object or scene. The idea of DINN is to incorporate a simple component, called the quasiconformal transformer network (QCTN), into other existing deep networks for imaging tasks. The QCTN is a deep neural network that outputs a quasiconformal map, which can be used to transform a geometrically distorted image into an improved version that is closer to the distribution of natural or good images. It first outputs a Beltrami coefficient, which measures the quasiconformality of the output deformation map. By controlling the Beltrami coefficient, the local geometric distortion under the quasiconformal mapping can be controlled. The QCTN is lightweight and simple, which can be readily integrated into other existing deep neural networks to enhance their performance. Leveraging our framework, we have developed an image classification network that achieves accurate classification of distorted images. Our proposed framework has been applied to restore geometrically distorted images by atmospheric turbulence and water turbulence. DINN outperforms existing GAN-based restoration methods under these scenarios, demonstrating the effectiveness of the proposed framework. Additionally, we apply our proposed framework to the 1-1 verification of human face images under atmospheric turbulence and achieve satisfactory performance, further demonstrating the efficacy of our approach.

[113] FEEL (Force-Enhanced Egocentric Learning): A Dataset for Physical Action Understanding

Eadom Dessalene, Botao He, Michael Maynord, Yonatan Tussa, Pavan Mantripragada, Yianni Karabati, Nirupam Roy, Yiannis Aloimonos

Main category: cs.CV

TL;DR: FEEL dataset pairs force measurements from piezoresistive gloves with egocentric video for physical action understanding in kitchen environments

DetailsMotivation: Force is the underlying cause that drives physical interaction, making it a critical primitive for physical action understanding. Current datasets lack synchronized force measurements with egocentric video.

Method: Created custom piezoresistive gloves for scalable force data collection, built FEEL dataset with ~3M force-synchronized frames of natural kitchen manipulation. Applied to contact understanding (temporal segmentation + pixel-level segmentation) and action representation learning via force prediction as self-supervised pretraining.

Result: Achieved state-of-the-art temporal contact segmentation and competitive pixel-level segmentation without manual annotations. Force-based pretraining improved transfer performance on action understanding tasks across EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano datasets.

Conclusion: Force measurements paired with egocentric video provide valuable signals for physical action understanding, enabling self-supervised learning and improved performance on downstream tasks without manual labels.

Abstract: We introduce FEEL (Force-Enhanced Egocentric Learning), the first large-scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force-synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand-object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel-level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self-supervised pretraining objective for video backbones. We achieve state-of-the-art temporal contact segmentation results and competitive pixel-level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano.

[114] Self-supervised Disentanglement of Disease Effects from Aging in 3D Medical Shapes

Jakaria Rabbi, Nilanjan Ray, Dana Cobzas

Main category: cs.CV

TL;DR: A two-stage framework for disentangling pathological changes from physiological aging in 3D medical shapes using unsupervised disease discovery and self-supervised disentanglement of implicit shape representations.

DetailsMotivation: Separating disease-related shape changes from normal aging in 3D medical shapes is crucial for interpretable biomarkers and patient stratification, but challenging when diagnosis labels are limited or unavailable due to overlapping effects.

Method: Two-stage framework: 1) Train implicit neural model with signed distance functions to learn stable shape embeddings, then apply clustering for pseudo disease labels. 2) Disentangle factors in variational space using pseudo labels and ground truth age labels with multi-objective disentanglement loss combining covariance and supervised contrastive loss.

Result: Achieved near-supervised performance on ADNI hippocampus and OAI distal femur shapes, improving disentanglement and reconstruction over state-of-the-art unsupervised baselines, enabling high-fidelity reconstruction, controllable synthesis, and factor-based explainability.

Conclusion: The proposed framework successfully disentangles pathological changes from physiological aging in 3D medical shapes without requiring extensive diagnosis labels, providing valuable tools for biomarker development and patient stratification.

Abstract: Disentangling pathological changes from physiological aging in 3D medical shapes is crucial for developing interpretable biomarkers and patient stratification. However, this separation is challenging when diagnosis labels are limited or unavailable, since disease and aging often produce overlapping effects on shape changes, obscuring clinically relevant shape patterns. To address this challenge, we propose a two-stage framework combining unsupervised disease discovery with self-supervised disentanglement of implicit shape representations. In the first stage, we train an implicit neural model with signed distance functions to learn stable shape embeddings. We then apply clustering on the shape latent space, which yields pseudo disease labels without using ground-truth diagnosis during discovery. In the second stage, we disentangle factors in a compact variational space using pseudo disease labels discovered in the first stage and the ground truth age labels available for all subjects. We enforce separation and controllability with a multi-objective disentanglement loss combining covariance and a supervised contrastive loss. On ADNI hippocampus and OAI distal femur shapes, we achieve near-supervised performance, improving disentanglement and reconstruction over state-of-the-art unsupervised baselines, while enabling high-fidelity reconstruction, controllable synthesis, and factor-based explainability. Code and checkpoints are available at https://github.com/anonymous-submission01/medical-shape-disentanglement

[115] Make it SING: Analyzing Semantic Invariants in Classifiers

Harel Yadid, Meir Yossef Levi, Roy Betser, Guy Gilboa

Main category: cs.CV

TL;DR: SING is a method that interprets classifier invariants by generating equivalent images and providing semantic interpretations using vision-language models, revealing differences in how models like ResNet50 and DinoViT handle semantic information in their null spaces.

DetailsMotivation: Classifiers have invariants in their null spaces that create equivalent input sets with identical outputs, but existing methods fail to provide human-interpretable semantic understanding of these invariants.

Method: SING constructs equivalent images with respect to network classification and uses a mapping from network features to multimodal vision-language models to obtain natural language descriptions and visual examples of semantic shifts in the null space.

Result: The method reveals that ResNet50 leaks relevant semantic attributes to the null space, while DinoViT (ViT pretrained with self-supervised DINO) better maintains class semantics across the invariant space.

Conclusion: SING provides interpretable semantic analysis of classifier invariants, enabling both local image analysis and statistical analysis at class/model levels, revealing important differences in how different architectures handle semantic information.

Abstract: All classifiers, including state-of-the-art vision models, possess invariants, partially rooted in the geometry of their linear mappings. These invariants, which reside in the null-space of the classifier, induce equivalent sets of inputs that map to identical outputs. The semantic content of these invariants remains vague, as existing approaches struggle to provide human-interpretable information. To address this gap, we present Semantic Interpretation of the Null-space Geometry (SING), a method that constructs equivalent images, with respect to the network, and assigns semantic interpretations to the available variations. We use a mapping from network features to multi-modal vision language models. This allows us to obtain natural language descriptions and visual examples of the induced semantic shifts. SING can be applied to a single image, uncovering local invariants, or to sets of images, allowing a breadth of statistical analysis at the class and model levels. For example, our method reveals that ResNet50 leaks relevant semantic attributes to the null space, whereas DinoViT, a ViT pretrained with self-supervised DINO, is superior in maintaining class semantics across the invariant space.

[116] EvoIQA - Explaining Image Distortions with Evolved White-Box Logic

Ruchika Gupta, Illya Bakurov, Nathan Haut, Wolfgang Banzhaf

Main category: cs.CV

TL;DR: EvoIQA uses Genetic Programming to evolve interpretable mathematical formulas for image quality assessment that achieve state-of-the-art performance while maintaining explainability.

DetailsMotivation: Traditional IQA metrics are either rigid hand-crafted models or uninterpretable "black-box" deep learning architectures. There's a need for methods that combine state-of-the-art performance with human-readable explainability.

Method: A symbolic regression framework based on Genetic Programming that evolves explicit mathematical formulas for IQA. Uses terminal sets from established metrics (VSI, VIF, FSIM, HaarPSI) to map structural, chromatic, and information-theoretic degradations into observable equations.

Result: Evolved GP models achieve strong alignment with human visual preferences, outperform traditional hand-crafted metrics, and achieve performance parity with complex state-of-the-art deep learning models like DB-CNN.

Conclusion: Interpretability doesn’t have to be sacrificed for state-of-the-art performance in image quality assessment; explainable symbolic models can achieve comparable results to complex deep learning architectures.

Abstract: Traditional Image Quality Assessment (IQA) metrics typically fall into one of two extremes: rigid, hand-crafted mathematical models or “black-box” deep learning architectures that completely lack interpretability. To bridge this gap, we propose EvoIQA, a fully explainable symbolic regression framework based on Genetic Programming that Evolves explicit, human-readable mathematical formulas for image quality assessment (IQA). Utilizing a rich terminal set from the VSI, VIF, FSIM, and HaarPSI metrics, our framework inherently maps structural, chromatic, and information-theoretic degradations into observable mathematical equations. Our results demonstrate that the evolved GP models consistently achieve strong alignment between the predictions and human visual preferences. Furthermore, they not only outperform traditional hand-crafted metrics but also achieve performance parity with complex, state-of-the-art deep learning models like DB-CNN, proving that we no longer have to sacrifice interpretability for state-of-the-art performance.

[117] Sparse but not Simpler: A Multi-Level Interpretability Analysis of Vision Transformers

Siyu Zhang

Main category: cs.CV

TL;DR: Sparse neural networks don’t necessarily improve interpretability in Vision Transformers despite producing more compact circuits, as pruning redistributes computation rather than isolating simpler functional modules.

DetailsMotivation: To systematically evaluate whether structural sparsity in neural networks leads to improved semantic interpretability, challenging the common hypothesis that sparse models are more interpretable than dense ones.

Method: Used DeiT-III B/16 Vision Transformers pruned with Wanda, and introduced IMPACT framework evaluating interpretability across four levels: neurons, layer representations (using BatchTopK sparse autoencoders), task circuits (via learnable node masking), and model-level attribution (using transformer attribution with insertion/deletion metrics).

Result: Sparse models produce circuits with 2.5× fewer edges than dense models, but fraction of active nodes remains similar or higher. No systematic improvements in neuron-level selectivity, SAE feature interpretability, or attribution faithfulness. Pruning redistributes computation rather than isolating simpler functional modules.

Conclusion: Structural sparsity alone does not reliably yield more interpretable vision models, highlighting the need for evaluation frameworks that assess interpretability beyond circuit compactness.

Abstract: Sparse neural networks are often hypothesized to be more interpretable than dense models, motivated by findings that weight sparsity can produce compact circuits in language models. However, it remains unclear whether structural sparsity itself leads to improved semantic interpretability. In this work, we systematically evaluate the relationship between weight sparsity and interpretability in Vision Transformers using DeiT-III B/16 models pruned with Wanda. To assess interpretability comprehensively, we introduce \textbf{IMPACT}, a multi-level framework that evaluates interpretability across four complementary levels: neurons, layer representations, task circuits, and model-level attribution. Layer representations are analyzed using BatchTopK sparse autoencoders, circuits are extracted via learnable node masking, and explanations are evaluated with transformer attribution using insertion and deletion metrics. Our results reveal a clear structural effect but limited interpretability gains. Sparse models produce circuits with approximately $2.5\times$ fewer edges than dense models, yet the fraction of active nodes remains similar or higher, indicating that pruning redistributes computation rather than isolating simpler functional modules. Consistent with this observation, sparse models show no systematic improvements in neuron-level selectivity, SAE feature interpretability, or attribution faithfulness. These findings suggest that structural sparsity alone does not reliably yield more interpretable vision models, highlighting the importance of evaluation frameworks that assess interpretability beyond circuit compactness.

[118] Nodule-Aligned Latent Space Learning with LLM-Driven Multimodal Diffusion for Lung Nodule Progression Prediction

James Song, Yifan Wang, Chuan Zhou, Liyue Shen

Main category: cs.CV

TL;DR: NAMD is a multimodal diffusion framework that generates 1-year follow-up CT images of lung nodules using baseline scans and EHR data, with nodule-aligned latent space and LLM-driven control for improved malignancy prediction.

DetailsMotivation: Early lung cancer diagnosis is difficult due to biological uncertainty and limited understanding of nodule progression mechanisms. Current methods lack effective integration of multimodal patient data for predicting nodule evolution.

Method: Proposes Nodule-Aligned Multimodal (Latent) Diffusion (NAMD) framework with: 1) nodule-aligned latent space where distances correspond to nodule attribute changes, 2) LLM-driven control mechanism to condition diffusion on patient EHR data, 3) generation of 1-year follow-up CT images from baseline scans and patient data.

Result: On NLST dataset: achieves AUROC 0.805 and AUPRC 0.346 for malignancy prediction, significantly outperforming baseline scans and state-of-the-art synthesis methods, approaching real follow-up scan performance (AUROC: 0.819, AUPRC: 0.393).

Conclusion: NAMD effectively captures clinically relevant features of lung nodule progression, demonstrating potential for earlier and more accurate lung cancer diagnosis through multimodal data integration and synthetic follow-up generation.

Abstract: Early diagnosis of lung cancer is challenging due to biological uncertainty and the limited understanding of the biological mechanisms driving nodule progression. To address this, we propose Nodule-Aligned Multimodal (Latent) Diffusion (NAMD), a novel framework that predicts lung nodule progression by generating 1-year follow-up nodule computed tomography images with baseline scans and the patient’s and nodule’s Electronic Health Record (EHR). NAMD introduces a nodule-aligned latent space, where distances between latents directly correspond to changes in nodule attributes, and utilizes an LLM-driven control mechanism to condition the diffusion backbone on patient data. On the National Lung Screening Trial (NLST) dataset, our method synthesizes follow-up nodule images that achieve an AUROC of 0.805 and an AUPRC of 0.346 for lung nodule malignancy prediction, significantly outperforming both baseline scans and state-of-the-art synthesis methods, while closely approaching the performance of real follow-up scans (AUROC: 0.819, AUPRC: 0.393). These results demonstrate that NAMD captures clinically relevant features of lung nodule progression, facilitating earlier and more accurate diagnosis.

[119] Towards Fair and Robust Volumetric CT Classification via KL-Regularised Group Distributionally Robust Optimisation

Samuel Johnny, Blessed Guda, Frank Ebeledike, Goodness Obasi, Moise Busogi

Main category: cs.CV

TL;DR: A medical imaging framework using MobileViT-XXS with SliceTransformer and KL-regularized Group DRO for COVID-19 classification and lung pathology recognition with fairness across acquisition sites and demographic groups.

DetailsMotivation: Address distribution shift across acquisition sites and performance disparity across demographic subgroups in chest CT scan diagnosis, particularly for COVID-19 classification and lung pathology recognition with gender fairness.

Method: Combines lightweight MobileViT-XXS slice encoder with two-layer SliceTransformer for volumetric reasoning, trained with KL-regularized Group Distributionally Robust Optimization that adaptively upweights underperforming acquisition centers and demographic subgroups.

Result: Achieved challenge F1 of 0.835 for COVID-19 classification (surpassing best published by +5.9) and mean per-gender macro F1 of 0.815 for lung pathology recognition (outperforming best challenge entry by +11.1 pp), with Female Squamous F1 improved by +17.4 over Focal Loss baseline.

Conclusion: The KL-regularized Group DRO framework effectively addresses distribution shift and fairness issues in medical imaging, providing stable balance between worst-case protection and average performance across diverse acquisition sites and demographic groups.

Abstract: Automated diagnosis from chest computed tomography (CT) scans faces two persistent challenges in clinical deployment: distribution shift across acquisition sites and performance disparity across demographic subgroups. We address both simultaneously across two complementary tasks: binary COVID-19 classification from multi-site CT volumes (Task 1) and four-class lung pathology recognition with gender-based fairness constraints (Task 2). Our framework combines a lightweight MobileViT-XXS slice encoder with a two-layer SliceTransformer aggregator for volumetric reasoning, and trains with a KL-regularised Group Distributionally Robust Optimisation (Group DRO) objective that adaptively upweights underperforming acquisition centres and demographic subgroups. Unlike standard Group DRO, the KL penalty prevents group weight collapse, providing a stable balance between worst-case protection and average performance. For Task 2, we define groups at the granularity of gender class, directly targeting severely underrepresented combinations such as female Squamous cell carcinoma. On Task 1, our best configuration achieves a challenge F1 of 0.835, surpassing the best published challenge entry by +5.9. On Task 2, Group DRO with α = 0.5 achieves a mean per-gender macro F1 of 0.815, outperforming the best challenge entry by +11.1 pp and improving Female Squamous F1 by +17.4 over the Fo- cal Loss baseline.

[120] A Comprehensive Benchmark of Histopathology Foundation Models for Kidney Histopathology

Harishwar Reddy Kasireddy, Patricio S. La Rosa, Akshita Gupta, Anindya S. Paul, Jamie L. Fermin, William L. Clapp, Meryl A. Waldman, Tarek M. El-Ashkar, Sanjay Jain, Luis Rodrigues, Kuang Yu Jen, Avi Z. Rosenberg, Michael T. Eadon, Jeffrey B. Hodgin, Pinaki Sarder

Main category: cs.CV

TL;DR: Evaluation of 11 histopathology foundation models on kidney disease tasks shows moderate performance on coarse morphology but poor results on fine-grained discrimination and prognostic inference.

DetailsMotivation: Histopathology foundation models have advanced cancer analysis but their applicability to non-cancerous chronic kidney disease remains underexplored, despite coexistence of renal pathology with malignancies.

Method: Systematic evaluation of 11 publicly available HFMs across 11 kidney-specific downstream tasks spanning multiple stains, spatial scales, task types, and clinical objectives using rigorous statistical validation methods.

Result: Models show moderate to strong performance on tasks driven by coarse meso-scale renal morphology but consistently decline for tasks requiring fine-grained microstructural discrimination, complex phenotypes, or slide-level prognostic inference.

Conclusion: Current HFMs encode predominantly static meso-scale representations with limited capacity for subtle renal pathology or prognosis, highlighting need for kidney-specific, multi-stain, multimodal foundation models.

Abstract: Histopathology foundation models (HFMs), pretrained on large-scale cancer datasets, have advanced computational pathology. However, their applicability to non-cancerous chronic kidney disease remains underexplored, despite coexistence of renal pathology with malignancies such as renal cell and urothelial carcinoma. We systematically evaluate 11 publicly available HFMs across 11 kidney-specific downstream tasks spanning multiple stains (PAS, H&E, PASM, and IHC), spatial scales (tile and slide-level), task types (classification, regression, and copy detection), and clinical objectives, including detection, diagnosis, and prognosis. Tile-level performance is assessed using repeated stratified group cross-validation, while slide-level tasks are evaluated using repeated nested stratified cross-validation. Statistical significance is examined using Friedman test followed by pairwise Wilcoxon signed-rank testing with Holm-Bonferroni correction and compact letter display visualization. To promote reproducibility, we release an open-source Python package, kidney-hfm-eval, available at https://pypi.org/project/kidney-hfm-eval/ , that reproduces the evaluation pipelines. Results show moderate to strong performance on tasks driven by coarse meso-scale renal morphology, including diagnostic classification and detection of prominent structural alterations. In contrast, performance consistently declines for tasks requiring fine-grained microstructural discrimination, complex biological phenotypes, or slide-level prognostic inference, largely independent of stain type. Overall, current HFMs appear to encode predominantly static meso-scale representations and may have limited capacity to capture subtle renal pathology or prognosis-related signals. Our results highlight the need for kidney-specific, multi-stain, and multimodal foundation models to support clinically reliable decision-making in nephrology.

[121] UMO: Unified In-Context Learning Unlocks Motion Foundation Model Priors

Xiaoyan Cong, Zekun Li, Zhiyang Dou, Hongyu Li, Omid Taheri, Chuan Guo, Abhay Mittal, Sizhe An, Taku Komura, Wojciech Matusik, Michael J. Black, Srinath Sridhar

Main category: cs.CV

TL;DR: UMO is a unified framework that adapts pretrained text-to-motion foundation models to support diverse downstream motion generation tasks through atomic per-frame operations and lightweight temporal fusion.

DetailsMotivation: While large-scale foundation models have advanced text-to-motion generation, they remain single-purpose and cannot efficiently handle diverse cross-modal and in-context motion generation tasks. Current approaches require task-specific adaptations rather than a unified solution.

Method: UMO introduces a unified formulation that casts diverse tasks into compositions of atomic per-frame operations. It uses three learnable frame-level meta-operation embeddings to specify per-frame intent and employs lightweight temporal fusion to inject in-context cues into pretrained DiT-based motion LFMs with minimal overhead.

Result: UMO enables a single pretrained model to support previously unsupported tasks including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation. It consistently outperforms task-specific and training-free baselines across various benchmarks.

Conclusion: UMO provides an effective unified framework for unlocking the generative priors of motion foundation models to support diverse downstream tasks through in-context adaptation with minimal computational overhead.

Abstract: Large-scale foundation models (LFMs) have recently made impressive progress in text-to-motion generation by learning strong generative priors from massive 3D human motion datasets and paired text descriptions. However, how to effectively and efficiently leverage such single-purpose motion LFMs, i.e., text-to-motion synthesis, in more diverse cross-modal and in-context motion generation downstream tasks remains largely unclear. Prior work typically adapts pretrained generative priors to individual downstream tasks in a task-specific manner. In contrast, our goal is to unlock such priors to support a broad spectrum of downstream motion generation tasks within a single unified framework. To bridge this gap, we present UMO, a simple yet general unified formulation that casts diverse downstream tasks into compositions of atomic per-frame operations, enabling in-context adaptation to unlock the generative priors of pretrained DiT-based motion LFMs. Specifically, UMO introduces three learnable frame-level meta-operation embeddings to specify per-frame intent and employs lightweight temporal fusion to inject in-context cues into the pretrained backbone, with negligible runtime overhead compared to the base model. With this design, UMO finetunes the pretrained model, originally limited to text-to-motion generation, to support diverse previously unsupported tasks, including temporal inpainting, text-guided motion editing, text-serialized geometric constraints, and multi-identity reaction generation. Experiments demonstrate that UMO consistently outperforms task-specific and training-free baselines across a wide range of benchmarks, despite using a single unified model. Code and model will be publicly available. Project Page: https://oliver-cong02.github.io/UMO.github.io/

[122] Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models

Sijie Li, Biao Qian, Jungong Han

Main category: cs.CV

TL;DR: ATV-Pruning is an asymmetric text-visual weight pruning method for Large Vision-Language Models that addresses modality-specific behaviors by using different calibration strategies for textual and visual pathways.

DetailsMotivation: Existing pruning methods for LVLMs treat calibration data from different modalities uniformly, ignoring modality-specific behaviors. This creates challenges in accurately pruning models due to divergent behaviors of textual and visual tokens.

Method: Proposes ATV-Pruning with two innovations: 1) adaptive calibration pool using all textual tokens and subset of visual tokens, 2) layer-adaptive selection strategy for important visual tokens. Based on findings that textual pathway is more sensitive and visual pathway has high redundancy.

Result: Extensive experiments across standard multimodal benchmarks show superiority over state-of-the-art pruning methods.

Conclusion: ATV-Pruning effectively addresses modality-specific behaviors in LVLM pruning through asymmetric text-visual calibration strategies, achieving better performance than existing methods.

Abstract: Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.

[123] Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery

Jecia Z. Y. Mao, Francis X. Creighton, Russell H. Taylor, Manish Sahu

Main category: cs.CV

TL;DR: Speech-guided embodied agent framework for video-guided skull base surgery that integrates natural language interaction with real-time visual perception on live intraoperative video streams, enabling surgeons to request computational assistance without disengaging from operative tasks.

DetailsMotivation: Current image-guided navigation systems for surgery rely on external optical trackers and additional hardware setup, which disrupt surgical workflow. There's a need for systems that integrate computational assistance directly into the surgical workflow without requiring surgeons to disengage from operative tasks.

Method: The framework integrates natural language interaction with real-time visual perception on live intraoperative video streams. It begins with interactive segmentation and labeling of surgical instruments, then uses the segmented instrument as a spatial anchor for autonomous tracking. This supports downstream workflows including anatomical segmentation, interactive registration of preoperative 3D models, monocular video-based surgical tool pose estimation, and real-time anatomical overlays for image guidance.

Result: The system achieves competitive spatial accuracy compared to commercially available optical tracking systems while improving workflow integration and enabling rapid deployment of video-guided surgical systems.

Conclusion: Speech-guided embodied agents can effectively integrate computational assistance into surgical workflows, providing competitive accuracy to traditional optical tracking systems while offering better workflow integration and easier deployment.

Abstract: We introduce a speech-guided embodied agent framework for video-guided skull base surgery that dynamically executes perception and image-guidance tasks in response to surgeon queries. The proposed system integrates natural language interaction with real-time visual perception directly on live intraoperative video streams, thereby enabling surgeons to request computational assistance without disengaging from operative tasks. Unlike conventional image-guided navigation systems that rely on external optical trackers and additional hardware setup, the framework operates purely on intraoperative video. The system begins with interactive segmentation and labeling of the surgical instrument. The segmented instrument is then used as a spatial anchor that is autonomously tracked in the video stream to support downstream workflows, including anatomical segmentation, interactive registration of preoperative 3D models, monocular video-based estimation of the surgical tool pose, and support image guidance through real-time anatomical overlays.We evaluate the proposed system in video-guided skull base surgery scenarios and benchmark its tracking performance against a commercially available optical tracking system. Results demonstrate that speech-guided embodied agents can achieve competitive spatial accuracy while improving workflow integration and enabling rapid deployment of video-guided surgical systems.

[124] ViT-AdaLA: Adapting Vision Transformers with Linear Attention

Yifan Li, Seunghyun Yoon, Viet Dac Lai, Franck Dernoncourt, Jason Kuen, Yu Kong, Trung Bui

Main category: cs.CV

TL;DR: ViT-AdaLA: A framework for adapting Vision Transformer foundation models to linear attention architectures through attention and feature alignment, enabling efficient transfer learning while maintaining performance.

DetailsMotivation: Vision Transformers (ViTs) have quadratic complexity that limits scalability to long sequences. Existing linear attention approaches either require training from scratch (computationally expensive) or don't transfer well from language models to vision tasks.

Method: Three-stage framework: 1) Attention alignment - align vanilla linear attention with original softmax attention in each block; 2) Feature alignment - fine-tune linearized ViT to align final-layer features with frozen softmax VFM teacher; 3) Supervised fine-tuning for downstream tasks.

Result: Extensive experiments on classification and segmentation tasks demonstrate effectiveness and generality over various state-of-the-art linear attention counterparts.

Conclusion: ViT-AdaLA successfully adapts prior knowledge from vision foundation models to linear attention ViTs, addressing computational limitations while maintaining performance across vision tasks.

Abstract: Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.

[125] Attribution Upsampling should Redistribute, Not Interpolate

Vincenzo Buono, Peyman Sheikholharam Mashhadi, Mahmoud Rahat, Prayag Tiwari, Stefan Byttner

Main category: cs.CV

TL;DR: USU is a semantic-aware upsampling method for attribution maps that preserves attribution mass and ordering by treating upsampling as mass redistribution rather than interpolation, addressing aliasing and boundary bleeding issues in explainable AI.

DetailsMotivation: Standard interpolation methods (bilinear/bicubic) corrupt attribution signals in explainable AI by creating spurious high-importance regions through aliasing, ringing, and boundary bleeding, misrepresenting model reasoning. The core issue is treating attribution upsampling as interpolation rather than semantic-aware mass redistribution.

Method: USU reformulates upsampling through ratio-form mass redistribution operators that preserve attribution mass and relative importance ordering. It formalizes four desiderata for faithful upsampling, proves interpolation violates three, and derives the unique ratio-form operator that satisfies all four axioms.

Result: Controlled experiments verify USU’s formal guarantees; evaluations across ImageNet, CIFAR-10, and CUB-200 show consistent faithfulness improvements and qualitatively superior, semantically coherent explanations compared to standard interpolation methods.

Conclusion: USU provides a principled solution to attribution upsampling by treating it as mass redistribution governed by semantic boundaries, offering provable guarantees and practical improvements for explainable AI attribution methods.

Abstract: Attribution methods in explainable AI rely on upsampling techniques that were designed for natural images, not saliency maps. Standard bilinear and bicubic interpolation systematically corrupts attribution signals through aliasing, ringing, and boundary bleeding, producing spurious high-importance regions that misrepresent model reasoning. We identify that the core issue is treating attribution upsampling as an interpolation problem that operates in isolation from the model’s reasoning, rather than a mass redistribution problem where model-derived semantic boundaries must govern how importance flows. We present Universal Semantic-Aware Upsampling (USU), a principled method that reformulates upsampling through ratio-form mass redistribution operators, provably preserving attribution mass and relative importance ordering. Extending the axiomatic tradition of feature attribution to upsampling, we formalize four desiderata for faithful upsampling and prove that interpolation structurally violates three of them. These same three force any redistribution operator into a ratio form; the fourth selects the unique potential within this family, yielding USU. Controlled experiments on models with known attribution priors verify USU’s formal guarantees; evaluation across ImageNet, CIFAR-10, and CUB-200 confirms consistent faithfulness improvements and qualitatively superior, semantically coherent explanations.

[126] Volumetrically Consistent Implicit Atlas Learning via Neural Diffeomorphic Flow for Placenta MRI

Athena Taymourtash, S. Mazdak Abulnaga, Esra Abaci Turk, P. Ellen Grant, Polina Golland

Main category: cs.CV

TL;DR: Volumetric implicit registration method for anatomical shapes that learns shared canonical template with neural diffeomorphic flow, enabling interior deformation modeling beyond surface correspondences.

DetailsMotivation: Existing implicit registration methods only capture surface correspondences near zero-level sets, leaving interior deformations under-constrained for anatomical shapes like placentas in MRI.

Method: Couples reconstruction of signed distance functions (SDFs) with neural diffeomorphic flow to learn shared canonical template. Uses volumetric regularization including Jacobian-determinant and biharmonic penalties to suppress local folding and promote globally coherent deformations.

Result: Demonstrates improved geometric fidelity and volumetric alignment over surface-based implicit baseline methods on in-vivo placenta MRI scans, yielding anatomically interpretable and topologically consistent flattening suitable for group analysis.

Conclusion: The method enables joint reconstruction, alignment to population-derived implicit template, and voxel-wise intensity mapping in unified canonical space for anatomical shape analysis.

Abstract: Establishing dense volumetric correspondences across anatomical shapes is essential for group-level analysis but remains challenging for implicit neural representations. Most existing implicit registration methods rely on supervision near the zero-level set and thus capture only surface correspondences, leaving interior deformations under-constrained. We introduce a volumetrically consistent implicit model that couples reconstruction of signed distance functions (SDFs) with neural diffeomorphic flow to learn a shared canonical template of the placenta. Volumetric regularization, including Jacobian-determinant and biharmonic penalties, suppresses local folding and promotes globally coherent deformations. In the motivating application to placenta MRI, our formulation jointly reconstructs individual placentas, aligns them to a population-derived implicit template, and enables voxel-wise intensity mapping in a unified canonical space. Experiments on in-vivo placenta MRI scans demonstrate improved geometric fidelity and volumetric alignment over surface-based implicit baseline methods, yielding anatomically interpretable and topologically consistent flattening suitable for group analysis.

[127] Structured prototype regularization for synthetic-to-real driving scene parsing

Jiahe Fan, Xiao Ma, Sergey Vityazev, George Giakos, Shaolong Shu, Rui Fan

Main category: cs.CV

TL;DR: Novel unsupervised domain adaptation framework for driving scene parsing that regularizes semantic feature structures to bridge synthetic-to-real domain gap, using class prototypes for inter-class separation and intra-class compactness.

DetailsMotivation: Models trained on synthetic driving data perform poorly on real-world scenes due to domain gap. Existing domain adaptation methods focus on global feature alignment but overlook semantic structure, limiting generalization.

Method: Proposes unsupervised domain adaptation framework with semantic feature structure regularization. Uses class-specific prototypes to enforce inter-class separation and intra-class compactness. Includes entropy-based noise filtering for pseudo labels and pixel-level attention for feature alignment.

Result: Extensive experiments on representative benchmarks show the method consistently outperforms recent state-of-the-art methods in driving scene parsing tasks.

Conclusion: Preserving semantic structure is crucial for robust synthetic-to-real adaptation in driving scene parsing, and the proposed framework effectively bridges the domain gap through semantic feature regularization.

Abstract: Driving scene parsing is critical for autonomous vehicles to operate reliably in complex real-world traffic environments. To reduce the reliance on costly pixel-level annotations, synthetic datasets with automatically generated labels have become a popular alternative. However, models trained on synthetic data often perform poorly when applied to real-world scenes due to the synthetic-to-real domain gap. Despite the success of unsupervised domain adaptation in narrowing this gap, most existing methods mainly focus on global feature alignment while overlooking the semantic structure of the feature space. As a result, semantic relations among classes are insufficiently modeled, limiting the model’s ability to generalize. To address these challenges, this study introduces a novel unsupervised domain adaptation framework that explicitly regularizes semantic feature structures to significantly enhance driving scene parsing performance in real-world scenarios. Specifically, the proposed method enforces inter-class separation and intra-class compactness by leveraging class-specific prototypes, thereby enhancing the discriminability and structural coherence of feature clusters. An entropy-based noise filtering strategy improves the reliability of pseudo labels, while a pixel-level attention mechanism further refines feature alignment. Extensive experiments on representative benchmarks demonstrate that the proposed method consistently outperforms recent state-of-the-art methods. These results underscore the importance of preserving semantic structure for robust synthetic-to-real adaptation in driving scene parsing tasks.

[128] Interact3D: Compositional 3D Generation of Interactive Objects

Hui Shan, Keyang Luo, Ming Li, Sizhe Zheng, Yanwei Fu, Zhen Chen, Xiangru Huang

Main category: cs.CV

TL;DR: Interact3D: A framework for generating physically plausible 3D compositional objects from single images with occlusion handling, using generative priors, geometric alignment, SDF-based optimization, and VLM-guided iterative refinement.

DetailsMotivation: Existing 3D generation methods struggle with compositional objects from single images, especially under occlusions, often degrading geometric details in hidden regions and failing to preserve object-object spatial relationships.

Method: 1) Use generative priors to curate high-quality individual assets with unified 3D guidance scene. 2) Two-stage composition: primary object anchored via global-to-local geometric alignment, subsequent geometries integrated using differentiable SDF-based optimization with intersection penalties. 3) Closed-loop agentic refinement: VLM analyzes multi-view renderings, formulates corrective prompts, guides image editing module for iterative self-correction.

Result: Extensive experiments show Interact3D successfully produces promising collision-aware compositions with improved geometric fidelity and consistent spatial relationships.

Conclusion: Interact3D presents an effective framework for generating physically plausible interacting 3D compositional objects from single images, addressing challenges of occlusions and spatial relationship preservation.

Abstract: Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images–particularly under occlusions–remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework Interact3D designed to generate physically plausible interacting 3D compositional objects. Our approach first leverages advanced generative priors to curate high-quality individual assets with a unified 3D guidance scene. To physically compose these assets, we then introduce a robust two-stage composition pipeline. Based on the 3D guidance scene, the primary object is anchored through precise global-to-local geometric alignment (registration), while subsequent geometries are integrated using a differentiable Signed Distance Field (SDF)-based optimization that explicitly penalizes geometry intersections. To reduce challenging collisions, we further deploy a closed-loop, agentic refinement strategy. A Vision-Language Model (VLM) autonomously analyzes multi-view renderings of the composed scene, formulates targeted corrective prompts, and guides an image editing module to iteratively self-correct the generation pipeline. Extensive experiments demonstrate that Interact3D successfully produces promising collsion-aware compositions with improved geometric fidelity and consistent spatial relationships.

[129] Parallel In-context Learning for Large Vision Language Models

Shin’ya Yamaguchi, Daiki Chijiwa, Tamao Sakao, Taku Hasegawa

Main category: cs.CV

TL;DR: Parallel-ICL: A plug-and-play inference algorithm that partitions long multimodal demonstration contexts into parallel chunks and ensembles predictions via weighted Product-of-Experts to reduce quadratic attention cost while maintaining performance.

DetailsMotivation: Current multimodal in-context learning suffers from quadratic computational cost with increasing demonstrations, creating a trade-off between performance and inference latency. Need to maintain accuracy while improving efficiency.

Method: Partitions long demonstration context into multiple shorter chunks, processes them in parallel, integrates predictions at logit level using weighted Product-of-Experts ensemble. Uses clustering-based context chunking for diversity and similarity-based context compilation for query-relevant weighting.

Result: Achieves performance comparable to full-context multimodal in-context learning while significantly improving inference speed on VQA, image captioning, and classification benchmarks.

Conclusion: Provides effective solution to accuracy-efficiency trade-off in multimodal in-context learning, enabling dynamic task adaptation with reduced inference overhead.

Abstract: Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.

[130] LICA: Layered Image Composition Annotations for Graphic Design Research

Elad Hirsch, Shubham Yadav, Mohit Garg, Purvanshi Mehta

Main category: cs.CV

TL;DR: LICA is a large-scale dataset of 1.55M multi-layer graphic design compositions with hierarchical component annotations, enabling structured understanding and generation of graphic layouts beyond pixel-level analysis.

DetailsMotivation: Current vision-language models lack structured understanding of graphic design compositions, treating them as flat images rather than hierarchical systems of components. There's a need for datasets that capture the layered structure of designs to enable more sophisticated layout understanding and generation tasks.

Method: Created LICA dataset with 1.55M designs represented as hierarchical compositions of typed components (text, image, vector, group) with rich metadata including spatial geometry, typographic attributes, and motion parameters. Includes 27K animated layouts with per-component keyframes.

Result: Large-scale dataset spanning 20 design categories and 971K unique templates, establishing new research tasks like layer-aware inpainting, structured layout generation, controlled design editing, and temporally-aware generative modeling.

Conclusion: LICA enables research on models that operate directly on design structure rather than pixels alone, advancing structured understanding and generation of graphic layouts through a new paradigm of compositional layer representation.

Abstract: We introduce LICA (Layered Image Composition Annotations), a large-scale dataset of 1,550,244 multi-layer graphic design compositions designed to advance structured understanding and generation of graphic layouts1. In addition to ren- dered PNG images, LICA represents each design as a hierarchical composition of typed components including text, image, vector, and group elements, each paired with rich per-element metadata such as spatial geometry, typographic attributes, opacity, and visibility. The dataset spans 20 design categories and 971,850 unique templates, providing broad coverage of real-world design structures. We further introduce graphic design video as a new and largely unexplored challenge for current vision-language models through 27,261 animated layouts annotated with per-component keyframes and motion parameters. Beyond scale, LICA establishes a new paradigm of research tasks for graphic design, enabling structured investiga- tions into problems such as layer-aware inpainting, structured layout generation, controlled design editing, and temporally-aware generative modeling. By repre- senting design as a system of compositional layers and relationships, the dataset supports research on models that operate directly on design structure rather than pixels alone.

[131] OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Sensen Gao, Zhaoqing Wang, Qihang Cao, Dongdong Yu, Changhu Wang, Tongliang Liu, Mingming Gong, Jiawang Bian

Main category: cs.CV

TL;DR: OneWorld is a framework for 3D scene generation that performs diffusion directly in a coherent 3D representation space, addressing cross-view consistency issues in existing 2D-based methods.

DetailsMotivation: Existing diffusion-based 3D scene generation methods operate in 2D image/video latent spaces, making it inherently challenging to maintain cross-view appearance and geometric consistency. The authors aim to bridge this gap by working directly in 3D representation space.

Method: The framework introduces: 1) 3D Unified Representation Autoencoder (3D-URAE) that leverages pretrained 3D foundation models and injects appearance while distilling semantics into a unified 3D latent space; 2) token-level Cross-View-Correspondence (CVC) consistency loss to enforce structural alignment across views; 3) Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold.

Result: Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods.

Conclusion: OneWorld successfully addresses the cross-view consistency problem in 3D scene generation by operating directly in 3D representation space, achieving better results than existing 2D-based approaches.

Abstract: Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.

[132] Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

Jonas Herzog, Yue Wang

Main category: cs.CV

TL;DR: CLIP-like models’ image embeddings are not suboptimal due to intra-modal misalignment; the issue is task ambiguity, not alignment problems

DetailsMotivation: Recent research claims CLIP embeddings are suboptimal for image-only tasks due to intra-modal misalignment from language-image training, but this study questions that hypothesis

Method: Reexamines theoretical arguments about embedding distance degrees of freedom, compares empirical measures between language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2), and conducts experiments on intra-modal tasks like retrieval and few-shot classification

Result: Found no theoretical degrees of freedom for image embedding distances, empirical measures yield similar results across model types, and experiments show addressing task ambiguity (not misalignment) is key for best performance

Conclusion: The intra-modal misalignment hypothesis is incorrect; CLIP-like models’ performance on image-only tasks is limited by task ambiguity rather than embedding misalignment issues

Abstract: Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.

[133] NanoGS: Training-Free Gaussian Splat Simplification

Butian Xiong, Rong Liu, Tiantian Zhou, Meida Chen, Zhiwen Fan, Andrew Feng

Main category: cs.CV

TL;DR: NanoGS is a training-free, lightweight framework for simplifying 3D Gaussian Splat models by formulating simplification as local pairwise merging over sparse spatial graphs, reducing primitive count while maintaining rendering quality.

DetailsMotivation: 3D Gaussian Splatting requires millions of primitives, incurring significant storage and transmission costs. Existing compression methods rely on GPU-intensive post-training optimization with calibrated images, limiting practical deployment.

Method: Formulates simplification as local pairwise merging over sparse spatial graph. Approximates pairs of Gaussians with single primitive using mass preserved moment matching. Evaluates merge quality through principled merge cost between original mixture and approximation. Restricts merge candidates to local neighborhoods and selects compatible pairs efficiently.

Result: NanoGS substantially reduces primitive count while maintaining high rendering fidelity. Operates directly on existing Gaussian Splat models, runs efficiently on CPU, and preserves standard 3DGS parameterization for seamless integration with existing rendering pipelines.

Conclusion: NanoGS provides an efficient and practical solution for Gaussian Splat simplification that is training-free, lightweight, and maintains rendering quality while significantly reducing model size.

Abstract: 3D Gaussian Splat (3DGS) enables high-fidelity, real-time novel view synthesis by representing scenes with large sets of anisotropic primitives, but often requires millions of Splats, incurring significant storage and transmission costs. Most existing compression methods rely on GPU-intensive post-training optimization with calibrated images, limiting practical deployment. We introduce NanoGS, a training-free and lightweight framework for Gaussian Splat simplification. Instead of relying on image-based rendering supervision, NanoGS formulates simplification as local pairwise merging over a sparse spatial graph. The method approximates a pair of Gaussians with a single primitive using mass preserved moment matching and evaluates merge quality through a principled merge cost between the original mixture and its approximation. By restricting merge candidates to local neighborhoods and selecting compatible pairs efficiently, NanoGS produces compact Gaussian representations while preserving scene structure and appearance. NanoGS operates directly on existing Gaussian Splat models, runs efficiently on CPU, and preserves the standard 3DGS parameterization, enabling seamless integration with existing rendering pipelines. Experiments demonstrate that NanoGS substantially reduces primitive count while maintaining high rendering fidelity, providing an efficient and practical solution for Gaussian Splat simplification. Our project website is available at https://saliteta.github.io/NanoGS/.

[134] PathGLS: Evaluating Pathology Vision-Language Models without Ground Truth through Multi-Dimensional Consistency

Minbing Chen, Zhu Meng, Fei Su

Main category: cs.CV

TL;DR: PathGLS is a reference-free evaluation framework for pathology VLMs that assesses Grounding, Logic, and Stability to detect hallucinations and measure trustworthiness.

DetailsMotivation: VLMs in computational pathology lack reliable automated evaluation metrics to detect subtle failures like hallucinations, limiting clinical adoption despite their potential for interpretable analysis and decision support.

Method: Proposes PathGLS framework evaluating pathology VLMs across three dimensions: Grounding (visual-text alignment), Logic (entailment graph consistency via NLI), and Stability (output variance under adversarial perturbations). Supports both patch-level and whole-slide image analysis.

Result: PathGLS shows 40.2% sensitivity drop for hallucinated reports vs 2.1% for BERTScore on Quilt-1M. Achieves ρ=0.71 correlation with expert error hierarchies, outperforming LLM-based approaches (Gemini 3.0 Pro: ρ=0.39).

Conclusion: PathGLS establishes a robust reference-free metric for benchmarking pathology VLMs, directly quantifying hallucination rates and domain shift robustness to inform safe clinical deployment.

Abstract: Vision-Language Models (VLMs) offer significant potential in computational pathology by enabling interpretable image analysis, automated reporting, and scalable decision support. However, their widespread clinical adoption remains limited due to the absence of reliable, automated evaluation metrics capable of identifying subtle failures such as hallucinations. To address this gap, we propose PathGLS, a novel reference-free evaluation framework that assesses pathology VLMs across three dimensions: Grounding (fine-grained visual-text alignment), Logic (entailment graph consistency using Natural Language Inference), and Stability (output variance under adversarial visual-semantic perturbations). PathGLS supports both patch-level and whole-slide image (WSI)-level analysis, yielding a comprehensive trust score. Experiments on Quilt-1M, TCGA, REG2025, PathMMU and TCGA-Sarcoma datasets demonstrate the superiority of PathGLS. Specifically, on the Quilt-1M dataset, PathGLS reveals a steep sensitivity drop of 40.2% for hallucinated reports compared to only 2.1% for BERTScore. Moreover, validation against expert-defined clinical error hierarchies reveals that PathGLS achieves a strong Spearman’s rank correlation of $ρ=0.71$ ($p < 0.0001$), significantly outperforming Large Language Model (LLM)-based approaches (Gemini 3.0 Pro: $ρ=0.39$, $p < 0.0001$). These results establish PathGLS as a robust reference-free metric. By directly quantifying hallucination rates and domain shift robustness, it serves as a reliable criterion for benchmarking VLMs on private clinical datasets and informing safe deployment. Code can be found at: https://github.com/My13ad/PathGLS

[135] Out-of-Distribution Object Detection in Street Scenes via Synthetic Outlier Exposure and Transfer Learning

Sadia Ilyas, Annika Mütze, Klaus Friedrichs, Thomas Kurbiel, Matthias Rottmann

Main category: cs.CV

TL;DR: SynOE-OD is a synthetic outlier-exposure framework for object detection that uses generative models (Stable Diffusion) and open-vocabulary detectors to generate OOD training data, enabling unified detection of both in-distribution and out-of-distribution objects.

DetailsMotivation: Current object detectors often miss out-of-distribution objects, treating them as background. Existing OOD detection approaches use complex architectures or auxiliary branches and don't provide unified handling of ID and OOD objects.

Method: Leverages strong generative models (Stable Diffusion) and Open-Vocabulary Object Detectors to generate semantically meaningful, object-level synthetic outlier data for training. Uses transfer learning to maintain ID performance while adding OOD robustness.

Result: Achieves state-of-the-art average precision on established OOD object detection benchmarks, outperforming zero-shot performance of OVODs like GroundingDINO on detecting OOD objects in street scenes.

Conclusion: SynOE-OD provides an effective framework for unified ID and OOD object detection using synthetic outlier exposure, addressing critical limitations in current object detection systems.

Abstract: Out-of-distribution (OOD) object detection is an important yet underexplored task. A reliable object detector should be able to handle OOD objects by localizing and correctly classifying them as OOD. However, a critical issue arises when such atypical objects are completely missed by the object detector and incorrectly treated as background. Existing OOD detection approaches in object detection often rely on complex architectures or auxiliary branches and typically do not provide a framework that treats in-distribution (ID) and OOD in a unified way. In this work, we address these limitations by enabling a single detector to detect OOD objects, that are otherwise silently overlooked, alongside ID objects. We present \textbf{SynOE-OD}, a \textbf{Syn}thetic \textbf{O}utlier-\textbf{E}xposure-based \textbf{O}bject \textbf{D}etection framework, that leverages strong generative models, like Stable Diffusion, and Open-Vocabulary Object Detectors (OVODs) to generate semantically meaningful, object-level data that serve as outliers during training. The generated data is used for transfer-learning to establish strong ID task performance and supplement detection models with OOD object detection robustness. Our approach achieves state-of-the-art average precision on an established OOD object detection benchmark, where OVODs, such as GroundingDINO, show limited zero-shot performance in detecting OOD objects in street-scenes.

[136] Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting

Da Zhang, Bingyu Li, Feiyu Wang, Zhiyuan Zhao, Junyu Gao

Main category: cs.CV

TL;DR: QICA is a zero-shot object counting framework that combines quantity perception with spatial aggregation to count arbitrary objects from text descriptions without visual exemplars.

DetailsMotivation: Existing zero-shot object counting methods treat counting as coarse retrieval, lacking fine-grained quantity awareness and suffering from spatial insensitivity and feature distortion during model adaptation.

Method: Proposes QICA with Synergistic Prompting Strategy (SPS) to adapt vision/language encoders with numerically conditioned prompts, Cost Aggregation Decoder (CAD) operating on vision-text similarity maps, and multi-level quantity alignment loss for numerical consistency.

Result: Competitive performance on FSC-147 benchmark and superior generalization to unseen domains (CARPK and ShanghaiTech-A) in zero-shot evaluation.

Conclusion: QICA effectively addresses limitations of existing zero-shot object counting methods by integrating quantity perception with robust spatial aggregation, demonstrating strong performance and generalization.

Abstract: Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation.To address these challenges, we present \textbf{QICA}, a novel framework that synergizes \underline{q}uantity percept\underline{i}on with robust spatial \underline{c}ast \underline{a}ggregation. Specifically, we introduce a Synergistic Prompting Strategy (\textbf{SPS}) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (\textbf{CAD}) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss ($\mathcal{L}_{MQA}$) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains.

[137] EPOFusion: Exposure aware Progressive Optimization Method for Infrared and Visible Image Fusion

Zhiwei Wang, Yayu Zheng, Defeng He, Li Zhao, Xiaoqin Zhang, Yuxing Li, Edmund Y. Lam

Main category: cs.CV

TL;DR: EPOFusion is an exposure-aware infrared and visible image fusion model that addresses overexposure issues by extracting fine-grained infrared features from bright regions and progressively enhancing fused images through iterative decoding.

DetailsMotivation: Overexposure frequently occurs in practical scenarios, causing loss of critical visual information. Existing infrared and visible fusion methods exhibit unsatisfactory performance in highly bright regions, necessitating a solution that can handle varying exposure conditions effectively.

Method: Proposes EPOFusion with: 1) guidance module to help encoder extract fine-grained infrared features from overexposed regions, 2) iterative decoder with multiscale context fusion module to progressively enhance fused images, 3) adaptive loss function to dynamically constrain fusion process under varying exposure conditions, and 4) creation of IVOE dataset with infrared-guided annotations for overexposed regions.

Result: Extensive experiments show EPOFusion outperforms existing methods, maintaining infrared cues in overexposed regions while achieving visually faithful fusion in non-overexposed areas, enhancing both visual fidelity and downstream task performance.

Conclusion: EPOFusion effectively addresses overexposure challenges in infrared-visible fusion through exposure-aware design, achieving superior performance in both overexposed and normal regions while improving downstream task capabilities.

Abstract: Overexposure frequently occurs in practical scenarios, causing the loss of critical visual information. However, existing infrared and visible fusion methods still exhibit unsatisfactory performance in highly bright regions. To address this, we propose EPOFusion, an exposure-aware fusion model. Specifically, a guidance module is introduced to facilitate the encoder in extracting fine-grained infrared features from overexposed regions. Meanwhile, an iterative decoder incorporating a multiscale context fusion module is designed to progressively enhance the fused image, ensuring consistent details and superior visual quality. Finally, an adaptive loss function dynamically constrains the fusion process, enabling an effective balance between the modalities under varying exposure conditions. To achieve better exposure awareness, we construct the first infrared and visible overexposure dataset (IVOE) with high quality infrared guided annotations for overexposed regions. Extensive experiments show that EPOFusion outperforms existing methods. It maintains infrared cues in overexposed regions while achieving visually faithful fusion in non-overexposed areas, thereby enhancing both visual fidelity and downstream task performance. Code, fusion results and IVOE dataset will be made available at https://github.com/warren-wzw/EPOFusion.git.

[138] DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives

Xiaoxu Meng, Zhongmin Chen, Bo Yang, Weikai Chen, Weixiao Liu, Lin Gao

Main category: cs.CV

TL;DR: DualPrim: A compact 3D reconstruction framework using positive and negative superquadrics for structured, topology-aware modeling with additive-subtractive design.

DetailsMotivation: Current neural reconstructions produce dense, unstructured meshes with irregular topology and weak part boundaries, making them difficult to edit, animate, or reuse in downstream applications.

Method: Uses positive and negative superquadrics where positive ones build bases and negative ones carve local volumes through differentiable operators, enabling topology-aware modeling of holes and concavities. Embedded in volumetric differentiable renderer for end-to-end learning from multi-view images.

Result: Achieves state-of-the-art accuracy while producing compact, structured, and interpretable outputs that better satisfy downstream needs than additive-only alternatives.

Conclusion: DualPrim’s additive-subtractive design increases representational power without sacrificing compactness or differentiability, enabling structured 3D reconstruction suitable for editing and reuse.

Abstract: Neural reconstructions often trade structure for fidelity, yielding dense and unstructured meshes with irregular topology and weak part boundaries that hinder editing, animation, and downstream asset reuse. We present DualPrim, a compact and structured 3D reconstruction framework. Unlike additive-only implicit or primitive methods, DualPrim represents shapes with positive and negative superquadrics: the former builds the bases while the latter carves local volumes through a differentiable operator, enabling topology-aware modeling of holes and concavities. This additive-subtractive design increases the representational power without sacrificing compactness or differentiability. We embed DualPrim in a volumetric differentiable renderer, enabling end-to-end learning from multi-view images and seamless mesh export via closed-form boolean difference. Empirically, DualPrim delivers state-of-the-art accuracy and produces compact, structured, and interpretable outputs that better satisfy downstream needs than additive-only alternatives.

[139] When Generative Augmentation Hurts: A Benchmark Study of GAN and Diffusion Models for Bias Correction in AI Classification Systems

Shesh Narayan Gupta, Nik Bear Brown

Main category: cs.CV

TL;DR: GAN augmentation can actively increase classifier bias in low-data regimes, while Stable Diffusion with LoRA performs best for addressing class imbalance.

DetailsMotivation: To understand failure modes of generative models for class imbalance compensation under low-data conditions, comparing different augmentation strategies for fine-grained classification tasks.

Method: Controlled benchmark comparing three augmentation strategies on Oxford-IIIT Pet Dataset with artificially underrepresented breeds: traditional transforms, FastGAN, and Stable Diffusion 1.5 fine-tuned with LoRA. Feature embedding analysis using t-SNE to examine image distributions.

Result: FastGAN augmentation actively increases classifier bias at low training set sizes (+20.7% bias gap, Cohen’s d = +5.03), with t-SNE showing tight isolated clusters indicating mode collapse. Stable Diffusion with LoRA achieved best results (macro F1: 0.9125 ± 0.0047) and reduced bias gap by 13.1%.

Conclusion: There’s a sample-size boundary (20-50 images per class) below which GAN augmentation becomes harmful. Stable Diffusion with LoRA is most effective for addressing class imbalance in low-data regimes.

Abstract: Generative models are widely used to compensate for class imbalance in AI training pipelines, yet their failure modes under low-data conditions are poorly understood. This paper reports a controlled benchmark comparing three augmentation strategies applied to a fine-grained animal classification task: traditional transforms, FastGAN, and Stable Diffusion 1.5 fine-tuned with Low-Rank Adaptation (LoRA). Using the Oxford-IIIT Pet Dataset with eight artificially underrepresented breeds, we find that FastGAN augmentation does not merely underperform at very low training set sizes but actively increases classifier bias, with a statistically significant large effect across three random seeds (bias gap increase: +20.7%, Cohen’s d = +5.03, p = 0.013). The effect size here is large enough to give confidence in the direction of the finding despite the small number of seeds. Feature embedding analysis using t-distributed Stochastic Neighbor Embedding reveals that FastGAN images for severe-minority breeds form tight isolated clusters outside the real image distribution, a pattern consistent with mode collapse. Stable Diffusion with Low-Rank Adaptation produced the best results overall, achieving the highest macro F1 (0.9125 plus or minus 0.0047) and a 13.1% reduction in the bias gap relative to the unaugmented baseline. The data suggest a sample-size boundary somewhere between 20 and 50 training images per class below which GAN augmentation becomes harmful in this setting, though further work across additional domains is needed to establish where that boundary sits more precisely. All experiments run on a consumer-grade GPU with 6 to 8 GB of memory, with no cloud compute required.

[140] Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Peng Sun, Jun Xie, Tao Lin

Main category: cs.CV

TL;DR: IOMM proposes a two-stage training framework for Unified Multimodal Models that first pre-trains visual generation components using only unlabeled images, then fine-tunes with minimal text-image pairs, achieving SOTA performance with high efficiency.

DetailsMotivation: Current UMMs face bottlenecks in visual generation due to inefficient pre-training paradigms and reliance on scarce, high-quality text-image paired data, which limits their effectiveness and scalability.

Method: Two-stage framework: 1) Image-only pre-training using abundant unlabeled images to train visual generative components without paired data, 2) Fine-tuning with mixture of unlabeled images and small curated text-image pairs for instruction alignment and quality improvement.

Result: IOMM-B (3.6B) trained with only ~1050 H800 GPU hours achieves 0.89 on GenEval and 0.55 on WISE, surpassing strong baselines like BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50).

Conclusion: IOMM addresses key bottlenecks in UMM visual generation through data-efficient training, demonstrating that image-only pre-training followed by minimal supervised fine-tuning can achieve state-of-the-art performance with significantly reduced computational cost.

Abstract: Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only $\sim \textbf{1050}$ H800 GPU hours (with the vast majority, $\textbf{1000}$ hours, dedicated to the efficient $\textbf{image-only pre-training stage}$). It achieves $\textbf{0.89}$ on GenEval and $\textbf{0.55}$ on WISE–surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available $\href{https://github.com/LINs-lab/IOMM}{https://github.com/LINs-lab/IOMM}$.

[141] EFF-Grasp: Energy-Field Flow Matching for Physics-Aware Dexterous Grasp Generation

Yukun Zhao, Zichen Zhong, Yongshun Gong, Yilong Yin, Haoliang Sun

Main category: cs.CV

TL;DR: EFF-Grasp: A Flow-Matching framework for physics-aware dexterous grasp generation using deterministic ODEs and training-free physics guidance

DetailsMotivation: Existing diffusion-based grasp generation methods use stochastic differential equations (SDEs) that require many sequential denoising steps and can produce unstable trajectories leading to physically infeasible grasps. There's a need for more efficient and stable generation with better physical feasibility.

Method: Reformulates grasp synthesis as a deterministic ordinary differential equation (ODE) process using Flow-Matching for smooth probability flows. Introduces training-free physics-aware energy guidance that defines an energy-guided target distribution using adapted physical energy functions and estimates guidance via local Monte Carlo approximation during inference.

Result: Extensive experiments on five benchmark datasets show superior performance in grasp quality and physical feasibility compared to diffusion-based baselines, while requiring substantially fewer sampling steps.

Conclusion: EFF-Grasp provides an efficient and stable framework for physics-aware dexterous grasp generation that dynamically steers trajectories toward physically feasible regions without requiring additional physics-based training or simulation feedback.

Abstract: Denoising generative models have recently become the dominant paradigm for dexterous grasp generation, owing to their ability to model complex grasp distributions from large-scale data. However, existing diffusion-based methods typically formulate generation as a stochastic differential equation (SDE), which often requires many sequential denoising steps and introduces trajectory instability that can lead to physically infeasible grasps. In this paper, we propose EFF-Grasp, a novel Flow-Matching-based framework for physics-aware dexterous grasp generation. Specifically, we reformulate grasp synthesis as a deterministic ordinary differential equation (ODE) process, which enables efficient and stable generation through smooth probability flows. To further enforce physical feasibility, we introduce a training-free physics-aware energy guidance strategy. Our method defines an energy-guided target distribution using adapted explicit physical energy functions that capture key grasp constraints, and estimates the corresponding guidance term via a local Monte Carlo approximation during inference. In this way, EFF-Grasp dynamically steers the generation trajectory toward physically feasible regions without requiring additional physics-based training or simulation feedback. Extensive experiments on five benchmark datasets show that EFF-Grasp achieves superior performance in grasp quality and physical feasibility, while requiring substantially fewer sampling steps than diffusion-based baselines.

[142] GATS: Gaussian Aware Temporal Scaling Transformer for Invariant 4D Spatio-Temporal Point Cloud Representation

Jiayi Tian, Jiaze Wang

Main category: cs.CV

TL;DR: GATS: A dual invariant framework for 4D point cloud video understanding that addresses temporal scale bias and distributional uncertainty through Gaussian-aware temporal scaling.

DetailsMotivation: Existing methods for 4D point cloud video understanding suffer from temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds, with CNN/Transformer approaches having limited receptive fields or high computational complexity.

Method: Proposes GATS with two complementary modules: 1) Uncertainty Guided Gaussian Convolution (UGGC) for robust neighborhood aggregation under density variation, noise, and occlusion, and 2) Temporal Scaling Attention (TSA) with learnable scaling factor to normalize temporal distances for frame partition invariance.

Result: Significant performance gains on MSR-Action3D (+6.62% accuracy), NTU RGBD (+1.4% accuracy), and Synthia4D (+1.8% mIoU), offering more efficient and principled paradigm than Transformer counterparts.

Conclusion: GATS provides a unified and robust 4D backbone for point cloud video understanding with superior accuracy, robustness, and scalability by explicitly addressing distributional inconsistencies and temporal scale bias.

Abstract: Understanding 4D point cloud videos is essential for enabling intelligent agents to perceive dynamic environments. However, temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds make it highly challenging to design a unified and robust 4D backbone. Existing CNN or Transformer based methods are constrained either by limited receptive fields or by quadratic computational complexity, while neglecting these implicit distortions. To address this problem, we propose a novel dual invariant framework, termed \textbf{Gaussian Aware Temporal Scaling (GATS)}, which explicitly resolves both distributional inconsistencies and temporal. The proposed \emph{Uncertainty Guided Gaussian Convolution (UGGC)} incorporates local Gaussian statistics and uncertainty aware gating into point convolution, thereby achieving robust neighborhood aggregation under density variation, noise, and occlusion. In parallel, the \emph{Temporal Scaling Attention (TSA)} introduces a learnable scaling factor to normalize temporal distances, ensuring frame partition invariance and consistent velocity estimation across different frame rates. These two modules are complementary: temporal scaling normalizes time intervals prior to Gaussian estimation, while Gaussian modeling enhances robustness to irregular distributions. Our experiments on mainstream benchmarks MSR-Action3D (\textbf{+6.62%} accuracy), NTU RGBD (\textbf{+1.4%} accuracy), and Synthia4D (\textbf{+1.8%} mIoU) demonstrate significant performance gains, offering a more efficient and principled paradigm for invariant 4D point cloud video understanding with superior accuracy, robustness, and scalability compared to Transformer based counterparts.

[143] AI-Generated Figures in Academic Publishing: Policies, Tools, and Practical Guidelines

Davie Chen

Main category: cs.CV

TL;DR: Survey of major academic publishers’ policies on AI-generated scientific figures, identifying key concerns and proposing best-practice guidelines for compliant use.

DetailsMotivation: The rapid advancement of generative AI tools for creating publication-quality scientific figures has created a policy gap, as academic publishers have inconsistent and ambiguous policies regarding AI-generated imagery, creating confusion for researchers.

Method: Survey of current policies from major journals and publishers (Nature, Science, Cell Press, Elsevier, PLOS), identification of key concerns (reproducibility, authorship attribution, visual misinformation), and analysis of practical examples from AI tools like SciDraw.

Result: Found inconsistent policies across publishers, identified key concerns about AI-generated figures, and developed best-practice guidelines for researchers to use AI figure-generation tools in a compliant and transparent manner.

Conclusion: With appropriate disclosure and quality control, AI-generated figures can meaningfully accelerate scientific communication without compromising integrity, but clear guidelines and consistent policies are needed.

Abstract: The rapid advancement of generative AI has introduced a new class of tools capable of producing publication-quality scientific figures, graphical abstracts, and data visualizations. However, academic publishers have responded with inconsistent and often ambiguous policies regarding AI-generated imagery. This paper surveys the current stance of major journals and publishers – including Nature, Science, Cell Press, Elsevier, and PLOS – on the use of AI-generated figures. We identify key concerns raised by publishers, including reproducibility, authorship attribution, and potential for visual misinformation. Drawing on practical examples from tools such as SciDraw, an AI-powered platform designed specifically for scientific illustration, we propose a set of best-practice guidelines for researchers seeking to use AI figure-generation tools in a compliant and transparent manner. Our findings suggest that, with appropriate disclosure and quality control, AI-generated figures can meaningfully accelerate scientific communication without compromising integrity.

[144] Segmentation-before-Staining Improves Structural Fidelity in Virtual IHC-to-Multiplex IF Translation

Junhyeok Lee, Han Jang, Heeseong Eum, Joon Jang, Kyu Sung Choi

Main category: cs.CV

TL;DR: A method for virtual staining of multiplex immunofluorescence from brightfield IHC using nuclei segmentation priors to preserve nuclear morphology without supervision.

DetailsMotivation: Multiplex immunofluorescence (mIF) is valuable for tissue analysis but limited by cost and complexity. Virtual staining from brightfield IHC exists but current methods fail to preserve clinically important nuclear morphology details that affect quantification endpoints like Ki67 proliferation index.

Method: Uses supervision-free, architecture-agnostic conditioning with continuous cell probability maps from pretrained nuclei segmentation foundation models as explicit input priors, plus variance-preserving regularization to maintain cell-level heterogeneity in synthesized fluorescence channels.

Result: Consistent improvements in nuclei count fidelity and perceptual quality across Pix2Pix with U-Net/ResNet generators, deterministic regression U-Net, and conditional diffusion models on two independent datasets.

Conclusion: The soft prior approach provides richer conditioning without task-specific tuning, preserving gradient-level boundary information lost by binary thresholding, leading to better virtual staining for clinical pathology applications.

Abstract: Multiplex immunofluorescence (mIF) enables simultaneous single-cell quantification of multiple biomarkers within intact tissue architecture, yet its high reagent cost, multi-round staining protocols, and need for specialized imaging platforms limit routine clinical adoption. Virtual staining can synthesize mIF channels from widely available brightfield immunohistochemistry (IHC), but current translators optimize pixel-level fidelity without explicitly constraining nuclear morphology. In pathology, this gap is clinically consequential: subtle distortions in nuclei count, shape, or spatial arrangement propagate directly to quantification endpoints such as the Ki67 proliferation index, where errors of a few percent can shift treatment-relevant risk categories. This work introduces a supervision-free, architecture-agnostic conditioning strategy that injects a continuous cell probability map from a pretrained nuclei segmentation foundation model as an explicit input prior, together with a variance-preserving regularization term that matches local intensity statistics to maintain cell-level heterogeneity in synthesized fluorescence channels. The soft prior retains gradient-level boundary information lost by binary thresholding, providing a richer conditioning signal without task-specific tuning. Controlled experiments across Pix2Pix with U-Net and ResNet generators, deterministic regression U-Net, and conditional diffusion on two independent datasets demonstrate consistent improvements in nuclei count fidelity and perceptual quality, as the sole modifications. Code will be made publicly available upon acceptance.

[145] STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

Suvajit Patra, Soumitra Samanta

Main category: cs.CV

TL;DR: A unified spatio-temporal attention network for Continuous Sign Language Recognition that reduces parameters by 70-80% while maintaining comparable performance to state-of-the-art keypoint-based methods.

DetailsMotivation: Current keypoint-based CSLR approaches use separate spatial (Graph Convolutional Networks/attention) and temporal (1D convolutional networks) encoding, which introduces large parameter counts in both encoder and decoder components.

Method: Proposes a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), aggregating features to produce local context-aware spatio-temporal representations.

Result: The encoder contains approximately 70-80% fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.

Conclusion: The unified spatio-temporal attention approach provides an efficient alternative to traditional separate spatial and temporal encoding methods for CSLR, significantly reducing model complexity while maintaining performance.

Abstract: Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70-80%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.

[146] Homogeneous and Heterogeneous Consistency progressive Re-ranking for Visible-Infrared Person Re-identification

Yiming Wang

Main category: cs.CV

TL;DR: A novel Progressive Modal Relationship Re-ranking method (HHCR) for visible-infrared person re-identification that addresses both intra-modal variations and inter-modal discrepancy through heterogeneous and homogeneous consistency re-ranking modules.

DetailsMotivation: Visible-infrared person re-identification faces greater challenges than traditional re-ID due to significant modality differences. Existing re-ranking algorithms cannot simultaneously address intra-modal variations and inter-modal discrepancy in cross-modal scenarios.

Method: Proposes HHCR with two modules: 1) Heterogeneous consistency re-ranking explores relationships between query and gallery modalities, 2) Homogeneous consistency re-ranking investigates intrinsic relationships within each modality. Also introduces a consistency re-ranking inference network (CRI) as a baseline.

Result: Comprehensive experiments demonstrate the proposed re-ranking method is generalized, and both the re-ranking approach and baseline achieve state-of-the-art performance.

Conclusion: The HHCR method effectively addresses cross-modal person re-identification challenges by handling both intra-modal variations and inter-modal discrepancy, achieving superior performance through progressive modal relationship re-ranking.

Abstract: Visible-infrared person re-identification faces greater challenges than traditional person re-identification due to the significant differences between modalities. In particular, the differences between these modalities make effective matching even more challenging, mainly because existing re-ranking algorithms cannot simultaneously address the intra-modal variations and inter-modal discrepancy in cross-modal person re-identification. To address this problem, we propose a novel Progressive Modal Relationship Re-ranking method consisting of two modules, called heterogeneous and homogeneous consistency re-ranking(HHCR). The first module, heterogeneous consistency re-ranking, explores the relationship between the query and the gallery modalities in the test set. The second module, homogeneous consistency reranking, investigates the intrinsic relationship within each modality between the query and the gallery in the test set. Based on this, we propose a baseline for cross-modal person re-identification, called a consistency re-ranking inference network (CRI). We conducted comprehensive experiments demonstrating that our proposed re-ranking method is generalized, and both the re-ranking and the baseline achieve state-of-the-art performance.

[147] 360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method

Huyen T. T. Tran, Van-Quang Nguyen, Farros Alferro, Kang-Jun Liu, Takayuki Okatani

Main category: cs.CV

TL;DR: Free360 is a training-free framework for 360° image VQA that uses scene graphs and adaptive spherical transformations to address challenges in multimodal LLM perception of panoramic images.

DetailsMotivation: MLLMs struggle with 360° image perception due to geometric distortion and complex spatial relations in panoramic images, which capture entire environments but present unique challenges compared to conventional images.

Method: Proposes Free360: a training-free scene-graph-based framework that decomposes reasoning into modular steps, applies adaptive spherical image transformations tailored to each step, and integrates information into a unified graph representation for answer generation.

Result: Free360 consistently improves base MLLM performance on 360Bench (7K-resolution 360° image VQA benchmark) and provides strong training-free solution for 360° VQA tasks.

Conclusion: The framework effectively addresses MLLM limitations in 360° image perception through modular reasoning and specialized transformations, offering a practical solution without requiring training.

Abstract: Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perception of 360° images remains largely underexplored. Unlike conventional images, 360° images capture the entire surrounding environment, enabling holistic spatial reasoning but introducing challenges such as geometric distortion and complex spatial relations. To comprehensively assess MLLMs’ capabilities to perceive 360° images, we introduce 360Bench, a Visual Question Answering (VQA) benchmark featuring 7K-resolution 360° images, seven representative (sub)tasks with annotations carefully curated by human annotators. Using 360Bench, we systematically evaluate seven MLLMs and six enhancement methods, revealing their shortcomings in 360° image perception. To address these challenges, we propose Free360, a training-free scene-graph-based framework for high-resolution 360° VQA. Free360 decomposes the reasoning process into modular steps, applies adaptive spherical image transformations to 360° images tailored to each step, and seamlessly integrates the resulting information into a unified graph representation for answer generation. Experiments show that Free360 consistently improves its base MLLM and provides a strong training-free solution for 360° VQA tasks. The source code and dataset will be publicly released upon acceptance.

[148] KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

Viraj Panchal, Tanmay Talsaniya, Parag Patel, Meet Patel

Main category: cs.CV

TL;DR: KidsNanny: Two-stage multimodal content moderation system for child safety combining vision transformer with object detector for visual screening, then OCR + 7B LLM for contextual reasoning, achieving better accuracy and speed than baselines.

DetailsMotivation: Need for efficient multimodal content moderation systems for child safety that can handle both visual and text-based threats with low latency, addressing limitations of existing vision-only or slower multimodal approaches.

Method: Two-stage architecture: Stage 1 uses ViT + object detector for visual screening (11.7ms), Stage 2 applies OCR and 7B LLM for contextual reasoning on text extracted from images (120ms total). Evaluated on UnsafeBench Sexual category with vision-only and multimodal regimes.

Result: Stage 1 achieves 80.27% accuracy, 85.39% F1; full pipeline achieves 81.40% accuracy, 86.16% F1 at 120ms, outperforming ShieldGemma-2 (64.80% accuracy, 1136ms) and LlavaGuard (80.36% accuracy, 4138ms). On text-only threats: 100% recall, 75.76% precision vs ShieldGemma-2’s 84% recall, 60% precision.

Conclusion: OCR-based reasoning offers recall-precision advantages for text-embedded threats at lower latency, contributing to efficient multimodal content moderation research for child safety.

Abstract: We present KidsNanny, a two-stage multimodal content moderation architecture for child safety. Stage 1 combines a vision transformer (ViT) with an object detector for visual screening (11.7 ms); outputs are routed as text not raw pixels to Stage 2, which applies OCR and a text based 7B language model for contextual reasoning (120 ms total pipeline). We evaluate on the UnsafeBench Sexual category (1,054 images) under two regimes: vision-only, isolating Stage 1, and multimodal, evaluating the full Stage 1+2 pipeline. Stage 1 achieves 80.27% accuracy and 85.39% F1 at 11.7 ms; vision-only baselines range from 59.01% to 77.04% accuracy. The full pipeline achieves 81.40% accuracy and 86.16% F1 at 120 ms, compared to ShieldGemma-2 (64.80% accuracy, 1,136 ms) and LlavaGuard (80.36% accuracy, 4,138 ms). To evaluate text-awareness, we filter two subsets: a text+visual subset (257 images) and a text-only subset (44 images where safety depends primarily on embedded text). On text-only images, KidsNanny achieves 100% recall (25/25 positives; small sample) and 75.76% precision; ShieldGemma-2 achieves 84% recall and 60% precision at 1,136 ms. Results suggest that dedicated OCR-based reasoning may offer recall-precision advantages on text-embedded threats at lower latency, though the small text-only subset limits generalizability. By documenting this architecture and evaluation methodology, we aim to contribute to the broader research effort on efficient multimodal content moderation for child safety.

[149] ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

Haozhe Jia, Jianfei Song, Yuan Zhang, Honglei Jin, Youcheng Fan, Wenshuo Chen, Wei Zhang, Yutao Yue

Main category: cs.CV

TL;DR: ECHO is an edge-cloud framework for language-driven whole-body control of humanoid robots using diffusion-based text-to-motion generation in the cloud and reinforcement learning tracking on the robot edge.

DetailsMotivation: To enable natural language control of humanoid robots through a system that generates motion from text instructions and executes them reliably on physical hardware without requiring retargeting from human motion models.

Method: Cloud-based diffusion model generates motion from text using CLIP-encoded features and 1D convolutional UNet, while edge-deployed RL tracker executes motions using Teacher-Student distillation with evidential adaptation for sim-to-real transfer, plus fall recovery mechanism.

Result: Achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) on retargeted HumanML3D benchmark and demonstrates stable execution of diverse text commands on Unitree G1 humanoid with zero hardware fine-tuning.

Conclusion: ECHO enables effective language-driven whole-body control of humanoid robots through a novel edge-cloud architecture that combines high-quality motion generation with robust real-world execution.

Abstract: We present ECHO, an edge–cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher–Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) under a unified robot-domain evaluator, while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning.

[150] Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

Haomin Wang, Qi Wei, Qianli Ma, Shengyuan Ding, Jinhui Yin, Kai Chen, Hongjie Zhang

Main category: cs.CV

TL;DR: CTRL-S introduces a chain-of-thought reinforcement learning framework for SVG generation with structured reasoning and multi-reward optimization, achieving superior visual fidelity and code quality.

DetailsMotivation: Existing vision-language models for SVG generation suffer from limited generalization, redundant code paths, and lack of explicit reasoning processes, despite improvements from large datasets and SVG-specific tokens.

Method: Proposes CTRL-S framework with chain-of-thought mechanism for explicit reasoning, constructs SVG-Sophia dataset (145K samples), trains models to generate group-level structured SVG code, and uses GRPO algorithm with multi-reward optimization (DINO, image-text similarity, format, and code efficiency rewards).

Result: Outperforms existing methods with higher task success rates, superior SVG code quality, and exceptional visual fidelity across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks.

Conclusion: CTRL-S demonstrates that structured reasoning combined with multi-reward reinforcement learning significantly improves SVG generation capabilities, offering a unified framework for high-quality vector graphics synthesis.

Abstract: With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model’s reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.

[151] S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, Haoang Li

Main category: cs.CV

TL;DR: S-VAM is a shortcut video-action model for robot learning that provides real-time, high-fidelity visual foresight for manipulation tasks through one-step inference, outperforming existing methods.

DetailsMotivation: Current video action models (VAMs) for robot learning face a trade-off between real-time inference and high-fidelity foresight - either using slow multi-step video generation or noisy one-step feature extraction. There's a need for models that can simultaneously guarantee both efficiency and accuracy for complex manipulation tasks.

Method: Proposes S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Uses a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Vision foundation model representations from the diffusion model’s own multi-step generated videos serve as teacher targets, while lightweight decouplers learn to directly map noisy one-step features to these targets.

Result: Extensive experiments in simulation and real-world demonstrate that S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments.

Conclusion: S-VAM successfully addresses the efficiency-accuracy trade-off in video action models for robot learning, providing real-time, high-fidelity visual foresight through a novel self-distillation approach that enables efficient and precise manipulation.

Abstract: Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model’s own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is https://haodong-yan.github.io/S-VAM/

[152] Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation

Yiming Huang, Baixiang Huang, Beilei Cui, Chi Kit Ng, Long Bai, Hongliang Ren

Main category: cs.CV

TL;DR: Leveling3D integrates feed-forward 3D reconstruction with geometry-aware diffusion models to simultaneously reconstruct and generate missing 3D content, improving novel-view synthesis and 3D reconstruction quality.

DetailsMotivation: Previous methods using diffusion models to fix corrupted 3D rendering results lack geometric consistency and fail to properly fill missing areas in extrapolated views. There's a need for a holistic approach that combines reconstruction with geometrically-consistent generation.

Method: Proposes Leveling3D pipeline with a geometry-aware leveling adapter that aligns diffusion model knowledge with geometry priors from feed-forward models. Includes palette filtering for diverse generation training and test-time masking refinement to prevent boundary artifacts. Enhanced extrapolated views can be fed back into 3D Gaussian Splatting for improved reconstruction.

Result: Achieves state-of-the-art performance on public datasets for novel-view synthesis and depth estimation tasks.

Conclusion: Leveling3D successfully integrates feed-forward 3D reconstruction with geometry-consistent generation, enabling simultaneous reconstruction and generation that improves both novel-view synthesis and 3D reconstruction quality.

Abstract: Feed-forward 3D reconstruction has revolutionized 3D vision, providing a powerful baseline for downstream tasks such as novel-view synthesis with 3D Gaussian Splatting. Previous works explore fixing the corrupted rendering results with a diffusion model. However, they lack geometric concern and fail at filling the missing area on the extrapolated view. In this work, we introduce Leveling3D, a novel pipeline that integrates feed-forward 3D reconstruction with geometrical-consistent generation to enable holistic simultaneous reconstruction and generation. We propose a geometry-aware leveling adapter, a lightweight technique that aligns internal knowledge in the diffusion model with the geometry prior from the feed-forward model. The leveling adapter enables generation on the artifact area of the extrapolated novel views caused by underconstrained regions of the 3D representation. Specifically, to learn a more diverse distributed generation, we introduce the palette filtering strategy for training, and a test-time masking refinement to prevent messy boundaries along the fixing regions. More importantly, the enhanced extrapolated novel views from Leveling3D could be used as the inputs for feed-forward 3DGS, leveling up the 3D reconstruction. We achieve SOTA performance on public datasets, including tasks such as novel-view synthesis and depth estimation.

[153] Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors

Ryosuke Hori, Jyun-Ting Song, Zhengyi Luo, Jinkun Cao, Soyong Shin, Hideo Saito, Kris Kitani

Main category: cs.CV

TL;DR: GRIP reconstructs physically plausible human motion using 4 wearable devices (IMUs + foot pressure sensors) combined with a physics simulator digital twin, outperforming existing methods.

DetailsMotivation: Conventional IMU-only motion reconstruction lacks physical plausibility and ground interaction modeling. Combining IMUs with foot pressure data and physics simulation can produce more realistic, physically consistent human motion.

Method: GRIP uses two modules: KinematicsNet estimates body poses/velocities from sensor data; DynamicsNet controls a physics simulator humanoid using residuals between KinematicsNet predictions and simulated state. Uses 4 wearable devices (IMUs + foot pressure sensors).

Result: Outperforms existing IMU-only and IMU-pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency. Introduces PRISM dataset for training/evaluation.

Conclusion: GRIP demonstrates that combining wearable sensors with physics simulation enables physically plausible human motion reconstruction, with foot pressure data being crucial for capturing ground interactions.

Abstract: We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMU-only approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions. Furthermore, rather than relying solely on kinematic estimation, GRIP uses a digital twin of a person, in the form of a synthetic humanoid in a physics simulator, to reconstruct realistic and physically plausible motion. At its core, GRIP consists of two modules: KinematicsNet, which estimates body poses and velocities from sensor data, and DynamicsNet, which controls the humanoid in the simulator using the residual between the KinematicsNet prediction and the simulated humanoid state. To enable robust training and fair evaluation, we introduce a large-scale dataset, Pressure and Inertial Sensing for Human Motion and Interaction (PRISM), that captures diverse human motions with synchronized IMUs and insole pressure sensors. Experimental results show that GRIP outperforms existing IMU-only and IMU-pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency.

[154] PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

Ryutaro Miya, Kazuyoshi Fushinobu, Tatsuya Kawaguchi

Main category: cs.CV

TL;DR: PureCLIP-Depth is a novel monocular depth estimation method that operates entirely within CLIP’s embedding space without prompts or decoders, achieving state-of-the-art performance among CLIP-based models.

DetailsMotivation: The paper aims to explore a conceptual information-driven approach to monocular depth estimation, moving away from traditional geometric feature reliance. The motivation is to perform depth estimation directly within the CLIP embedding space, leveraging its rich semantic representations rather than geometric priors.

Method: PureCLIP-Depth learns a direct mapping from RGB images to depth maps entirely within the CLIP embedding space. It is completely prompt-free and decoder-free, operating strictly inside the conceptual CLIP space without external geometric features or complex decoder architectures.

Result: The method achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets, demonstrating the effectiveness of conceptual information for depth estimation.

Conclusion: The paper shows that monocular depth estimation can be effectively performed using conceptual information within CLIP’s embedding space, offering a novel alternative to geometry-driven approaches.

Abstract: We propose PureCLIP-Depth, a completely prompt-free, decoder-free Monocular Depth Estimation (MDE) model that operates entirely within the Contrastive Language-Image Pre-training (CLIP) embedding space. Unlike recent models that rely heavily on geometric features, we explore a novel approach to MDE driven by conceptual information, performing computations directly within the conceptual CLIP space. The core of our method lies in learning a direct mapping from the RGB domain to the depth domain strictly inside this embedding space. Our approach achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets. The code used in this research is available at: https://github.com/ryutaroLF/PureCLIP-Depth

[155] Exclusivity-Guided Mask Learning for Semi-Supervised Crowd Instance Segmentation and Counting

Jiyang Huang, Hongru Cheng, Wei Lin, Jia Wan, Antoni B. Chan

Main category: cs.CV

TL;DR: EDP-SAM generates mask supervision for crowd datasets using exclusion constraints, and XMask learns discriminative instance masks for semi-supervised crowd counting with mask priors as pseudo-labels.

DetailsMotivation: Traditional point-based annotations for crowd analysis are limited because individual regions are ambiguous, making it hard to learn fine-grained structural semantics from sparse annotations. Semi-supervised approaches are needed since unlabeled data is abundant and cheap.

Method: 1) EDP-SAM (Exclusion-Constrained Dual-Prompt SAM) with NNEC (Nearest Neighbor Exclusion Circle) constraint generates mask supervision. 2) XMask (Exclusivity-Guided Mask Learning) enforces spatial separation through discriminative mask objectives with Gaussian smoothing and differentiable center sampling. 3) Semi-supervised framework uses instance mask priors as pseudo-labels instead of traditional point cues.

Result: Achieves state-of-the-art semi-supervised segmentation and counting performance on ShanghaiTech A, UCF-QNRF, and JHU++ datasets using only 5%, 10%, and 40% labeled data. Effectively bridges counting and instance segmentation within a unified framework.

Conclusion: The proposed approach successfully addresses limitations of point-based annotations by generating mask supervision and learning discriminative instance masks, enabling effective semi-supervised crowd analysis that unifies counting and segmentation.

Abstract: Semi-supervised crowd analysis is a prominent area of research, as unlabeled data are typically abundant and inexpensive to obtain. However, traditional point-based annotations constrain performance because individual regions are inherently ambiguous, and consequently, learning fine-grained structural semantics from sparse anno tations remains an unresolved challenge. In this paper, we first propose an Exclusion-Constrained Dual-Prompt SAM (EDP-SAM), based on our Nearest Neighbor Exclusion Circle (NNEC) constraint, to generate mask supervision for current datasets. With the aim of segmenting individuals in dense scenes, we then propose Exclusivity-Guided Mask Learning (XMask), which enforces spatial separation through a discriminative mask objective. Gaussian smoothing and a differentiable center sampling strategy are utilized to improve feature continuity and training stability. Building on XMask, we present a semi-supervised crowd counting framework that uses instance mask priors as pseudo-labels, which contain richer shape information than traditional point cues. Extensive experiments on the ShanghaiTech A, UCF-QNRF, and JHU++ datasets (using 5%, 10%, and 40% labeled data) verify that our end-to-end model achieves state-of-the-art semi-supervised segmentation and counting performance, effectively bridging the gap between counting and instance segmentation within a unified framework.

[156] RASLF: Representation-Aware State Space Model for Light Field Super-Resolution

Zeqiang Wei, Kai Jin, Kuan Song, Xiuzhuang Zhou, Wenlong Chen, Min Xu

Main category: cs.CV

TL;DR: RASLF: A representation-aware state-space framework for light field super-resolution that leverages multiple LF representations through geometric refinement, adaptive scanning, and hierarchical feature aggregation.

DetailsMotivation: Current SSM-based light field super-resolution methods fail to fully leverage complementarity among various LF representations, leading to loss of fine textures and geometric misalignments across views.

Method: Proposed RASLF framework with: 1) Progressive Geometric Refinement block using panoramic epipolar representation to encode multi-view parallax differences; 2) Representation Aware Asymmetric Scanning mechanism that dynamically adjusts scanning paths based on representation properties; 3) Dual-Anchor Aggregation module for improved hierarchical feature flow.

Result: Experiments on various public benchmarks show RASLF achieves the highest reconstruction accuracy while remaining highly computationally efficient.

Conclusion: RASLF effectively addresses limitations of current SSM-based LFSR methods by better leveraging multiple LF representations through explicit modeling of structural correlations and adaptive mechanisms.

Abstract: Current SSM-based light field super-resolution (LFSR) methods often fail to fully leverage the complementarity among various LF representations, leading to the loss of fine textures and geometric misalignments across views. To address these issues, we propose RASLF, a representation-aware state-space framework that explicitly models structural correlations across multiple LF representations. Specifically, a Progressive Geometric Refinement (PGR) block is created that uses a panoramic epipolar representation to explicitly encode multi-view parallax differences, thereby enabling integration across different LF representations. Furthermore, we introduce a Representation Aware Asymmetric Scanning (RAAS) mechanism that dynamically adjusts scanning paths based on the physical properties of different representation spaces, optimizing the balance between performance and efficiency through path pruning. Additionally, a Dual-Anchor Aggregation (DAA) module improves hierarchical feature flow, reducing redundant deeplayer features and prioritizing important reconstruction information. Experiments on various public benchmarks show that RASLF achieves the highest reconstruction accuracy while remaining highly computationally efficient.

[157] How to Utilize Complementary Vision-Text Information for 2D Structure Understanding

Jiancheng Dong, Pengyue Jia, Derong Xu, Jiawei Cheng, Jingyu Peng, Chao Zhang, Bowen Liu, Xin Sun, Lixin Su, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao

Main category: cs.CV

TL;DR: DiVA-Former is a lightweight architecture that effectively integrates vision and text information for table understanding by using visual tokens as dynamic queries to distill long textual sequences, addressing limitations of both pure-text and pure-vision approaches.

DetailsMotivation: Current LLMs linearize 2D tables into 1D sequences, weakening layout cues, while visual encoders capture spatial information but struggle with exact cell text. Both modalities provide distinct but complementary information, but simple fusion methods yield limited gains and cause cross-modal interference.

Method: DiVA-Former uses visual tokens as dynamic queries to distill long textual sequences into digest vectors, effectively exploiting complementary vision-text information through a lightweight architecture designed to integrate both modalities without interference.

Result: Across 13 table benchmarks, DiVA-Former improves upon pure-text baselines by 23.9% and achieves consistent gains over existing baselines using visual inputs, textual inputs, or combinations of both.

Conclusion: The proposed DiVA-Former architecture effectively integrates vision and text information for table understanding, demonstrating significant performance improvements by leveraging complementary multimodal information while avoiding cross-modal interference.

Abstract: LLMs typically linearize 2D tables into 1D sequences to fit their autoregressive architecture, which weakens row-column adjacency and other layout cues. In contrast, purely visual encoders can capture spatial cues, yet often struggle to preserve exact cell text. Our analysis reveals that these two modalities provide highly distinct information to LLMs and exhibit strong complementarity. However, direct concatenation and other fusion methods yield limited gains and frequently introduce cross-modal interference. To address this issue, we propose DiVA-Former, a lightweight architecture designed to effectively integrate vision and text information. DiVA-Former leverages visual tokens as dynamic queries to distill long textual sequences into digest vectors, thereby effectively exploiting complementary vision–text information. Evaluated across 13 table benchmarks, DiVA-Former improves upon the pure-text baseline by 23.9% and achieves consistent gains over existing baselines using visual inputs, textual inputs, or a combination of both.

[158] Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification

Trong-Duc Nguyen, Hoang-Long Nguyen, Huy-Hieu Pham

Main category: cs.CV

TL;DR: A hybrid framework for rare-class WBC classification combining generative restoration, transformer ensembles with medical contrastive embeddings, and biologically-inspired morphological constraints to address extreme class imbalance and domain shift.

DetailsMotivation: Automated white blood cell classification faces challenges from extreme class imbalance, long-tail distributions, and domain shift, causing deep models to overfit dominant classes and fail on rare leukemia subtypes.

Method: Proposes a hybrid framework with: 1) Pix2Pix-based generative restoration for artifact removal, 2) Swin Transformer ensemble with MedSigLIP contrastive embeddings for robust representation learning, and 3) biologically-inspired refinement using geometric spikiness and Mahalanobis-based morphological constraints.

Result: Achieves Macro-F1 of 0.77139 on WBCBench 2026 private leaderboard, demonstrating strong performance under severe class imbalance and highlighting the value of incorporating biological priors into deep learning.

Conclusion: The framework effectively addresses rare-class generalization in WBC classification by combining generative restoration, robust representation learning, and biological priors, showing promise for hematological image analysis under challenging conditions.

Abstract: Automated white blood cell (WBC) classification is essential for leukemia screening but remains challenged by extreme class imbalance, long-tail distributions, and domain shift, leading deep models to overfit dominant classes and fail on rare subtypes. We propose a hybrid framework for rare-class generalization that integrates a generative Pix2Pix-based restoration module for artifact removal, a Swin Transformer ensemble with MedSigLIP contrastive embeddings for robust representation learning, and a biologically-inspired refinement step using geometric spikiness and Mahalanobis-based morphological constraints to recover out-of-distribution predictions. Evaluated on the WBCBench 2026 challenge, our method achieves a Macro-F1 of 0.77139 on the private leaderboard, demonstrating strong performance under severe imbalance and highlighting the value of incorporating biological priors into deep learning for hematological image analysis.

[159] Visual Prompt Discovery via Semantic Exploration

Jaechang Kim, Yotaro Shimose, Zhao Wang, Kuang-Da Wang, Jungseul Ok, Shingo Takamatsu

Main category: cs.CV

TL;DR: SEVEX is an automated semantic exploration framework that discovers task-wise visual prompts to address LVLM perception failures through agent-driven experiments, abstract idea spaces, and novelty-guided selection.

DetailsMotivation: LVLMs face significant challenges in image understanding and visual reasoning, leading to critical perception failures. Current visual prompt generation methods focus on tool selection rather than diagnosing root causes, and optimal prompts require manual trial-and-error due to LVLM opacity.

Method: Proposes SEVEX framework with: 1) abstract idea space as search space, 2) novelty-guided selection algorithm, and 3) semantic feedback-driven ideation process to efficiently explore diverse visual prompts based on empirical results.

Result: SEVEX significantly outperforms baselines on BlindTest and BLINK benchmarks in task accuracy, inference efficiency, exploration efficiency, and stability. Discovers sophisticated counter-intuitive visual strategies beyond conventional tool usage.

Conclusion: SEVEX offers a new paradigm for enhancing LVLM perception through automated, task-wise visual prompts, moving beyond manual trial-and-error and tool-focused approaches to address root causes of perception failures.

Abstract: LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown promising potential in mitigating these issues. While emerged as a promising direction, previous methods for visual prompt generation have focused on tool selection rather than diagnosing and mitigating the root causes of LVLM perception failures. Because of the opacity and unpredictability of LVLMs, optimal visual prompts must be discovered through empirical experiments, which have relied on manual human trial-and-error. We propose an automated semantic exploration framework for discovering task-wise visual prompts. Our approach enables diverse yet efficient exploration through agent-driven experiments, minimizing human intervention and avoiding the inefficiency of per-sample generation. We introduce a semantic exploration algorithm named SEVEX, which addresses two major challenges of visual prompt exploration: (1) the distraction caused by lengthy, low-level code and (2) the vast, unstructured search space of visual prompts. Specifically, our method leverages an abstract idea space as a search space, a novelty-guided selection algorithm, and a semantic feedback-driven ideation process to efficiently explore diverse visual prompts based on empirical results. We evaluate SEVEX on the BlindTest and BLINK benchmarks, which are designed to assess LVLM perception. Experimental results demonstrate that SEVEX significantly outperforms baseline methods in task accuracy, inference efficiency, exploration efficiency, and exploration stability. Notably, our framework discovers sophisticated and counter-intuitive visual strategies that go beyond conventional tool usage, offering a new paradigm for enhancing LVLM perception through automated, task-wise visual prompts.

[160] Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

Junxin Wang, Dai Guan, Weijie Qiu, Zhihang Li, Yongbo Gai, Zhengyi Yang, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

Main category: cs.CV

TL;DR: EVPV introduces explicit visual premise verification to decouple perception from reasoning in vision-language process reward models, improving step-level verification and reranking accuracy.

DetailsMotivation: Current vision-language process reward models (VL-PRMs) suffer from entanglement between perception and reasoning, leading to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), which undermines both reranking and error localization.

Method: EVPV uses a lightweight verification interface where the policy produces step-wise visual checklists making required visual facts explicit, while a constraint extractor independently derives structured visual constraints from input images. It matches checklist claims against constraints to compute visual reliability signals and calibrates PRM step rewards via reliability gating.

Result: Experiments on VisualProcessBench and six multimodal reasoning benchmarks show EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Controlled corruption experiments provide causal evidence that gains arise from constraint fidelity and explicit premise verification.

Conclusion: EVPV effectively decouples perceptual uncertainty from logical evaluation in vision-language reasoning without requiring per-step tool calls, providing a more reliable verification framework for multimodal reasoning systems.

Abstract: Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier’s misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: https://github.com/Qwen-Applications/EVPV-PRM

[161] When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

Xiaokun Sun, Yubo Wang, Haoyu Cao, Linli Xu

Main category: cs.CV

TL;DR: FrameRepeat is a lightweight framework that helps Video-LLMs autonomously identify and reinforce important frames during reasoning to prevent visual anchor drifting in video question answering.

DetailsMotivation: Current Video-LLMs suffer from "visual anchor drifting" where extended reasoning causes models to rely more on self-generated text than visual inputs, leading to hallucinations. Existing solutions require expensive training and lack generalizability across architectures.

Method: Proposes FrameRepeat with a lightweight repeat scoring module and Add-One-In (AOI) training strategy. AOI uses MLLM output probabilities to generate supervision signals for training a frame scoring network that guides frame repetition behavior.

Result: Experimental results across multiple models and datasets show FrameRepeat effectively strengthens important visual cues during reasoning and demonstrates good generalizability.

Conclusion: FrameRepeat provides an effective and generalizable solution to visual anchor drifting in Video-LLMs by enabling autonomous identification and reinforcement of important frames during reasoning processes.

Abstract: Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting’’, where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically introduce specific mechanisms for the model to re-attend to visual inputs during inference, these approaches often incur prohibitive training costs and suffer from poor generalizability across different architectures. To address this, we propose FrameRepeat, an automated enhancement framework which features a lightweight repeat scoring module that enables Video-LLMs to autonomously identify which frames should be reinforced. We introduce a novel training strategy, Add-One-In (AOI), that uses MLLM output probabilities to generate supervision signals representing repeat gain. This can be used to train a frame scoring network, which guides the frame repetition behavior. Experimental results across multiple models and datasets demonstrate that FrameRepeat is both effective and generalizable in strengthening important visual cues during the reasoning process.

[162] Point-to-Mask: From Arbitrary Point Annotations to Mask-Level Infrared Small Target Detection

Weihua Gao, Wenlong Niu, Jie Tang, Man Yang, Jiafeng Zhang, Xiaodong Peng

Main category: cs.CV

TL;DR: Point-to-Mask framework bridges point supervision and mask-level detection for infrared small target detection using physics-driven mask generation and radius-aware regression.

DetailsMotivation: Current IRSTD methods require costly dense pixel-level annotations and struggle with tiny targets having weak texture and ambiguous boundaries. Point supervision offers lower annotation cost but needs to be effectively leveraged for mask-level detection.

Method: Two-component framework: 1) Physics-driven Adaptive Mask Generation (PAMG) converts point annotations into target masks and geometric cues, 2) Radius-aware Point Regression Network (RPR-Net) performs target center localization and radius regression using spatiotemporal motion cues. Forms closed loop between training and inference.

Result: Achieves strong pseudo-label quality, high detection accuracy, and efficient inference, approaching full-supervision performance under point-supervised settings with substantially lower annotation cost. New SIRSTD-Pixel dataset with refined pixel-level annotations.

Conclusion: Point-to-Mask effectively bridges low-cost point supervision and mask-level detection for infrared small targets, offering practical solution with reduced annotation burden while maintaining detection performance.

Abstract: Infrared small target detection (IRSTD) methods predominantly formulate the task as pixel-level segmentation, which requires costly dense annotations and is not well suited to tiny targets with weak texture and ambiguous boundaries. To address this issue, we propose Point-to-Mask, a framework that bridges low-cost point supervision and mask-level detection through two components: a Physics-driven Adaptive Mask Generation (PAMG) module that converts point annotations into compact target masks and geometric cues, and a lightweight Radius-aware Point Regression Network (RPR-Net) that reformulates IRSTD as target center localization and effective radius regression using spatiotemporal motion cues. The two modules form a closed loop: PAMG generates pseudo masks and geometric supervision during training, while the geometric predictions of RPR-Net are fed back to PAMG for pixel-level mask recovery during inference. To facilitate systematic evaluation, we further construct SIRSTD-Pixel, a sequential dataset with refined pixel-level annotations. Experiments show that the proposed framework achieves strong pseudo-label quality, high detection accuracy, and efficient inference, approaching full-supervision performance under point-supervised settings with substantially lower annotation cost. Code and datasets will be available at: https://github.com/GaoScience/point-to-mask.

[163] AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection

Hongwei Lin, Xun Huang, Chenglu Wen, Cheng Wang

Main category: cs.CV

TL;DR: AW-MoE integrates Mixture of Experts with weather-aware routing for robust 3D object detection across adverse weather conditions using LiDAR and 4D Radar data.

DetailsMotivation: Existing 3D object detection methods struggle with performance conflicts due to data distribution discrepancies across different weather scenarios when simply combining all weather samples for training.

Method: Proposes AW-MoE framework with Image-guided Weather-aware Routing (IWR) that uses image features for weather classification and selects top-K Weather-Specific Experts, plus Unified Dual-Modal Augmentation (UDMA) for synchronous LiDAR and 4D Radar data augmentation.

Result: Achieves ~15% improvement in adverse-weather performance over state-of-the-art methods with negligible inference overhead, and shows strong scalability when integrated into baseline detectors.

Conclusion: AW-MoE effectively addresses weather-related performance conflicts in 3D object detection through weather-aware expert selection and dual-modal augmentation, demonstrating strong performance and scalability.

Abstract: Robust 3D object detection under adverse weather conditions is crucial for autonomous driving. However, most existing methods simply combine all weather samples for training while overlooking data distribution discrepancies across different weather scenarios, leading to performance conflicts. To address this issue, we introduce AW-MoE, the framework that innovatively integrates Mixture of Experts (MoE) into weather-robust multi-modal 3D object detection approaches. AW-MoE incorporates Image-guided Weather-aware Routing (IWR), which leverages the superior discriminability of image features across weather conditions and their invariance to scene variations for precise weather classification. Based on this accurate classification, IWR selects the top-K most relevant Weather-Specific Experts (WSE) that handle data discrepancies, ensuring optimal detection under all weather conditions. Additionally, we propose a Unified Dual-Modal Augmentation (UDMA) for synchronous LiDAR and 4D Radar dual-modal data augmentation while preserving the realism of scenes. Extensive experiments on the real-world dataset demonstrate that AW-MoE achieves ~ 15% improvement in adverse-weather performance over state-of-the-art methods, while incurring negligible inference overhead. Moreover, integrating AW-MoE into established baseline detectors yields performance improvements surpassing current state-of-the-art methods. These results show the effectiveness and strong scalability of our AW-MoE. We will release the code publicly at https://github.com/windlinsherlock/AW-MoE.

[164] FG-SGL: Fine-Grained Semantic Guidance Learning via Motion Process Decomposition for Micro-Gesture Recognition

Jinsheng Wei, Zhaodi Xu, Guanming Lu, Haoyu Chen, Jingjie Yan

Main category: cs.CV

TL;DR: A framework called FG-SGL that uses fine-grained semantic guidance to improve micro-gesture recognition by integrating both fine-grained and category-level semantics to guide vision-language models in perceiving subtle local motion differences.

DetailsMotivation: Micro-gesture recognition is challenging due to subtle inter-class variations, and existing methods relying only on category-level supervision are insufficient for capturing localized motion differences. There's a need for more detailed semantic guidance to perceive fine-grained motion patterns.

Method: Proposes Fine-Grained Semantic Guidance Learning (FG-SGL) framework with two modules: FG-SA uses fine-grained semantic cues to guide local motion feature learning, and CP-A enhances feature separability through category-level semantic guidance. Also constructs a fine-grained textual dataset with human annotations describing MG dynamics in four semantic dimensions, and designs a Multi-Level Contrastive Optimization strategy for joint coarse-to-fine optimization.

Result: Experiments show that FG-SGL achieves competitive performance in micro-gesture recognition, validating the effectiveness of fine-grained semantic guidance for this task.

Conclusion: The proposed FG-SGL framework successfully integrates fine-grained semantic guidance with vision-language models to improve micro-gesture recognition by better capturing subtle motion differences through localized semantic understanding.

Abstract: Micro-gesture recognition (MGR) is challenging due to subtle inter-class variations. Existing methods rely on category-level supervision, which is insufficient for capturing subtle and localized motion differences. Thus, this paper proposes a Fine-Grained Semantic Guidance Learning (FG-SGL) framework that jointly integrates fine-grained and category-level semantics to guide vision–language models in perceiving local MG motions. FG-SA adopts fine-grained semantic cues to guide the learning of local motion features, while CP-A enhances the separability of MG features through category-level semantic guidance. To support fine-grained semantic guidance, this work constructs a fine-grained textual dataset with human annotations that describes the dynamic process of MGs in four refined semantic dimensions. Furthermore, a Multi-Level Contrastive Optimization strategy is designed to jointly optimize both modules in a coarse-to-fine pattern. Experiments show that FG-SGL achieves competitive performance, validating the effectiveness of fine-grained semantic guidance for MGR.

[165] VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

Tengjiao Yin, Jinglei Shi, Heng Guo, Xi Wang

Main category: cs.CV

TL;DR: Geometry-based reward model using pretrained geometric foundation models to evaluate multi-view consistency in video diffusion models, addressing inconsistency artifacts like object deformation and spatial drift.

DetailsMotivation: Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos.

Method: Proposes a geometry-based reward model leveraging pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Uses pointwise error computation instead of pixel space for robustness. Introduces geometry-aware sampling to filter low-texture/non-semantic regions. Applies reward via two pathways: post-training (SFT/RL) and inference-time optimization of causal video models via test-time scaling.

Result: Experimental results validate effectiveness, showing geometry-based reward provides superior robustness compared to other variants. Enables efficient inference-time scaling for enhancing open-source video models without extensive retraining resources.

Conclusion: The geometry-based reward model offers a practical solution for improving geometric consistency in video generation, addressing key limitations of current video diffusion models through both training and inference-time optimization approaches.

Abstract: Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.

[166] Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

TianTian Dang, Chao Bi, Shufan Shen, Jinzhe Liu, Qingming Huang, Shuhui Wang

Main category: cs.CV

TL;DR: LTS-FS is a plug-and-play framework that uses layer-specific feature steering to reduce hallucinations in Large Vision-Language Models by first locating hallucination-relevant layers through causal attribution, then applying sparsified steering intensities based on each layer’s relevance.

DetailsMotivation: Current feature steering methods for hallucination mitigation in LVLMs apply uniform steering across all layers, ignoring inter-layer differences. This heuristic approach can disrupt layers unrelated to hallucinations and degrade performance on general tasks.

Method: 1) Construct synthetic dataset with token-level and sentence-level hallucination cases; 2) Use causal intervention-based attribution to quantify hallucination relevance per layer; 3) Convert attribution scores into layer-specific steering intensities; 4) Apply sparsified feature steering only to relevant layers.

Result: Extensive experiments across multiple LVLMs and benchmarks show LTS-FS effectively mitigates hallucinations while preserving strong performance on general tasks, outperforming uniform steering approaches.

Conclusion: Layer-specific feature steering based on hallucination relevance attribution enables more precise hallucination mitigation in LVLMs without disrupting unrelated layers or degrading general performance.

Abstract: Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose a plug-and-play framework called Locate-Then-Sparsify for Feature Steering (LTS-FS), which controls the steering intensity according to the hallucination relevance of each layer. We first construct a synthetic dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that our LTS-FS framework effectively mitigates hallucination while preserving strong performance.

[167] Persistent Story World Simulation with Continuous Character Customization

Jinlu Zhang, Qiyun Wang, Baoxiang Du, Jiayi Ji, Jing He, Rongsheng Zhang, Tangjie Lv, Xiaoshuai Sun, Rongrong Ji

Main category: cs.CV

TL;DR: EverTale is a story world simulator for continuous story character customization that achieves synergy between accurate character customization, semantic alignment, and continuous integration of new identities through unified LoRA modules, MLLM quality gates, and region-focus sampling.

DetailsMotivation: Current story visualization methods fail to achieve synergy between accurate character customization, semantic alignment, and continuous integration of new identities. There's a need for better continuous character adaptation without per-character optimization modules.

Method: 1) All-in-One-World Character Integrator using unified LoRA modules for continuous character adaptation; 2) Character Quality Gate via MLLM-as-Judge with chain-of-thought reasoning to ensure fidelity; 3) Character-Aware Region-Focus Sampling to address identity degradation and layout conflicts in multi-character generation.

Result: Experimental results show EverTale achieves superior performance against a wider range of compared methods on both single- and multi-character story visualization.

Conclusion: EverTale presents an effective framework for continuous story character customization that addresses key limitations in current story visualization methods through unified adaptation, quality assessment, and region-aware sampling.

Abstract: Story visualization has gained increasing attention in computer vision. However, current methods often fail to achieve a synergy between accurate character customization, semantic alignment, and continuous integration of new identities. To tackle this challenge, in this paper we present EverTale, a story world simulator for continuous story character customization. We first propose an All-in-One-World Character Integrator to achieve continuous character adaptation within unified LoRA module, eliminating the need for per-character optimization modules of previous methods. Then, we incorporate a Character Quality Gate via MLLM-as-Judge to ensure the fidelity of each character adaptation process through chain-of-thought reasoning, determining whether the model can proceed to the next character or require additional training on the current one. We also introduce a Character-Aware Region-Focus Sampling strategy to address the identity degradation and layout conflicts in existing multi-character visual storytelling, ensuring natural multi-character generation by harmonizing local character-specific details with global scene context with higher efficiency. Experimental results show that our EverTale achieves superior performance against a wider range of compared methods on both single- and multi-character story visualization. Codes will be available.

[168] VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

Zhengbo Zhang, Jinbo Su, Zhaowen Zhou, Changtao Miao, Yuhan Hong, Qimeng Wu, Yumeng Liu, Feier Wu, Yihe Tian, Yuhao Liang, Zitong Shan, Wanke Xia, Yi-Fan Zhang, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan

Main category: cs.CV

TL;DR: VisBrowse-Bench: A new benchmark for evaluating visual reasoning in web browsing agents, featuring 169 VQA instances with multimodal evidence cross-validation.

DetailsMotivation: Existing benchmarks for multimodal browsing agents have two key limitations: insufficient evaluation of visual reasoning ability and neglect of native visual information from web pages in reasoning chains.

Method: Created VisBrowse-Bench with 169 VQA instances across multiple domains, constructed by human experts using multi-stage pipeline with rigorous manual verification. Also proposed an agent workflow to drive browsing agents to actively collect and reason over visual information during search.

Result: Even best-performing model (Claude-4.6-Opus) only achieved 47.6% accuracy, and proprietary Deep Research model (o3-deep-research) achieved 41.1% accuracy, showing significant room for improvement.

Conclusion: Current MLLMs struggle with visual-native search tasks, highlighting the need for better visual reasoning capabilities in web browsing agents and providing a benchmark to drive future research.

Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models’ visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse-Bench

[169] Micro-AU CLIP: Fine-Grained Contrastive Learning from Local Independence to Global Dependency for Micro-Expression Action Unit Detection

Jinsheng Wei, Fengzhou Guo, Yante Li, Haoyu Chen, Guanming Lu, Guoying Zhao

Main category: cs.CV

TL;DR: Micro-AU CLIP: A novel framework for micro-expression action unit detection using CLIP with local independence and global dependency modeling, achieving state-of-the-art performance.

DetailsMotivation: Existing Micro-AU detection methods learn from whole facial images/videos, conflicting with AU locality and lacking perception of AU regions. Need to capture both local independence (specific muscle movements) and global dependency (relationships between AUs under emotional states).

Method: Proposes micro-AU CLIP framework with: 1) Local Semantic Independence modeling using Patch Token Attention to map local AU region features; 2) Global Semantic Dependency modeling with Global Dependency Attention and Global Dependency Loss; 3) MicroAU contrastive loss for fine-grained visual-text alignment; 4) Emotion-label-free ME recognition application.

Result: Achieves state-of-the-art performance in micro-AU detection by fully learning fine-grained micro-AU features through the proposed independence-to-dependency pattern.

Conclusion: Micro-AU CLIP effectively models both local independence and global dependency of action units, overcoming CLIP’s limitations in micro-semantic alignment and enabling fine-grained emotion analysis without emotion labels.

Abstract: Micro-expression (ME) action units (Micro-AUs) provide objective clues for fine-grained genuine emotion analysis. Most existing Micro-AU detection methods learn AU features from the whole facial image/video, which conflicts with the inherent locality of AU, resulting in insufficient perception of AU regions. In fact, each AU independently corresponds to specific localized facial muscle movements (local independence), while there is an inherent dependency between some AUs under specific emotional states (global dependency). Thus, this paper explores the effectiveness of the independence-to-dependency pattern and proposes a novel micro-AU detection framework, micro-AU CLIP, that uniquely decomposes the AU detection process into local semantic independence modeling (LSI) and global semantic dependency (GSD) modeling. In LSI, Patch Token Attention (PTA) is designed, mapping several local features within the AU region to the same feature space; In GSD, Global Dependency Attention (GDA) and Global Dependency Loss (GDLoss) are presented to model the global dependency relationships between different AUs, thereby enhancing each AU feature. Furthermore, considering CLIP’s native limitations in micro-semantic alignment, a microAU contrastive loss (MiAUCL) is designed to learn AU features by a fine-grained alignment of visual and text features. Also, Micro-AU CLIP is effectively applied to ME recognition in an emotion-label-free way. The experimental results demonstrate that Micro-AU CLIP can fully learn fine-grained micro-AU features, achieving state-of-the-art performance.

[170] DriveFix: Spatio-Temporally Coherent Driving Scene Restoration

Heyu Si, Brandon James Denis, Muyang Sun, Dragos Datcu, Yaoru Li, Xin Jin, Ruiju Fu, Yuliia Tatarinova, Federico Landi, Jie Song, Mingli Song, Qi Guo

Main category: cs.CV

TL;DR: DriveFix: A multi-view restoration framework for 4D driving scene reconstruction that ensures spatio-temporal coherence through interleaved diffusion transformers and geometry-aware losses.

DetailsMotivation: Existing 4D scene reconstruction methods for autonomous driving process frames independently or view-by-view, leading to spatial misalignment across cameras and temporal drift in sequences. There's a critical lack of spatio-temporal synergy in current approaches.

Method: Proposes DriveFix with interleaved diffusion transformer architecture with specialized blocks to model temporal dependencies and cross-camera spatial consistency. Conditions generation on historical context and integrates geometry-aware training losses to enforce unified 3D geometry adherence.

Result: Extensive evaluations on Waymo, nuScenes, and PandaSet datasets show state-of-the-art performance in both reconstruction and novel view synthesis. Significantly reduces artifacts and enables consistent propagation of high-fidelity textures.

Conclusion: DriveFix marks a substantial step toward robust 4D world modeling for real-world deployment by ensuring spatio-temporal coherence in driving scene reconstruction.

Abstract: Recent advancements in 4D scene reconstruction, particularly those leveraging diffusion priors, have shown promise for novel view synthesis in autonomous driving. However, these methods often process frames independently or in a view-by-view manner, leading to a critical lack of spatio-temporal synergy. This results in spatial misalignment across cameras and temporal drift in sequences. We propose DriveFix, a novel multi-view restoration framework that ensures spatio-temporal coherence for driving scenes. Our approach employs an interleaved diffusion transformer architecture with specialized blocks to explicitly model both temporal dependencies and cross-camera spatial consistency. By conditioning the generation on historical context and integrating geometry-aware training losses, DriveFix enforces that the restored views adhere to a unified 3D geometry. This enables the consistent propagation of high-fidelity textures and significantly reduces artifacts. Extensive evaluations on the Waymo, nuScenes, and PandaSet datasets demonstrate that DriveFix achieves state-of-the-art performance in both reconstruction and novel view synthesis, marking a substantial step toward robust 4D world modeling for real-world deployment.

[171] An Interpretable Machine Learning Framework for Non-Small Cell Lung Cancer Drug Response Analysis

Ann Rachel, Pranav M Pawar, Mithun Mukharjee, Raja M, Tojo Mathew

Main category: cs.CV

TL;DR: AI-driven personalized lung cancer treatment using multi-omics data and XGBoost to predict drug sensitivity, with SHAP for feature importance and DeepSeek LLM for biological validation.

DetailsMotivation: Traditional lung cancer treatments (surgery, chemotherapy, radiation) are limited due to cancer heterogeneity. Personalized medicine using genetic information and AI can provide tailored treatments by predicting drug responses based on individual patient data.

Method: Used multi-omics data from Genomics of Drug Sensitivity in Cancer to build predictive model. Employed XGBoost regressor to predict drug sensitivity (LN-IC50) based on molecular and cellular features. Used cross-validation and Randomized Search for hyperparameter tuning. Applied SHAP for feature importance analysis and DeepSeek LLM for biological validation of features.

Result: Developed a predictive model for personalized lung cancer treatment that can determine drug sensitivity/resistance. SHAP identified important molecular features, and DeepSeek provided biological context and validation for the predictive features.

Conclusion: AI-based approaches combining multi-omics data, machine learning, and large language models can effectively predict drug responses and support personalized treatment planning for lung cancer patients.

Abstract: Lung cancer is a condition where there is abnormal growth of malignant cells that spread in an uncontrollable fashion in the lungs. Some common treatment strategies are surgery, chemotherapy, and radiation which aren’t the best options due to the heterogeneous nature of cancer. In personalized medicine, treatments are tailored according to the individual’s genetic information along with lifestyle aspects. In addition, AI-based deep learning methods can analyze large sets of data to find early signs of cancer, types of tumor, and prospects of treatment. The paper focuses on the development of personalized treatment plans using specific patient data focusing primarily on the genetic profile. Multi-Omics data from Genomics of Drug Sensitivity in Cancer have been used to build a predictive model along with machine learning techniques. The value of the target variable, LN-IC50, determines how sensitive or resistive a drug is. An XGBoost regressor is utilized to predict the drug response focusing on molecular and cellular features extracted from cancer datasets. Cross-validation and Randomized Search are performed for hyperparameter tuning to further optimize the model’s predictive performance. For explanation purposes, SHAP (SHapley Additive exPlanations) was used. SHAP values measure each feature’s impact on an individual prediction. Furthermore, interpreting feature relationships was performed using DeepSeek, a large language model trained to verify the biological validity of the features. Contextual explanations regarding the most important genes or pathways were provided by DeepSeek alongside the top SHAP value constituents, supporting the predictability of the model.

[172] SpikeCLR: Contrastive Self-Supervised Learning for Few-Shot Event-Based Vision using Spiking Neural Networks

Maxime Vaillant, Axel Carlier, Lai Xing Ng, Christophe Hurter, Benoit R. Cottereau

Main category: cs.CV

TL;DR: SpikeCLR: A contrastive self-supervised learning framework for Spiking Neural Networks (SNNs) that learns visual representations from unlabeled event-based vision data, addressing dataset scarcity issues.

DetailsMotivation: Event-based vision sensors offer advantages for high-speed perception but face limitations due to scarcity of large-scale labeled datasets needed to train Spiking Neural Networks (SNNs) for neuromorphic hardware applications.

Method: Adapts frame-based contrastive learning methods to spiking domain using surrogate gradient training, introduces event-specific augmentations leveraging spatial, temporal, and polarity transformations for self-supervised pretraining.

Result: Self-supervised pretraining with subsequent fine-tuning outperforms supervised learning in low-data regimes on CIFAR10-DVS, N-Caltech101, N-MNIST, and DVS-Gesture benchmarks, showing consistent gains in few-shot and semi-supervised settings.

Conclusion: SpikeCLR enables effective learning from unlabeled event data, with learned representations transferring across datasets, contributing to powerful event-based models in label-scarce settings.

Abstract: Event-based vision sensors provide significant advantages for high-speed perception, including microsecond temporal resolution, high dynamic range, and low power consumption. When combined with Spiking Neural Networks (SNNs), they can be deployed on neuromorphic hardware, enabling energy-efficient applications on embedded systems. However, this potential is severely limited by the scarcity of large-scale labeled datasets required to effectively train such models. In this work, we introduce SpikeCLR, a contrastive self-supervised learning framework that enables SNNs to learn robust visual representations from unlabeled event data. We adapt prior frame-based methods to the spiking domain using surrogate gradient training and introduce a suite of event-specific augmentations that leverage spatial, temporal, and polarity transformations. Through extensive experiments on CIFAR10-DVS, N-Caltech101, N-MNIST, and DVS-Gesture benchmarks, we demonstrate that self-supervised pretraining with subsequent fine-tuning outperforms supervised learning in low-data regimes, achieving consistent gains in few-shot and semi-supervised settings. Our ablation studies reveal that combining spatial and temporal augmentations is critical for learning effective spatio-temporal invariances in event data. We further show that learned representations transfer across datasets, contributing to efforts for powerful event-based models in label-scarce settings.

[173] Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation

Xinhao Cai, Gensheng Pei, Zeren Sun, Yazhou Yao, Fumin Shen, Wenguan Wang

Main category: cs.CV

TL;DR: Iris is a deterministic framework for monocular depth estimation that integrates real-world priors into diffusion models, achieving strong generalization from synthetic to real scenes while preserving fine details.

DetailsMotivation: Current monocular depth estimation methods have limitations: feed-forward methods miss details despite massive training data, while diffusion-based methods struggle with synthetic-to-real domain transfer. There's a need for a method that preserves fine details, generalizes well from synthetic to real scenes, and works efficiently with limited training data.

Method: Iris uses a two-stage Priors-to-Geometry Deterministic (PGD) schedule: 1) Prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, 2) Geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. Both stages share weights and use a high-to-low timestep schedule.

Result: Extensive experiments show Iris achieves significant improvements in monocular depth estimation performance with strong in-the-wild generalization, preserving fine details better than previous methods.

Conclusion: Iris successfully integrates real-world priors into diffusion models for depth estimation, overcoming limitations of both feed-forward and previous diffusion-based methods by preserving details and achieving strong synthetic-to-real generalization.

Abstract: In this paper, we propose \textbf{Iris}, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model. Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.

[174] PKINet-v2: Towards Powerful and Efficient Poly-Kernel Remote Sensing Object Detection

Xinhao Cai, Liulei Li, Gensheng Pei, Zeren Sun, Yazhou Yao, Wenguan Wang

Main category: cs.CV

TL;DR: PKINet-v2 is an improved backbone for remote sensing image object detection that combines anisotropic strip convolutions with isotropic square kernels in a unified architecture to handle both geometric diversity and spatial complexity, with efficient deployment via heterogeneous kernel re-parameterization.

DetailsMotivation: Remote sensing image object detection faces two key challenges: geometric complexity (diverse aspect ratios) and spatial complexity (wide range of object sizes). Existing approaches address these separately - anisotropic strip kernels for slender targets or isotropic large kernels for broader context - leading to complementary drawbacks like disrupted spatial coherence or background noise.

Method: PKINet-v2 synergizes anisotropic axial-strip convolutions with isotropic square kernels to build multi-scope receptive fields, preserving fine-grained local textures while aggregating long-range context across scales. Introduces Heterogeneous Kernel Re-parameterization (HKR) strategy to fuse all heterogeneous branches into a single depth-wise convolution for efficient inference.

Result: Extensive experiments on four benchmarks (DOTA-v1.0, DOTA-v1.5, HRSC2016, DIOR-R) show PKINet-v2 achieves state-of-the-art accuracy with 3.9× FPS acceleration compared to PKINet-v1, surpassing previous remote sensing backbones in both effectiveness and efficiency.

Conclusion: PKINet-v2 provides a unified solution for handling both geometric and spatial complexity in remote sensing object detection, offering superior performance and efficiency through synergistic kernel design and deployment optimization.

Abstract: Object detection in remote sensing images (RSIs) is challenged by the coexistence of geometric and spatial complexity: targets may appear with diverse aspect ratios, while spanning a wide range of object sizes under varied contexts. Existing RSI backbones address the two challenges separately, either by adopting anisotropic strip kernels to model slender targets or by using isotropic large kernels to capture broader context. However, such isolated treatments lead to complementary drawbacks: the strip-only design can disrupt spatial coherence for regular-shaped objects and weaken tiny details, whereas isotropic large kernels often introduce severe background noise and geometric mismatch for slender structures. In this paper, we extend PKINet, and present a powerful and efficient backbone that jointly handles both challenges within a unified paradigm named Poly Kernel Inception Network v2 (PKINet-v2). PKINet-v2 synergizes anisotropic axial-strip convolutions with isotropic square kernels and builds a multi-scope receptive field, preserving fine-grained local textures while progressively aggregating long-range context across scales. To enable efficient deployment, we further introduce a Heterogeneous Kernel Re-parameterization (HKR) Strategy that fuses all heterogeneous branches into a single depth-wise convolution for inference, eliminating fragmented kernel launches without accuracy loss. Extensive experiments on four widely-used benchmarks, including DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R, demonstrate that PKINet-v2 achieves state-of-the-art accuracy while delivering a $\textbf{3.9}\times$ FPS acceleration compared to PKINet-v1, surpassing previous remote sensing backbones in both effectiveness and efficiency.

[175] Learning Human-Object Interaction for 3D Human Pose Estimation from LiDAR Point Clouds

Daniel Sungho Jung, Dohee Cho, Kyoung Mu Lee

Main category: cs.CV

TL;DR: HOIL framework for 3D human pose estimation from LiDAR point clouds using human-object interaction learning to address spatial ambiguity and class imbalance issues

DetailsMotivation: Existing methods overlook human-object interactions for robust 3D human pose estimation in autonomous driving, facing challenges of spatial ambiguity between human/object points and severe class imbalance in interacting vs non-interacting body parts

Method: Proposes Human-Object Interaction Learning (HOIL) framework with: 1) HOICL (human-object interaction-aware contrastive learning) to enhance feature discrimination, 2) CPPool (contact-aware part-guided pooling) to reallocate representational capacity, and 3) optional contact-based temporal refinement

Result: Codes will be released; framework effectively leverages human-object interaction to resolve spatial ambiguity and class imbalance in interaction regions

Conclusion: HOIL framework successfully addresses key challenges in 3D human pose estimation from LiDAR by incorporating human-object interaction learning

Abstract: Understanding humans from LiDAR point clouds is one of the most critical tasks in autonomous driving due to its close relationships with pedestrian safety, yet it remains challenging in the presence of diverse human-object interactions and cluttered backgrounds. Nevertheless, existing methods largely overlook the potential of leveraging human-object interactions to build robust 3D human pose estimation frameworks. There are two major challenges that motivate the incorporation of human-object interaction. First, human-object interactions introduce spatial ambiguity between human and object points, which often leads to erroneous 3D human keypoint predictions in interaction regions. Second, there exists severe class imbalance in the number of points between interacting and non-interacting body parts, with the interaction-frequent regions such as hand and foot being sparsely observed in LiDAR data. To address these challenges, we propose a Human-Object Interaction Learning (HOIL) framework for robust 3D human pose estimation from LiDAR point clouds. To mitigate the spatial ambiguity issue, we present human-object interaction-aware contrastive learning (HOICL) that effectively enhances feature discrimination between human and object points, particularly in interaction regions. To alleviate the class imbalance issue, we introduce contact-aware part-guided pooling (CPPool) that adaptively reallocates representational capacity by compressing overrepresented points while preserving informative points from interacting body parts. In addition, we present an optional contact-based temporal refinement that refines erroneous per-frame keypoint estimates using contact cues over time. As a result, our HOIL effectively leverages human-object interaction to resolve spatial ambiguity and class imbalance in interaction regions. Codes will be released.

[176] Retrieving Counterfactuals Improves Visual In-Context Learning

Guangzhi Xiong, Sanchit Sinha, Zhenghao He, Aidong Zhang

Main category: cs.CV

TL;DR: CIRCLES is a framework that uses attribute-guided composed image retrieval to actively construct counterfactual-style demonstration examples for vision-language models, improving their causal reasoning capabilities in in-context learning.

DetailsMotivation: Vision-language models struggle with fine-grained visual attribute disentanglement and causal reasoning. Existing retrieval-augmented ICL methods use passive similarity-based retrieval that selects correlated but non-causal examples, amplifying spurious associations and limiting model robustness.

Method: CIRCLES actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. This approach enables VLMs to implicitly reason about causal relations between attributes and outcomes.

Result: CIRCLES consistently outperforms existing methods across four diverse datasets and multiple architectures, especially on small-scale models with pronounced gains under information scarcity. It retrieves more diverse and causally informative examples.

Conclusion: The framework moves beyond superficial correlations to foster more robust and grounded reasoning in VLMs through counterfactual-style example selection in in-context learning.

Abstract: Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.

[177] Automated identification of Ichneumonoidea wasps via YOLO-based deep learning: Integrating HiresCam for Explainable AI

Joao Manoel Herrera Pinheiro, Gabriela Do Nascimento Herrera, Alvaro Doria Dos Santos, Luciana Bueno Dos Reis Fernandes, Ricardo V. Godoy, Eduardo A. B. Almeida, Helena Carolina Onody, Marcelo Andrade Da Costa Vieira, Angelica Maria Penteado-Dias, Marcelo Becker

Main category: cs.CV

TL;DR: A deep learning framework using YOLO with HiResCAM for automated taxonomic identification of Ichneumonoidea parasitoid wasps from high-resolution images, achieving over 96% accuracy with interpretable visualizations of taxonomically relevant features.

DetailsMotivation: Manual identification of parasitoid wasps is labor-intensive and expertise-dependent due to morphological similarity, small body size, and fine-grained interspecific variation. There's a need for automated systems to accelerate biodiversity assessment, ecological monitoring, and biological control programs.

Method: YOLO-based architecture integrated with High-Resolution Class Activation Mapping (HiResCAM) for simultaneous identification of wasp families from high-resolution images. Uses a dataset of 3556 high-resolution images across Ichneumonidae, Braconidae, Apidae, and Vespidae families.

Result: Achieved over 96% accuracy with robust generalization across morphological variations. HiResCAM visualizations confirmed the model focuses on taxonomically relevant anatomical regions like wing venation, antennae segmentation, and metasomal structures.

Conclusion: The integration of explainable AI techniques improves transparency and trustworthiness, making the system suitable for entomological research to accelerate biodiversity characterization in an under-described parasitoid superfamily.

Abstract: Accurate taxonomic identification of parasitoid wasps within the superfamily Ichneumonoidea is essential for biodiversity assessment, ecological monitoring, and biological control programs. However, morphological similarity, small body size, and fine-grained interspecific variation make manual identification labor-intensive and expertise-dependent. This study proposes a deep learning-based framework for the automated identification of Ichneumonoidea wasps using a YOLO-based architecture integrated with High-Resolution Class Activation Mapping (HiResCAM) to enhance interpretability. The proposed system simultaneously identifies wasp families from high-resolution images. The dataset comprises 3556 high-resolution images of Hymenoptera specimens. The taxonomic distribution is primarily concentrated among the families Ichneumonidae (n = 786), Braconidae (n = 648), Apidae (n = 466), and Vespidae (n = 460). Extensive experiments were conducted using a curated dataset, with model performance evaluated through precision, recall, F1 score, and accuracy. The results demonstrate high accuracy of over 96 % and robust generalization across morphological variations. HiResCAM visualizations confirm that the model focuses on taxonomically relevant anatomical regions, such as wing venation, antennae segmentation, and metasomal structures, thereby validating the biological plausibility of the learned features. The integration of explainable AI techniques improves transparency and trustworthiness, making the system suitable for entomological research to accelerate biodiversity characterization in an under-described parasitoid superfamily.

[178] $D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

Ruizhi Wang, Weihan Li, Zunlei Feng, Haofei Zhang, Mingli Song, Jiayu Wang, Jie Song, Li Sun

Main category: cs.CV

TL;DR: D³-RSMDE is an efficient framework for remote sensing monocular depth estimation that combines ViT-based structure generation with lightweight diffusion refinement to achieve high quality and real-time performance.

DetailsMotivation: Existing methods for monocular depth estimation from remote sensing imagery face a trade-off between accuracy and efficiency - ViT backbones are fast but produce poor perceptual quality, while diffusion models offer high fidelity but are computationally prohibitive for real-time applications.

Method: The framework first uses a ViT-based module to rapidly generate a preliminary depth map as structural prior, then applies Progressive Linear Blending Refinement (PLBR) with a lightweight U-Net in a compact VAE latent space for detail refinement in few iterations.

Result: Achieves 11.85% reduction in LPIPS perceptual metric over leading models like Marigold, over 40x speedup in inference, and maintains VRAM usage comparable to lightweight ViT models.

Conclusion: D³-RSMDE successfully balances speed and quality for remote sensing depth estimation by combining efficient structure generation with lightweight diffusion refinement, enabling real-time high-fidelity applications.

Abstract: Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ($D^3$-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that $D^3$-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

[179] Advancing Visual Reliability: Color-Accurate Underwater Image Enhancement for Real-Time Underwater Missions

Yiqiang Zhou, Yifan Chen, Zhe Sun, Jijun Lu, Ye Zheng, Xuelong Li

Main category: cs.CV

TL;DR: A lightweight real-time underwater image enhancement framework with adaptive color restoration and efficient architecture for deployment on underwater platforms.

DetailsMotivation: Underwater image enhancement is crucial for underwater platforms but existing methods either use complex architectures that hinder deployment or lightweight methods that sacrifice quality, especially for severely degraded images.

Method: Three main components: 1) Adaptive Weighted Channel Compensation module for dynamic color recovery using green channel as reference, 2) Multi-branch Re-parameterized Dilated Convolution for large receptive field with low computation, and 3) Statistical Global Color Adjustment module for overall color optimization based on statistical priors.

Result: Achieves state-of-the-art performance on eight datasets across seven evaluation metrics with only 3,880 inference parameters and 409 FPS inference speed. Improves UCIQE score by 29.7% and validates deployment on ROV platforms with performance gains in downstream tasks.

Conclusion: The proposed lightweight framework effectively enhances underwater images in real-time with accurate color restoration, making it suitable for deployment on resource-constrained underwater platforms.

Abstract: Underwater image enhancement plays a crucial role in providing reliable visual information for underwater platforms, since strong absorption and scattering in water-related environments generally lead to image quality degradation. Existing high-performance methods often rely on complex architectures, which hinder deployment on underwater devices. Lightweight methods often sacrifice quality for speed and struggle to handle severely degraded underwater images. To address this limitation, we present a real-time underwater image enhancement framework with accurate color restoration. First, an Adaptive Weighted Channel Compensation module is introduced to achieve dynamic color recovery of the red and blue channels using the green channel as a reference anchor. Second, we design a Multi-branch Re-parameterized Dilated Convolution that employs multi-branch fusion during training and structural re-parameterization during inference, enabling large receptive field representation with low computational overhead. Finally, a Statistical Global Color Adjustment module is employed to optimize overall color performance based on statistical priors. Extensive experiments on eight datasets demonstrate that the proposed method achieves state-of-the-art performance across seven evaluation metrics. The model contains only 3,880 inference parameters and achieves an inference speed of 409 FPS. Our method improves the UCIQE score by 29.7% under diverse environmental conditions, and the deployment on ROV platforms and performance gains in downstream tasks further validate its superiority for real-time underwater missions.

[180] InViC: Intent-aware Visual Cues for Medical Visual Question Answering

Zhisong Wang, Ziyang Chen, Zanting Ye, Hongze Zhu, Yefeng Zheng, Yong Xia

Main category: cs.CV

TL;DR: InViC: A lightweight plug-in framework that enhances medical VQA by extracting question-conditioned visual cue tokens and using bottleneck training to force models to rely on visual evidence rather than language priors.

DetailsMotivation: Existing multimodal LLMs for medical VQA often exhibit shortcut answering by exploiting language priors or dataset biases instead of properly attending to visual evidence, undermining clinical reliability when subtle imaging findings are crucial.

Method: InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into compact question-conditioned cue tokens. It uses a two-stage fine-tuning strategy: Stage I blocks direct access to raw visual features with attention masks, forcing all visual evidence through cue tokens; Stage II restores standard attention to train joint exploitation of visual and cue tokens.

Result: InViC consistently improves over zero-shot inference and standard LoRA fine-tuning across three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) with multiple representative MLLMs.

Conclusion: Intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy medical VQA by forcing models to properly attend to visual evidence rather than relying on language priors.

Abstract: Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM’s direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.

[181] Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

Yunpeng Qu, Kaidong Zhang, Yukang Ding, Ying Chen, Jian Wang

Main category: cs.CV

TL;DR: SemTok is a semantic 1D tokenizer that compresses 2D images into 1D discrete tokens with high-level semantics, achieving state-of-the-art image reconstruction with compact representations.

DetailsMotivation: Existing visual tokenizers map images to fixed 2D spatial grids and focus on pixel-level restoration, which hinders capturing compact global semantic representations needed for better multimodal alignment and downstream task performance.

Method: Proposes SemTok with three key innovations: 1) 2D-to-1D tokenization scheme, 2) semantic alignment constraint, and 3) two-stage generative training strategy. Builds a masked autoregressive generation framework on top of SemTok.

Result: Sets new state-of-the-art in image reconstruction with superior fidelity using remarkably compact token representation. Shows notable improvements in downstream image generation tasks.

Conclusion: SemTok’s semantic 1D tokenization effectively addresses limitations of existing 2D tokenizers, enabling better capture of global semantics and improved performance in visual generation tasks.

Abstract: Visual generative models based on latent space have achieved great success, underscoring the significance of visual tokenization. Mapping images to latents boosts efficiency and enables multimodal alignment for scaling up in downstream tasks. Existing visual tokenizers primarily map images into fixed 2D spatial grids and focus on pixel-level restoration, which hinders the capture of representations with compact global semantics. To address these issues, we propose \textbf{SemTok}, a semantic one-dimensional tokenizer that compresses 2D images into 1D discrete tokens with high-level semantics. SemTok sets a new state-of-the-art in image reconstruction, achieving superior fidelity with a remarkably compact token representation. This is achieved via a synergistic framework with three key innovations: a 2D-to-1D tokenization scheme, a semantic alignment constraint, and a two-stage generative training strategy. Building on SemTok, we construct a masked autoregressive generation framework, which yields notable improvements in downstream image generation tasks. Experiments confirm the effectiveness of our semantic 1D tokenization. Our code will be open-sourced.

[182] Unpaired Cross-Domain Calibration of DMSP to VIIRS Nighttime Light Data Based on CUT Network

Zhan Tong, ChenXu Zhou, Fei Tang, Yiming Tu, Tianyu Qin, Kaihao Fang

Main category: cs.CV

TL;DR: Proposes a cross-sensor calibration method using Contrastive Unpaired Translation (CUT) network to transform DMSP nighttime light data into VIIRS-like format for long-term urban monitoring analysis.

DetailsMotivation: DMSP-OLS and SNPP-VIIRS nighttime light data are crucial for urbanization monitoring, but sensor incompatibilities prevent long-term analysis. There's a need to fuse data from different sensors while correcting DMSP defects.

Method: Uses Contrastive Unpaired Translation (CUT) network with multilayer patch-wise contrastive learning to maximize mutual information between corresponding patches. Trained on 2012-2013 overlapping data to transform 1992-2013 DMSP imagery into VIIRS-like format.

Result: Generated VIIRS-like data shows high consistency with actual VIIRS observations (R-squared > 0.87) and socioeconomic indicators. Effectively resolves cross-sensor data fusion issues and calibrates DMSP defects.

Conclusion: The CUT-based approach provides reliable solution for extended nighttime light time-series analysis by enabling cross-sensor data fusion and defect correction.

Abstract: Defense Meteorological Satellite Program (DMSP-OLS) and Suomi National Polar-orbiting Partnership (SNPP-VIIRS) nighttime light (NTL) data are vital for monitoring urbanization, yet sensor incompatibilities hinder long-term analysis. This study proposes a cross-sensor calibration method using Contrastive Unpaired Translation (CUT) network to transform DMSP data into VIIRS-like format, correcting DMSP defects. The method employs multilayer patch-wise contrastive learning to maximize mutual information between corresponding patches, preserving content consistency while learning cross-domain similarity. Utilizing 2012-2013 overlapping data for training, the network processes 1992-2013 DMSP imagery to generate enhanced VIIRS-style raster data. Validation results demonstrate that generated VIIRS-like data exhibits high consistency with actual VIIRS observations (R-squared greater than 0.87) and socioeconomic indicators. This approach effectively resolves cross-sensor data fusion issues and calibrates DMSP defects, providing reliable attempt for extended NTL time-series.

[183] DermaFlux: Synthetic Skin Lesion Generation with Rectified Flows for Enhanced Image Classification

Stathis Galanakis, Alexandros Koliousis, Stefanos Zafeiriou

Main category: cs.CV

TL;DR: DermaFlux is a rectified flow-based text-to-image framework that generates clinically accurate skin lesion images from text descriptions to address data scarcity in dermatology classification.

DetailsMotivation: Skin lesion classification suffers from limited, imbalanced clinical datasets, leading to poor generalization. There's a need for diverse, clinically grounded synthetic data to improve model performance.

Method: Built on Flux.1 foundation model, fine-tuned with LoRA on curated clinical datasets. Uses Llama 3.2 to generate synthetic captions following dermatological criteria (asymmetry, border irregularity, color variation). Creates image-text pairs for training.

Result: DermaFlux generates diverse, clinically meaningful images that improve binary classification by 6% when augmenting small real datasets, and by 9% compared to diffusion-based synthetic images. Achieves 78.04% accuracy and 0.859 AUC with only 2,500 real images + 4,375 synthetic samples.

Conclusion: DermaFlux effectively addresses data scarcity in dermatology through text-to-image generation, significantly improving classification performance and demonstrating the value of synthetic data in medical imaging.

Abstract: Despite recent advances in deep generative modeling, skin lesion classification systems remain constrained by the limited availability of large, diverse, and well-annotated clinical datasets, resulting in class imbalance between benign and malignant lesions and consequently reduced generalization performance. We introduce DermaFlux, a rectified flow-based text-to-image generative framework that synthesizes clinically grounded skin lesion images from natural language descriptions of dermatological attributes. Built upon Flux.1, DermaFlux is fine-tuned using parameter-efficient Low-Rank Adaptation (LoRA) on a large curated collection of publicly available clinical image datasets. We construct image-text pairs using synthetic textual captions generated by Llama 3.2, following established dermatological criteria including lesion asymmetry, border irregularity, and color variation. Extensive experiments demonstrate that DermaFlux generates diverse and clinically meaningful dermatology images that improve binary classification performance by up to 6% when augmenting small real-world datasets, and by up to 9% when classifiers are trained on DermaFlux-generated synthetic images rather than diffusion-based synthetic images. Our ImageNet-pretrained ViT fine-tuned with only 2,500 real images and 4,375 DermaFlux-generated samples achieves 78.04% binary classification accuracy and an AUC of 0.859, surpassing the next best dermatology model by 8%.

[184] Near-light Photometric Stereo with Symmetric Lights

Lilika Makabe, Heng Guo, Hiroaki Santo, Fumio Okura, Yasuyuki Matsushita

Main category: cs.CV

TL;DR: A linear solution method for near-light photometric stereo using symmetric light source arrangements to derive closed-form surface normal and depth solutions without initialization.

DetailsMotivation: Conventional near-light photometric stereo methods require non-convex optimization with careful initialization and precise light calibration, which is complex and computationally expensive. The authors aim to develop a simpler, more practical approach that reduces these requirements.

Method: The method arranges multiple sets of symmetric nearby light source pairs and exploits this symmetry to derive a closed-form linear solution for surface normal and depth. It works with symmetrically distributed light sources about an arbitrary point, even when spatial offsets are uncalibrated.

Result: Experiments show the method achieves comparable shape recovery accuracy to state-of-the-art calibrated near-light photometric stereo methods while significantly reducing requirements for careful depth initialization and light calibration.

Conclusion: The proposed linear solution method provides a practical alternative to conventional optimization-based approaches for near-light photometric stereo, offering similar accuracy with reduced calibration and initialization requirements through symmetric light source arrangements.

Abstract: This paper describes a linear solution method for near-light photometric stereo by exploiting symmetric light source arrangements. Unlike conventional non-convex optimization approaches, by arranging multiple sets of symmetric nearby light source pairs, our method derives a closed-form solution for surface normal and depth without requiring initialization. In addition, our method works as long as the light sources are symmetrically distributed about an arbitrary point even when the entire spatial offset is uncalibrated. Experiments showcase the accuracy of shape recovery accuracy of our method, achieving comparable results to the state-of-the-art calibrated near-light photometric stereo method while significantly reducing requirements of careful depth initialization and light calibration.

[185] HGP-Mamba: Integrating Histology and Generated Protein Features for Mamba-based Multimodal Survival Risk Prediction

Jing Dai, Chen Wu, Ming Wu, Qibin Zhang, Zexi Wu, Jingdong Zhang, Hongming Xu

Main category: cs.CV

TL;DR: HGP-Mamba is a multimodal framework that integrates histopathology images with generated protein features for cancer survival risk prediction using Mamba architecture for efficient cross-modal fusion.

DetailsMotivation: The joint prognostic potential of protein markers and histopathology images remains underexplored due to high cost and limited availability of protein expression profiling. There's a need for data-efficient methods to incorporate molecular information from histology images.

Method: Proposes HGP-Mamba: 1) Protein Feature Extractor (PFE) uses pretrained foundation models to derive protein embeddings directly from Whole Slide Images, 2) Local Interaction-aware Mamba (LiAM) for fine-grained feature interaction, 3) Global Interaction-enhanced Mamba (GiEM) for holistic modality fusion at slide level to capture complex cross-modal dependencies.

Result: Experiments on four public cancer datasets demonstrate state-of-the-art performance while maintaining superior computational efficiency compared with existing methods.

Conclusion: HGP-Mamba effectively integrates histological with generated protein features for survival risk prediction, addressing the challenge of limited protein expression data through efficient multimodal learning.

Abstract: Recent advances in multimodal learning have significantly improved cancer survival risk prediction. However, the joint prognostic potential of protein markers and histopathology images remains underexplored, largely due to the high cost and limited availability of protein expression profiling. To address this challenge, we propose HGP-Mamba, a Mamba-based multimodal framework that efficiently integrates histological with generated protein features for survival risk prediction. Specifically, we introduce a protein feature extractor (PFE) that leverages pretrained foundation models to derive high-throughput protein embeddings directly from Whole Slide Images (WSIs), enabling data-efficient incorporation of molecular information. Together with histology embeddings that capture morphological patterns, we further introduce the Local Interaction-aware Mamba (LiAM) for fine-grained feature interaction and the Global Interaction-enhanced Mamba (GiEM) to promote holistic modality fusion at the slide level, thus capture complex cross-modal dependencies. Experiments on four public cancer datasets demonstrate that HGP-Mamba achieves state-of-the-art performance while maintaining superior computational efficiency compared with existing methods. Our source code is publicly available at this https URL.

[186] SF-Mamba: Rethinking State Space Model for Vision

Masakazu Yoshimura, Teruaki Hayashi, Yuki Hoshino, Wei-Yao Wang, Takeshi Ohashi

Main category: cs.CV

TL;DR: SF-Mamba introduces a novel visual Mamba architecture with auxiliary patch swapping and batch folding techniques to improve bidirectional information flow and GPU parallelism for vision tasks.

DetailsMotivation: Current Mamba architectures for vision suffer from limitations: 1) recurrent scanning restricts non-causal interactions between image patches, 2) existing multi-scan strategies are inefficient due to poor scan designs and data rearrangement, and 3) Mamba has slow computational speed with short token lengths common in vision tasks.

Method: Proposes SF-Mamba with two key innovations: 1) auxiliary patch swapping to encode bidirectional information flow under unidirectional scanning by strategically rearranging patches, and 2) batch folding with periodic state reset to enhance GPU parallelism by grouping operations and managing state memory efficiently.

Result: Extensive experiments on image classification, object detection, and instance/semantic segmentation show SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes.

Conclusion: SF-Mamba provides an efficient vision encoder that addresses Mamba’s limitations for vision tasks through novel scan operations and computational optimizations, achieving better performance and throughput than existing approaches.

Abstract: The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.

[187] 3D Fourier-based Global Feature Extraction for Hyperspectral Image Classification

Muhammad Ahmad

Main category: cs.CV

TL;DR: HGFNet is a novel hyperspectral image classification architecture combining 3D convolutional feature extraction with frequency-domain global filtering using multiple Fourier transforms, plus adaptive focal loss for class imbalance.

DetailsMotivation: Existing hyperspectral image classification methods have limitations: transformer-based models suffer from quadratic complexity scaling issues, while Fourier transform methods typically use 2D spatial FFTs and ignore critical spectral dependencies inherent to hyperspectral data.

Method: Proposes Hybrid GFNet (HGFNet) integrating 3D convolutional feature extraction with frequency-domain global filtering via GFNet-style blocks. Introduces three complementary frequency transforms: Spectral Fourier Transform (1D FFT along spectral axis), Spatial Fourier Transform (2D FFT over spatial dimensions), and Spatial-Spatial Fourier Transform (3D FFT jointly over spectral and spatial dimensions). Also incorporates Adaptive Focal Loss to handle class imbalance.

Result: The paper claims HGFNet enables comprehensive and high-dimensional frequency modeling, captures fine-grained local spatial-spectral structures with 3D convolutions, efficiently models long-range dependencies with Fourier-based global filtering, and improves discrimination for underrepresented classes through adaptive focal loss.

Conclusion: HGFNet addresses fundamental limitations in hyperspectral image classification by combining localized 3D convolutional feature extraction with efficient frequency-domain global filtering through multiple tailored Fourier transforms, while handling class imbalance with adaptive loss functions.

Abstract: Hyperspectral image classification (HSIC) has been significantly advanced by deep learning methods that exploit rich spatial-spectral correlations. However, existing approaches still face fundamental limitations: transformer-based models suffer from poor scalability due to the quadratic complexity of self-attention, while recent Fourier transform-based methods typically rely on 2D spatial FFTs and largely ignore critical inter-band spectral dependencies inherent to hyperspectral data. To address these challenges, we propose Hybrid GFNet (HGFNet), a novel architecture that integrates localized 3D convolutional feature extraction with frequency-domain global filtering via GFNet-style blocks for efficient and robust spatial-spectral representation learning. HGFNet introduces three complementary frequency transforms tailored to hyperspectral imagery: Spectral Fourier Transform (a 1D FFT along the spectral axis), Spatial Fourier Transform (a 2D FFT over spatial dimensions), and Spatial-Spatial Fourier Transform (a 3D FFT jointly over spectral and spatial dimensions), enabling comprehensive and high-dimensional frequency modeling. The 3D convolutional layers capture fine-grained local spatial-spectral structures, while the Fourier-based global filtering modules efficiently model long-range dependencies and suppress noise. To further mitigate the severe class imbalance commonly observed in HSIC, HGFNet incorporates an Adaptive Focal Loss (AFL) that dynamically adjusts class-wise focusing and weighting, improving discrimination for underrepresented classes.

[188] Cross-modal learning for plankton recognition

Joona Kareinen, Veikka Immonen, Tuomas Eerola, Lumi Haraguchi, Lasse Lensu, Kaisa Kraft, Sanna Suikkanen, Heikki Kälviäinen

Main category: cs.CV

TL;DR: Self-supervised cross-modal learning for plankton recognition using images and optical measurement data without manual labeling, achieving high accuracy with minimal labeled data.

DetailsMotivation: Automated plankton imaging generates large unlabeled datasets, but current supervised methods require labor-intensive labeling. Complementary optical measurement data (scatter/fluorescence profiles) exists but is underutilized. Need for methods that can leverage both modalities without extensive labeling.

Method: Inspired by CLIP, train encoders for both image and profile modalities using only binary supervision indicating whether image-profile pairs come from same particle. Use contrastive learning to align modalities. For recognition, combine with small labeled gallery and k-NN classifier.

Result: Achieves high recognition accuracy while requiring minimal labeled images. Outperforms image-only self-supervised baseline. Creates inherently multimodal recognition model capable of utilizing both image and profile data.

Conclusion: Self-supervised cross-modal coordination enables effective plankton recognition by leveraging multiple modalities and large unlabeled data, reducing labeling requirements while improving performance.

Abstract: This paper considers self-supervised cross-modal coordination as a strategy enabling utilization of multiple modalities and large volumes of unlabeled plankton data to build models for plankton recognition. Automated imaging instruments facilitate the continuous collection of plankton image data on a large scale. Current methods for automatic plankton image recognition rely primarily on supervised approaches, which require labeled training sets that are labor-intensive to collect. On the other hand, some modern plankton imaging instruments complement image information with optical measurement data, such as scatter and fluorescence profiles, which currently are not widely utilized in plankton recognition. In this work, we explore the possibility of using such measurement data to guide the learning process without requiring manual labeling. Inspired by the concepts behind Contrastive Language-Image Pre-training, we train encoders for both modalities using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles. For plankton recognition, we employ a small labeled gallery of known plankton species combined with a $k$-NN classifier. This approach yields a recognition model that is inherently multimodal, i.e., capable of utilizing information extracted from both image and profile data. We demonstrate that the proposed method achieves high recognition accuracy while requiring only a minimal number of labeled images. Furthermore, we show that the approach outperforms an image-only self-supervised baseline. Code available at https://github.com/Jookare/cross-modal-plankton.

[189] IRIS: A Real-World Benchmark for Inverse Recovery and Identification of Physical Dynamic Systems from Monocular Video

Rasul Khanbayov, Mohamed Rayan Barhdadi, Erchin Serpedin, Hasan Kurban

Main category: cs.CV

TL;DR: IRIS is a new benchmark dataset for unsupervised physical parameter estimation from video, featuring 220 real-world 4K/60fps videos with ground-truth parameters and governing equations, covering single- and multi-body dynamics.

DetailsMotivation: Current research lacks a common benchmark for unsupervised physical parameter estimation from video, with existing methods evaluated on non-overlapping synthetic data, limited real-world datasets restricted to single-body systems, and no established protocol for governing-equation identification.

Method: Introduces IRIS benchmark with 220 real-world videos captured at 4K/60fps under controlled lab conditions, spanning single- and multi-body dynamics with independently measured ground-truth parameters and uncertainty estimates. Defines standardized evaluation protocol covering parameter accuracy, identifiability, extrapolation, robustness, and governing-equation selection.

Result: Establishes reference performance across all IRIS scenarios using multiple baselines including multi-step physics loss formulation and four complementary equation-identification strategies (VLM temporal reasoning, describe-then-classify prompting, CNN-based classification, and path-based labelling). Exposes systematic failure modes.

Conclusion: IRIS provides a comprehensive benchmark for evaluating unsupervised physical parameter estimation from video, with publicly released dataset, annotations, evaluation toolkit, and baseline implementations to facilitate future research in this area.

Abstract: Unsupervised physical parameter estimation from video lacks a common benchmark: existing methods evaluate on non-overlapping synthetic data, the sole real-world dataset is restricted to single-body systems, and no established protocol addresses governing-equation identification. This work introduces IRIS, a high-fidelity benchmark comprising 220 real-world videos captured at 4K resolution and 60,fps, spanning both single- and multi-body dynamics with independently measured ground-truth parameters and uncertainty estimates. Each dynamical system is recorded under controlled laboratory conditions and paired with its governing equations, enabling principled evaluation. A standardized evaluation protocol is defined encompassing parameter accuracy, identifiability, extrapolation, robustness, and governing-equation selection. Multiple baselines are evaluated, including a multi-step physics loss formulation and four complementary equation-identification strategies (VLM temporal reasoning, describe-then-classify prompting, CNN-based classification, and path-based labelling), establishing reference performance across all IRIS scenarios and exposing systematic failure modes that motivate future research. The dataset, annotations, evaluation toolkit, and all baseline implementations are publicly released.

[190] CD-FKD: Cross-Domain Feature Knowledge Distillation for Robust Single-Domain Generalization in Object Detection

Junseok Lee, Sungho Shin, Seongju Lee, Kyoobin Lee

Main category: cs.CV

TL;DR: CD-FKD improves single-domain generalization for object detection using cross-domain feature knowledge distillation with global and instance-wise feature alignment between teacher (original data) and student (corrupted data) networks.

DetailsMotivation: Object detection models trained on single source domains struggle with domain shifts (weather, lighting, scene changes) when deployed to unseen target domains. Current methods lack robustness to such variations, limiting real-world applications like autonomous driving and surveillance.

Method: Proposes Cross-Domain Feature Knowledge Distillation (CD-FKD) with teacher-student framework: teacher network trains on original source domain data while student network trains on diversified data (downscaled and corrupted versions). Student mimics teacher features through both global feature distillation (overall feature alignment) and instance-wise distillation (object-level feature alignment).

Result: Extensive experiments show CD-FKD outperforms state-of-the-art methods in both target domain generalization and source domain performance. The method effectively handles corrupted/difficult objects and demonstrates robustness to domain shifts.

Conclusion: CD-FKD successfully enhances object detection generalization to unseen domains through cross-domain feature distillation, making it valuable for real-world applications requiring robust detection in diverse environments.

Abstract: Single-domain generalization is essential for object detection, particularly when training models on a single source domain and evaluating them on unseen target domains. Domain shifts, such as changes in weather, lighting, or scene conditions, pose significant challenges to the generalization ability of existing models. To address this, we propose Cross-Domain Feature Knowledge Distillation (CD-FKD), which enhances the generalization capability of the student network by leveraging both global and instance-wise feature distillation. The proposed method uses diversified data through downscaling and corruption to train the student network, whereas the teacher network receives the original source domain data. The student network mimics the features of the teacher through both global and instance-wise distillation, enabling it to extract object-centric features effectively, even for objects that are difficult to detect owing to corruption. Extensive experiments on challenging scenes demonstrate that CD-FKD outperforms state-of-the-art methods in both target domain generalization and source domain performance, validating its effectiveness in improving object detection robustness to domain shifts. This approach is valuable in real-world applications, like autonomous driving and surveillance, where robust object detection in diverse environments is crucial.

[191] Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation

Hunain Ahmed Jillani, Ahmed Tawfik Aboukhadra, Ahmed Elhayek, Jameel Malik, Nadia Robertini, Didier Stricker

Main category: cs.CV

TL;DR: Lightweight 3D hand reconstruction using knowledge distillation to accelerate HaMeR model for real-time applications on resource-constrained devices.

DetailsMotivation: Real-time 3D hand reconstruction is essential for VR/AR, HCI, robotics, and healthcare, but current state-of-the-art methods are too heavy for resource-constrained devices like headsets and smartphones.

Method: Replace HaMeR’s original ViT-H backbone with lightweight alternatives (MobileNet, MobileViT, ConvNeXt, ResNet) and apply three knowledge distillation strategies: output-level, feature-level, and hybrid.

Result: Lightweight backbones at 35% of original size achieve 1.5x faster inference speed with minimal accuracy loss (0.4mm difference). Output-level distillation improves student performance, while feature-level works better for higher-capacity students.

Conclusion: Knowledge distillation with lightweight backbones enables efficient 3D hand reconstruction suitable for real-world applications on low-power devices while maintaining comparable accuracy.

Abstract: Fast and accurate 3D hand reconstruction is essential for real-time applications in VR/AR, human-computer interaction, robotics, and healthcare. Most state-of-the-art methods rely on heavy models, limiting their use on resource-constrained devices like headsets, smartphones, and embedded systems. In this paper, we investigate how the use of lightweight neural networks, combined with Knowledge Distillation, can accelerate complex 3D hand reconstruction models by making them faster and lighter, while maintaining comparable reconstruction accuracy. While our approach is suited for various hand reconstruction frameworks, we focus primarily on boosting the HaMeR model, currently the leading method in terms of reconstruction accuracy. We replace its original ViT-H backbone with lighter alternatives, including MobileNet, MobileViT, ConvNeXt, and ResNet, and evaluate three knowledge distillation strategies: output-level, feature-level, and a hybrid of both. Our experiments show that using lightweight backbones that are only 35% the size of the original achieves 1.5x faster inference speed while preserving similar performance quality with only a minimal accuracy difference of 0.4mm. More specifically, we show how output-level distillation notably improves student performance, while feature-level distillation proves more effective for higher-capacity students. Overall, the findings pave the way for efficient real-world applications on low-power devices. The code and models are publicly available under https://github.com/hunainahmedj/Fast-HaMeR.

[192] Unified Removal of Raindrops and Reflections: A New Benchmark and A Novel Pipeline

Xingyu Liu, Zewei He, Yu Chen, Chunyu Zhu, Zixuan Chen, Xing Luo, Zhe-Ming Lu

Main category: cs.CV

TL;DR: A diffusion-based framework (DiffUR³) for unified removal of raindrops and reflections from images captured through glass surfaces, with a new real-shot dataset RDRF.

DetailsMotivation: Images captured through glass surfaces on rainy days suffer from both raindrops and reflections, creating a composite degradation problem that existing de-raindrop, de-reflection, or all-in-one models fail to address effectively.

Method: Proposes DiffUR³, a diffusion-based framework with targeted designs for unified removal of raindrops and reflections (UR³). Introduces the RainDrop and ReFlection (RDRF) dataset as a new benchmark with substantial, high-quality, diverse image pairs.

Result: Extensive experiments show state-of-the-art performance on the RDRF benchmark and challenging in-the-wild images, successfully removing both types of degradations.

Conclusion: The work formally defines the UR³ task, provides a new dataset benchmark, and demonstrates that diffusion-based approaches can effectively handle this challenging composite degradation problem.

Abstract: When capturing images through glass surfaces or windshields on rainy days, raindrops and reflections frequently co-occur to significantly reduce the visibility of captured images. This practical problem lacks attention and needs to be resolved urgently. Prior de-raindrop, de-reflection, and all-in-one models have failed to address this composite degradation. To this end, we first formally define the unified removal of raindrops and reflections (UR$^3$) task for the first time and construct a real-shot dataset, namely RainDrop and ReFlection (RDRF), which provides a new benchmark with substantial, high-quality, diverse image pairs. Then, we propose a novel diffusion-based framework (i.e., DiffUR$^3$) with several target designs to address this challenging task. By leveraging the powerful generative prior, DiffUR$^3$ successfully removes both types of degradations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on our benchmark and on challenging in-the-wild images. The RDRF dataset and the codes will be made public upon acceptance.

[193] ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars

Kaiwen Song, Jinkai Cui, Juyong Zhang

Main category: cs.CV

TL;DR: ProgressiveAvatars: A progressive 3D avatar representation using hierarchical 3D Gaussians with adaptive implicit subdivision for real-time XR/telepresence applications under fluctuating network/compute resources.

DetailsMotivation: Real-time XR and telepresence applications face fluctuating network and computing resources, requiring progressive 3D representations that can adapt to varying bandwidth and resource constraints while maintaining animatability.

Method: Builds progressive avatar representation using hierarchy of 3D Gaussians grown by adaptive implicit subdivision on template mesh. Gaussians are defined in face-local coordinates to remain animatable across expressions and head motion. Hierarchy expands based on screen-space signals indicating lack of detail, with importance ranking for incremental loading and rendering.

Result: Enables progressive delivery and rendering under fluctuating network bandwidth and varying compute/memory resources, supporting smooth quality improvements as new Gaussians arrive while preserving previous content.

Conclusion: ProgressiveAvatars provides an effective solution for adaptive 3D avatar representation in resource-constrained real-time XR and telepresence applications through hierarchical Gaussian-based progressive rendering.

Abstract: In practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive 3D representation is needed. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown by adaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face-local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. Leveraging importance ranking, ProgressiveAvatars supports incremental loading and rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.

[194] TinyGLASS: Real-Time Self-Supervised In-Sensor Anomaly Detection

Pietro Bonazzi, Rafael Sutter, Luigi Capogrosso, Mischa Buob, Michele Magno

Main category: cs.CV

TL;DR: TinyGLASS is a lightweight adaptation of GLASS for real-time anomaly detection on edge vision sensors, achieving 8.7x parameter compression while maintaining competitive performance on industrial benchmarks.

DetailsMotivation: Industrial anomaly detection needs lightweight solutions for resource-constrained edge platforms, as existing self-supervised approaches like GLASS have high computational requirements that limit deployment on devices like the Sony IMX500 intelligent vision sensor.

Method: Replaces WideResNet-50 backbone with compact ResNet-18, introduces deployment-oriented modifications for static graph tracing and INT8 quantization using Sony’s Model Compression Toolkit, and evaluates on MVTec-AD benchmark with custom MMS Dataset for cross-device evaluation.

Result: Achieves 94.2% image-level AUROC on MVTec-AD, 8.7x parameter compression, operates at 20 FPS within 8 MB memory constraints, with low power consumption (4.0 mJ per inference) and high energy efficiency (470 GMAC/J).

Conclusion: TinyGLASS enables real-time in-sensor anomaly detection on edge platforms while maintaining competitive performance and robustness to training data contamination, making it suitable for industrial quality control applications.

Abstract: Anomaly detection plays a key role in industrial quality control, where defects must be identified despite the scarcity of labeled faulty samples. Recent self-supervised approaches, such as GLASS, learn normal visual patterns using only defect-free data and have shown strong performance on industrial benchmarks. However, their computational requirements limit deployment on resource-constrained edge platforms. This work introduces TinyGLASS, a lightweight adaptation of the GLASS framework designed for real-time in-sensor anomaly detection on the Sony IMX500 intelligent vision sensor. The proposed architecture replaces the original WideResNet-50 backbone with a compact ResNet-18 and introduces deployment-oriented modifications that enable static graph tracing and INT8 quantization using Sony’s Model Compression Toolkit. In addition to evaluating performance on the MVTec-AD benchmark, we investigate robustness to contaminated training data and introduce a custom industrial dataset, named MMS Dataset, for cross-device evaluation. Experimental results show that TinyGLASS achieves 8.7x parameter compression while maintaining competitive detection performance, reaching 94.2% image-level AUROC on MVTec-AD and operating at 20 FPS within the 8 MB memory constraints of the IMX500 platform. System profiling demonstrates low power consumption (4.0 mJ per inference), real-time end-to-end latency (20 FPS), and high energy efficiency (470 GMAC/J). Furthermore, the model maintains stable performance under moderate levels of training data contamination.

[195] Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Weiqing Li, Jinyue Guo, Yaqi Wang, Haiyang Xiao, Yuewei Zhang, Guohua Liu, Hao Henry Wang

Main category: cs.CV

TL;DR: Evo-Retriever: A retrieval framework using LLM-guided curriculum evolution with Viewpoint-Pathway collaboration to improve cross-modal document retrieval by enhancing fine-grained matching and adaptive training.

DetailsMotivation: Real-world document heterogeneity and unstructuredness disrupt consistency of cross-modal embeddings in visual-language models. Traditional training with limited samples and static strategies cannot adapt to model's dynamic evolution, causing cross-modal retrieval confusion.

Method: 1) Multi-view image alignment for fine-grained matching via multi-scale and multi-directional perspectives. 2) Bidirectional contrastive learning generates “hard queries” and establishes complementary learning paths for visual/textual disambiguation. 3) LLM meta-controller adaptively adjusts training curriculum using expert knowledge based on model-state summary.

Result: Achieves state-of-the-art performance on ViDoRe V2 and MMEB (VisDoc) with nDCG@5 scores of 65.2% and 77.1% respectively.

Conclusion: Evo-Retriever effectively addresses cross-modal retrieval challenges in heterogeneous documents through adaptive curriculum evolution and fine-grained alignment techniques.

Abstract: Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model’s dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates “hard queries” and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model’s evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.

[196] GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

Jiaxin Zhang, Junjun Jiang, Haijie Li, Youyu Chen, Kui Jiang, Dave Zhenyu Chen

Main category: cs.CV

TL;DR: GAP-MLLM introduces a geometry-aligned pre-training paradigm to activate 3D spatial perception in multimodal LLMs by predicting sparse pointmaps alongside semantic labels, improving performance on 3D vision tasks.

DetailsMotivation: MLLMs excel at semantic reasoning but struggle with 3D spatial perception from RGB inputs. The performance gap compared to explicit 3D methods stems from misaligned training paradigms where text-dominated fine-tuning fails to activate geometric representations, and naive feature concatenation leads to suboptimal structural utilization.

Method: Proposes GAP-MLLM with geometry-aligned pre-training: 1) Visual-prompted joint task forcing MLLMs to predict sparse pointmaps alongside semantic labels to enforce geometric awareness, 2) Multi-level progressive fusion module with token-level gating for adaptive integration of geometric priors without suppressing semantic reasoning.

Result: Extensive experiments show GAP-MLLM significantly enhances geometric feature fusion and consistently improves performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.

Conclusion: The geometry-aligned pre-training paradigm effectively activates structural perception in MLLMs before downstream adaptation, addressing the fundamental misalignment in training approaches and enabling better 3D spatial understanding from RGB inputs.

Abstract: Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before downstream adaptation. Specifically, we introduce a visual-prompted joint task that compels the MLLMs to predict sparse pointmaps alongside semantic labels, thereby enforcing geometric awareness. Furthermore, we design a multi-level progressive fusion module with a token-level gating mechanism, enabling adaptive integration of geometric priors without suppressing semantic reasoning. Extensive experiments demonstrate that GAP-MLLM significantly enhances geometric feature fusion and consistently enhances performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.

[197] DST-Net: A Dual-Stream Transformer with Illumination-Independent Feature Guidance and Multi-Scale Spatial Convolution for Low-Light Image Enhancement

Yicui Shi, Yuhan Chen, Xiangfei Huang, Zhenguo Wang, Wenxuan Yu, Ying Fang

Main category: cs.CV

TL;DR: DST-Net: A dual-stream transformer network for low-light image enhancement using illumination-agnostic signal priors and multi-scale spatial convolutions to preserve textures and fine structures.

DetailsMotivation: Existing low-light image enhancement methods often cause severe loss of intrinsic signal priors and fail to preserve fine structures and textures, leading to suboptimal visual quality.

Method: Proposes DST-Net with: 1) Feature extraction module using DoG, LAB color space, and VGG-16 for illumination-agnostic signal priors; 2) Dual-stream interaction architecture with cross-modal attention to guide enhancement; 3) Multi-Scale Spatial Fusion Block (MSFB) with pseudo-3D and 3D gradient operator convolutions to recover high-frequency edges and capture spatial correlations.

Result: Achieves PSNR of 25.64 dB on LOL dataset and demonstrates robust cross-scene generalization on LSRW dataset, with superior performance in both subjective visual quality and objective metrics.

Conclusion: DST-Net effectively addresses signal degradation in low-light images by leveraging illumination-agnostic priors and multi-scale spatial convolutions, achieving state-of-the-art enhancement while preserving fine structures and textures.

Abstract: Low-light image enhancement aims to restore the visibility of images captured by visual sensors in dim environments by addressing their inherent signal degradations, such as luminance attenuation and structural corruption. Although numerous algorithms attempt to improve image quality, existing methods often cause a severe loss of intrinsic signal priors. To overcome these challenges, we propose a Dual-Stream Transformer Network (DST-Net) based on illumination-agnostic signal prior guidance and multi-scale spatial convolutions. First, to address the loss of critical signal features under low-light conditions, we design a feature extraction module. This module integrates Difference of Gaussians (DoG), LAB color space transformations, and VGG-16 for texture extraction, utilizing decoupled illumination-agnostic features as signal priors to continuously guide the enhancement process. Second, we construct a dual-stream interaction architecture. By employing a cross-modal attention mechanism, the network leverages the extracted priors to dynamically rectify the deteriorated signal representation of the enhanced image, ultimately achieving iterative enhancement through differentiable curve estimation. Furthermore, to overcome the inability of existing methods to preserve fine structures and textures, we propose a Multi-Scale Spatial Fusion Block (MSFB) featuring pseudo-3D and 3D gradient operator convolutions. This module integrates explicit gradient operators to recover high-frequency edges while capturing inter-channel spatial correlations via multi-scale spatial convolutions. Extensive evaluations and ablation studies demonstrate that DST-Net achieves superior performance in subjective visual quality and objective metrics. Specifically, our method achieves a PSNR of 25.64 dB on the LOL dataset. Subsequent validation on the LSRW dataset further confirms its robust cross-scene generalization.

[198] Unlearning for One-Step Generative Models via Unbalanced Optimal Transport

Hyundo Choi, Junhyeong An, Jinseong Park, Jaewoong Choi

Main category: cs.CV

TL;DR: UOT-Unlearn: A plug-and-play class unlearning framework for one-step generative models using Unbalanced Optimal Transport to safely remove target classes while preserving generation quality.

DetailsMotivation: One-step generative models (like flow map models) are efficient but lack safety mechanisms for machine unlearning. Existing diffusion unlearning methods are incompatible with one-step models due to their reliance on multi-step iterative processes.

Method: Formulates unlearning as a principled trade-off using Unbalanced Optimal Transport: forget cost suppresses target class + f-divergence penalty preserves overall generation fidelity via relaxed marginal constraints. Enables smooth redistribution of forgotten class probability mass to remaining classes.

Result: Experimental results on CIFAR-10 and ImageNet-256 show superior unlearning success (PUL) and retention quality (u-FID), significantly outperforming baselines.

Conclusion: UOT-Unlearn provides an effective solution for machine unlearning in one-step generative models, addressing safety concerns while maintaining generation quality.

Abstract: Recent advances in one-step generative frameworks, such as flow map models, have significantly improved the efficiency of image generation by learning direct noise-to-data mappings in a single forward pass. However, machine unlearning for ensuring the safety of these powerful generators remains entirely unexplored. Existing diffusion unlearning methods are inherently incompatible with these one-step models, as they rely on a multi-step iterative denoising process. In this work, we propose UOT-Unlearn, a novel plug-and-play class unlearning framework for one-step generative models based on the Unbalanced Optimal Transport (UOT). Our method formulates unlearning as a principled trade-off between a forget cost, which suppresses the target class, and an $f$-divergence penalty, which preserves overall generation fidelity via relaxed marginal constraints. By leveraging UOT, our method enables the probability mass of the forgotten class to be smoothly redistributed to the remaining classes, rather than collapsing into low-quality or noise-like samples. Experimental results on CIFAR-10 and ImageNet-256 demonstrate that our framework achieves superior unlearning success (PUL) and retention quality (u-FID), significantly outperforming baselines.

[199] VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations

Fucai Ke, Zhixi Cai, Boying Li, Long Chen, Beibei Lin, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Hamid Rezatofighi

Main category: cs.CV

TL;DR: VIEW2SPACE: A multi-view visual reasoning benchmark using physically grounded simulation to create diverse 3D scenes with precise metadata, enabling scalable data generation and evaluation of models on sparse multi-view reasoning tasks.

DetailsMotivation: Multi-view visual reasoning is crucial for intelligent systems but existing research focuses on single-image or dense video settings. Real-world scenarios require integrating partial observations without guidance, and collecting large-scale multi-view data with accurate annotations is challenging.

Method: Leverage physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata for scalable data generation. Create VIEW2SPACE benchmark with millions of grounded question-answer pairs and a scalable disjoint training split. Propose Grounded Chain-of-Thought with Visual Evidence method.

Result: State-of-the-art vision-language and spatial models perform only marginally above random guessing on multi-view reasoning. The proposed Grounded Chain-of-Thought method substantially improves performance under moderate difficulty and generalizes to real-world data, outperforming existing approaches in cross-dataset evaluation.

Conclusion: Multi-view reasoning remains largely unsolved despite scaling efforts. While geometric perception can benefit from scaling under sufficient visibility, deep compositional reasoning across sparse views remains a fundamental challenge that current models struggle with.

Abstract: Multi-view visual reasoning is essential for intelligent systems that must understand complex environments from sparse and discrete viewpoints, yet existing research has largely focused on single-image or temporally dense video settings. In real-world scenarios, reasoning across views requires integrating partial observations without explicit guidance, while collecting large-scale multi-view data with accurate geometric and semantic annotations remains challenging. To address this gap, we leverage physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable data generation that remains transferable to real-world settings. Based on this engine, we introduce VIEW2SPACE, a multi-dimensional benchmark for sparse multi-view reasoning, together with a scalable, disjoint training split supporting millions of grounded question-answer pairs. Using this benchmark, a comprehensive evaluation of state-of-the-art vision-language and spatial models reveals that multi-view reasoning remains largely unsolved, with most models performing only marginally above random guessing. We further investigate whether training can bridge this gap. Our proposed Grounded Chain-of-Thought with Visual Evidence substantially improves performance under moderate difficulty, and generalizes to real-world data, outperforming existing approaches in cross-dataset evaluation. We further conduct difficulty-aware scaling analyses across model size, data scale, reasoning depth, and visibility constraints, indicating that while geometric perception can benefit from scaling under sufficient visibility, deep compositional reasoning across sparse views remains a fundamental challenge.

[200] An approximate graph elicits detonation lattice

Vansh Sharma, Venkat Raman

Main category: cs.CV

TL;DR: A graph theory-based algorithm for precise segmentation and measurement of detonation cells from 3D pressure traces, addressing limitations of manual and 2D methods in detonation research.

DetailsMotivation: To overcome the limitations of manual and primitive 2D edge detection methods for analyzing detonation cells, which have been a longstanding challenge in detonation research. Current methods lack precision and robustness for complex 3D cellular patterns.

Method: A training-free graph theory-based algorithm that segments and measures detonation cells from 3D pressure traces (detonation lattices). The approach uses a graph-based workflow to extract cellular patterns and quantify their statistical properties.

Result: The algorithm achieves 2% prediction error on generated data. On 3D simulation data, it reveals oblong cells aligned with wave propagation axis with 17% deviation, and shows cubic amplification of linear variability in volume dispersion. The framework generalizes across diverse cellular geometries.

Conclusion: The graph-based formulation provides a robust tool for detonation analysis and serves as a strong foundation for future extensions in triple-point collision studies, though challenges remain for highly complex cellular patterns.

Abstract: This study presents a novel algorithm based on graph theory for the precise segmentation and measurement of detonation cells from 3D pressure traces, termed detonation lattices, addressing the limitations of manual and primitive 2D edge detection methods prevalent in the field. Using a segmentation model, the proposed training-free algorithm is designed to accurately extract cellular patterns, a longstanding challenge in detonations research. First, the efficacy of segmentation on generated data is shown with a prediction error 2%. Next, 3D simulation data is used to establish performance of the graph-based workflow. The results of statistics and joint probability densities show oblong cells aligned with the wave propagation axis with 17% deviation, whereas larger dispersion in volume reflects cubic amplification of linear variability. Although the framework is robust, it remains challenging to reliably segment and quantify highly complex cellular patterns. However, the graph-based formulation generalizes across diverse cellular geometries, positioning it as a practical tool for detonation analysis and a strong foundation for future extensions in triple-point collision studies.

[201] Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty

Mangyu Kong, Jaewon Lee, Seongwon Lee, Euntai Kim

Main category: cs.CV

TL;DR: A robust relocalization framework combining Monte Carlo pose sampling with Fisher Information-based PnP optimization to address uncertainties in 3D Gaussian Splatting-based pose refinement.

DetailsMotivation: 3D Gaussian Splatting (3DGS) is increasingly used for visual localization and pose refinement, but its robustness remains highly sensitive to initial camera pose and reconstructed geometry. The paper identifies two major uncertainty sources: pose prior uncertainty from regression/retrieval models and geometric uncertainty from 3DGS reconstruction imperfections that propagate errors into PnP solvers.

Method: Introduces a relocalization framework that combines Monte Carlo pose sampling with Fisher Information-based PnP optimization. The method explicitly accounts for both pose and geometric uncertainty without requiring retraining or additional supervision.

Result: Across diverse indoor and outdoor benchmarks, the approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.

Conclusion: The proposed framework effectively addresses uncertainties in 3DGS-based pose refinement, providing more robust and accurate visual localization without additional training requirements.

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible. To address these uncertainties, we introduce a relocalization framework that combines Monte Carlo pose sampling with Fisher Information-based PnP optimization. Our method explicitly accounts for both pose and geometric uncertainty and requires no retraining or additional supervision. Across diverse indoor and outdoor benchmarks, our approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.

[202] Bridging the Simulation-to-Reality Gap in Electron Microscope Calibration via VAE-EM Estimation

Jilles S. van Hulst, W. P. M. H., Heemels, Duarte J. Antunes

Main category: cs.CV

TL;DR: VAE-EM framework for automated STEM calibration using low-dimensional image representations to overcome simulation-to-reality gap and achieve better parameter estimation.

DetailsMotivation: Automating microscope parameter calibration is challenging due to high-dimensional noisy images, inability to identify optimal parameters from single images, and simulation-to-reality gaps in data-driven methods.

Method: Use VAEs trained on simulated data to learn low-dimensional image representations, then jointly estimate mapping model and optimal parameters using EM approach, leveraging optical symmetry for identifiability.

Result: 2x reduction in estimation error, faster and more consistent than existing methods on real STEM, requiring fewer observations for calibration.

Conclusion: VAE-EM framework advances automated STEM calibration and demonstrates VAE potential for image information compression; applicable to inverse problems with simulation-to-reality gaps.

Abstract: Electron microscopy has enabled many scientific breakthroughs across multiple fields. A key challenge is the tuning of microscope parameters based on images to overcome optical aberrations that deteriorate image quality. This calibration problem is challenging due to the high-dimensional and noisy nature of the diagnostic images, and the fact that optimal parameters cannot be identified from a single image. We tackle the calibration problem for Scanning Transmission Electron Microscopes (STEM) by employing variational autoencoders (VAEs), trained on simulated data, to learn low-dimensional representations of images, whereas most existing methods extract only scalar values. We then simultaneously estimate the model that maps calibration parameters to encoded representations and the optimal calibration parameters using an expectation maximization (EM) approach. This joint estimation explicitly addresses the simulation-to-reality gap inherent in data-driven methods that train on simulated data from a digital twin. We leverage the known symmetry property of the optical system to establish global identifiability of the joint estimation problem, ensuring that a unique optimum exists. We demonstrate that our approach is substantially faster and more consistent than existing methods on a real STEM, achieving a 2x reduction in estimation error while requiring fewer observations. This represents a notable advance in automated STEM calibration and demonstrates the potential of VAEs for information compression in images. Beyond microscopy, the VAE-EM framework applies to inverse problems where simulated training data introduces a reality gap and where non-injective mappings would otherwise prevent unique solutions.

[203] CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

Mahmoud Ibrahim, Bart Elen, Chang Sun, Gokhan Ertaylan, Michel Dumontier

Main category: cs.CV

TL;DR: CompDiff: A hierarchical compositional diffusion framework for fair medical image generation that addresses the imbalanced generator problem by decomposing demographic conditioning at the representation level.

DetailsMotivation: Generative models for medical imaging often inherit demographic imbalances from training data, producing degraded synthesis quality for rare subgroups and struggling with unseen demographic intersections. Existing methods like loss reweighting provide limited benefit when training signal is scarce for certain combinations.

Method: Proposes CompDiff with a Hierarchical Conditioner Network (HCN) that decomposes demographic conditioning, producing a demographic token concatenated with CLIP embeddings as cross-attention context. This structured factorization encourages parameter sharing across subgroups and supports compositional generalization.

Result: Experiments on chest X-rays (MIMIC-CXR) and fundus images (FairGenMed) show CompDiff outperforms standard fine-tuning and FairDiffusion across image quality (FID: 64.3 vs. 75.1), subgroup equity (ES-FID), and zero-shot intersectional generalization (up to 21% FID improvement on held-out intersections).

Conclusion: Architectural design of demographic conditioning is an important and underexplored factor in fair medical image generation. CompDiff enables improved fairness and generalization in medical image synthesis.

Abstract: Generative models are increasingly used to augment medical imaging datasets for fairer AI. Yet a key assumption often goes unexamined: that generators themselves produce equally high-quality images across demographic groups. Models trained on imbalanced data can inherit these imbalances, yielding degraded synthesis quality for rare subgroups and struggling with demographic intersections absent from training. We refer to this as the imbalanced generator problem. Existing remedies such as loss reweighting operate at the optimization level and provide limited benefit when training signal is scarce or absent for certain combinations. We propose CompDiff, a hierarchical compositional diffusion framework that addresses this problem at the representation level. A dedicated Hierarchical Conditioner Network (HCN) decomposes demographic conditioning, producing a demographic token concatenated with CLIP embeddings as cross-attention context. This structured factorization encourages parameter sharing across subgroups and supports compositional generalization to rare or unseen demographic intersections. Experiments on chest X-rays (MIMIC-CXR) and fundus images (FairGenMed) show that CompDiff compares favorably against both standard fine-tuning and FairDiffusion across image quality (FID: 64.3 vs. 75.1), subgroup equity (ES-FID), and zero-shot intersectional generalization (up to 21% FID improvement on held-out intersections). Downstream classifiers trained on CompDiff-generated data also show improved AUROC and reduced demographic bias, suggesting that architectural design of demographic conditioning is an important and underexplored factor in fair medical image generation. Code is available at https://anonymous.4open.science/r/CompDiff-6FE6.

[204] Understanding Cell Fate Decisions with Temporal Attention

Florian Bürger, Martim Dias Gomes, Adrián E. Granada, Noémie Moreau, Katarzyna Bozek

Main category: cs.CV

TL;DR: Transformer-based deep learning model predicts cancer cell fate from live-cell video recordings with high accuracy, using attention mechanisms to identify temporal and morphological cues without predefined features.

DetailsMotivation: Understanding non-genetic determinants of cell fate is crucial for cancer therapy development, as genetically identical cells can have divergent outcomes under the same treatment conditions. Current approaches lack comprehensive temporal analysis of cell behavior.

Method: Transformer model trained on raw long-term live-cell recordings to predict cell fate directly from image sequences without predefined features. Includes explainability framework using attention mechanisms and masking experiments to interpret temporal and morphological cues.

Result: Model achieves balanced accuracy of 0.94 and F1-score of 0.93. Attention analysis shows predictive signals exist up to 10 hours before cell fate events, with distinct temporal patterns for mitotic vs apoptotic sequences. Reveals role of cell morphology and p53 signaling.

Conclusion: Attention-based temporal models enable accurate cell fate prediction while providing biologically interpretable insights into non-genetic determinants of cellular decision-making, advancing understanding of cancer cell behavior under treatment.

Abstract: Understanding non-genetic determinants of cell fate is critical for developing and improving cancer therapies, as genetically identical cells can exhibit divergent outcomes under the same treatment conditions. In this work, we present a deep learning approach for cell fate prediction from raw long-term live-cell recordings of cancer cell populations under chemotherapeutic treatment. Our Transformer model is trained to predict cell fate directly from raw image sequences, without relying on predefined morphological or molecular features. Beyond classification, we introduce a comprehensive explainability framework for interpreting the temporal and morphological cues guiding the model’s predictions. We demonstrate that prediction of cell outcomes is possible based on the video only, our model achieves balanced accuracy of 0.94 and an F1-score of 0.93. Attention and masking experiments further indicate that the signal predictive of the cell fate is not uniquely located in the final frames of a cell trajectory, as reliable predictions are possible up to 10 h before the event. Our analysis reveals distinct temporal distribution of predictive information in the mitotic and apoptotic sequences, as well as the role of cell morphology and p53 signaling in determining cell outcomes. Together, these findings demonstrate that attention-based temporal models enable accurate cell fate prediction while providing biologically interpretable insights into non-genetic determinants of cellular decision-making. The code is available at https://github.com/bozeklab/Cell-Fate-Prediction.

[205] VideoMatGen: PBR Materials through Joint Generative Modeling

Jon Hasselgren, Zheng Zeng, Milos Hasan, Jacob Munkberg

Main category: cs.CV

TL;DR: A video diffusion transformer method for generating physically-based materials (color, roughness, metallicity, height maps) for 3D shapes conditioned on geometry and text prompts, using a custom VAE for multi-modal latent encoding.

DetailsMotivation: To enable automatic generation of physically plausible materials for 3D shapes from text descriptions, addressing the challenge of creating multiple material properties simultaneously while maintaining compatibility with content creation tools.

Method: Uses a video diffusion transformer architecture conditioned on input geometry and text. Introduces a custom variational auto-encoder that encodes multiple material modalities (base color, roughness, metallicity, height map) into a compact latent space to enable joint generation without increasing token count.

Result: Generates high-quality, physically plausible materials for 3D shapes from text prompts, with outputs compatible with common content creation tools.

Conclusion: The method successfully generates multiple material properties jointly for 3D shapes using text conditioning, with efficient multi-modal encoding through a custom VAE.

Abstract: We present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.

[206] Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration

Amirhossein Kazerouni, Maitreya Suin, Tristan Aumentado-Armstrong, Sina Honari, Amanpreet Walia, Iqbal Mohomed, Konstantinos G. Derpanis, Babak Taati, Alex Levinshtein

Main category: cs.CV

TL;DR: Face2Scene: A two-stage framework that uses restored faces as degradation oracles to guide full-scene image restoration via diffusion models.

DetailsMotivation: Current reference-based face restoration models only fix facial regions, ignoring degradation in body and background. Full-scene restorers often ignore degradation cues, leading to artifacts. Need a method that leverages facial restoration to guide complete scene restoration.

Method: Two-stage approach: 1) Use Ref-FR model to restore high-quality facial details from degraded input and identity references. 2) Extract face-derived degradation code from restored-degraded face pair, transform into multi-scale degradation-aware tokens, and condition a diffusion model to restore entire image (body + background) in single step.

Result: Extensive experiments show superior effectiveness compared to state-of-the-art methods for full-scene restoration.

Conclusion: Face2Scene successfully leverages facial restoration as perceptual oracle to estimate degradation and guide comprehensive scene restoration, addressing limitations of existing methods.

Abstract: Recent advances in image restoration have enabled high-fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we propose Face2Scene, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.

[207] REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei, Yunyun Dong, Li Tang, Wei Zhou, Renyang Liu

Main category: cs.CV

TL;DR: REFORGE is a black-box red-teaming framework that evaluates the robustness of Image Generation Model Unlearning (IGMU) against adversarial image prompts, exposing vulnerabilities in current unlearning methods.

DetailsMotivation: While image generation models enable high-fidelity content creation, they also amplify risks like reproducing copyrighted content and generating offensive content. Image Generation Model Unlearning (IGMU) aims to mitigate these risks by removing harmful concepts without full retraining, but its robustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplored.

Method: REFORGE is a black-box red-teaming framework that evaluates IGMU robustness via adversarial image prompts. It initializes stroke-based images and optimizes perturbations with a cross-attention-guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity.

Result: Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines.

Conclusion: The results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness-aware unlearning against multi-modal adversarial attacks.

Abstract: Recent progress in image generation models (IGMs) enables high-fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplored. To bridge this gap, we present REFORGE, a black-box red-teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke-based images and optimizes perturbations with a cross-attention-guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness-aware unlearning against multi-modal adversarial attacks. Our code is at: https://github.com/Imfatnoily/REFORGE.

[208] On the Transfer of Collinearity to Computer Vision

Frederik Beuth, Danny Kowerko

Main category: cs.CV

TL;DR: The paper explores transferring the human visual perception principle of collinearity to computer vision, demonstrating its effectiveness in industrial applications like wafer defect detection and nanotechnology materials, but not for general datasets like ImageNet.

DetailsMotivation: To transfer the human visual perception phenomenon of collinearity (amplification of spatially aligned edges) to computer vision applications, exploring its potential uses in engineering and computer vision where this principle has been largely unexplored.

Method: Developed a prototype model to implement the collinearity principle, then systematically tested and benchmarked it across four use cases: combining collinearity with deep learning for wafer defect detection, using it with saliency models for nanotechnology materials, testing on occlusions, and evaluating on ImageNet.

Result: Collinearity improved wafer fault detection by 1.24x (error rate decreased from 6.5% to 5.26%), enhanced nanotechnology defect recognition by 3.2x (error from 21.65% to 6.64%), was beneficial for occlusion scenarios, but not effective for ImageNet. The principle works best with man-made structures containing lines.

Conclusion: Collinearity is a valuable tool for computer vision, particularly in industrial applications where images contain man-made linear structures, but not beneficial for general natural image datasets like ImageNet.

Abstract: Collinearity is a visual perception phenomenon in the human brain that amplifies spatially aligned edges arranged along a straight line. However, it is vague for which purpose humans might have this principle in the real-world, and its utilization in computer vision and engineering applications even is a largely unexplored field. In this work, our goal is to transfer the collinearity principle to computer vision, and we explore the potential usages of this novel principle for computer vision applications. We developed a prototype model to exemplify the principle, then tested it systematically, and benchmarked it in the context of four use cases. Our cases are selected to spawn a broad range of potential applications and scenarios: sketching the combination of collinearity with deep learning (case I and II), using collinearity with saliency models (case II), and as a feature detector (case I). In the first use case, we found that collinearity is able to improve the fault detection of wafers and obtain a performance increase by a factor 1.24 via collinearity (decrease of the error rate from 6.5% to 5.26%). In the second use case, we test the defect recognition in nanotechnology materials and achieve a performance increase by 3.2x via collinearity (deep learning, error from 21.65% to 6.64%), and also explore saliency models. As third experiment, we cover occlusions; while as fourth experiment, we test ImageNet and observe that it might not be very beneficial for ImageNet. Therefore, we can assemble a list of scenarios for which collinearity is beneficial (wafers, nanotechnology, occlusions), and for what is not beneficial (ImageNet). Hence, we infer collinearity might be suitable for industry applications as it helps if the image structures of interest are man-made because they often consist of lines. Our work provides another tool for CV, hope to capture the power of human processing.

[209] FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation

Fangjing Li, Zhihai Wang, Xinxin Ding, Haiyang Liu, Ronghua Gao, Rong Wang, Yao Zhu, Ming Jin

Main category: cs.CV

TL;DR: FSMC-Pose is a top-down framework for cattle mounting pose estimation that uses frequency-spatial fusion and multiscale self-calibration to handle cluttered backgrounds and inter-animal occlusion.

DetailsMotivation: Mounting posture is a key visual indicator of estrus in dairy cattle, but reliable pose estimation in real-world environments is challenging due to cluttered backgrounds and frequent inter-animal occlusion.

Method: FSMC-Pose integrates CattleMountNet (with Spatial Frequency Enhancement Block and Receptive Aggregation Block) and SC2Head (Spatial-Channel Self-Calibration Head). The framework uses frequency-spatial fusion to separate cattle from backgrounds and multiscale self-calibration to handle occlusion.

Result: FSMC-Pose achieves higher accuracy than strong baselines with lower computational costs, maintains real-time inference on commodity GPUs, and effectively captures cattle mounting pose in complex environments. A new dataset MOUNT-Cattle with 1176 mounting instances is created.

Conclusion: The proposed FSMC-Pose framework successfully addresses challenges in cattle mounting pose estimation through innovative frequency-spatial fusion and self-calibration techniques, enabling practical deployment in real-world dairy farm environments.

Abstract: Mounting posture is an important visual indicator of estrus in dairy cattle. However, achieving reliable mounting pose estimation in real-world environments remains challenging due to cluttered backgrounds and frequent inter-animal occlusion. We present FSMC-Pose, a top-down framework that integrates a lightweight frequency-spatial fusion backbone, CattleMountNet, and a multiscale self-calibration head, SC2Head. Specifically, we design two algorithmic components for CattleMountNet: the Spatial Frequency Enhancement Block (SFEBlock) and the Receptive Aggregation Block (RABlock). SFEBlock separates cattle from cluttered backgrounds, while RABlock captures multiscale contextual information. The Spatial-Channel Self-Calibration Head (SC2Head) attends to spatial and channel dependencies and introduces a self-calibration branch to mitigate structural misalignment under inter-animal overlap. We construct a mounting dataset, MOUNT-Cattle, covering 1176 mounting instances, which follows the COCO format and supports drop-in training across pose estimation models. Using a comprehensive dataset that combines MOUNT-Cattle with the public NWAFU-Cattle dataset, FSMC-Pose achieves higher accuracy than strong baselines, with markedly lower computational and parameter costs, while maintaining real-time inference on commodity GPUs. Extensive experiments and qualitative analyses show that FSMC-Pose effectively captures and estimates cattle mounting pose in complex and cluttered environments. Dataset and code are available at https://github.com/elianafang/FSMC-Pose.

[210] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

Main category: cs.CV

TL;DR: Proxy-GRM improves generative reward models for VLMs by optimizing rubric quality through proxy-guided verification in RL, achieving SOTA results with better rubric transferability.

DetailsMotivation: Current generative reward models for vision-language models use intermediate rubrics that are rarely optimized directly, relying on expensive LLM-as-judge checks without differentiable signals or training guidance.

Method: Introduces proxy-guided rubric verification into RL by training lightweight proxy agents (Proxy-SFT and Proxy-RL) that predict preference ordering using only rubrics as evidence, with prediction accuracy serving as rubric-quality reward.

Result: Achieves state-of-the-art results on VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench with ~50k data samples, outperforming methods trained on 4x more data. Rubrics transfer to unseen evaluators.

Conclusion: Proxy-GRM effectively optimizes rubric quality in generative reward models, improving reward accuracy and transferability while requiring less training data.

Abstract: Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy’s prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.

[211] ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery

Weiqin Jiao, Hao Cheng, George Vosselman, Claudio Persello

Main category: cs.CV

TL;DR: ACPV-Net generates complete vector maps from aerial imagery with shared boundaries and no gaps/overlops across all land-cover classes in a single run.

DetailsMotivation: Existing polygonization methods are class-specific, leading to topological inconsistencies (duplicated edges, gaps, overlaps) when extended to multiple classes via per-class runs.

Method: Proposes ACPV-Net with Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, and topological reconstruction enforcing shared-edge consistency by design.

Result: ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512 benchmark, and achieves best-reported results on WHU-Building for single-class polygonal vectorization.

Conclusion: The paper introduces the ACPV task, releases the first public benchmark, and demonstrates that ACPV-Net can generate topologically consistent vector maps across all classes while maintaining high quality.

Abstract: We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and without gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the first public benchmark, Deventer-512, with standardized metrics jointly evaluating semantic fidelity, geometric accuracy, vertex efficiency, per-class topological fidelity and global topological consistency. To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that enforces shared-edge consistency by design. While enforcing such strict topological constraints, ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512. It also applies to single-class polygonal vectorization without any architectural modification, achieving the best-reported results on WHU-Building. Data, code, and models will be released at: https://github.com/HeinzJiao/ACPV-Net.

[212] TCATSeg: A Tooth Center-Wise Attention Network for 3D Dental Model Semantic Segmentation

Qiang He, Wentian Qu, Jiajia Dai, Changsong Lei, Shaofeng Wang, Feifei Zuo, Yajie Wang, Yaqian Liang, Xiaoming Deng, Cuixia Ma, Yong-Jin Liu, Hongan Wang

Main category: cs.CV

TL;DR: TCATSeg: A 3D dental model segmentation framework combining local geometric features with global semantic context using sparse superpoints for improved accuracy.

DetailsMotivation: Accurate 3D dental model segmentation is crucial for digital dentistry applications, but existing methods struggle due to complex tooth arrangements and shape similarities, often focusing too much on local geometry while neglecting global context.

Method: Proposes TCATSeg framework that combines local geometric features with global semantic context using sparse yet physically meaningful superpoints to capture global semantic relationships. Also introduces a new dataset of 400 dental models including pre-orthodontic samples.

Result: Extensive experiments show TCATSeg outperforms state-of-the-art approaches in 3D dental model segmentation accuracy.

Conclusion: TCATSeg effectively addresses the limitations of existing methods by incorporating global semantic context alongside local geometric features, demonstrating superior performance on dental segmentation tasks.

Abstract: Accurate semantic segmentation of 3D dental models is essential for digital dentistry applications such as orthodontics and dental implants. However, due to complex tooth arrangements and similarities in shape among adjacent teeth, existing methods struggle with accurate segmentation, because they often focus on local geometry while neglecting global contextual information. To address this, we propose TCATSeg, a novel framework that combines local geometric features with global semantic context. We introduce a set of sparse yet physically meaningful superpoints to capture global semantic relationships and enhance segmentation accuracy. Additionally, we present a new dataset of 400 dental models, including pre-orthodontic samples, to evaluate the generalization of our method. Extensive experiments demonstrate that TCATSeg outperforms state-of-the-art approaches.

[213] MLLM-based Textual Explanations for Face Comparison

Redwan Sony, Anil K Jain, Ross Arun

Main category: cs.CV

TL;DR: MLLMs generate unreliable explanations for face verification decisions, often hallucinating facial attributes not supported by visual evidence, even when verification decisions are correct.

DetailsMotivation: While MLLMs are being used to generate natural-language explanations for face recognition decisions to improve human interpretability, the reliability of these explanations on unconstrained face images (especially with extreme pose variation and surveillance imagery) remains underexplored.

Method: Systematically analyze MLLM-generated explanations for unconstrained face verification on the challenging IJB-S dataset, study the effect of incorporating traditional face recognition system information (scores and decisions), and introduce a likelihood-ratio-based framework to measure evidential strength of textual explanations beyond decision accuracy.

Result: Even when MLLMs produce correct verification decisions, explanations frequently rely on non-verifiable or hallucinated facial attributes not supported by visual evidence. Incorporating traditional face recognition information improves categorical verification performance but doesn’t consistently lead to faithful explanations.

Conclusion: Current MLLMs have fundamental limitations for explainable face recognition, highlighting the need for principled evaluation of reliable and trustworthy explanations in biometric applications.

Abstract: Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural-language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM-generated explanations for the unconstrained face verification task on the challenging IJB-S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non-verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such information improves categorical verification performance, it does not consistently lead to faithful explanations. To evaluate the explanations beyond decision accuracy, we introduce a likelihood-ratio-based framework that measures the evidential strength of textual explanations. Our findings highlight fundamental limitations of current MLLMs for explainable face recognition and underscore the need for a principled evaluation of reliable and trustworthy explanations in biometric applications. Code is available at https://github.com/redwankarimsony/LR-MLLMFR-Explainability.

[214] FlowComposer: Composable Flows for Compositional Zero-Shot Learning

Zhenqi He, Lin Li, Long Chen

Main category: cs.CV

TL;DR: FlowComposer introduces flow matching for compositional zero-shot learning, using primitive flows to transport visual features to attribute/object embeddings and a Composer to fuse them, with leakage-guided augmentation to handle residual feature entanglement.

DetailsMotivation: Current CZSL methods using vision-language models with PEFT suffer from implicit composition construction (only token concatenation) and remained feature entanglement, limiting generalization to unseen attribute-object compositions.

Method: FlowComposer learns two primitive flows to transport visual features toward attribute and object text embeddings, plus a learnable Composer that explicitly fuses their velocity fields into a composition flow. Includes leakage-guided augmentation to reuse leaked features as auxiliary signals.

Result: Significant improvements on three public CZSL benchmarks when integrated as plug-and-play component into various baselines.

Conclusion: Flow matching provides an effective framework for CZSL by enabling explicit composition operations in embedding space and handling residual feature entanglement through leakage-guided augmentation.

Abstract: Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT). They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions. However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models. In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals. We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.

[215] BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection

Melissa Schween, Mathis Kruse, Bodo Rosenhahn

Main category: cs.CV

TL;DR: BUSSARD uses normalizing flows and language model embeddings to detect anomalous relationships in scene graphs from images, achieving better performance and speed than SOTA.

DetailsMotivation: To develop an effective method for detecting anomalous relationships in scene graphs generated from images, leveraging semantic knowledge from language models and probabilistic modeling for robust anomaly detection.

Method: Multimodal approach combining scene graphs with language model embeddings: object-relation-object triplets are embedded using a language model, then a normalizing flow learns bijective transformations to map these to a Gaussian base distribution for likelihood-based anomaly detection.

Result: Achieves ~10% better AUROC than state-of-the-art on SARD dataset (office/dining scenes), 5x faster inference, and shows superior robustness to synonyms with only minor performance deviation vs 17.5% for baseline.

Conclusion: Demonstrates strong potential of learning-based methods for relationship anomaly detection in scene graphs, with normalizing flows and language model embeddings providing effective multimodal semantic understanding.

Abstract: We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images. Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world. A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation. We evaluate our approach on the SARD dataset containing office and dining room scenes. Our method achieves around 10% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster. Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5% deviation. This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs. Our code is available at https://github.com/mschween/BUSSARD .

[216] Mixture of Style Experts for Diverse Image Stylization

Shihao Zhu, Ziheng Ouyang, Yijia Kang, Qilong Wang, Mi Zhou, Bo Li, Ming-Ming Cheng, Qibin Hou

Main category: cs.CV

TL;DR: StyleExpert is a semantic-aware diffusion-based stylization framework using Mixture of Experts (MoE) to handle diverse styles across multiple semantic levels while preserving material details.

DetailsMotivation: Existing diffusion-based stylization methods are limited to color-driven transformations and neglect complex semantics and material details, creating a need for more sophisticated semantic-aware stylization approaches.

Method: Uses a unified style encoder trained on large-scale content-style-stylized triplets to embed diverse styles into a consistent latent space, then employs similarity-aware gating to dynamically route styles to specialized experts within a Mixture of Experts architecture.

Result: Outperforms existing approaches in preserving semantics and material details while generalizing to unseen styles, as demonstrated through extensive experiments.

Conclusion: StyleExpert effectively handles diverse styles spanning multiple semantic levels (from shallow textures to deep semantics) through its MoE-based framework and unified style encoding approach.

Abstract: Diffusion-based stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material details.We introduce StyleExpert, a semantic-aware framework based on the Mixture of Experts (MoE). Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles. Our code and collected images are available at the project page: https://hh-lg.github.io/StyleExpert-Page/.

[217] TempCore: Are Video QA Benchmarks Temporally Grounded? A Frame Selection Sensitivity Analysis and Benchmark

Hyunjong Ok, Jaeho Lee

Main category: cs.CV

TL;DR: The paper introduces Frame Selection Sensitivity (FSS) to measure how much VLM accuracy changes when replacing most relevant frames with least relevant ones, finding most video QA samples are frame-agnostic, and creates TempCore subsets to isolate temporally sensitive samples.

DetailsMotivation: Current video QA benchmarks may not genuinely require temporal frame selection - many questions might be answerable regardless of which frames are shown. The authors want to understand how much VLMs actually rely on temporal information versus being frame-agnostic.

Method: Introduces Frame Selection Sensitivity (FSS), a per-sample diagnostic that measures VLM accuracy change when replacing most relevant frames with least relevant ones. Combines FSS with Language Independence Score (LIS) to identify Temporally Sensitive samples. Constructs TempCore, compact evaluation subsets that isolate these temporal samples from existing benchmarks.

Result: Across six benchmarks and eight VLMs, a large majority of samples are frame-agnostic. Only 8-33% of samples are genuinely Temporally Sensitive. The TempCore subsets successfully isolate these temporal samples for more meaningful evaluation.

Conclusion: Most current video QA benchmarks contain many frame-agnostic samples, questioning whether they truly evaluate temporal understanding. TempCore provides more focused evaluation subsets that isolate temporally sensitive samples for better assessment of VLMs’ temporal reasoning capabilities.

Abstract: Vision-language models (VLMs) can ingest only a limited number of video frames, making frame selection a practical necessity. But do current Video QA benchmarks genuinely require temporal frame selection, or can most questions be answered regardless of which frames are shown? We introduce Frame Selection Sensitivity (FSS), a per-sample diagnostic that measures how much VLM accuracy changes when the most relevant frames are replaced with the least relevant ones. Across six benchmarks and eight VLMs, we find that a large majority of samples are frame-agnostic: only a minority are genuinely sensitive to frame choice. Combining FSS with a Language Independence Score (LIS) reveals that merely 8–33% of samples are Temporally Sensitive. We construct TempCore, compact evaluation subsets that isolate these temporal samples from existing benchmarks, and will release code and per-sample annotations upon publication.

[218] Efficient Brood Cell Detection in Layer Trap Nests for Bees and Wasps: Balancing Labeling Effort and Species Coverage

Chenchang Liu, Felix Fornoff, Annika Grasreiner, Patrick Maeder, Henri Greil, Marco Seeland

Main category: cs.CV

TL;DR: Deep learning approach with novel Constrained False Positive Loss for detecting and classifying brood cells in layer trap nests to monitor cavity-nesting wild bees and wasps, addressing labeling challenges and class imbalance.

DetailsMotivation: Manual evaluation of layer trap nests for monitoring cavity-nesting wild bees and wasps is labor-intensive and time-consuming. The research aims to automate brood cell detection and classification to support biodiversity research and conservation efforts.

Method: Proposes a deep learning approach with a novel Constrained False Positive Loss (CFPL) strategy that dynamically masks predictions from unlabeled data to prevent interference with classification loss during training, addressing labeling challenges and class imbalance.

Result: Deep learning effectively detects brood cells in LTNs. The CFPL method improves performance, balances model accuracy with labeling effort, and mitigates class imbalance on a dataset of 712 LTN images covering 28 fine-grained classes.

Conclusion: The proposed deep learning approach with CFPL enables efficient monitoring of cavity-nesting insects, reducing manual labeling effort while maintaining accuracy, which supports biodiversity conservation research.

Abstract: Monitoring cavity-nesting wild bees and wasps is vital for biodiversity research and conservation. Layer trap nests (LTNs) are emerging as a valuable tool to study the abundance and species richness of these insects, offering insights into their nesting activities and ecological needs. However, manually evaluating LTNs to detect and classify brood cells is labor-intensive and time-consuming. To address this, we propose a deep learning based approach for efficient brood cell detection and classification in LTNs. LTNs present additional challenges due to densely packed brood cells, leading to a high labeling effort per image. Moreover, we observe a significant imbalance in class distribution, with common species having notably more occurrences than rare species. Comprehensive labeling of common species is time-consuming and exacerbates data imbalance, while partial labeling introduces data incompleteness which degrades model performance. To reduce labeling effort and mitigate the impact of unlabeled data, we introduce a novel Constrained False Positive Loss (CFPL) strategy. CFPL dynamically masks predictions from unlabeled data, preventing them from interfering with the classification loss during training. We evaluate our approach on a dataset of 712 LTN images collected over one season, covering 28 fine-grained classes describing the taxonomy and status of brood cells. To minimize labeling effort, we limit the training set to a maximum of 300 labels per class. Experimental results demonstrate that deep learning can be effectively used to detect brood cells in LTNs. Our CFPL method further improves performance and balances model accuracy and labeling effort while also mitigating class imbalance.

[219] HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

Md Jahidul Islam

Main category: cs.CV

TL;DR: HeBA introduces modality-specific architectural biases for VLM adaptation, using 2D convolutions for visual tokens and dense projections for text tokens, with bottleneck regularization and active gradient initialization for improved few-shot performance.

DetailsMotivation: Current VLM adaptation methods use homogeneous architectures that ignore the distinct structural nature of visual (spatial locality) and textual (semantic density) modalities, leading to suboptimal performance.

Method: HeBA introduces three innovations: 1) Heterogeneous processing with 2D depthwise-separable convolutions for visual tokens and dense linear projections for text tokens, 2) Bottleneck regularization with compression (D→D/4), and 3) Active gradient initialization using Kaiming initialization instead of zero-initialization.

Result: HeBA achieves superior stability and accuracy, establishing new state-of-the-art on 11 few-shot benchmarks, demonstrating the effectiveness of modality-specific architectural biases.

Conclusion: Modality-specific architectural design with structural inductive biases significantly improves VLM adaptation performance, offering a more effective approach than homogeneous adapter architectures.

Abstract: Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a “one-size-fits-all” architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities – spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone’s pre-trained knowledge. Extensive experiments demonstrate that HeBA’s architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.

[220] ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

Jewon Lee, Wooksu Shin, Seungmin Yang, Ki-Ung Song, DongUk Lim, Jaeyeon Kim, Tae-Ho Kim, Bo-Kyeong Kim

Main category: cs.CV

TL;DR: ERGO is an efficient vision-language model that uses a coarse-to-fine reasoning pipeline to reduce computational costs by processing only task-relevant image regions at full resolution, achieving better accuracy with fewer vision tokens.

DetailsMotivation: Existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to processing many vision tokens. The emergence of "thinking with images" models enables visual reasoning, motivating a more efficient approach that focuses computational resources only on relevant image regions.

Method: Two-stage coarse-to-fine reasoning pipeline: 1) Analyze downsampled image to identify task-relevant regions, 2) Crop only those regions at full resolution for detailed processing. Uses reasoning-driven perception with reinforcement learning rewards to handle perceptual uncertainty and expand cropped regions for ambiguous areas.

Result: ERGO surpasses Qwen2.5-VL-7B on V* benchmark by 4.7 points while using only 23% of vision tokens, achieving 3x inference speedup. Outperforms original models and competitive methods across multiple datasets with greater efficiency.

Conclusion: The proposed reasoning-driven perception approach enables efficient high-resolution image processing for vision-language tasks by intelligently focusing computational resources on relevant regions, balancing accuracy and efficiency.

Abstract: Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of “thinking with images” models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage “coarse-to-fine” reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.

[221] Spectral Property-Driven Data Augmentation for Hyperspectral Single-Source Domain Generalization

Taiqin Chen, Yifeng Wang, Xiaochen Feng, Zhilin Zhu, Hao Sha, Yingjian Li, Yongbing Zhang

Main category: cs.CV

TL;DR: SPDDA is a spectral property-driven data augmentation method for hyperspectral image domain generalization that balances realism and diversity by modeling spectral channel variations and mixing while preserving spatial structure.

DetailsMotivation: Hyperspectral images suffer from domain shift issues due to sensor variability and high dimensionality. Existing data augmentation methods for single-source domain generalization face a tradeoff between realism and diversity - blind augmentation produces unrealistic samples while excessive realism limits diversity, both harming generalization to target domains.

Method: Proposes SPDDA with: 1) Spectral diversity module that resamples data along spectral dimension to generate samples with varying spectral channels; 2) Channel-wise adaptive spectral mixer modeling inter-channel similarity to avoid fixed patterns; 3) Spatial-spectral co-optimization with spatial fidelity constraint and spectral continuity self-constraint; 4) Adaptive weight adjustment for spectral constraint based on spatial counterpart to prevent over-smoothing.

Result: Extensive experiments on three remote sensing benchmarks show SPDDA outperforms state-of-the-art methods for hyperspectral image domain generalization.

Conclusion: SPDDA effectively addresses the realism-diversity tradeoff in hyperspectral image augmentation by explicitly modeling spectral properties and employing spatial-spectral co-optimization, leading to improved domain generalization performance.

Abstract: While hyperspectral images (HSI) benefit from numerous spectral channels that provide rich information for classification, the increased dimensionality and sensor variability make them more sensitive to distributional discrepancies across domains, which in turn can affect classification performance. To tackle this issue, hyperspectral single-source domain generalization (SDG) typically employs data augmentation to simulate potential domain shifts and enhance model robustness under the condition of single-source domain training data availability. However, blind augmentation may produce samples misaligned with real-world scenarios, while excessive emphasis on realism can suppress diversity, highlighting a tradeoff between realism and diversity that limits generalization to target domains. To address this challenge, we propose a spectral property-driven data augmentation (SPDDA) that explicitly accounts for the inherent properties of HSI, namely the device-dependent variation in the number of spectral channels and the mixing of adjacent channels. Specifically, SPDDA employs a spectral diversity module that resamples data from the source domain along the spectral dimension to generate samples with varying spectral channels, and constructs a channel-wise adaptive spectral mixer by modeling inter-channel similarity, thereby avoiding fixed augmentation patterns. To further enhance the realism of the augmented samples, we propose a spatial-spectral co-optimization mechanism, which jointly optimizes a spatial fidelity constraint and a spectral continuity self-constraint. Moreover, the weight of the spectral self-constraint is adaptively adjusted based on the spatial counterpart, thus preventing over-smoothing in the spectral dimension and preserving spatial structure. Extensive experiments conducted on three remote sensing benchmarks demonstrate that SPDDA outperforms state-of-the-art methods.

[222] Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

Jiawei Mao, Hardy Chen, Haoqin Tu, Yuhan Wang, Letian Zhang, Zeyu Zheng, Huaxiu Yao, Zirui Wang, Cihang Xie, Yuyin Zhou

Main category: cs.CV

TL;DR: Kestrel is a training-free framework that mitigates hallucinations in large vision-language models by combining explicit visual grounding with evidence-verified self-refinement, improving performance on hallucination benchmarks while providing transparent verification traces.

DetailsMotivation: Large vision-language models (LVLMs) suffer from hallucinations that limit their deployment, and training-free methods for mitigation offer cost-effective solutions but existing approaches have limited gains and weak interpretability.

Method: Kestrel uses an explicit visual-grounding agent to collect visual evidence and convert tool outputs into structured textual evidence, then employs an LVLM judge for evidence verification and iterative self-refinement of answers to prevent over-correction.

Result: Kestrel improves performance over strong baselines across hallucination benchmarks (average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), with both self-refinement and grounding agent contributing significant gains.

Conclusion: Kestrel provides an effective training-free solution for LVLM hallucination mitigation that offers transparent verification traces for diagnosis and analysis while achieving substantial performance improvements.

Abstract: Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis – e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.

[223] Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao

Main category: cs.CV

TL;DR: Fast-WAM challenges the need for explicit future imagination at test time in World Action Models, showing that video modeling during training alone can achieve competitive performance with much faster inference.

DetailsMotivation: Current World Action Models (WAMs) use an imagine-then-execute paradigm requiring iterative video denoising at test time, causing substantial latency. The authors question whether explicit future imagination is actually necessary for strong action performance or if the benefit comes primarily from video modeling during training.

Method: Proposes Fast-WAM, a WAM architecture that retains video co-training during training but skips future prediction at test time. Creates several Fast-WAM variants to enable controlled comparison of training vs. inference factors, disentangling the role of video modeling during training from explicit future generation during inference.

Result: Fast-WAM achieves competitive results with state-of-the-art methods on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks without embodied pretraining. It runs in real time with 190ms latency, over 4× faster than existing imagine-then-execute WAMs. Removing video co-training causes much larger performance drops than skipping future prediction.

Conclusion: The main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Fast-WAM demonstrates that explicit future imagination at inference time is not essential for strong action performance.

Abstract: World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: https://yuantianyuan01.github.io/FastWAM/

[224] $x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space

Ruishan Guo, Ciyu Ruan, Haoyang Wang, Zihang Gong, Jingao Xu, Xinlei Chen

Main category: cs.CV

TL;DR: x²-Fusion: A multimodal fusion framework that uses event camera data as a spatiotemporal edge field to create a unified latent representation (Event Edge Space) for jointly estimating 2D optical flow and 3D scene flow from images, LiDAR, and event data.

DetailsMotivation: Current multimodal approaches for 2D optical flow and 3D scene flow estimation operate in separate heterogeneous feature spaces, requiring multiple modality-specific blocks that leave cross-sensor mismatches unresolved and make fusion unnecessarily complex. There's a need for a shared latent space that all modalities can align to for more effective fusion.

Method: Uses event camera data as an intrinsic edge field to create Event Edge Space - a unified latent representation. Image and LiDAR features are explicitly aligned in this shared representation. Performs reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. Employs cross-dimension contrast learning to couple 2D optical flow with 3D scene flow.

Result: Extensive experiments on both synthetic and real benchmarks show that x²-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.

Conclusion: The Event Edge Space provides an effective unified representation for multimodal fusion, enabling better alignment of heterogeneous sensor data and improved joint estimation of 2D optical flow and 3D scene flow, particularly in challenging conditions.

Abstract: Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex.Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce $x^2$-Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation.Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that $x^2$-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.

[225] HMAR: Hierarchical Modality-Aware Expert and Dynamic Routing Medical Image Retrieval Architecture

Aojie Yuan

Main category: cs.CV

TL;DR: HMAR is a hierarchical modality-aware medical image retrieval framework using Mixture-of-Experts architecture with global and local experts for both holistic and fine-grained lesion-region retrieval.

DetailsMotivation: Existing medical image retrieval systems have three key limitations: uniform feature encoding that ignores varying clinical importance of anatomical structures, ambiguous similarity metrics based on coarse classification labels, and exclusive focus on global image similarity that cannot meet clinical demand for fine-grained region-specific retrieval.

Method: HMAR uses a Mixture-of-Experts architecture with dual-expert mechanism: Expert0 extracts global features for holistic similarity matching, while Expert1 learns position-invariant local representations for lesion-region retrieval. It employs two-stage contrastive learning without bounding-box annotations, sliding-window matching for dense local comparison, and generates hash codes via Kolmogorov-Arnold Network layers for efficient Hamming-distance search.

Result: On RadioImageNet-CT dataset (16 clinical patterns, 29,903 images), HMAR achieves mean Average Precision (mAP) of 0.711 and 0.724 for 64-bit and 128-bit hash codes, improving over state-of-the-art ACIR method by 0.7% and 1.1% respectively.

Conclusion: HMAR effectively addresses limitations of existing medical image retrieval systems by providing both global and fine-grained local retrieval capabilities through its hierarchical expert architecture, demonstrating superior performance on clinical datasets.

Abstract: Medical image retrieval (MIR) is a critical component of computer-aided diagnosis, yet existing systems suffer from three persistent limitations: uniform feature encoding that fails to account for the varying clinical importance of anatomical structures, ambiguous similarity metrics based on coarse classification labels, and an exclusive focus on global image similarity that cannot meet the clinical demand for fine-grained region-specific retrieval. We propose HMAR (Hierarchical Modality-Aware Expert and Dynamic Routing), an adaptive retrieval framework built on a Mixture-of-Experts (MoE) architecture. HMAR employs a dual-expert mechanism: Expert0 extracts global features for holistic similarity matching, while Expert1 learns position-invariant local representations for precise lesion-region retrieval. A two-stage contrastive learning strategy eliminates the need for expensive bounding-box annotations, and a sliding-window matching algorithm enables dense local comparison at inference time. Hash codes are generated via Kolmogorov-Arnold Network (KAN) layers for efficient Hamming-distance search. Experiments on the RadioImageNet-CT dataset (16 clinical patterns, 29,903 images) show that HMAR achieves mean Average Precision (mAP) of 0.711 and 0.724 for 64-bit and 128-bit hash codes, improving over the state-of-the-art ACIR method by 0.7% and 1.1%, respectively.

Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi

Main category: cs.CV

TL;DR: Search2Motion is a training-free framework for object-level motion editing in image-to-video generation that uses target-frame-based control with first-last-frame motion priors for object relocation while maintaining scene stability.

DetailsMotivation: Existing methods for object motion editing require trajectories, bounding boxes, masks, or motion fields, which can be cumbersome. The paper aims to enable intuitive object relocation in videos without fine-tuning while preserving scene stability.

Method: Uses target-frame-based control with semantic-guided object insertion and robust background inpainting for reliable target-frame construction. Leverages early-step self-attention maps to predict object and camera dynamics, and introduces ACE-Seed (Attention Consensus for Early-step Seed selection) for improved motion fidelity.

Result: Search2Motion outperforms baselines on FLF2V-obj and VBench metrics. Introduces new benchmarks S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, and FLF2V-obj metrics that isolate object artifacts without ground-truth trajectories.

Conclusion: Search2Motion provides an effective training-free framework for object-level motion editing in videos, offering intuitive control, interpretable feedback, and improved motion fidelity without requiring fine-tuning or complex annotations.

Abstract: We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.

[227] Emotion-Aware Classroom Quality Assessment Leveraging IoT-Based Real-Time Student Monitoring

Hai Nguyen, Hieu Dao, Hung Nguyen, Nam Vu, Cong Tran

Main category: cs.CV

TL;DR: Real-time multi-agent affective computing framework for classroom emotion monitoring using IoT devices, achieving 88% accuracy in engagement classification at 25 FPS for up to 50 faces.

DetailsMotivation: Address challenges of large classroom sizes and limited teacher-student interaction by creating scalable, data-driven tools for real-time emotional state monitoring to enhance learning outcomes.

Method: High-throughput, real-time multi-agent affective computing framework tailored for IoT devices, addressing load balancing and latency through efficient processing. Uses Classroom Emotion Dataset (1,500 labeled images + 300 classroom detection videos) and evaluated across three educational institutions.

Result: System detects up to 50 faces at 25 FPS with 88% overall accuracy in classifying classroom engagement states. Positive feedback from students, teachers, and parents regarding improved classroom interaction and teaching adaptation.

Conclusion: Establishes practical IoT-based framework for emotion-aware learning environments and introduces Classroom Emotion Dataset for further validation and research in affective computing for education.

Abstract: This study presents high-throughput, real-time multi-agent affective computing framework designed to enhance classroom learning through emotional state monitoring. As large classroom sizes and limited teacher student interaction increasingly challenge educators, there is a growing need for scalable, data-driven tools capable of capturing students’ emotional and engagement patterns in real time. The system was evaluated using the Classroom Emotion Dataset, consisting of 1,500 labeled images and 300 classroom detection videos. Tailored for IoT devices, the system addresses load balancing and latency challenges through efficient real-time processing. Field testing was conducted across three educational institutions in a large metropolitan area: a primary school (hereafter school A), a secondary school (school B), and a high school (school C). The system demonstrated robust performance, detecting up to 50 faces at 25 FPS and achieving 88% overall accuracy in classifying classroom engagement states. Implementation results showed positive outcomes, with favorable feedback from students, teachers, and parents regarding improved classroom interaction and teaching adaptation. Key contributions of this research include establishing a practical, IoT-based framework for emotion-aware learning environments and introducing the ‘Classroom Emotion Dataset’ to facilitate further validation and research.

[228] World Reconstruction From Inconsistent Views

Lukas Höllein, Matthias Nießner

Main category: cs.CV

TL;DR: A method to convert inconsistent video frames from diffusion models into 3D-consistent point clouds and high-quality 3D environments using non-rigid alignment and inverse deformation rendering.

DetailsMotivation: Video diffusion models generate high-quality but 3D-inconsistent frames, making 3D reconstruction difficult. The paper aims to address these inconsistencies to create explorable 3D environments from video models.

Method: 1) Use geometric foundation model to lift frames to pixel-wise 3D point clouds; 2) Apply non-rigid iterative frame-to-model ICP for initial alignment; 3) Global optimization to sharpen point clouds; 4) Use point clouds as initialization for 3D reconstruction with novel inverse deformation rendering loss.

Result: The method produces higher quality 3D scenes than baselines, effectively turning video diffusion models into 3D-consistent world generators with sharp, detailed point cloud reconstructions.

Conclusion: The proposed approach successfully addresses 3D inconsistencies in video diffusion model outputs, enabling the creation of high-quality, explorable 3D environments from inconsistent video frames.

Abstract: Video diffusion models generate high-quality and diverse worlds; however, individual frames often lack 3D consistency across the output sequence, which makes the reconstruction of 3D worlds difficult. To this end, we propose a new method that handles these inconsistencies by non-rigidly aligning the video frames into a globally-consistent coordinate frame that produces sharp and detailed pointcloud reconstructions. First, a geometric foundation model lifts each frame into a pixel-wise 3D pointcloud, which contains unaligned surfaces due to these inconsistencies. We then propose a tailored non-rigid iterative frame-to-model ICP to obtain an initial alignment across all frames, followed by a global optimization that further sharpens the pointcloud. Finally, we leverage this pointcloud as initialization for 3D reconstruction and propose a novel inverse deformation rendering loss to create high quality and explorable 3D environments from inconsistent views. We demonstrate that our 3D scenes achieve higher quality than baselines, effectively turning video models into 3D-consistent world generators.

[229] When the City Teaches the Car: Label-Free 3D Perception from Infrastructure

Zhen Xu, Jinsu Yoo, Cristian Bautista, Zanming Huang, Tai-Yu Pan, Zhenzhen Liu, Katie Z Luo, Mark Campbell, Bharath Hariharan, Wei-Lun Chao

Main category: cs.CV

TL;DR: Infrastructure-taught label-free 3D perception where roadside units (RSUs) act as stationary teachers to provide pseudo-labels for training ego vehicle detectors without manual annotation.

DetailsMotivation: Manual annotation for 3D perception in self-driving is impractical at scale across diverse regions. Modern cities have roadside units (RSUs) that could potentially provide supervisory signals to vehicles.

Method: Three-stage pipeline: RSUs learn local 3D detectors from unlabeled data using fixed viewpoints, broadcast predictions to passing vehicles, which aggregate them as pseudo-labels to train standalone ego detectors. Test-time requires no infrastructure.

Result: Achieves 82.3% AP for vehicle detection using CenterPoint (vs 94.4% supervised upper bound). Pipeline is scalable and complementary to existing ego-centric label-free methods.

Conclusion: City infrastructure can provide scalable supervisory signals for autonomous vehicles, offering a promising orthogonal paradigm to reduce annotation costs in 3D perception.

Abstract: Building robust 3D perception for self-driving still relies heavily on large-scale data collection and manual annotation, yet this paradigm becomes impractical as deployment expands across diverse cities and regions. Meanwhile, modern cities are increasingly instrumented with roadside units (RSUs), static sensors deployed along roads and at intersections to monitor traffic. This raises a natural question: can the city itself help train the vehicle? We propose infrastructure-taught, label-free 3D perception, a paradigm in which RSUs act as stationary, unsupervised teachers for ego vehicles. Leveraging their fixed viewpoints and repeated observations, RSUs learn local 3D detectors from unlabeled data and broadcast predictions to passing vehicles, which are aggregated as pseudo-label supervision for training a standalone ego detector. The resulting model requires no infrastructure or communication at test time. We instantiate this idea as a fully label-free three-stage pipeline and conduct a concept-and-feasibility study in a CARLA-based multi-agent environment. With CenterPoint, our pipeline achieves 82.3% AP for detecting vehicles, compared to a fully supervised ego upper bound of 94.4%. We further systematically analyze each stage, evaluate its scalability, and demonstrate complementarity with existing ego-centric label-free methods. Together, these results suggest that city infrastructure itself can potentially provide a scalable supervisory signal for autonomous vehicles, positioning infrastructure-taught learning as a promising orthogonal paradigm for reducing annotation cost in 3D perception.

[230] IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

Huimin Xiong, Zijie Meng, Tianxiang Hu, Chenyi Zhou, Yang Feng, Zuozhu Liu

Main category: cs.CV

TL;DR: IOSVLM: A 3D vision-language model for unified dental diagnosis and VQA on intraoral scans using point cloud representation and LLM integration

DetailsMotivation: 3D intraoral scans (IOS) are increasingly used in dentistry but current VLMs only use 2D/multi-view images, not native 3D geometry. Challenges include heterogeneous scan forms, complex topology, multi-disease co-occurrence with class imbalance, and limited 3D-text data.

Method: End-to-end 3D VLM with point cloud representation, 3D encoder-projector-LLM design. Introduces geometry-to-chromatic proxy to bridge color-free IOS data and color-dependent 3D pre-training. Uses two-stage curriculum training strategy. Also creates IOSVQA dataset with 19,002 cases and 249,055 VQA pairs.

Result: IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, demonstrating effectiveness of direct 3D geometry modeling for IOS-based diagnosis.

Conclusion: Direct 3D geometry modeling is effective for IOS-based diagnosis, and the proposed IOSVLM framework with geometry-to-chromatic proxy and curriculum training addresses key challenges in 3D dental VLM development.

Abstract: 3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.

[231] Semi-supervised Latent Disentangled Diffusion Model for Textile Pattern Generation

Chenggong Hu, Yi Wang, Mengqi Xue, Haofei Zhang, Jie Song, Li Sun

Main category: cs.CV

TL;DR: SLDDM-TPG is a two-stage method for textile pattern generation that uses latent disentanglement and semi-supervised diffusion to preserve fine-grained details in clothing images.

DetailsMotivation: Existing image-to-image models fail to preserve fine-grained textile pattern details due to feature confusion between complex patterns and non-rigid texture distortions in clothing images.

Method: Two-stage approach: 1) Latent disentangled network (LDN) resolves feature confusion and constructs independent clothing feature space; 2) Semi-supervised latent diffusion model (S-LDM) receives guidance from LDN and uses fine-grained alignment strategy for faithful generation.

Result: Reduces FID by 4.1 and improves SSIM by up to 0.116 on CTP-HD dataset, with good generalization on VITON-HD dataset.

Conclusion: SLDDM-TPG effectively addresses feature confusion in textile pattern generation and produces faithful, high-fidelity results through disentangled representation learning and semi-supervised diffusion.

Abstract: Textile pattern generation (TPG) aims to synthesize fine-grained textile pattern images based on given clothing images. Although previous studies have not explicitly investigated TPG, existing image-to-image models appear to be natural candidates for this task. However, when applied directly, these methods often produce unfaithful results, failing to preserve fine-grained details due to feature confusion between complex textile patterns and the inherent non-rigid texture distortions in clothing images. In this paper, we propose a novel method, SLDDM-TPG, for faithful and high-fidelity TPG. Our method consists of two stages: (1) a latent disentangled network (LDN) that resolves feature confusion in clothing representations and constructs a multi-dimensional, independent clothing feature space; and (2) a semi-supervised latent diffusion model (S-LDM), which receives guidance signals from LDN and generates faithful results through semi-supervised diffusion training, combined with our designed fine-grained alignment strategy. Extensive evaluations show that SLDDM-TPG reduces FID by 4.1 and improves SSIM by up to 0.116 on our CTP-HD dataset, and also demonstrate good generalization on the VITON-HD dataset.

[232] SuCor: Susceptibility Distortion Correction via Parameter-Free and Self-Regularized Optimal Transport

Sreekar Chigurupati, Eleftherios Garyfallidis

Main category: cs.CV

TL;DR: SuCor uses optimal transport theory to correct geometric distortions in EPI MRI scans by modeling distortion fields as Wasserstein-2 barycentric displacements between reversed phase encoding volumes.

DetailsMotivation: Echo planar imaging (EPI) suffers from susceptibility-induced geometric distortions that degrade image quality and registration accuracy, particularly in diffusion MRI and functional MRI applications. Existing correction methods like FSL TOPUP have limitations in accuracy and require manual parameter tuning.

Method: SuCor models each column of the distortion field as a Wasserstein-2 barycentric displacement between opposing-polarity intensity profiles from reversed phase encoding EPI volumes. It uses optimal transport theory to find the optimal displacement field, with regularization in the spectral domain using a bending-energy penalty. The regularization strength is automatically selected via the Morozov discrepancy principle, eliminating manual tuning.

Result: On HCP dataset with left-right/right-left b0 EPI pairs and co-registered T1 structural reference, SuCor achieves mean volumetric mutual information of 0.341 with T1 image, compared to 0.317 for FSL TOPUP. It runs in approximately 12 seconds on a single CPU core.

Conclusion: SuCor provides an efficient, accurate, and automatic method for EPI distortion correction using optimal transport theory, outperforming existing methods like FSL TOPUP in both accuracy and computational efficiency.

Abstract: We present SuCor, a method for correcting susceptibility induced geometric distortions in echo planar imaging (EPI) using optimal transport (OT) along the phase encoding direction. Given a pair of reversed phase encoding EPI volumes, we model each column of the distortion field as a Wasserstein-2 barycentric displacement between the opposing-polarity intensity profiles. Regularization is performed in the spectral domain using a bending-energy penalty whose strength is selected automatically via the Morozov discrepancy principle, requiring no manual tuning. On a human connectome project (HCP) dataset with left-right/right-left b0 EPI pairs and a co-registered T1 structural reference, SuCor achieves a mean volumetric mutual information of 0.341 with the T1 image, compared to 0.317 for FSL TOPUP, while running in approximately 12 seconds on a single CPU core.

[233] V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang, Jaemin Cho, Mohit Bansal

Main category: cs.CV

TL;DR: V-Co presents a systematic study of visual co-denoising in pixel-space diffusion models, identifying four key ingredients for effective representation-aligned generation.

DetailsMotivation: Pixel-space diffusion models lack strong semantic supervision and aren't explicitly designed to capture high-level visual structure. While representation-alignment methods like REPA show promise, existing co-denoising approaches entangle multiple design choices, making it unclear which components are essential.

Method: V-Co uses a unified JiT-based framework to systematically study visual co-denoising. The approach isolates key ingredients through controlled experiments, leading to a fully dual-stream architecture, structurally defined unconditional prediction for CFG, perceptual-drifting hybrid loss, and RMS-based feature rescaling for cross-stream calibration.

Result: On ImageNet-256, V-Co outperforms underlying pixel-space diffusion baselines and strong prior pixel-diffusion methods at comparable model sizes while using fewer training epochs.

Conclusion: The study provides a simple recipe for effective visual co-denoising with four key ingredients, offering practical guidance for future representation-aligned generative models in pixel-space diffusion.

Abstract: Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.

[234] Dual Stream Independence Decoupling for True Emotion Recognition under Masked Expressions

Jinsheng Wei, Xiguang Zhang, Zheng Shi, Guanming Lu

Main category: cs.CV

TL;DR: A novel apexframe-based paradigm for recognizing true emotions from masked expressions, using a dual stream independence decoupling framework to separate true and disguised emotion features.

DetailsMotivation: Existing methods recognize true emotions from masked-expression clips containing onset frames that leak true emotional information before stable disguise state is reached. This doesn't reflect actual disguised states, requiring a new paradigm that works with stable disguised expressions.

Method: Proposes apexframe-based paradigm using frames with stable disguised state. Introduces dual stream independence decoupling framework with two classification losses for true and disguised emotion features, plus Hilbert-Schmidt Independence loss to enhance feature independence.

Result: Experiments show the apexframe-based paradigm is challenging but the proposed decoupling framework improves recognition performance for true emotions from masked expressions.

Conclusion: The apexframe paradigm better reflects actual disguised states, and the decoupling framework effectively separates true and disguised emotion features, improving true emotion recognition from masked expressions.

Abstract: Recongnizing true emotions from masked expressions is extremely challenging due to deliberate concealment. Existing paradigms recognize true emotions from masked-expression clips that contain onsetframes just starting to disguise. However, this paradigm may not reflect the actual disguised state, as the onsetframe leaks the true emotional information without reaching a stable disguise state. Thus, this paper introduces a novel apexframe-based paradigm that classifies true emotions from the apexframe with a stable disguised state. Furthermore, this paper proposes a novel dual stream independence decoupling framework that decouples true and disguised emotion features, avoiding the interference of disguised emotions on true emotions. For efficient decoupling, we design a decoupling loss group, comprising two classification losses that learn true emotion and disguised expression features, respectively, and a Hilbert-Schmidt Independence loss that enhances the independence of two features. Experiments demonstrate that the apexframe-based paradigm is challenging, and the proposed decouple framework improves recogntion performances.

[235] GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution

Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Lei Zhang

Main category: cs.CV

TL;DR: GDPO integrates reinforcement learning into one-step generative image super-resolution using a noise-aware diffusion model with unequal-timestep strategy and group-relative advantage optimization.

DetailsMotivation: Current RL methods for generative image super-resolution focus on multi-step approaches and have limitations: DPO requires offline sample pairs (limited samples), GRPO only considers global image likelihood (ignores local details), and one-step generative ISR remains underexplored due to limited stochasticity.

Method: 1) Noise-aware one-step diffusion model with unequal-timestep strategy to decouple noise addition from diffusion timesteps; 2) Group Direct Preference Optimization (GDPO) that integrates GRPO principles into DPO to calculate group-relative advantage for online samples; 3) Attribute-aware reward function that dynamically evaluates samples based on smooth and texture area statistics.

Result: Experiments demonstrate GDPO’s effectiveness in enhancing performance of one-step generative ISR models.

Conclusion: GDPO successfully integrates RL into one-step generative ISR training, addressing limitations of existing methods through novel noise-aware modeling and group-relative optimization strategies.

Abstract: Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models. Code: https://github.com/Joyies/GDPO.

[236] WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation

Muhammad Aamir, Naoya Muramatsu, Sangyun Shin, Matthew Wijers, Jiaxing Jhong, Xinyu Hou, Amir Patel, Andrew Markham

Main category: cs.CV

TL;DR: WildDepth: A multimodal dataset with synchronized RGB and LiDAR for animal depth estimation, behavior detection, and 3D reconstruction across domestic and wild environments.

DetailsMotivation: Existing animal depth estimation models lack metric scale validation due to datasets without ground truth depth. The paper addresses this limitation by creating a multimodal dataset with synchronized RGB and LiDAR for reliable depth estimation and 3D reconstruction of animals.

Method: Presents WildDepth dataset with synchronized RGB and LiDAR data across diverse animal categories in domestic and wild environments. Uses multimodal fusion techniques combining RGB and LiDAR data for improved depth estimation and 3D reconstruction.

Result: Multimodal data improves depth reliability by up to 10% RMSE, and RGB-LiDAR fusion enhances 3D reconstruction fidelity by 12% in Chamfer distance. The dataset enables robust multimodal perception systems.

Conclusion: WildDepth addresses the metric scale limitation in animal depth estimation by providing synchronized RGB-LiDAR data, enabling more reliable multimodal perception systems that generalize across domains.

Abstract: Depth estimation and 3D reconstruction have been extensively studied as core topics in computer vision. Starting from rigid objects with relatively simple geometric shapes, such as vehicles, the research has expanded to address general objects, including challenging deformable objects, such as humans and animals. However, for the animal, in particular, the majority of existing models are trained based on datasets without metric scale, which can help validate image-only models. To address this limitation, we present WildDepth, a multimodal dataset and benchmark suite for depth estimation, behavior detection, and 3D reconstruction from diverse categories of animals ranging from domestic to wild environments with synchronized RGB and LiDAR. Experimental results show that the use of multi-modal data improves depth reliability by up to 10% RMSE, while RGB-LiDAR fusion enhances 3D reconstruction fidelity by 12% in Chamfer distance. By releasing WildDepth and its benchmarks, we aim to foster robust multimodal perception systems that generalize across domains.

[237] Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines

Sourya Saha, Saptarshi Debroy

Main category: cs.CV

TL;DR: Battery-aware execution management framework for edge-assisted XR systems using deep reinforcement learning to optimize latency-energy trade-offs

DetailsMotivation: XR applications require real-time responsiveness on energy-constrained devices, but existing approaches don't fully capture the interaction between latency requirements and battery lifetime in closed-loop XR workloads

Method: Online decision mechanism based on lightweight deep reinforcement learning policy that continuously adapts execution decisions (placement, workload quality, latency requirements, battery dynamics) under dynamic network conditions

Result: Extends projected device battery lifetime by up to 163% compared to latency-optimal local execution while maintaining over 90% motion-to-photon latency compliance under stable networks, and not below 80% even under limited bandwidth

Conclusion: Explicitly managing latency-energy trade-offs is effective for immersive XR systems, demonstrating the value of battery-aware execution management in edge-assisted architectures

Abstract: Immersive extended reality (XR) applications introduce latency-critical workloads that must satisfy stringent real-time responsiveness while operating on energy- and battery-constrained devices, making execution placement between end devices and nearby edge servers a fundamental systems challenge. Existing approaches to adaptive execution and computation offloading typically optimize average performance metrics and do not fully capture the sustained interaction between real-time latency requirements and device battery lifetime in closed-loop XR workloads. In this paper, we present a battery-aware execution management framework for edge-assisted XR systems that jointly considers execution placement, workload quality, latency requirements, and battery dynamics. We design an online decision mechanism based on a lightweight deep reinforcement learning policy that continuously adapts execution decisions under dynamic network conditions while maintaining high motion-to-photon latency compliance. Experimental results show that the proposed approach extends the projected device battery lifetime by up to 163% compared to latency-optimal local execution while maintaining over 90% motion-to-photon latency compliance under stable network conditions. Such compliance does not fall below 80% even under significantly limited network bandwidth availability, thereby demonstrating the effectiveness of explicitly managing latency-energy trade-offs in immersive XR systems.

[238] An assessment of data-centric methods for label noise identification in remote sensing data sets

Felix Kröber, Genc Hoxha, Ribana Roscher

Main category: cs.CV

TL;DR: Systematic analysis of data-centric label noise methods for remote sensing data, evaluating three methods under different noise types and levels to improve model generalizability.

DetailsMotivation: Label noise (incorrect labels) severely limits deep learning model generalizability, but automated treatment in remote sensing has received little attention. There's a lack of systematic analysis of data-centric methods that both cope with and identify noisy labels in remote sensing data.

Method: Examines three data-centric label noise methods, injects different types of label noise (10-70% noise levels) into two benchmark remote sensing datasets, analyzes how well methods filter noise and affect task performance.

Result: Proves the value of data-centric methods for both label noise identification and task performance improvements. Provides insights into which method is best depending on setting and objective, and identifies areas needing further research for transfer to remote sensing.

Conclusion: The work bridges methodological establishment of data-centric label noise methods with practical usage in remote sensing, showing clear benefits of these approaches for handling noisy labels in the domain.

Abstract: Label noise in the sense of incorrect labels is present in many real-world data sets and is known to severely limit the generalizability of deep learning models. In the field of remote sensing, however, automated treatment of label noise in data sets has received little attention to date. In particular, there is a lack of systematic analysis of the performance of data-centric methods that not only cope with label noise but also explicitly identify and isolate noisy labels. In this paper, we examine three such methods and evaluate their behavior under different label noise assumptions. To do this, we inject different types of label noise with noise levels ranging from 10 to 70% into two benchmark data sets, followed by an analysis of how well the selected methods filter the label noise and how this affects task performances. With our analyses, we clearly prove the value of data-centric methods for both parts - label noise identification and task performance improvements. Our analyses provide insights into which method is the best choice depending on the setting and objective. Finally, we show in which areas there is still a need for research in the transfer of data-centric label noise methods to remote sensing data. As such, our work is a step forward in bridging the methodological establishment of data-centric label noise methods and their usage in practical settings in the remote sensing domain.

[239] What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

Moritz Pawlowsky, Antonis Vamvakeros, Alexander Weiss, Anja Bielefeld, Samuel J. Cooper, Ronan Docherty

Main category: cs.CV

TL;DR: Vision transformers exhibit positional biases that hinder zero-shot adaptation in material science applications, but finetuning with ALiBi relative positional encoding reduces these biases while preserving semantic features.

DetailsMotivation: Vision transformers like DINOv2 learn rich representations but suffer from positional biases due to architectural choices like positional encoding, making zero-shot adaptation difficult in material science where images of homogeneous microstructures have no preferred direction.

Method: Investigate positional bias in ViTs via linear probing across various objectives and positional encodings, then reduce bias by finetuning models to use ALiBi relative positional encoding.

Result: Models retain desirable general semantics and their unbiased features can be successfully used in trainable segmentation of complex microscopy images.

Conclusion: ALiBi relative positional encoding effectively reduces positional biases in vision transformers while maintaining semantic quality, enabling better performance in material science applications.

Abstract: Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.

[240] SOMA: Unifying Parametric Human Body Models

Jun Saito, Jiefeng Li, Michael de Ruyter, Miguel Guerrero, Edy Lim, Ehsan Hassani, Roger Blanco Ribera, Hyejin Moon, Magdalena Dadela, Marco Di Lucca, Qiao Wang, Xueting Li, Jan Kautz, Simon Yuen, Umar Iqbal

Main category: cs.CV

TL;DR: SOMA is a unified body layer that bridges heterogeneous parametric human body models (SMPL, SMPL-X, MHR, Anny) through mesh topology, skeletal, and pose abstractions, enabling mixing of identity sources and pose data without custom retargeting.

DetailsMotivation: Existing parametric human body models (SMPL, SMPL-X, MHR, Anny) are mutually incompatible with different mesh topologies, skeletal structures, shape parameterizations, and unit conventions, making it impractical to use their complementary strengths in a single pipeline.

Method: Three abstraction layers: 1) Mesh topology abstraction maps any source model’s identity to a shared canonical mesh, 2) Skeletal abstraction recovers identity-adapted joint transforms from any body shape in a single closed-form pass, 3) Pose abstraction inverts skinning pipeline to recover unified skeleton rotations directly from posed vertices of any supported model.

Result: Reduces the O(M²) per-pair adapter problem to O(M) single-backend connectors, enabling practitioners to freely mix identity sources and pose data at inference time. The pipeline is fully differentiable end-to-end and GPU-accelerated via NVIDIA-Warp.

Conclusion: SOMA provides a unified framework that bridges heterogeneous human body models, enabling interoperability and practical mixing of different model strengths without custom retargeting or iterative optimization.

Abstract: Parametric human body models are foundational to human reconstruction, animation, and simulation, yet they remain mutually incompatible: SMPL, SMPL-X, MHR, Anny, and related models each diverge in mesh topology, skeletal structure, shape parameterization, and unit convention, making it impractical to exploit their complementary strengths within a single pipeline. We present SOMA, a unified body layer that bridges these heterogeneous representations through three abstraction layers. Mesh topology abstraction maps any source model’s identity to a shared canonical mesh in constant time per vertex. Skeletal abstraction recovers a full set of identity-adapted joint transforms from any body shape, whether in rest pose or an arbitrary posed configuration, in a single closed-form pass, with no iterative optimization or per-model training. Pose abstraction inverts the skinning pipeline to recover unified skeleton rotations directly from posed vertices of any supported model, enabling heterogeneous motion datasets to be consumed without custom retargeting. Together, these layers reduce the $O(M^2)$ per-pair adapter problem to $O(M)$ single-backend connectors, letting practitioners freely mix identity sources and pose data at inference time. The entire pipeline is fully differentiable end-to-end and GPU-accelerated via NVIDIA-Warp.

[241] M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

Kerui Ren, Guanghao Li, Changjian Jiang, Yingxiang Xu, Tao Lu, Linning Xu, Junting Dong, Jiangmiao Pang, Mulin Yu, Bo Dai

Main category: cs.CV

TL;DR: M^3 integrates a multi-view foundation model with a matching head for dense correspondences into monocular Gaussian splatting SLAM, achieving state-of-the-art pose estimation and reconstruction accuracy.

DetailsMotivation: Streaming reconstruction from uncalibrated monocular video requires high-precision pose estimation and efficient online refinement. Current multi-view foundation models lack the precision needed for rigorous geometric optimization in SLAM frameworks.

Method: Augments multi-view foundation model with dedicated matching head for fine-grained dense correspondences, integrates into robust monocular Gaussian splatting SLAM, adds dynamic area suppression and cross-inference intrinsic alignment for tracking stability.

Result: State-of-the-art accuracy on diverse indoor/outdoor benchmarks; reduces ATE RMSE by 64.3% vs VGGT-SLAM 2.0; outperforms ARTDECO by 2.11 dB PSNR on ScanNet++.

Conclusion: M^3 successfully addresses precision bottleneck in coupling 3D foundation models with SLAM, enabling high-quality streaming reconstruction from monocular video through improved dense correspondences and tracking stability.

Abstract: Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.

[242] SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Jiongze Yu, Xiangbo Gao, Pooja Verlani, Akshay Gadde, Yilin Wang, Balu Adsumilli, Zhengzhong Tu

Main category: cs.CV

TL;DR: SparkVSR: Interactive video super-resolution framework using sparse keyframes as control signals for controllable, high-quality video restoration

DetailsMotivation: Most VSR approaches are black boxes with no user control over artifacts. Need interactive framework allowing users to guide restoration using keyframes.

Method: Keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents. Supports flexible keyframe selection and reference-free guidance mechanism.

Result: Outperforms baselines by up to 24.6% on CLIP-IQA, 21.8% on DOVER, and 5.6% on MUSIQ. Enables controllable VSR and generalizes to tasks like old-film restoration and video style transfer.

Conclusion: SparkVSR provides effective interactive control for VSR through sparse keyframes, improving temporal consistency and restoration quality while being applicable to various video processing tasks.

Abstract: Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: https://sparkvsr.github.io/

[243] MessyKitchens: Contact-rich object-level 3D scene reconstruction

Junaid Ahmed Ansari, Ran Ding, Fabio Pizzati, Ivan Laptev

Main category: cs.CV

TL;DR: A new dataset (MessyKitchens) and method for monocular 3D scene reconstruction with object-level decomposition, focusing on physically-plausible reconstruction with accurate object contacts and non-penetration.

DetailsMotivation: Current monocular 3D reconstruction methods struggle with decomposing scenes into individual 3D objects due to object variety, occlusions, and complex relations. Applications in robotics and animation require physically-plausible reconstructions where objects obey physical principles like non-penetration and realistic contacts.

Method: Two main contributions: 1) MessyKitchens dataset with real-world cluttered scenes providing high-fidelity object-level ground truth (3D shapes, poses, accurate contacts), 2) Extension of SAM 3D approach with Multi-Object Decoder (MOD) for joint object-level scene reconstruction.

Result: MessyKitchens significantly improves previous datasets in registration accuracy and inter-object penetration. MOD demonstrates consistent and significant improvements over state-of-the-art on three datasets.

Conclusion: The work advances object-level scene reconstruction through a new benchmark dataset and improved reconstruction method, enabling more physically-plausible 3D scene understanding from single images.

Abstract: Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: https://messykitchens.github.io/.

[244] Demystifing Video Reasoning

Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

Main category: cs.CV

TL;DR: Video diffusion models exhibit reasoning capabilities primarily through denoising steps (Chain-of-Steps) rather than sequential frame processing (Chain-of-Frames), with emergent behaviors like working memory, self-correction, and perception-action separation.

DetailsMotivation: To understand the underlying mechanisms of reasoning in video generation models, challenging the assumption that reasoning occurs sequentially across frames (Chain-of-Frames) and investigating how reasoning actually emerges in diffusion-based video models.

Method: Qualitative analysis and targeted probing experiments to examine reasoning patterns, identification of emergent behaviors, analysis of functional specialization within Diffusion Transformers, and development of a training-free ensembling strategy using different random seeds.

Result: Discovered that reasoning primarily emerges along diffusion denoising steps (Chain-of-Steps), where models explore multiple solutions early and converge later. Identified critical emergent behaviors: working memory, self-correction, and perception-before-action. Found functional specialization in Diffusion Transformers with early layers encoding perception, middle layers executing reasoning, and later layers consolidating representations.

Conclusion: Video generation models exhibit reasoning through denoising steps rather than frame sequences, with emergent behaviors and functional specialization that can be leveraged for improved reasoning. This understanding provides a foundation for exploiting video models as a substrate for intelligence.

Abstract: Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

[245] SegviGen: Repurposing 3D Generative Model for Part Segmentation

Lin Li, Haoran Feng, Zehuan Huang, Haohua Chen, Wenbo Nie, Shaohua Hou, Keqing Fan, Pan Hu, Sheng Wang, Buyu Li, Lu Sheng

Main category: cs.CV

TL;DR: SegviGen repurposes pretrained 3D generative models for 3D part segmentation using distinctive part colorization, achieving state-of-the-art performance with minimal labeled data.

DetailsMotivation: Existing 3D segmentation methods either rely on 2D priors with cross-view inconsistency issues or require large-scale annotated 3D data and substantial training resources. There's a need for efficient 3D segmentation that leverages structured priors from pretrained models.

Method: SegviGen encodes 3D assets and predicts part-indicative colors on active voxels of geometry-aligned reconstructions. It leverages structured priors from pretrained 3D generative models to induce segmentation through distinctive part colorization, supporting interactive segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework.

Result: SegviGen improves over prior state-of-the-art by 40% on interactive part segmentation and 15% on full segmentation, while using only 0.32% of labeled training data. Demonstrates effective transfer of pretrained 3D generative priors to segmentation tasks.

Conclusion: Pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. The framework establishes a novel and efficient approach for 3D segmentation tasks.

Abstract: We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at https://fenghora.github.io/SegviGen-Page/.

[246] WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou

Main category: cs.CV

TL;DR: A video diffusion transformer approach for interactive gaming world models that uses camera pose as geometric representation for precise action control and long-horizon 3D consistency.

DetailsMotivation: Existing video diffusion approaches for interactive gaming world models struggle with precise action control and long-horizon 3D consistency, as they treat actions as abstract conditioning signals rather than leveraging the geometric coupling between actions and the 3D world.

Method: Uses camera pose as unifying geometric representation: 1) Defines physics-based continuous action space and represents user inputs in Lie algebra to derive precise 6-DoF camera poses, injected via camera embedder; 2) Uses global camera poses as spatial indices to retrieve relevant past observations for geometrically consistent revisiting during long-horizon navigation.

Result: The approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency, validated through extensive experiments.

Conclusion: Camera pose serves as an effective geometric representation for grounding both immediate action control and long-term 3D consistency in interactive gaming world models, enabling more realistic and controllable generated environments.

Abstract: Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.

[247] High-Quality Facial Geometry and Appearance Capture at Home

Yuxuan Han, Junfeng Lyu, Feng Xu

Main category: cs.CV

TL;DR: Single smartphone flashlight capture for complete 3D facial reconstruction including skin, mouth, hair, and eyes using hybrid representation and lighting modeling

DetailsMotivation: Existing facial capture methods require studio setups and focus only on facial skin, lacking convenience and completeness for daily usage

Method: Uses co-located smartphone flashlight sequence in dim rooms, hybrid representation for eyes and facial regions, combined lighting model, and morphable face albedo prior

Result: High-quality 3D relightable facial scans captured from single smartphone sequences, demonstrating complete facial reconstruction

Conclusion: Proposed method enables convenient, high-quality complete facial capture using everyday smartphone hardware

Abstract: Facial geometry and appearance capture have demonstrated tremendous success in 3D scanning real humans in studios. Recent works propose to democratize this technique while keeping the results high quality. However, they are still inconvenient for daily usage. In addition, they focus on an easier problem of only capturing facial skin. This paper proposes a novel method for high-quality face capture, featuring an easy-to-use system and the capability to model the complete face with skin, mouth interior, hair, and eyes. We reconstruct facial geometry and appearance from a single co-located smartphone flashlight sequence captured in a dim room where the flashlight is the dominant light source (e.g. rooms with curtains or at night). To model the complete face, we propose a novel hybrid representation to effectively model both eyes and other facial regions, along with novel techniques to learn it from images. We apply a combined lighting model to compactly represent real illuminations and exploit a morphable face albedo model as a reflectance prior to disentangle diffuse and specular. Experiments show that our method can capture high-quality 3D relightable scans.

[248] Label-supervised surgical instrument segmentation using temporal equivariance and semantic continuity

Qiyuan Wang, Yanzhe Liu, Shang Zhao, Rong Liu, S. Kevin Zhou

Main category: cs.CV

TL;DR: A weakly supervised surgical instrument segmentation method using only instrument presence labels, leveraging temporal properties in surgical videos through temporal equivariance constraints, class-aware semantic continuity, and temporal-enhanced pseudo masks.

DetailsMotivation: Surgical instrument segmentation typically requires expensive manual annotations. While instrument presence labels are often recorded with video streams, weakly supervised segmentation with only these labels is rarely explored due to under-constrained challenges. Temporal properties can enhance representation learning even with incomplete supervision.

Method: Extends a two-stage weakly supervised segmentation paradigm with temporal considerations: 1) Temporal equivariance constraint for pixel-wise consistency between adjacent features, 2) Class-aware semantic continuity between global and local regions across time, 3) Temporal-enhanced pseudo masks from consecutive frames to suppress irrelevant regions.

Result: Extensive experiments on two surgical video datasets (cholecystectomy surgery benchmark and real robotic left lateral segment liver surgery dataset) show promising performance, achieving comparable or favorable results with previous state-of-the-art approaches.

Conclusion: The method effectively leverages temporal properties in surgical videos for weakly supervised instrument segmentation using only presence labels, demonstrating the value of temporal consistency and continuity constraints in overcoming annotation limitations.

Abstract: For robotic surgical videos, instrument presence annotations are typically recorded with video streams, which offering the potential to reduce the manually annotated costs for segmentation. However, weakly supervised surgical instrument segmentation with only instrument presence labels has been rarely explored in surgical domain due to the highly under-constrained challenges. Temporal properties can enhance representation learning by capturing sequential dependencies and patterns over time even in incomplete supervision situations. From this, we take the inherent temporal attributes of surgical video into account and extend a two-stage weakly supervised segmentation paradigm from different perspectives. Firstly, we make temporal equivariance constraint to enhance pixel-wise temporal consistency between adjacent features. Secondly, we constrain class-aware semantic continuity between global and local regions across temporal dimension. Finally, we generate temporal-enhanced pseudo masks from consecutive frames to suppress irrelevant regions. Extensive experiments are validated on two surgical video datasets, including one cholecystectomy surgery benchmark and one real robotic left lateral segment liver surgery dataset. We annotate instance-wise instrument labels with fixed time-steps which are double checked by a clinician with 3-years experience to evaluate segmentation results. Experimental results demonstrate the promising performances of our method, which consistently achieves comparable or favorable results with previous state-of-the-art approaches.

[249] Revisiting RGBT Tracking Benchmarks from the Perspective of Modality Validity: A New Benchmark, Problem, and Solution

Zhangyong Tang, Tianyang Xu, Zhenhua Feng, Xuefeng Zhu, Chunyang Cheng, Xiao-Jun Wu, Josef Kittler

Main category: cs.CV

TL;DR: MV-RGBT benchmark for RGBT tracking in multi-modal warranting scenarios where either RGB or thermal modality is invalid, plus MoETrack solution with mixture of experts

DetailsMotivation: Existing RGBT tracking benchmarks lack representation of severe imaging conditions where one modality fails, limiting robustness in real-world multi-modal warranting scenarios like nighttime and adverse weather.

Method: Created MV-RGBT benchmark with 36 object categories across 19 scenes, divided by valid modality. Proposed MoETrack with mixture of experts where each expert generates independent tracking results with confidence scores.

Result: MV-RGBT reveals fusion isn’t always beneficial in severe conditions. MoETrack achieves state-of-the-art results on MV-RGBT, GTOT, and LasHeR benchmarks.

Conclusion: MV-RGBT advances RGBT tracking by addressing modality validity in severe conditions, and MoETrack’s expert-based approach effectively handles ‘when to fuse’ problem.

Abstract: RGBT tracking draws increasing attention because its robustness in multi-modal warranting (MMW) scenarios, such as nighttime and adverse weather conditions, where relying on a single sensing modality fails to ensure stable tracking results. However, existing benchmarks predominantly contain videos collected in common scenarios where both RGB and thermal infrared (TIR) information are of sufficient quality. This weakens the representativeness of existing benchmarks in severe imaging conditions, leading to tracking failures in MMW scenarios. To bridge this gap, we present a new benchmark considering the modality validity, MV-RGBT, captured specifically from MMW scenarios where either RGB (extreme illumination) or TIR (thermal truncation) modality is invalid. Hence, it is further divided into two subsets according to the valid modality, offering a new compositional perspective for evaluation and providing valuable insights for future designs. Moreover, MV-RGBT is the most diverse benchmark of its kind, featuring 36 different object categories captured across 19 distinct scenes. Furthermore, considering severe imaging conditions in MMW scenarios, a new problem is posed in RGBT tracking, named `when to fuse’, to stimulate the development of fusion strategies for such scenarios. To facilitate its discussion, we propose a new solution with a mixture of experts, named MoETrack, where each expert generates independent tracking results along with a confidence score. Extensive results demonstrate the significant potential of MV-RGBT in advancing RGBT tracking and elicit the conclusion that fusion is not always beneficial, especially in MMW scenarios. Besides, MoETrack achieves state-of-the-art results on several benchmarks, including MV-RGBT, GTOT, and LasHeR. Github: https://github.com/Zhangyong-Tang/MVRGBT.

[250] From Geometric Mimicry to Comprehensive Generation: A Context-Informed Multimodal Diffusion Model for Urban Morphology Synthesis

Fangshuo Zhou, Huaxia Li, Liuchang Xu, Rui Hu, Sensen Wu, Liang Xu, Hailin Feng, Zhenhong Du

Main category: cs.CV

TL;DR: ControlCity: A diffusion model for urban morphology generation using multimodal fusion of images, text, metadata, and building footprints to create realistic urban layouts with geographical context.

DetailsMotivation: Current urban morphology simulation methods oversimplify generation as geometric problems without incorporating urban semantics and geographical context, limiting their realism and applicability.

Method: Proposes ControlCity diffusion model using quadruple dataset (image-text-metadata-building footprints) from 22 cities. Enhanced ControlNet encodes spatial constraints from images, while text and metadata provide semantic guidance and geographical context to direct generation.

Result: Achieved significant improvements: FID reduced by 71.01% to 50.94, MIoU improved by 38.46% to 0.36 compared to unimodal baseline. Model shows robust knowledge generalization, cross-city style transfer, and zero-shot generation for unknown cities.

Conclusion: Multimodal fusion enables transition from “geometric mimicry” to “comprehensive generation” of urban morphology, providing novel paradigm for urban research and applications.

Abstract: Urban morphology is fundamental to determining urban functionality and vitality. Prevailing simulation methods, however, often oversimplify morphological generation as a geometric problem, lacking the fusion of urban semantics and geographical context. To address this limitation, this study proposes ControlCity, a diffusion model that achieves comprehensive urban morphology generation through multimodal information fusion. We first constructed a quadruple image-text-metadata-building footprints" dataset from 22 cities worldwide. ControlCity utilizes this information as control conditions. Specifically, an enhanced ControlNet encodes image-based spatial constraints, while text and metadata provide semantic guidance and geographical context to collectively direct the generation. Experimental results demonstrate that compared to the unimodal baseline, this method achieves significant advantages in morphological fidelity. Specifically, FID (lower scores indicate less visual error) was reduced by 71.01%, reaching 50.94, and MIoU (higher scores indicate greater spatial overlap) improved by 38.46%, reaching 0.36. Furthermore, the model demonstrates robust knowledge generalization and controllability, enabling cross-city style transfer and zero-shot generation for unknown cities. Ablation studies reveal the distinct roles of images, text, and metadata in the generation. This study confirms that multimodal fusion is crucial for achieving the transition from geometric mimicry" to ``comprehensive generation," providing a novel paradigm for urban morphology research and applications.

[251] FreeGaussian: Annotation-free Control of Articulated Objects via 3D Gaussian Splats with Flow Derivatives

Qizhi Chen, Delin Qu, Junli Liu, Yiwen Tang, Haoming Song, Dong Wang, Yuan Yuan, Bin Zhao

Main category: cs.CV

TL;DR: FreeGaussian reconstructs controllable 3D Gaussian splats for articulated objects from monocular video without manual annotations, using flow derivatives to disentangle camera and object motion.

DetailsMotivation: Existing methods for reconstructing controllable Gaussian splats for articulated objects require dense masks and manually defined control signals, limiting real-world applications. There's a need for annotation-free methods that can handle the inherently insufficient constraints of monocular video.

Method: Uses flow derivatives to mathematically disentangle camera egomotion and articulated movements. Establishes connection between 2D flows and 3D Gaussian dynamic flow for optimization without control signals. Introduces 3D spherical vector controlling scheme representing state as 3D Gaussian trajectory, eliminating complex 1D control signal calculations.

Result: Extensive experiments on articulated objects demonstrate state-of-the-art visual performance and precise, part-aware controllability.

Conclusion: FreeGaussian enables annotation-free reconstruction of controllable Gaussian splats for articulated objects with superior performance and simplified control modeling.

Abstract: Reconstructing controllable Gaussian splats for articulated objects from monocular video is especially challenging due to its inherently insufficient constraints. Existing methods address this by relying on dense masks and manually defined control signals, limiting their real-world applications. In this paper, we propose an annotation-free method, FreeGaussian, which mathematically disentangles camera egomotion and articulated movements via flow derivatives. By establishing a connection between 2D flows and 3D Gaussian dynamic flow, our method enables optimization and continuity of dynamic Gaussian motions from flow priors without any control signals. Furthermore, we introduce a 3D spherical vector controlling scheme, which represents the state as a 3D Gaussian trajectory, thereby eliminating the need for complex 1D control signal calculations and simplifying controllable Gaussian modeling. Extensive experiments on articulated objects demonstrate the state-of-the-art visual performance and precise, part-aware controllability of our method. Code is available at: https://github.com/Tavish9/freegaussian.

[252] Open-Vocabulary Octree-Graph for 3D Scene Understanding

Zhigang Wang, Yifei Su, Chenhui Li, Dong Wang, Yan Huang, Bin Zhao, Xuelong Li

Main category: cs.CV

TL;DR: Octree-Graph: A novel scene representation combining adaptive octrees and graph structures for open-vocabulary 3D scene understanding, enabling efficient spatial reasoning and object retrieval.

DetailsMotivation: Existing 3D scene understanding methods using point clouds are inefficient for downstream tasks like path planning and object retrieval due to unordered coordinates, substantial storage requirements, and lack of direct occupancy/spatial relation information.

Method: 1) CGSM strategy and IFA algorithm to obtain 3D instances with semantic features; 2) Adaptive-octree structure storing semantics and depicting object occupancy adjustably; 3) Octree-Graph construction where each adaptive-octree acts as a graph node with edges describing spatial relations.

Result: Extensive experiments on various tasks across multiple datasets demonstrate the versatility and effectiveness of the Octree-Graph method for open-vocabulary 3D scene understanding.

Conclusion: Octree-Graph provides an efficient scene representation that addresses limitations of point cloud approaches, enabling better spatial reasoning and supporting downstream tasks like path planning and text-based object retrieval.

Abstract: Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and text-based object retrieval. To address these issues, we propose \textbf{Octree-Graph}, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree-Graph is constructed where each adaptive-octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widely-used datasets, demonstrating the versatility and effectiveness of our method. Code is available \href{https://github.com/yifeisu/OV-Octree-Graph}{here}.

[253] Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna

Main category: cs.CV

TL;DR: Generate Any Scene is a data engine that creates systematic scene graphs for generating training data to improve text-to-vision models’ compositional understanding and semantic alignment.

DetailsMotivation: Current text-to-vision models struggle with compositional generalization and semantic alignment due to noisy, weakly compositional datasets. There's a need for scalable solutions to create dense, high-quality annotations for complex scene understanding.

Method: Generate Any Scene systematically enumerates scene graphs from structured taxonomies of objects, attributes, and relations. It translates scene graphs into captions for text-to-image/video generation and visual QA pairs for automatic evaluation. The system enables three approaches: 1) self-improving framework with iterative data generation, 2) distillation from proprietary to open-source models, and 3) reward modeling for semantic alignment.

Result: Stable Diffusion v1.5 achieved 4% improvement over baselines and surpassed fine-tuning on CC3M. Using <800 synthetic captions, it achieved 10% increase in TIFA score. Reward modeling with GRPO surpassed CLIP-based methods by +5% on DPG-Bench. Applied to content moderation for identifying challenging cases.

Conclusion: Generate Any Scene provides a scalable solution for generating high-quality training data that improves text-to-vision models’ compositional understanding and semantic alignment through systematic scene graph enumeration and multiple application frameworks.

Abstract: Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models’ understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.

[254] Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks

Enis Baty, Alejandro Hernández Díaz, Rebecca Davidson, Chris Bridges, Simon Hadfield

Main category: cs.CV

TL;DR: M2D-SSM introduces a 2D selective state-space model for vision that natively processes spatial dimensions, outperforming previous 1D SSM adaptations in image classification and downstream tasks.

DetailsMotivation: Existing visual SSMs inherit biases from NLP origins and apply 1D SSMs to images through arbitrary rasterized scanning, which doesn't properly handle 2D spatial structure.

Method: M2D-SSM re-derives selective state-space techniques from the ground up for multidimensional data, using a single 2D scan that factors in both spatial dimensions natively rather than applying 1D SSMs with rasterized scanning.

Result: Achieves 84.0% top-1 accuracy on ImageNet-1K with 27M parameters (M2D-T), surpassing prior SSM-based vision models; M2D-S achieves 85.3%. Strong downstream performance: 52.2 box AP on MS-COCO detection and 51.7 mIoU on ADE20K segmentation.

Conclusion: M2D-SSM demonstrates that properly designed 2D state-space models can achieve state-of-the-art performance in vision tasks while maintaining efficiency, overcoming limitations of 1D SSM adaptations.

Abstract: State-Space Models (SSMs) have emerged as an efficient alternative to transformers, yet existing visual SSMs retain deeply ingrained biases from their origins in natural language processing. In this paper, we address these limitations by introducing M2D-SSM, a ground-up re-derivation of selective state-space techniques for multidimensional data. Unlike prior works that apply 1D SSMs directly to images through arbitrary rasterised scanning, our M2D-SSM employs a single 2D scan that factors in both spatial dimensions natively. On ImageNet-1K classification, M2D-T achieves 84.0% top-1 accuracy with only 27M parameters, surpassing all prior SSM-based vision models at that size. M2D-S further achieves 85.3%, establishing state-of-the-art results among SSM-based architectures. Across downstream tasks, Mamba2D achieves 52.2 box AP on MS-COCO object detection (3$\times$ schedule) and 51.7 mIoU on ADE20K segmentation, demonstrating strong generalisation and efficiency at scale. Source code is available at https://github.com/cocoalex00/Mamba2D.

[255] Fillerbuster: Unified Generative Scene Completion Model for Casual Captures

Ethan Weber, Norman Müller, Yash Kant, Vasu Agrawal, Michael Zollhöfer, Angjoo Kanazawa, Christian Richardt

Main category: cs.CV

TL;DR: Fillerbuster is a unified 3D scene completion model using multi-view latent diffusion transformer to fill missing regions in sparse captures by generating unknown target views and recovering camera poses.

DetailsMotivation: Casual 3D scene captures are often sparse and miss surrounding content behind objects or above scenes. Existing methods focus on making known pixels look good or creating missing sides from few photos, but real scenarios involve hundreds of input frames needing completion of unobserved areas.

Method: Trains a generative model that consumes large context of input frames while generating unknown target views and recovering image poses when camera parameters are unknown. Uses multi-view latent diffusion transformer for unified inpainting framework that jointly inpaints all inputs.

Result: Shows completion results on two existing datasets and presents an uncalibrated scene completion task where the unified model predicts both poses and creates new content. Framework is open-sourced for integration into popular reconstruction platforms like Nerfstudio or Gsplat.

Conclusion: Presents a flexible, unified inpainting framework that can predict many images and poses together, with potential extension to predict more modalities such as depth, addressing the challenge of completing missing regions in sparse 3D scene captures.

Abstract: We present Fillerbuster, a unified model that completes unknown regions of a 3D scene with a multi-view latent diffusion transformer. Casual captures are often sparse and miss surrounding content behind objects or above the scene. Existing methods are not suitable for this challenge as they focus on making known pixels look good with sparse-view priors, or on creating missing sides of objects from just one or two photos. In reality, we often have hundreds of input frames and want to complete areas that are missing and unobserved from the input frames. Our solution is to train a generative model that can consume a large context of input frames while generating unknown target views and recovering image poses when camera parameters are unknown. We show results where we complete partial captures on two existing datasets. We also present an uncalibrated scene completion task where our unified model predicts both poses and creates new content. We open-source our framework for integration into popular reconstruction platforms like Nerfstudio or Gsplat. We present a flexible, unified inpainting framework to predict many images and poses together, where all inputs are jointly inpainted, and it could be extended to predict more modalities such as depth.

[256] Boosting the Local Invariance for Better Adversarial Transferability

Bohan Liu, Xiaosen Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.06140: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.06140&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[257] Dynamic Memory Transformer for Hyperspectral Image Classification

Muhammad Ahmad

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2504.13242: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13242&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[258] FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment

Myunsoo Kim, Seongwoong Shim, Byung-Jun Lee

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2505.11192 suggests it’s from May 2025, but no content available for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content. The arXiv API returned a rate limiting error (HTTP 429), preventing retrieval of the abstract or paper details.

Method: No method information available due to API rate limiting preventing access to the paper content.

Result: No results available as the paper content could not be retrieved from arXiv due to HTTP 429 error.

Conclusion: Unable to analyze the paper due to technical limitations in accessing the content from arXiv’s API.

Abstract: Failed to fetch summary for 2505.11192: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11192&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[259] SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, Xiangyu Yue

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2505.17018: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17018&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[260] VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, Xu Sun

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2505.23359: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.23359&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[261] Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction

Chengzhi Xu, Yuyang Wang, Lai Wei, Lichao Sun, Weiran Huang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2506.14837: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14837&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[262] BridgeShape: Latent Diffusion Schrödinger Bridge for 3D Shape Completion

Dequan Kong, Honghua Chen, Zhe Zhu, Mingqiang Wei

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2506.23205: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23205&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[263] Laplace-Beltrami Operator for Gaussian Splatting

Hongyu Zhou, Zorah Lähner

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2502.17531: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.17531&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[264] VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Liu, Jose M. Alvarez, Lei Zhang, Zhiding Yu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2507.13353: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13353&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[265] Omni Survey for Multimodality Analysis in Visual Object Tracking

Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Hui Li, Shaochuan Zhao, Tao Zhou, Chunyang Cheng, Xiaojun Wu, Josef Kittler

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2508.13000: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13000&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[266] PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis

Chunji Lv, Zequn Chen, Donglin Di, Weinan Zhang, Hao Li, Wei Chen, Yinjie Lei, Changsheng Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.13911: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13911&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[267] CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Lladós, Xiatian Zhu, Anjan Dutta

Main category: cs.CV

TL;DR: Paper 2508.16644 summary unavailable due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2508.16644: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16644&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[268] A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods

Qinqian Lei, Bo Wang, Robby T. Tan

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2508.18753 appears to be from August 2025, suggesting it’s a recent multimodal LLM paper.

DetailsMotivation: Cannot determine motivation without access to the paper content. Based on the arXiv ID format (2508.18753), this appears to be a recent paper from August 2025, likely related to multimodal AI given the reader's interests.

Method: Method details unavailable due to HTTP 429 error preventing access to the paper abstract and content.

Result: Results cannot be determined without access to the paper content. The arXiv API rate limiting prevents retrieval of any information about this paper.

Conclusion: Unable to draw conclusions about this specific paper due to technical limitations in accessing the content. The reader may need to access the paper directly on arXiv or wait for rate limits to reset.

Abstract: Failed to fetch summary for 2508.18753: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.18753&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[269] ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning

Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to analyze paper due to technical issues with arXiv API

Abstract: Failed to fetch summary for 2509.03951: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03951&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[270] VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

Atif Belal, Heitor R. Medeiros, Marco Pedersoli, Eric Granger

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation due to API access limitations

Method: Unable to determine method due to API access limitations

Result: Unable to determine results due to API access limitations

Conclusion: Unable to draw conclusions due to API access limitations

Abstract: Failed to fetch summary for 2510.00458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[271] Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition

Yu Kiu, Chao Chen, Ge Jin, Chen Feng

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.04282: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04282&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[272] Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

Yu Huang, Zelin Peng, Changsong Wen, Xiaokang Yang, Wei Shen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2510.08316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[273] VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?

Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.08398: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08398&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[274] LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates

Minkwan Kim, Seungmin Lee, Junho Kim, Young Min Kim

Main category: cs.CV

TL;DR: Unable to analyze paper 2510.09881 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2510.09881: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09881&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[275] Representing Beauty: Towards a Participatory but Objective Latent Aesthetics

Alexander Michael Rusnak

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable due to server rate limiting

Result: Cannot determine results as paper content is unavailable due to server rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to server rate limiting

Abstract: Failed to fetch summary for 2510.02869: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02869&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[276] MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment

Bingyu Li, Feiyu Wang, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2510.15398: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15398&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[277] Unbiased Object Detection Beyond Frequency with Visually Prompted Image Synthesis

Xinhao Cai, Liulei Li, Gensheng Pei, Tao Chen, Jinshan Pan, Yazhou Yao, Wenguan Wang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.18229: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18229&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[278] Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models

Huichan Seo, Sieun Choi, Minki Hong, Yi Zhou, Junseo Kim, Lukman Ismaila, Naome Etori, Mehul Agarwal, Zhixuan Liu, Jihie Kim, Jean Oh

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2510.20042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[279] TAUE: Training-free Noise Transplant and Cultivation Diffusion Model

Daichi Nagai, Ryugo Morita, Shunsuke Kitada, Hitoshi Iyatomi

Main category: cs.CV

TL;DR: TAUE is a training-free diffusion framework for layer-wise image generation that enables control over separate image layers without fine-tuning or additional data.

DetailsMotivation: Current text-to-image diffusion models output flattened images, lacking layer-wise control needed for professional applications. Existing solutions require fine-tuning with inaccessible datasets or only generate isolated foreground elements without complete scenes.

Method: TAUE uses noise transplantation and cultivation: embeds global structural information from intermediate denoising latents into initial noise for spatial coherence, and integrates semantic cues through cross-layer attention sharing for contextual consistency across layers.

Result: Achieves state-of-the-art performance among training-free methods, with image quality comparable to fine-tuned models while improving inter-layer consistency. Enables layout-aware editing, multi-object composition, and background replacement.

Conclusion: TAUE enables interactive, layer-separated generation systems for real-world creative workflows without requiring fine-tuning or additional data.

Abstract: Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for layer-wise image generation that requires neither fine-tuning nor additional data. TAUE embeds global structural information from intermediate denoising latents into the initial noise to preserve spatial coherence, and integrates semantic cues through cross-layer attention sharing to maintain contextual and visual consistency across layers. Extensive experiments demonstrate that TAUE achieves state-of-the-art performance among training-free methods, delivering image quality comparable to fine-tuned models while improving inter-layer consistency. Moreover, it enables new applications, such as layout-aware editing, multi-object composition, and background replacement, indicating potential for interactive, layer-separated generation systems in real-world creative workflows.

[280] Exploring the Underwater World Segmentation without Extra Training

Bingyu Li, Tao Huo, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2511.07923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[281] Learning Topology-Driven Multi-Subspace Fusion for Grassmannian Deep Network

Xuan Yu, Tianyang Xu

Main category: cs.CV

TL;DR: Paper 2511.08628: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2511.08628: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08628&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[282] DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

Zhou Tao, Shida Wang, Yongxiang Hua, Haoyu Cao, Linli Xu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2512.12633: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12633&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[283] DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization

Rui-Yang Ju, Kohei Yamashita, Hirotaka Kameko, Shinsuke Mori

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.09117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[284] MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, Chenglu Wen

Main category: cs.CV

TL;DR: MSGNav introduces a multimodal 3D scene graph (M3DSG) for zero-shot embodied navigation, preserving visual cues through image-based relational edges instead of text-only representations, with modules for efficient reasoning and open vocabulary support.

DetailsMotivation: Existing zero-shot navigation methods that use explicit 3D scene graphs compress visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. Real-world deployment requires open vocabulary generalization and low training overhead.

Method: Proposes Multi-modal 3D Scene Graph (M3DSG) with image-based relational edges instead of text-only relations. Builds MSGNav system with: 1) Key Subgraph Selection for efficient reasoning, 2) Adaptive Vocabulary Update for open vocabulary support, 3) Closed-Loop Reasoning for accurate exploration, and 4) Visibility-based Viewpoint Decision to solve the “last mile” problem of determining final target location.

Result: Achieves state-of-the-art performance on challenging GOAT-Bench and HM3D-ObjNav benchmarks, demonstrating superior zero-shot navigation capabilities compared to existing methods.

Conclusion: MSGNav effectively addresses limitations of text-only scene graphs by preserving visual evidence through multimodal representations, enabling more robust and generalizable zero-shot embodied navigation with open vocabulary support.

Abstract: Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last mile problem in zero-shot navigation determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on the challenging GOAT-Bench and HM3D-ObjNav benchmark. The code will be publicly available at https://github.com/ylwhxht/MSGNav.

[285] Lite Any Stereo: Efficient Zero-Shot Stereo Matching

Junpeng Jing, Weixun Luo, Ye Mao, Krystian Mikolajczyk

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2511.16555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[286] MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

Runxun Zhang, Yizhou Liu, Li Dongrui, Bo XU, Jingwei Wei

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2511.17392: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17392&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[287] Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2601.10477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[288] A Tri-Modal Dataset and a Baseline System for Tracking Unmanned Aerial Vehicles

Tianyang Xu, Jinjie Gu, Xuefeng Zhu, XiaoJun Wu, Josef Kittler

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2511.18344: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18344&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[289] SineProject: Machine Unlearning for Stable Vision Language Alignment

Arpit Garg, Hemanth Saratchandran, Simon Lucey

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2511.18444: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18444&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[290] CFM: Language-aligned Concept Foundation Model for Vision

Kai Wittenmayer, Sukrut Rao, Amin Parchami-Araghi, Bernt Schiele, Jonas Fischer

Main category: cs.CV

TL;DR: CFM is a language-aligned concept foundation model that provides human-interpretable, spatially-grounded concepts for vision tasks, enabling explanations for any downstream task while maintaining competitive performance.

DetailsMotivation: Current vision foundation models have opaque representations that are difficult to interpret. Existing concept decomposition methods provide poor spatial grounding and are limited to image classification tasks.

Method: Proposes CFM, a language-aligned concept foundation model that learns fine-grained, human-interpretable concepts spatially grounded in input images. Uses local co-occurrence dependencies to define concept relationships and improve concept naming.

Result: CFM achieves competitive performance on classification, segmentation, and captioning benchmarks while providing high-quality concept-based explanations. The model offers fine-grained interpretability without sacrificing task performance.

Conclusion: CFM demonstrates that vision foundation models can provide interpretable, spatially-grounded concepts while maintaining strong performance across diverse vision tasks, bridging the gap between performance and interpretability.

Abstract: Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making difficult. Recent work decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose CFM, a language-aligned concept foundation model for vision that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. When paired with a foundation model with strong semantic representations, we get explanations for any of its downstream tasks. Examining local co-occurrence dependencies of concepts allows us to define concept relationships through which we improve concept naming and obtain richer explanations. On benchmark data, we show that CFM provides performance on classification, segmentation, and captioning that is competitive with opaque foundation models while providing fine-grained, high quality concept-based explanations. Code at https://github.com/kawi19/CFM.

[291] 3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion

Minchong Chen, Xiaoyun Yuan, Junzhe Wan, Jianing Zhang, Jun Zhang

Main category: cs.CV

TL;DR: Unable to analyze paper 2511.19117 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2511.19117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[292] Rethinking Reward Signals in Video GRPO: When Scores Become Targets

Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method without access to paper content

Result: No results available due to technical limitations in accessing the paper

Conclusion: Cannot draw conclusions about the paper without access to its content

Abstract: Failed to fetch summary for 2511.19356: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19356&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[293] Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos

Yixuan Shen, Peng He, Honglu Liu, Jinxuan Fan, Yuyang Ji, Tingting Li, Tianlong Chen, Kaidi Xu, Feng Liu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.18466: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18466&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[294] AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Hongsheng Li

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2511.22663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[295] GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

Zixuan Song, Jing Zhang, Di Wang, Zidie Zhou, Wenbin Liu, Haonan Guo, En Wang, Bo Du

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.02697: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02697&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[296] Order Matters: 3D Shape Generation from Sequential VR Sketches

Yizi Chen, Sidi Wu, Tianyi Xiao, Nina Wiedemann, Loic Landrieu

Main category: cs.CV

TL;DR: Paper 2512.04761: Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as abstract retrieval failed due to server rate limiting

Method: Cannot determine method as abstract retrieval failed due to server rate limiting

Result: Cannot determine results as abstract retrieval failed due to server rate limiting

Conclusion: Cannot draw conclusions as abstract retrieval failed due to server rate limiting

Abstract: Failed to fetch summary for 2512.04761: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04761&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[297] A Novel Evolutionary Method for Automated Skull-Face Overlay in Computer-Aided Craniofacial Superimposition

Práxedes Martínez-Moreno, Andrea Valsecchi, Pablo Mesejo, Pilar Navarro-Ramírez, Valentino Lugli, Sergio Damas

Main category: cs.CV

TL;DR: Failed to fetch summary for paper 2603.00170 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2603.00170: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00170&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[298] LeAD-M3D: Leveraging Asymmetric Distillation for Real-Time Monocular 3D Detection

Johannes Meier, Jonathan Michel, Oussema Dhaouadi, Yung-Hsu Yang, Christoph Reich, Zuria Bauer, Stefan Roth, Marc Pollefeys, Jacques Kaiser, Daniel Cremers

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.05663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[299] S2WMamba: A Wavelet-Assisted Mamba-Based Dual-Branch Network For Pansharpening

Haoyu Zhang, Junhan Luo, Yugang Cao, Jie Huang, Liangjian-Deng

Main category: cs.CV

TL;DR: Unable to fetch paper abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to the paper content

Abstract: Failed to fetch summary for 2512.06330: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06330&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[300] COREA: Coupled Relightable 3D Gaussians and SDFs for Efficient Normal Alignment

Jaeyoon Lee, Hojoon Jung, Sungtae Hwang, Jihyong Oh, Jongwon Choi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to API rate limiting

Result: No results available due to technical issue with arXiv API access

Conclusion: Paper analysis impossible due to HTTP 429 error preventing content retrieval

Abstract: Failed to fetch summary for 2512.07107: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07107&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[301] Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

Chen Ziwen, Hao Tan, Peng Wang, Zexiang Xu, Li Fuxin

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2512.10267: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10267&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[302] Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description

Nazanin Mahjourian, Vinh Nguyen

Main category: cs.CV

TL;DR: VLM-IRIS adapts vision-language models for zero-shot infrared image understanding by converting thermal images to RGB-compatible magma representations, enabling label-free monitoring in industrial settings.

DetailsMotivation: Manufacturing environments often have low-light conditions where conventional RGB vision systems fail. Infrared cameras work well in these conditions but lack zero-shot learning capabilities. Current vision-language models (VLMs) trained on RGB data cannot understand infrared images, creating a gap for industrial applications.

Method: Proposes VLM-IRIS framework that preprocesses infrared images from FLIR Boson sensors into RGB-compatible inputs using magma color representation. Uses CLIP ViT-B/32 encoder with centroid prompt ensembling for zero-shot predictions without model retraining.

Result: Demonstrates successful zero-shot workpiece presence detection on 3D printer beds using thermal imaging, where temperature differences between build plate and workpieces make the task suitable for infrared sensing.

Conclusion: VLM-IRIS effectively extends vision-language model capabilities to thermal applications, enabling label-free monitoring in industrial settings without requiring model retraining or large labeled datasets.

Abstract: Many manufacturing environments operate in low-light conditions or within enclosed machines where conventional vision systems struggle. Infrared cameras provide complementary advantages in such environments. Simultaneously, supervised AI systems require large labeled datasets, which makes zero-shot learning frameworks more practical for applications including infrared cameras. Recent advances in vision-language foundation models (VLMs) offer a new path in zero-shot predictions from paired image-text representations. However, current VLMs cannot understand infrared camera data since they are trained on RGB data. This work introduces VLM-IRIS (Vision-Language Models for InfraRed Industrial Sensing), a zero-shot framework that adapts VLMs to infrared data by preprocessing infrared images captured by a FLIR Boson sensor into RGB-compatible inputs suitable for CLIP-based encoders. We demonstrate zero-shot workpiece presence detection on a 3D printer bed where temperature differences between the build plate and workpieces make the task well-suited for thermal imaging. VLM-IRIS converts the infrared images to magma representation and applies centroid prompt ensembling with a CLIP ViT-B/32 encoder to achieve high accuracy on infrared images without any model retraining. These findings demonstrate that the proposed improvements to VLMs can be effectively extended to thermal applications for label-free monitoring.

[303] Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Bowen Wen, Shaurya Dewan, Stan Birchfield

Main category: cs.CV

TL;DR: Fast-FoundationStereo bridges the gap between accurate but slow stereo foundation models and fast but less robust efficient architectures, achieving real-time zero-shot stereo matching through knowledge distillation, neural architecture search, and structured pruning.

DetailsMotivation: Stereo foundation models have strong zero-shot generalization but are too slow for real-time applications, while efficient stereo architectures sacrifice robustness for speed and require costly per-domain fine-tuning. There's a need for models that combine the best of both worlds.

Method: Three-component acceleration strategy: (1) knowledge distillation to compress hybrid backbone into efficient student, (2) blockwise neural architecture search for optimal cost filtering under latency constraints, (3) structured pruning for iterative refinement module. Plus automatic pseudo-labeling pipeline for 1.4M in-the-wild stereo pairs.

Result: Model runs over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, establishing new SOTA among real-time stereo methods.

Conclusion: Fast-FoundationStereo achieves strong zero-shot generalization at real-time frame rates for the first time, bridging the accuracy-speed gap in stereo vision.

Abstract: Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo/

[304] A Novel Patch-Based TDA Approach for Computed Tomography Imaging

Dashti A. Ali, Aras T. Asaad, Jacob J. Peoples, Ahmad Bashir Barekzai, Camila Vilela, Hala Khasawneh, Jayasree Chakraborty, João Miranda, Mohammad Hamghalam, Natalie Gangai, Natally Horvat, Richard K. G. Do, Alice C. Wei, Amber L. Simpson

Main category: cs.CV

TL;DR: A novel patch-based persistent homology approach for 3D CT image analysis that outperforms traditional cubical complex methods and radiomic features in both classification performance and computational efficiency.

DetailsMotivation: Traditional topological data analysis methods for 3D CT images using cubical complex filtration suffer from poor performance and high computational costs with higher resolution images, limiting their practical application in medical imaging.

Method: Developed a patch-based persistent homology construction approach specifically designed for volumetric CT imaging data, which processes images in patches rather than as whole volumes to improve efficiency and performance.

Result: The patch-based TDA approach significantly outperformed both the cubical complex method and radiomic features, achieving average improvements of 7.2% accuracy, 3.6% AUC, 2.7% sensitivity, 8.0% specificity, and 7.2% F1 score across all datasets, while also reducing computational time.

Conclusion: The patch-based persistent homology approach provides a superior alternative to traditional methods for topological feature extraction from 3D CT images, offering both improved classification performance and computational efficiency, with an accompanying Python package (Patch-TDA) for practical implementation.

Abstract: The development of machine learning (ML) models based on computed tomography (CT) imaging has been a major focus due to the promise that imaging holds for diagnosis, staging, and prognostication. These models often rely on the extraction of hand-crafted features, incorporating robust feature engineering improves the performance of these models. Topological data analysis (TDA), based on the mathematical field of algebraic topology, focuses on data from a topological perspective, extracting deeper insight and higher dimensional structures. Persistent homology (PH), a fundamental tool in TDA, extracts topological features such as connected components, cycles, and voids. A popular approach to construct PH from 3D CT images is to utilize the 3D cubical complex filtration, a method adapted for grid-structured data. However, this approach is subject to poor performance and high computational cost with higher resolution CT images. This study introduces a novel patch-based PH construction approach tailored for volumetric CT imaging data that improves performance and reduces computational time. This study conducts a series of systematic experiments to comprehensively analyze the performance of the proposed method with various parameters and benchmarks against the 3D cubical complex algorithm and radiomic features. Our results highlight the dominance of the patch-based TDA approach in terms of both classification performance and computational time. The proposed approach outperformed the cubical complex method and radiomic features, achieving average improvement of 7.2%, 3.6%, 2.7%, 8.0%, and 7.2% in accuracy, AUC, sensitivity, specificity, and F1 score, respectively, across all datasets. Finally, we provide a convenient python package, Patch-TDA, to facilitate the utilization of the proposed approach.

[305] Learning complete and explainable visual representations from itemized text supervision

Yiwei Lyu, Chenhui Zhao, Soumyanil Banerjee, Shixuan Liu, Akshay Rao, Akhil Kondepudi, Honglak Lee, Todd C. Hollon

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.11141: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11141&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[306] SARMAE: Masked Autoencoder for SAR Representation Learning

Danxu Liu, Di Wang, Hebaixu Wang, Haoyang Chen, Wentao Jiang, Yilin Cheng, Haonan Guo, Wei Cui, Jing Zhang

Main category: cs.CV

TL;DR: SARMAE: A noise-aware masked autoencoder for self-supervised SAR representation learning using million-scale SAR dataset with optical image pairs and speckle noise injection.

DetailsMotivation: SAR imagery suffers from data scarcity and speckle noise that hampers semantic representation learning, requiring specialized self-supervised approaches for robust feature extraction.

Method: Proposes SARMAE with three components: 1) SAR-1M million-scale dataset with optical image pairs, 2) Speckle-Aware Representation Enhancement (SARE) injecting SAR-specific noise into masked autoencoders, 3) Semantic Anchor Representation Constraint (SARC) using optical priors for feature alignment.

Result: Achieves state-of-the-art performance on multiple SAR datasets for classification, detection, and segmentation tasks, demonstrating effectiveness of noise-aware self-supervised learning.

Conclusion: SARMAE successfully addresses SAR data scarcity and speckle noise challenges through specialized self-supervised learning with noise injection and optical priors, enabling robust SAR representation learning.

Abstract: Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code and models will be available at https://github.com/MiliLab/SARMAE.

[307] WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering

Yuxuan Han, Xin Ming, Tianxiao Li, Zhuofan Shen, Qixuan Zhang, Lan Xu, Feng Xu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2512.11237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[308] SHAMISA: SHAped Modeling of Implicit Structural Associations for Self-supervised No-Reference Image Quality Assessment

Mahdi Naseri, Zhou Wang

Main category: cs.CV

TL;DR: SHAMISA is a self-supervised NR-IQA framework that learns from unlabeled distorted images using implicit structural associations and a compositional distortion engine, achieving strong performance without human labels.

DetailsMotivation: Traditional NR-IQA models require expensive human perceptual labels, creating a fundamental bottleneck. The authors aim to develop a self-supervised approach that can learn from unlabeled distorted images by leveraging structured relational supervision instead of contrastive learning.

Method: SHAMISA introduces implicit structural associations - soft, controllable relations that are distortion-aware and content-sensitive. It uses a compositional distortion engine to generate continuous degradations where only one distortion factor varies at a time. Dual-source relation graphs encode both known degradation profiles and emergent structural affinities to guide learning. A convolutional encoder is trained under this supervision and frozen, with quality prediction done by a linear regressor on its features.

Result: Extensive experiments on synthetic, authentic, and cross-dataset NR-IQA benchmarks show SHAMISA achieves strong overall performance with improved cross-dataset generalization and robustness, all without human quality annotations or contrastive losses.

Conclusion: SHAMISA demonstrates that self-supervised learning with structured relational supervision can effectively address the label bottleneck in NR-IQA, achieving competitive performance without human annotations while improving generalization capabilities.

Abstract: No-Reference Image Quality Assessment (NR-IQA) aims to estimate perceptual quality without access to a reference image of pristine quality. Learning an NR-IQA model faces a fundamental bottleneck: its need for a large number of costly human perceptual labels. We propose SHAMISA, a non-contrastive self-supervised framework that learns from unlabeled distorted images by leveraging explicitly structured relational supervision. Unlike prior methods that impose rigid, binary similarity constraints, SHAMISA introduces implicit structural associations, defined as soft, controllable relations that are both distortion-aware and content-sensitive, inferred from synthetic metadata and intrinsic feature structure. A key innovation is our compositional distortion engine, which generates an uncountable family of degradations from continuous parameter spaces, grouped so that only one distortion factor varies at a time. This enables fine-grained control over representational similarity during training: images with shared distortion patterns are pulled together in the embedding space, while severity variations produce structured, predictable shifts. We integrate these insights via dual-source relation graphs that encode both known degradation profiles and emergent structural affinities to guide the learning process throughout training. A convolutional encoder is trained under this supervision and then frozen for inference, with quality prediction performed by a linear regressor on its features. Extensive experiments on synthetic, authentic, and cross-dataset NR-IQA benchmarks demonstrate that SHAMISA achieves strong overall performance with improved cross-dataset generalization and robustness, all without human quality annotations or contrastive losses.

[309] On Geometric Understanding and Learned Priors in Feed-forward 3D Reconstruction Models

Jelena Bratulić, Sudhanshu Mittal, Thomas Brox, Christian Rupprecht

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.11508: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11508&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[310] Few-Shot Video Object Segmentation in X-Ray Angiography Using Local Matching and Spatio-Temporal Consistency Loss

Lin Xi, Yingliang Ma, Xiahai Zhuang

Main category: cs.CV

TL;DR: Paper 2601.00988: Could not fetch summary due to HTTP 429 error (rate limiting). No abstract available for analysis.

DetailsMotivation: Unable to determine motivation due to lack of accessible paper content. The HTTP 429 error indicates rate limiting from arXiv API.

Method: No method information available. The paper content could not be retrieved due to technical limitations.

Result: No results available. The analysis cannot proceed without access to the paper’s content.

Conclusion: Cannot draw conclusions about an inaccessible paper. The reader should try accessing the paper directly through arXiv or alternative means.

Abstract: Failed to fetch summary for 2601.00988: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00988&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[311] Topology-Preserving Data Augmentation for Ring-Type Polygon Annotations

Sudip Laudari, Sang Hun Baek

Main category: cs.CV

TL;DR: Order-preserving polygon augmentation for segmentation that maintains cyclic connectivity in structured domains like architectural floorplans by transforming in mask space and projecting vertices back to restore adjacency relations.

DetailsMotivation: Geometric data augmentation for segmentation typically assumes simply connected polygon regions, but structured domains like architectural floorplan analysis often have ring-type regions encoded as single cyclic polygon chains. Standard clipping operations during augmentation can disrupt cyclic connectivity by removing intermediate vertices, breaking structural relationships between outer and inner boundaries.

Method: Introduces an order-preserving polygon augmentation strategy that performs transformations in mask space first, then projects surviving vertices back into index-space to restore adjacency relations. This repair maintains the original traversal order of the polygon and preserves topological consistency with minimal computational overhead.

Result: Experiments demonstrate the approach reliably restores connectivity, achieving near-perfect Cyclic Adjacency Preservation (CAP) across both single and compound augmentations.

Conclusion: The method effectively preserves topological consistency in geometric data augmentation for structured domains with cyclic polygon representations, maintaining important structural relationships during augmentation operations.

Abstract: Geometric data augmentation is widely used in segmentation pipelines and typically assumes that polygon annotations represent simply connected regions. However, in structured domains such as architectural floorplan analysis, ring-type regions are often encoded as a single cyclic polygon chain connecting outer and inner boundaries. During augmentation, clipping operations may remove intermediate vertices and disrupt this cyclic connectivity, breaking the structural relationship between the boundaries. In this work, we introduce an order-preserving polygon augmentation strategy that performs transformations in mask space and then projects surviving vertices back into index-space to restore adjacency relations. This repair maintains the original traversal order of the polygon and preserves topological consistency with minimal computational overhead. Experiments demonstrate that the approach reliably restores connectivity, achieving near-perfect Cyclic Adjacency Preservation (CAP) across both single and compound augmentations.

[312] Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning

Yifan Wang, Yanyu Li, Gordon Guocheng Qian, Sergey Tulyakov, Yun Fu, Anil Kag

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).

DetailsMotivation: Cannot determine motivation due to failed paper fetch.

Method: Cannot determine method due to failed paper fetch.

Result: Cannot determine results due to failed paper fetch.

Conclusion: Cannot draw conclusions due to failed paper fetch.

Abstract: Failed to fetch summary for 2601.04153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[313] Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression

Mridankan Mandal

Main category: cs.CV

TL;DR: Vision foundation models adapted for pasture biomass estimation from agricultural imagery, revealing that simpler fusion mechanisms outperform complex transformers on scarce data, with backbone pretraining quality being most important.

DetailsMotivation: Accurate pasture biomass estimation from agricultural imagery is crucial for sustainable livestock management, but existing methods struggle with small, imbalanced, and sparsely annotated datasets typical of real-world monitoring.

Method: Systematic evaluation of vision foundation model adaptation on CSIRO Pasture Biomass benchmark (357 image dual-view dataset) through 17 configurations spanning four backbones (EfficientNet-B3 to DINOv3-ViT-L), five cross-view fusion mechanisms, and metadata analysis.

Result: Discovered “fusion complexity inversion”: on scarce agricultural data, simple two-layer gated depthwise convolution (R²=0.903) outperforms cross-view attention transformers (0.833), bidirectional SSMs (0.819), and full Mamba (0.793). Backbone pretraining scale dominates architectural choices, with DINOv2→DINOv3 upgrade yielding +5.0 R² points.

Conclusion: For sparse agricultural benchmarks: prioritize backbone quality over fusion complexity, prefer local modules over global alternatives, and exclude features unavailable at inference. Simple architectures work best on limited data.

Abstract: Accurate estimation of pasture biomass from agricultural imagery is critical for sustainable livestock management, yet existing methods are limited by the small, imbalanced, and sparsely annotated datasets typical of real world monitoring. In this study, adaptation of vision foundation models to agricultural regression is systematically evaluated on the CSIRO Pasture Biomass benchmark, a 357 image dual view dataset with laboratory validated, component wise ground truth for five biomass targets, through 17 configurations spanning four backbones (EfficientNet-B3 to DINOv3-ViT-L), five cross view fusion mechanisms, and a 4x2 metadata factorial. A counterintuitive principle, termed “fusion complexity inversion”, is uncovered: on scarce agricultural data, a two layer gated depthwise convolution (R^2 = 0.903) outperforms cross view attention transformers (0.833), bidirectional SSMs (0.819), and full Mamba (0.793, below the no fusion baseline). Backbone pretraining scale is found to monotonically dominate all architectural choices, with the DINOv2 -> DINOv3 upgrade alone yielding +5.0 R^2 points. Training only metadata (species, state, and NDVI) is shown to create a universal ceiling at R^2 ~ 0.829, collapsing an 8.4 point fusion spread to 0.1 points. Actionable guidelines for sparse agricultural benchmarks are established: backbone quality should be prioritized over fusion complexity, local modules preferred over global alternatives, and features unavailable at inference excluded.

[314] Think3D: Thinking with Space for Spatial Reasoning

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, Lijun Wang, Huchuan Lu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2601.13029: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13029&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[315] A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy

Xin Zhang, Liangxiu Han, Tam Sobeih, Yue Shi, Yalin Zheng, Uazman Alam, Maryam Ferdousi, Rayaz Malik

Main category: cs.CV

TL;DR: Failed to fetch summary for paper 2602.13693 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.13693: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13693&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[316] RSGen: Enhancing Layout-Driven Remote Sensing Image Generation with Diverse Edge Guidance

Xianbao Hou, Yonghao He, Zeyd Boukhers, John See, Hu Su, Wei Sui, Cong Yang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions about paper content due to API access limitations

Abstract: Failed to fetch summary for 2603.15484: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15484&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[317] VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense

Nadav Kadvil, Malak Fares, Ayellet Tal

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.19570: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19570&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[318] Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

Tianhao Qian, Zhuoxuan Li, Jinde Cao, Xinli Shi, Leszek Rutkowski

Main category: cs.CV

TL;DR: Proposes a decoupled kinetic paradigm using Alternating Gradient Flow for structural pruning of vision networks, addressing magnitude bias in traditional metrics and achieving efficient compression without collapse.

DetailsMotivation: Traditional pruning metrics (weight magnitude, activation awareness) suffer from magnitude bias and fail to preserve critical functional pathways in structural pruning of deep vision networks, leading to performance collapse at high sparsity.

Method: Uses Alternating Gradient Flow (AGF) with absolute feature-space Taylor expansion to capture structural “kinetic utility”. Proposes hybrid routing framework that decouples AGF-guided offline structural search from online execution via zero-cost physical priors.

Result: AGF avoids structural collapse where traditional metrics fall below random sampling at 75% compression on ImageNet-1K. Hybrid approach reduces heavy expert usage by ~50% (0.92× overall cost) without sacrificing accuracy on ImageNet-100.

Conclusion: The decoupled kinetic paradigm effectively addresses limitations of traditional pruning metrics, enabling efficient structural compression of vision networks while preserving functionality through topological implicit regularization.

Abstract: Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation awareness (e.g., Wanda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the structural pruning of deep vision networks. These contemporary metrics suffer from a magnitude bias, failing to preserve critical functional pathways. To overcome this, we propose a decoupled kinetic paradigm inspired by Alternating Gradient Flow (AGF), utilizing an absolute feature-space Taylor expansion to accurately capture the network’s structural “kinetic utility”. First, we uncover a topological phase transition at extreme sparsity, where AGF successfully preserves baseline functionality and exhibits topological implicit regularization, avoiding the collapse seen in models trained from scratch. Second, transitioning to architectures without strict structural priors, we reveal a phenomenon of Sparsity Bottleneck in Vision Transformers (ViTs). Through a gradient-magnitude decoupling analysis, we discover that dynamic signals suffer from signal compression in converged models, rendering them suboptimal for real-time routing. Finally, driven by these empirical constraints, we design a hybrid routing framework that decouples AGF-guided offline structural search from online execution via zero-cost physical priors. We validate our paradigm on large-scale benchmarks: under a 75% compression stress test on ImageNet-1K, AGF effectively avoids the structural collapse where traditional metrics aggressively fall below random sampling. Furthermore, when systematically deployed for dynamic inference on ImageNet-100, our hybrid approach achieves Pareto-optimal efficiency. It reduces the usage of the heavy expert by approximately 50% (achieving an estimated overall cost of 0.92$\times$) without sacrificing the full-model accuracy.

[319] Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking

Shengqiong Wu, Bobo Li, Xinkai Wang, Xiangtai Li, Lei Cui, Furu Wei, Shuicheng Yan, Hao Fei, Tat-seng Chua

Main category: cs.CV

TL;DR: Failed to fetch summary for paper 2602.21435 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2602.21435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[320] CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

Di Zhang, Zhangpeng Gong, Xiaobo Pang, Jiashuai Liu, Junbo Lu, Hao Cui, Jiusong Ge, Zhi Zeng, Kai Yi, Yinghua Li, Si Liu, Tingsong Yu, Haoran Wang, Mireia Crispin-Ortuzar, Weimiao Yu, Chen Li, Zeyu Gao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2602.21637: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21637&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[321] AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification

Hamza Mooraj, George Pantazopoulos, Alessandro Suglia

Main category: cs.CV

TL;DR: Systematic comparison of CNN, contrastive VLM, and generative VLM models for crop disease classification reveals distinct performance profiles across lab vs. field domains, with generative VLMs showing strongest domain resilience despite text generation failures.

DetailsMotivation: Existing crop disease detection evaluations focus on single architectures or lab datasets, lacking systematic comparison across model paradigms and controlled analysis of domain effects between laboratory and field conditions.

Method: Introduces AgriPath-LF16 benchmark with 111k images across 16 crops and 41 diseases, with explicit lab/field separation. Compares CNNs, contrastive VLMs, and generative VLMs under unified protocols across full/lab/field training regimes using macro-F1 and Parse Success Rate metrics.

Result: CNNs achieve highest accuracy on lab imagery but degrade under domain shift. Contrastive VLMs provide robust, parameter-efficient cross-domain performance. Generative VLMs show strongest resilience to distributional variation but have additional failure modes from free-text generation.

Conclusion: Architectural choice should be guided by deployment context rather than aggregate accuracy alone, with different paradigms excelling in different conditions (lab vs. field, controlled vs. real-world).

Abstract: Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark containing 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardized training and evaluation. All models are trained and evaluated under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability. The results reveal distinct performance profiles. CNNs achieve the highest accuracy on lab imagery but degrade under domain shift. Contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance. Generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate accuracy alone.

[322] Overview of the CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification

Hexin Dong, Yi Lin, Pengyu Zhou, Xuan Zhong Feng, Alan Clint Legasto, Mingquan Lin, Hao Chen, Yuzhe Yang, George Shih, Yifan Peng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.22092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[323] Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation

Muquan Li, Hang Gou, Yingyi Ma, Rongzheng Wang, Ke Qin, Tao He

Main category: cs.CV

TL;DR: Paper ID 2602.24144 summary could not be fetched due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing abstract content

Method: Unable to determine method due to missing abstract content

Result: Unable to determine results due to missing abstract content

Conclusion: Unable to determine conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2602.24144: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24144&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[324] Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding

Wang Chen, Yuhui Zeng, Yongdong Luo, Tianyu Xie, Luojun Lin, Jiayi Ji, Yan Zhang, Xiawu Zheng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2603.00512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[325] BAWSeg: A UAV Multispectral Benchmark for Barley Weed Segmentation

Haitian Wang, Xinyu Wang, Muhammad Ibrahim, Dustin Severtson, Ajmal Mian

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.01932: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01932&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[326] FlowMotion: Training-Free Flow Guidance for Video Motion Transfer

Zhen Wang, Youcan Xu, Jun Xiao, Long Chen

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.06289: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06289&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[327] Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

Zhuorui Zhang, Roger Pallarès-López, Praneeth Namburi, Brian W. Anthony

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - paper summary fetch failed due to rate limiting

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2603.06471: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06471&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[328] One-Shot Badminton Shuttle Detection for Mobile Robots

Florentin Dipner, William Talbot, Turcan Tuna, Andrei Cramariuc, Marco Hutter

Main category: cs.CV

TL;DR: Robust one-shot badminton shuttlecock detection framework for non-stationary robots using semi-automatic annotation pipeline and YOLOv8 fine-tuning

DetailsMotivation: Address the lack of egocentric shuttlecock detection datasets for mobile robots and develop a robust detector for dynamic viewpoints in badminton robotics applications

Method: Created dataset of 20,510 semi-automatically annotated frames across 11 backgrounds, categorized by difficulty levels. Developed novel semi-automatic annotation pipeline and fine-tuned YOLOv8 network optimized for real-time detection

Result: Achieved F1-score of 0.86 in test environments similar to training and 0.70 in entirely unseen environments. Detection performance depends on shuttlecock size and background texture complexity

Conclusion: Successfully developed a robust shuttlecock detector for mobile robots with moving cameras, providing foundation for downstream tasks like tracking and trajectory estimation

Abstract: This paper presents a robust one-shot badminton shuttlecock detection framework for non-stationary robots. To address the lack of egocentric shuttlecock detection datasets, we introduce a dataset of 20,510 semi-automatically annotated frames captured across 11 distinct backgrounds in diverse indoor and outdoor environments, and categorize each frame into one of three difficulty levels. For labeling, we present a novel semi-automatic annotation pipeline, that enables efficient labeling from stationary camera footage. We propose a metric suited to our downstream use case and fine-tune a YOLOv8 network optimized for real-time shuttlecock detection, achieving an F1-score of 0.86 under our metric in test environments similar to training, and 0.70 in entirely unseen environments. Our analysis reveals that detection performance is critically dependent on shuttlecock size and background texture complexity. Qualitative experiments confirm their applicability to robots with moving cameras. Unlike prior work with stationary camera setups, our detector is specifically designed for the egocentric, dynamic viewpoints of mobile robots, providing a foundational building block for downstream tasks, including tracking, trajectory estimation, and system (re)-initialization.

[329] EmoStory: Emotion-Aware Story Generation

Jingyuan Yang, Rucong Chen, Weibin Luo, Hui Huang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.10349: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10349&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[330] Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI

Joan Perramon-Llussà, Amelia Jiménez-Sánchez, Grzegorz Skorupko, Fotis Avgoustidis, Carlos Martín-Isla, Karim Lekadir, Polyxeni Gkontra

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.10967: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10967&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[331] TennisExpert: Towards Expert-Level Analytical Sports Video Understanding

Zhaoyu Liu, Xi Weng, Lianyu Hu, Zhe Hou, Kan Jiang, Jin Song Dong, Yang Liu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2603.13397: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13397&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[332] LibraGen: Playing a Balance Game in Subject-Driven Video Generation

Jiahao Zhu, Shanshan Lao, Lijie Liu, Gen Li, Tianhao Qi, Wei Han, Bingchuan Li, Fangfang Liu, Zhuowei Chen, Tianxiang Ma, Qian HE, Yi Zhou, Xiaohua Xie

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.13506: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13506&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[333] Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery

Bohan Zhang, Weidong Tang, Zhixiang Chi, Yi Jin, Zhenbo Li, Yang Wang, Yanan Wu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.13858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[334] USIS-PGM: Photometric Gaussian Mixtures for Underwater Salient Instance Segmentation

Lin Hong, Xiangtong Yao, Mürüvvet Bozkurt, Xin Wang, Fumin Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2603.13961: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13961&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[335] V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to analyze paper content due to technical fetching error

Abstract: Failed to fetch summary for 2603.14482: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14482&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[336] WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning

Stefan Englmeier, Katharina Winter, Fabian B. Flohr

Main category: cs.CV

TL;DR: WorldVLM: A hybrid architecture combining Vision-Language Models (VLMs) for high-level contextual reasoning and World Models (WMs) for spatial dynamics prediction in autonomous driving systems.

DetailsMotivation: Autonomous driving needs both high-level scene context reasoning and accurate environmental dynamics prediction. While VLMs excel at contextual reasoning, they lack spatial comprehension, and WMs predict future scene evolution but need guidance. The paper aims to combine their complementary strengths for better autonomous driving systems.

Method: Proposes WorldVLM: a hybrid architecture where the VLM generates high-level behavior commands based on contextual reasoning, which then guide the World Model for spatial dynamics prediction and action execution. The paper explores conditioning strategies and addresses hybrid design challenges.

Result: The paper presents evaluation of conditioning strategies and provides insights into hybrid design challenges for combining VLMs and WMs in autonomous driving systems.

Conclusion: WorldVLM offers a promising approach to leverage both contextual reasoning from VLMs and spatial dynamics prediction from WMs for more effective and interpretable autonomous driving systems.

Abstract: Autonomous driving systems depend on on models that can reason about high-level scene contexts and accurately predict the dynamics of their surrounding environment. Vision- Language Models (VLMs) have recently emerged as promising tools for decision-making and scene understanding, offering strong capabilities in contextual reasoning. However, their limited spatial comprehension constrains their effectiveness as end-to-end driving models. World Models (WM) internalize environmental dynamics to predict future scene evolution. Recently explored as ego-motion predictors and foundation models for autonomous driving, they represent a promising direction for addressing key challenges in the field, particularly enhancing generalization while maintaining dynamic prediction. To leverage the complementary strengths of context-based decision making and prediction, we propose WorldVLM: A hybrid architecture that unifies VLMs and WMs. In our design, the high-level VLM generates behavior commands to guide the driving WM, enabling interpretable and context-aware actions. We evaluate conditioning strategies and provide insights into the hybrid design challenges.

[337] Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing

Jiahe Song, Chuang Wang, Yinfan Wang, Hao Zheng, Rui Nie, Bowen Jiang, Xingjian Wei, Junyuan Gao, Yubin Wang, Bin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.15011: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15011&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[338] Tracking the Discriminative Axis: Dual Prototypes for Test-Time OOD Detection Under Covariate Shift

Wooseok Lee, Jin Mo Yang, Saewoong Bahk, Hyung-Sin Kim

Main category: cs.CV

TL;DR: Unable to analyze paper 2603.15213 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable due to API rate limiting

Method: Cannot determine method as abstract is unavailable due to API rate limiting

Result: Cannot determine results as abstract is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as abstract is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.15213: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15213&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[339] HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization

Xuerui Qiu, Yutao Cui, Guozhen Zhang, Junzhe Li, JiaKui Hu, Xiao Zhang, Yang Li, Songtao Liu, Miles Yang, Yu Shi, Zhao Zhong, Liefeng Bo

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as the paper content could not be retrieved

Method: Unable to determine method as the paper content could not be retrieved

Result: Unable to determine results as the paper content could not be retrieved

Conclusion: Unable to draw conclusions as the paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.15228: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15228&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[340] Gym-V: A Unified Vision Environment System for Agentic Vision Research

Fanqing Meng, Lingxiao Du, Jiawei Gu, Jiaqi Liao, Linjie Li, Zijian Wu, Xiangyan Liu, Ziqi Zhao, Mengkang Hu, Yue Zhang, Zichen Liu, Jiaheng Zhang, Michael Qizhe Shieh

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.15432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[341] Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).

DetailsMotivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2603.15618: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15618&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[342] Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation

Gal Fiebelman, Hadar Averbuch-Elor, Sagie Benaim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2504.05296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.05296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[343] AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction

Jiashu Yang, Chi Zhang, Abudukelimu Wuerkaixi, Xuxin Cheng, Cao Liu, Ke Zeng, Xu Jia, Xunliang Cai

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.05551: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05551&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[344] Neural-Symbolic Logic Query Answering in Non-Euclidean Space

Lihui Liu

Main category: cs.AI

TL;DR: HYQNET is a neural-symbolic model for first-order logic query reasoning on knowledge graphs that uses hyperbolic space embeddings to better capture hierarchical query structures.

DetailsMotivation: Current approaches to answering complex first-order logic queries on knowledge graphs have limitations: symbolic methods are interpretable but struggle with incomplete graphs, neural methods generalize better but lack transparency, and existing neural-symbolic models fail to capture the hierarchical structure of logical queries.

Method: HYQNET decomposes FOL queries into relation projections and logical operations over fuzzy sets for interpretability. It uses a hyperbolic GNN-based approach for knowledge graph completion in hyperbolic space, embedding the recursive query tree while preserving structural dependencies through hyperbolic representations.

Result: Experiments on three benchmark datasets demonstrate that HYQNET achieves strong performance, showing advantages of reasoning in hyperbolic space over Euclidean-based approaches.

Conclusion: HYQNET effectively integrates neural and symbolic approaches for logic query reasoning by leveraging hyperbolic space to capture hierarchical query structures, addressing limitations of existing methods.

Abstract: Answering complex first-order logic (FOL) queries on knowledge graphs is essential for reasoning. Symbolic methods offer interpretability but struggle with incomplete graphs, while neural approaches generalize better but lack transparency. Neural-symbolic models aim to integrate both strengths but often fail to capture the hierarchical structure of logical queries, limiting their effectiveness. We propose HYQNET, a neural-symbolic model for logic query reasoning that fully leverages hyperbolic space. HYQNET decomposes FOL queries into relation projections and logical operations over fuzzy sets, enhancing interpretability. To address missing links, it employs a hyperbolic GNN-based approach for knowledge graph completion in hyperbolic space, effectively embedding the recursive query tree while preserving structural dependencies. By utilizing hyperbolic representations, HYQNET captures the hierarchical nature of logical projection reasoning more effectively than Euclidean-based approaches. Experiments on three benchmark datasets demonstrate that HYQNET achieves strong performance, highlighting the advantages of reasoning in hyperbolic space.

[345] NextMem: Towards Latent Factual Memory for LLM-based Agents

Zeyu Zhang, Rui Li, Xiaoyan Zhao, Yang Zhang, Wenjie Wang, Xu Chen, Tat-Seng Chua

Main category: cs.AI

TL;DR: NextMem is a latent factual memory framework for LLM-based agents that uses autoregressive autoencoders to create efficient latent memory with accurate reconstruction, addressing limitations of textual and parametric memory approaches.

DetailsMotivation: Existing factual memory approaches for LLM-based agents have significant limitations: textual methods create heavy context and indexing burdens, while parametric methods suffer from catastrophic forgetting and high costs. There's a need for an efficient memory framework that preserves past observations for future decision-making without these drawbacks.

Method: NextMem uses an autoregressive autoencoder to construct latent factual memory with accurate reconstruction. It employs a two-stage training process: 1) autoregressive reconstruction alignment, and 2) progressive latent substitution. The framework also incorporates quantization to reduce storage overhead.

Result: Extensive experiments show NextMem achieves superior performance compared to existing methods, with excellent retrieval capabilities, robustness, and extensibility properties. The framework effectively balances memory efficiency with reconstruction accuracy.

Conclusion: NextMem provides an effective solution for factual memory in LLM-based agents, overcoming limitations of previous approaches through latent memory construction with autoregressive autoencoders and a two-stage training process.

Abstract: Memory is critical for LLM-based agents to preserve past observations for future decision-making, where factual memory serves as its foundational part. However, existing approaches to constructing factual memory face several limitations. Textual methods impose heavy context and indexing burdens, while parametric methods suffer from catastrophic forgetting and high costs. To address these challenges, we introduce NextMem, a latent factual memory framework that utilizes an autoregressive autoencoder to efficiently construct latent memory while ensuring accurate reconstruction. For better optimization, we propose a two-stage training process, including autoregressive reconstruction alignment and progressive latent substitution. We also incorporate quantization to reduce storage overhead. Extensive experiments demonstrate that NextMem achieves superior performance, and excels in retrieval, robustness, and extensibility properties. We release our code and model checkpoints at https://github.com/nuster1128/NextMem.

[346] AIDABench: AI Data Analytics Benchmark

Yibo Yang, Fei Lei, Yixuan Sun, Yantao Zeng, Chengguang Lv, Jiancao Hong, Jiaojiao Tian, Tianyu Qiu, Xin Wang, Yanbing Chen, Yanjie Li, Zheng Pan, Xiaochen Zhou, Guanzhou Chen, Haoran Lv, Yuning Xu, Yue Ou, Haodong Liu, Shiqi He, Anya Jia, Yulei Xin, Huan Wu, Liang Liu, Jiaye Ge, Jianxin Dong, Dahua Lin, Wenxiu Sun

Main category: cs.AI

TL;DR: AIDABench: A comprehensive benchmark for evaluating AI systems on complex, end-to-end document analysis tasks across question answering, data visualization, and file generation dimensions.

DetailsMotivation: Existing benchmarks focus on isolated capabilities or simplified scenarios, failing to capture end-to-end task effectiveness required in practical settings for AI-driven document understanding tools.

Method: Created AIDABench with 600+ diverse document analysis tasks across three core dimensions (question answering, data visualization, file generation) using realistic scenarios with heterogeneous data types like spreadsheets, databases, financial reports, and operational records.

Result: Evaluated 11 state-of-the-art models (both proprietary and open-source) with best-performing model achieving only 59.43% pass-at-1, showing complex real-world data analytics remain challenging for current AI systems.

Conclusion: AIDABench provides a principled reference for enterprise procurement, tool selection, and model optimization, highlighting significant challenges in complex document analysis that need future research.

Abstract: As AI-driven document understanding and processing tools become increasingly prevalent in real-world applications, the need for rigorous evaluation standards has grown increasingly urgent. Existing benchmarks and evaluations often focus on isolated capabilities or simplified scenarios, failing to capture the end-to-end task effectiveness required in practical settings. To address this gap, we introduce AIDABench, a comprehensive benchmark for evaluating AI systems on complex data analytics tasks in an end-to-end manner. AIDABench encompasses 600+ diverse document analysis tasks across three core capability dimensions: question answering, data visualization, and file generation. These tasks are grounded in realistic scenarios involving heterogeneous data types, including spreadsheets, databases, financial reports, and operational records, and reflect analytical demands across diverse industries and job functions. Notably, the tasks in AIDABench are sufficiently challenging that even human experts require 1-2 hours per question when assisted by AI tools, underscoring the benchmark’s difficulty and real-world complexity. We evaluate 11 state-of-the-art models on AIDABench, spanning both proprietary (e.g., Claude Sonnet 4.5, Gemini 3 Pro Preview) and open-source (e.g., Qwen3-Max-2026-01-23-Thinking) families. Our results reveal that complex, real-world data analytics tasks remain a significant challenge for current AI systems, with the best-performing model achieving only 59.43% pass-at-1. We provide a detailed analysis of failure modes across each capability dimension and identify key challenges for future research. AIDABench offers a principled reference for enterprise procurement, tool selection, and model optimization, and is publicly available at https://github.com/MichaelYang-lyx/AIDABench.

[347] The Comprehension-Gated Agent Economy: A Robustness-First Architecture for AI Economic Agency

Rahul Baxi

Main category: cs.AI

TL;DR: CGAE framework links AI agents’ economic permissions to verified robustness metrics rather than capability benchmarks, creating incentive structures for safety.

DetailsMotivation: Current AI economic agency frameworks use capability benchmarks that don't correlate with operational robustness, creating safety risks in agent economies.

Method: Introduces comprehension-gated architecture where economic permissions are bounded by verified comprehension function from adversarial robustness audits across three dimensions: constraint compliance (CDCT), epistemic integrity (DDFT), and behavioral alignment (AGT).

Result: Proves three system properties: bounded economic exposure, incentive-compatible robustness investment, and monotonic safety scaling. Includes temporal decay and stochastic re-auditing to prevent drift.

Conclusion: CGAE creates formal bridge between empirical AI robustness evaluation and economic governance, transforming safety from regulatory burden into competitive advantage.

Abstract: AI agents are increasingly granted economic agency (executing trades, managing budgets, negotiating contracts, and spawning sub-agents), yet current frameworks gate this agency on capability benchmarks that are empirically uncorrelated with operational robustness. We introduce the Comprehension-Gated Agent Economy (CGAE), a formal architecture in which an agent’s economic permissions are upper-bounded by a verified comprehension function derived from adversarial robustness audits. The gating mechanism operates over three orthogonal robustness dimensions: constraint compliance (measured by CDCT), epistemic integrity (measured by DDFT), and behavioral alignment (measured by AGT), with intrinsic hallucination rates serving as a cross-cutting diagnostic. We define a weakest-link gate function that maps robustness vectors to discrete economic tiers, and prove three properties of the resulting system: (1) bounded economic exposure, ensuring maximum financial liability is a function of verified robustness; (2) incentive-compatible robustness investment, showing rational agents maximize profit by improving robustness rather than scaling capability alone; and (3) monotonic safety scaling, demonstrating that aggregate system safety does not decrease as the economy grows. The architecture includes temporal decay and stochastic re-auditing mechanisms that prevent post-certification drift. CGAE provides the first formal bridge between empirical AI robustness evaluation and economic governance, transforming safety from a regulatory burden into a competitive advantage.

[348] Form Follows Function: Recursive Stem Model

Navid Hakimi

Main category: cs.AI

TL;DR: RSM introduces a recursive reasoning model with depth-agnostic training that enables test-time scaling and provides reliability signals through convergence behavior.

DetailsMotivation: Existing recursive reasoning models like HRM and TRM suffer from training inefficiencies due to deep supervision and long unrolls, which increase computational cost and can bias models toward greedy intermediate behavior.

Method: RSM keeps the TRM-style backbone but changes training to learn a stable, depth-agnostic transition operator by detaching hidden-state history, treating early iterations as detached warm-up steps, applying loss only at final step, and using stochastic outer-transition scheme with independent scaling of outer recursion depth H and inner compute depth L.

Result: RSM achieves >20× faster training than TRM with improved accuracy (~5× error reduction), enables test-time scaling for arbitrarily many refinement steps, reaches 97.5% exact accuracy on Sudoku-Extreme with test-time compute, and ~80% exact accuracy on Maze-Hard (30×30) in ~40 minutes.

Conclusion: RSM provides efficient recursive reasoning with test-time scaling capabilities and offers architecture-native reliability signals through convergence behavior, which can help detect hallucinations and enable practical correctness checks.

Abstract: Recursive reasoning models such as Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM) show that small, weight-shared networks can solve compute-heavy and NP puzzles by iteratively refining latent states, but their training typically relies on deep supervision and/or long unrolls that increase wall-clock cost and can bias the model toward greedy intermediate behavior. We introduce Recursive Stem Model (RSM), a recursive reasoning approach that keeps the TRM-style backbone while changing the training contract so the network learns a stable, depth-agnostic transition operator. RSM fully detaches the hidden-state history during training, treats early iterations as detached “warm-up” steps, and applies loss only at the final step. We further grow the outer recursion depth $H$ and inner compute depth $L$ independently and use a stochastic outer-transition scheme (stochastic depth over $H$) to mitigate instability when increasing depth. This yields two key capabilities: (i) $>20\times$ faster training than TRM while improving accuracy ($\approx 5\times$ reduction in error rate), and (ii) test-time scaling where inference can run for arbitrarily many refinement steps ($\sim 20,000 H_{\text{test}} \gg 20 H_{\text{train}}$), enabling additional “thinking” without retraining. On Sudoku-Extreme, RSM reaches 97.5% exact accuracy with test-time compute (within ~1 hour of training on a single A100), and on Maze-Hard ($30 \times 30$) it reaches ~80% exact accuracy in ~40 minutes using attention-based instantiation. Finally, because RSM implements an iterative settling process, convergence behavior provides a simple, architecture-native reliability signal: non-settling trajectories warn that the model has not reached a viable solution and can be a guard against hallucination, while stable fixed points can be paired with domain verifiers for practical correctness checks.

[349] CraniMem: Cranial Inspired Gated and Bounded Memory for Agentic Systems

Pearl Mody, Mihir Panchal, Rishit Kar, Kiran Bhowmick, Ruhina Karani

Main category: cs.AI

TL;DR: CraniMem is a neurocognitively inspired memory system for LLM agents with gated multi-stage memory (episodic buffer + knowledge graph) and scheduled consolidation for robust long-horizon task performance.

DetailsMotivation: Existing agent memory systems behave like external databases with ad hoc read/write rules, leading to unstable retention, limited consolidation, and vulnerability to distractor content in long-running workflows.

Method: Neurocognitively motivated design with goal-conditioned gating and utility tagging, bounded episodic buffer for near-term continuity, structured long-term knowledge graph for semantic recall, and scheduled consolidation loop that replays high-utility traces while pruning low-utility items.

Result: On long-horizon benchmarks with both clean inputs and injected noise, CraniMem outperforms Vanilla RAG and Mem0 baselines, showing smaller performance drops under distraction and more robust memory retention.

Conclusion: CraniMem provides a principled memory architecture for LLM agents that improves robustness in long-running workflows through neurocognitively inspired gating, multi-stage memory, and consolidation mechanisms.

Abstract: Large language model (LLM) agents are increasingly deployed in long running workflows, where they must preserve user and task state across many turns. Many existing agent memory systems behave like external databases with ad hoc read/write rules, which can yield unstable retention, limited consolidation, and vulnerability to distractor content. We present CraniMem, a neurocognitively motivated, gated and bounded multi-stage memory design for agentic systems. CraniMem couples goal conditioned gating and utility tagging with a bounded episodic buffer for near term continuity and a structured long-term knowledge graph for durable semantic recall. A scheduled consolidation loop replays high utility traces into the graph while pruning low utility items, keeping memory growth in check and reducing interference. On long horizon benchmarks evaluated under both clean inputs and injected noise, CraniMem is more robust than a Vanilla RAG and Mem0 baseline and exhibits smaller performance drops under distraction. Our code is available at https://github.com/PearlMody05/Cranimem and the accompanying PyPI package at https://pypi.org/project/cranimem.

[350] GSI Agent: Domain Knowledge Enhancement for Large Language Models in Green Stormwater Infrastructure

Shaohuang Wang

Main category: cs.AI

TL;DR: GSI Agent: A domain-enhanced LLM framework for Green Stormwater Infrastructure tasks using supervised fine-tuning, retrieval-augmented generation, and agent-based reasoning.

DetailsMotivation: Green Stormwater Infrastructure (GSI) systems require continuous inspection and maintenance, but domain knowledge is scattered across documents. Non-experts struggle to get reliable guidance, and general LLMs lack domain-specific knowledge and produce inaccurate answers in engineering scenarios.

Method: Three complementary strategies: (1) supervised fine-tuning on curated GSI instruction dataset, (2) retrieval-augmented generation over internal GSI knowledge base from municipal documents, and (3) agent-based reasoning pipeline coordinating retrieval, context integration, and structured response generation.

Result: Significant improvement in domain-specific performance: BLEU-4 improved from 0.090 to 0.307 on GSI dataset while maintaining general knowledge capability (0.304 vs 0.305 on common knowledge dataset).

Conclusion: Systematic domain knowledge enhancement can effectively adapt general-purpose LLMs to professional infrastructure applications, demonstrating practical value for GSI inspection and maintenance tasks.

Abstract: Green Stormwater Infrastructure (GSI) systems, such as permeable pavement, rain gardens, and bioretention facilities, require continuous inspection and maintenance to ensure long-term performance. However, domain knowledge about GSI is often scattered across municipal manuals, regulatory documents, and inspection forms. As a result, non-expert users and maintenance staff may struggle to obtain reliable and actionable guidance from field observations. Although Large Language Models (LLMs) have demonstrated strong general reasoning and language generation capabilities, they often lack domain-specific knowledge and may produce inaccurate or hallucinated answers in engineering scenarios. This limitation restricts their direct application to professional infrastructure tasks. In this paper, we propose GSI Agent, a domain-enhanced LLM framework designed to improve performance in GSI-related tasks. Our approach integrates three complementary strategies: (1) supervised fine-tuning (SFT) on a curated GSI instruction dataset, (2) retrieval-augmented generation (RAG) over an internal GSI knowledge base constructed from municipal documents, and (3) an agent-based reasoning pipeline that coordinates retrieval, context integration, and structured response generation. We also construct a new GSI Dataset aligned with real-world GSI inspection and maintenance scenarios. Experimental results show that our framework significantly improves domain-specific performance while maintaining general knowledge capability. On the GSI dataset, BLEU-4 improves from 0.090 to 0.307, while performance on the common knowledge dataset remains stable (0.304 vs. 0.305). These results demonstrate that systematic domain knowledge enhancement can effectively adapt general-purpose LLMs to professional infrastructure applications.

[351] MAC: Multi-Agent Constitution Learning

Rushil Thareja, Gautam Gupta, Francesco Pinto, Nils Lukas

Main category: cs.AI

TL;DR: MAC is a multi-agent system that automatically learns constitutional rules for LLMs through structured prompt optimization, outperforming existing methods by 50+% on PII tagging and achieving performance comparable to fine-tuning without parameter updates.

DetailsMotivation: Current constitutional AI methods rely on human-written rules, and existing LLM-based prompt optimizers are ineffective at learning constitutions due to needing many labeled examples and lacking structure in optimized prompts.

Method: Proposes Multi-Agent Constitutional Learning (MAC) with a network of specialized agents that accept, edit, or reject rule updates for structured prompts, and MAC+ which trains agents on successful trajectories to reinforce high-reward updates.

Result: MAC outperforms recent prompt optimization methods by over 50% on PII tagging, produces human-readable rule sets, and achieves performance comparable to supervised fine-tuning and GRPO without parameter updates.

Conclusion: MAC enables automatic learning of constitutional rules through structured multi-agent optimization, providing interpretable, auditable rule sets that match fine-tuning performance without parameter updates.

Abstract: Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM-based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled examples and (ii) lack structure in the optimized prompts, leading to diminishing improvements as prompt size grows. To address these limitations, we propose Multi-Agent Constitutional Learning (MAC), which optimizes over structured prompts represented as sets of rules using a network of agents with specialized tasks to accept, edit, or reject rule updates. We also present MAC+, which improves performance by training agents on successful trajectories to reinforce updates leading to higher reward. We evaluate MAC on tagging Personally Identifiable Information (PII), a classification task with limited labels where interpretability is critical, and demonstrate that it generalizes to other agentic tasks such as tool calling. MAC outperforms recent prompt optimization methods by over 50%, produces human-readable and auditable rule sets, and achieves performance comparable to supervised fine-tuning and GRPO without requiring parameter updates.

[352] Did You Check the Right Pocket? Cost-Sensitive Store Routing for Memory-Augmented Agents

Madhava Gaikwad

Main category: cs.AI

TL;DR: Memory-augmented agents can improve efficiency by selectively routing queries to relevant memory stores instead of retrieving from all stores, reducing costs and improving accuracy.

DetailsMotivation: Current memory-augmented agents retrieve from all specialized memory stores for every query, which increases computational costs and introduces irrelevant context that can harm performance.

Method: Formulate memory retrieval as a store-routing problem and evaluate using coverage, exact match, and token efficiency metrics. Propose selective routing mechanisms and formalize store selection as a cost-sensitive decision problem trading accuracy against retrieval cost.

Result: Oracle router achieves higher accuracy while using substantially fewer context tokens compared to uniform retrieval, demonstrating selective retrieval improves both efficiency and performance.

Conclusion: Routing decisions are a first-class component of memory-augmented agent design, motivating learned routing mechanisms for scalable multi-store systems. Store selection should be treated as a cost-sensitive decision problem.

Abstract: Memory-augmented agents maintain multiple specialized stores, yet most systems retrieve from all stores for every query, increasing cost and introducing irrelevant context. We formulate memory retrieval as a store-routing problem and evaluate it using coverage, exact match, and token efficiency metrics. On downstream question answering, an oracle router achieves higher accuracy while using substantially fewer context tokens compared to uniform retrieval, demonstrating that selective retrieval improves both efficiency and performance. Our results show that routing decisions are a first-class component of memory-augmented agent design and motivate learned routing mechanisms for scalable multi-store systems. We additionally formalize store selection as a cost-sensitive decision problem that trades answer accuracy against retrieval cost, providing a principled interpretation of routing policies.

[353] DynaTrust: Defending Multi-Agent Systems Against Sleeper Agents via Dynamic Trust Graphs

Yu Li, Qiang Hu, Yao Zhang, Lili Quan, Jiongchi Yu, Junjie Wang

Main category: cs.AI

TL;DR: DynaTrust: A dynamic trust graph defense method for LLM-based multi-agent systems against sleeper agent attacks, using continuous trust evaluation and adaptive graph restructuring instead of rigid blocking.

DetailsMotivation: LLM-based multi-agent systems introduce new attack surfaces like sleeper agents that behave benignly initially but reveal malicious behaviors later. Existing defenses fail to adapt to evolving adversarial strategies or suffer from high false-positive rates due to rigid blocking policies.

Method: Models MAS as a dynamic trust graph (DTG) where trust is a continuous, evolving process. Dynamically updates agent trust based on historical behaviors and expert agent confidence. Instead of blocking, autonomously restructures the graph to isolate compromised agents while restoring task connectivity.

Result: Outperforms state-of-the-art AgentShield by increasing defense success rate by 41.7%, achieving rates exceeding 86% under adversarial conditions. Significantly reduces false-positive rates while ensuring uninterrupted system operations through graph adaptation.

Conclusion: DynaTrust provides an effective defense against sleeper agents in LLM-based multi-agent systems by treating trust as dynamic rather than static, enabling adaptive security that balances protection with system utility.

Abstract: Large Language Model-based Multi-Agent Systems (MAS) have demonstrated remarkable collaborative reasoning capabilities but introduce new attack surfaces, such as the sleeper agent, which behave benignly during routine operation and gradually accumulate trust, only revealing malicious behaviors when specific conditions or triggers are met. Existing defense works primarily focus on static graph optimization or hierarchical data management, often failing to adapt to evolving adversarial strategies or suffering from high false-positive rates (FPR) due to rigid blocking policies. To address this, we propose DynaTrust, a novel defense method against sleeper agents. DynaTrust models MAS as a dynamic trust graph~(DTG), and treats trust as a continuous, evolving process rather than a static attribute. It dynamically updates the trust of each agent based on its historical behaviors and the confidence of selected expert agents. Instead of simply blocking, DynaTrust autonomously restructures the graph to isolate compromised agents and restore task connectivity to ensure the usability of MAS. To assess the effectiveness of DynaTrust, we evaluate it on mixed benchmarks derived from AdvBench and HumanEval. The results demonstrate that DynaTrust outperforms the state-of-the-art method AgentShield by increasing the defense success rate by 41.7%, achieving rates exceeding 86% under adversarial conditions. Furthermore, it effectively balances security with utility by significantly reducing FPR, ensuring uninterrupted system operations through graph adaptation.

[354] QV May Be Enough: Toward the Essence of Attention in LLMs

Zhang Edward

Main category: cs.AI

TL;DR: This paper provides a linguistic analysis of the Transformer’s QKV mechanism from part-of-speech and syntactic perspectives, proposes a QV paradigm with empirical validation, and introduces QV-Ka optimization scheme.

DetailsMotivation: The paper aims to understand the fundamental essence of the Query-Key-Value mechanism in Transformers from first principles and linguistic perspectives, seeking to provide a unified explanatory framework for modern architectures like MQA, GQA, and MLA.

Method: The authors use linguistic analysis centered on part-of-speech and syntactic analysis to derive the underlying essence of QKV. They introduce the QV paradigm, provide empirical evidence for its validity, and propose the QV-Ka optimization scheme with experimental validation.

Result: The paper establishes an interpretable theoretical analysis of the QKV mechanism, provides a unified framework explaining contemporary architectures, and validates the proposed QV-Ka optimization scheme through experiments.

Conclusion: The interpretable theoretical analysis of QKV establishes a robust foundation for future evolution of large language model architectures, providing insights into trade-offs and optimization trajectories for modern attention mechanisms.

Abstract: Starting from first principles and a linguistic perspective centered on part-of-speech (POS) and syntactic analysis, this paper explores and derives the underlying essence of the Query-Key-Value (QKV) mechanism within the Transformer architecture. Based on this theoretical foundation, we provide a unified explanatory framework for the efficacy of contemporary architectures, including MQA, GQA, and MLA, while identifying their inherent trade-offs and potential optimization trajectories. We introduce the QV paradigm and provide empirical evidence for its validity. Building upon this, we propose the QV-Ka optimization scheme, which is further substantiated through experimental validation. The interpretable theoretical analysis of the QKV mechanism presented in this work establishes a robust foundation for the future evolution of large language model architectures.

[355] Compiled Memory: Not More Information, but More Precise Instructions for Language Agents

James Rhodes, George Kang

Main category: cs.AI

TL;DR: Atlas is a memory kernel that compiles agent task experience into instruction structures through distillation rather than storage, improving performance on tasks like contract analysis and QA without fine-tuning or RAG.

DetailsMotivation: Current memory systems focus on memory management (retrieval and paging), but lack mechanisms for memory utility - determining what experiences are worth keeping and how they should change agent behavior.

Method: Atlas compiles accumulated task experience into agent instruction structure through distillation. Facts from agent failures/successes are verified through a three-step promotion gate, then delivered by rewriting the agent’s system prompt with learned sub-bullets.

Result: On CUAD contract analysis: GPT-4o token-level F1 improved +8.7pp, precision +12.5pp. On HotpotQA multi-hop QA: joint F1 improved +3.16pp. The evolved prompt also improved Claude Sonnet 4.5 by +2.31pp when applied unchanged.

Conclusion: Memory should be distillation rather than storage, with delivery through instruction rewriting rather than context injection. The compiled knowledge is task-shaped rather than model-shaped, enabling cross-model transfer.

Abstract: Existing memory systems for language agents address memory management: how to retrieve and page more information within a context budget. We address a complementary problem – memory utility: what experience is worth keeping, and how it should change agent behavior. We present Atlas, a memory kernel that compiles accumulated task experience into an agent’s instruction structure – without fine-tuning, RAG, or human intervention. Memory is distillation, not storage; delivery is instruction rewriting, not context injection. Facts extracted from agent failures and successes are verified through a three-step promotion gate and delivered by rewriting the agent’s system prompt with learned sub-bullets. On CUAD contract analysis, the evolved prompt improves GPT-4o token-level F1 by $+8.7$pp and precision by $+12.5$pp. On HotpotQA multi-hop QA, joint F1 improves $+3.16$pp. An ablation isolates the mechanism’s defining property – the training signal constraint: the evolved prompt learns exactly what it is taught, and nothing more. Applied to Claude Sonnet~4.5 using the same evolved prompt – compiled from GPT-4o errors, unchanged – joint F1 improves $+2.31$pp, with gains concentrating where Claude’s stronger baseline leaves the most room – confirming that the compiled knowledge is task-shaped, not model-shaped.

[356] A Dynamic Survey of Fuzzy, Intuitionistic Fuzzy, Neutrosophic, Plithogenic, and Extensional Sets

Takaaki Fujita, Florentin Smarandache

Main category: cs.AI

TL;DR: A comprehensive survey book covering four major families of uncertainty-oriented mathematical frameworks: Fuzzy Sets, Intuitionistic Fuzzy Sets, Neutrosophic Sets, and Plithogenic Sets, providing systematic overview and unified exposition.

DetailsMotivation: Real-world phenomena often exhibit vagueness, partial truth, and incomplete information, requiring mathematically rigorous frameworks to model such uncertainty. Many generalized set-theoretic frameworks have been developed, leading to recurring ideas and structural patterns across these uncertainty models.

Method: The book provides a comprehensive, large-scale survey of four major families: Fuzzy Sets, Intuitionistic Fuzzy Sets, Neutrosophic Sets, and Plithogenic Sets. It offers systematic overview of existing developments through unified exposition.

Result: A systematic overview of existing developments in uncertainty modeling frameworks, highlighting recurring patterns and structural similarities across different set theories.

Conclusion: The unified exposition aims to stimulate new insights, further conceptual extensions, and additional applications across a wide range of disciplines by providing comprehensive coverage of these uncertainty-oriented mathematical frameworks.

Abstract: Real-world phenomena often exhibit vagueness, partial truth, and incomplete information. To model such uncertainty in a mathematically rigorous way, many generalized set-theoretic frameworks have been introduced, including Fuzzy Sets [1], Intuitionistic Fuzzy Sets [2], Neutrosophic Sets [3,4], Vague Sets [5], Hesitant Fuzzy Sets [6], Picture Fuzzy Sets [7], Quadripartitioned Neutrosophic Sets [8], Penta-Partitioned Neutrosophic Sets [9], Plithogenic Sets [10], HyperFuzzy Sets [11], and HyperNeutrosophic Sets [12]. Within these frameworks, a wide range of notions has been proposed and studied, particularly in the settings of fuzzy, intuitionistic fuzzy, neutrosophic, and plithogenic set theories. This extensive literature underscores both the significance of these theories and the breadth of their application areas. As a result, many ideas, constructions, and structural patterns recur across these four major families of uncertainty-oriented models. In this book, we provide a comprehensive, large-scale survey of Fuzzy, Intuitionistic Fuzzy, Neutrosophic, and Plithogenic Sets. Our goal is to give readers a systematic overview of existing developments and, through a unified exposition, to stimulate new insights, further conceptual extensions, and additional applications across a wide range of disciplines.

[357] Quantum-Secure-By-Construction (QSC): A Paradigm Shift For Post-Quantum Agentic Intelligence

Arit Kumar Bishwas, Mousumi Sen, Albert Nieto-Morales, Joel Jacob Varghese

Main category: cs.AI

TL;DR: QSC (Quantum Secure by Construction) is a design paradigm that integrates quantum-secure communication as a core architectural property of agentic AI systems, combining post-quantum cryptography, quantum random number generation, and quantum key distribution for secure agent interactions across distributed environments.

DetailsMotivation: As agentic AI systems scale across globally distributed infrastructures, secure communication becomes critical, especially in the quantum era where current cryptographic assumptions may become obsolete over operational lifetimes. There's a need to treat quantum security as a fundamental architectural property rather than an afterthought.

Method: QSC implements a runtime adaptive security model with cryptographically pluggable components guided by policy. It combines post-quantum cryptography, quantum random number generation, and quantum key distribution to secure agent interactions across cloud, edge, and inter-organizational environments through a governance-aware orchestration layer.

Result: System-level analysis and empirical evaluation show QSC can reduce operational complexity and cost of introducing quantum security into deployed agentic AI systems, while examining trade-offs between classical and quantum-secure mechanisms.

Conclusion: QSC establishes a foundational paradigm for post-quantum agentic intelligence, providing a principled pathway for designing globally interoperable, resilient, and future-ready intelligent systems with built-in quantum security.

Abstract: As agentic artificial intelligence systems scale across globally distributed and long lived infrastructures, secure and policy compliant communication becomes a fundamental systems challenge. This challenge grows more serious in the quantum era, where the cryptographic assumptions built into today’s AI deployments may not remain valid over their operational lifetime. Here, we introduce quantum secure by construction, or QSC, as a design paradigm that treats quantum secure communication as a core architectural property of agentic AI systems rather than an upgrade added later. We realize QSC through a runtime adaptive security model that combines post quantum cryptography, quantum random number generation, and quantum key distribution to secure interactions among autonomous agents operating across heterogeneous cloud, edge, and inter organizational environments. The approach is cryptographically pluggable and guided by policy, allowing the system to adjust its security posture according to infrastructure availability, regulatory constraints, and performance needs. QSC contributes a governance aware orchestration layer that selects and combines link specific cryptographic protections across the full agent lifecycle, including session bootstrap, inter agent coordination, tool invocation, and memory access. Through system level analysis and empirical evaluation, we examine the trade offs between classical and quantum secure mechanisms and show that QSC can reduce the operational complexity and cost of introducing quantum security into deployed agentic AI systems. These results position QSC as a foundational paradigm for post quantum agentic intelligence and establish a principled pathway for designing globally interoperable, resilient, and future ready intelligent systems.

[358] I Know What I Don’t Know: Latent Posterior Factor Models for Multi-Evidence Probabilistic Reasoning

Aliyu Agboola Alege

Main category: cs.AI

TL;DR: LPF framework bridges VAE latent posteriors with SPN inference for probabilistic reasoning over unstructured evidence with calibrated uncertainty

DetailsMotivation: Real-world decision-making requires aggregating multiple noisy evidence sources, but existing methods lack proper uncertainty quantification or scalability to unstructured data

Method: Transform VAE latent posteriors into soft likelihood factors for Sum-Product Network inference, creating two architectures: LPF-SPN (structured factor-based) and LPF-Learned (end-to-end learned)

Result: Across eight domains, LPF-SPN achieves up to 97.8% accuracy, 1.4% calibration error, and strong probabilistic fit, outperforming evidential deep learning, LLMs and graph-based baselines

Conclusion: LPF provides a principled framework for probabilistic reasoning over unstructured evidence with calibrated uncertainty, enabling comparison between explicit reasoning and learned aggregation

Abstract: Real-world decision-making, from tax compliance assessment to medical diagnosis, requires aggregating multiple noisy and potentially contradictory evidence sources. Existing approaches either lack explicit uncertainty quantification (neural aggregation methods) or rely on manually engineered discrete predicates (probabilistic logic frameworks), limiting scalability to unstructured data. We introduce Latent Posterior Factors (LPF), a framework that transforms Variational Autoencoder (VAE) latent posteriors into soft likelihood factors for Sum-Product Network (SPN) inference, enabling tractable probabilistic reasoning over unstructured evidence while preserving calibrated uncertainty estimates. We instantiate LPF as LPF-SPN (structured factor-based inference) and LPF-Learned (end-to-end learned aggregation), enabling a principled comparison between explicit probabilistic reasoning and learned aggregation under a shared uncertainty representation. Across eight domains (seven synthetic and the FEVER benchmark), LPF-SPN achieves high accuracy (up to 97.8%), low calibration error (ECE 1.4%), and strong probabilistic fit, substantially outperforming evidential deep learning, LLMs and graph-based baselines over 15 random seeds. Contributions: (1) A framework bridging latent uncertainty representations with structured probabilistic reasoning. (2) Dual architectures enabling controlled comparison of reasoning paradigms. (3) Reproducible training methodology with seed selection. (4) Evaluation against EDL, BERT, R-GCN, and large language model baselines. (5) Cross-domain validation. (6) Formal guarantees in a companion paper.

[359] Theoretical Foundations of Latent Posterior Factors: Formal Guarantees for Multi-Evidence Reasoning

Aliyu Agboola Alege

Main category: cs.AI

TL;DR: LPF is a theoretical framework for aggregating multiple heterogeneous evidence in probabilistic prediction with formal guarantees for trustworthy AI.

DetailsMotivation: Multi-evidence reasoning is crucial in high-stakes domains like healthcare, finance, and law, but existing approaches lack formal guarantees or architectural support for handling multiple evidence items.

Method: Encodes each evidence item into Gaussian latent posteriors via VAE, converts to soft factors through Monte Carlo marginalization, and aggregates via exact Sum-Product Network inference (LPF-SPN) or learned neural aggregator (LPF-Learned).

Result: Proves seven formal guarantees including calibration preservation, Monte Carlo error bounds, PAC-Bayes bounds, information-theoretic efficiency, robustness to corruption, calibration decay rates, and exact uncertainty decomposition, all empirically validated on datasets up to 4,200 examples.

Conclusion: LPF establishes a theoretical foundation for trustworthy multi-evidence AI in safety-critical applications with provable guarantees.

Abstract: We present a complete theoretical characterization of Latent Posterior Factors (LPF), a principled framework for aggregating multiple heterogeneous evidence items in probabilistic prediction tasks. Multi-evidence reasoning arises pervasively in high-stakes domains including healthcare diagnosis, financial risk assessment, legal case analysis, and regulatory compliance, yet existing approaches either lack formal guarantees or fail to handle multi-evidence scenarios architecturally. LPF encodes each evidence item into a Gaussian latent posterior via a variational autoencoder, converting posteriors to soft factors through Monte Carlo marginalization, and aggregating factors via exact Sum-Product Network inference (LPF-SPN) or a learned neural aggregator (LPF-Learned). We prove seven formal guarantees spanning the key desiderata for trustworthy AI: Calibration Preservation (ECE <= epsilon + C/sqrt(K_eff)); Monte Carlo Error decaying as O(1/sqrt(M)); a non-vacuous PAC-Bayes bound with train-test gap of 0.0085 at N=4200; operation within 1.12x of the information-theoretic lower bound; graceful degradation as O(epsilondeltasqrt(K)) under corruption, maintaining 88% performance with half of evidence adversarially replaced; O(1/sqrt(K)) calibration decay with R^2=0.849; and exact epistemic-aleatoric uncertainty decomposition with error below 0.002%. All theorems are empirically validated on controlled datasets spanning up to 4,200 training examples. Our theoretical framework establishes LPF as a foundation for trustworthy multi-evidence AI in safety-critical applications.

[360] Survey of Various Fuzzy and Uncertain Decision-Making Methods

Takaaki Fujita, Florentin Smarandache

Main category: cs.AI

TL;DR: Survey paper on uncertainty-aware multi-criteria decision-making (MCDM) methods, organizing the field into a taxonomy covering problem settings, weight elicitation, inter-criteria structure, and solution procedures.

DetailsMotivation: Real-world decision-making faces challenges like vagueness, incomplete information, heterogeneous data, and conflicting expert opinions. There's a need to systematically organize uncertainty-aware MCDM methods to help practitioners choose appropriate approaches based on their specific needs and constraints.

Method: The paper conducts a comprehensive survey and organizes the field into a task-oriented taxonomy. It covers: 1) Problem-level settings (discrete, group/consensus, dynamic, multi-stage, multi-level, multiagent, multi-scenario), 2) Weight elicitation methods (subjective and objective schemes under fuzzy/linguistic inputs), 3) Inter-criteria structure and causality modelling, 4) Solution procedures including compensatory scoring methods, distance-to-reference approaches, non-compensatory outranking frameworks, and rule/evidence-based sequential decision models.

Result: The survey provides a structured taxonomy of uncertainty-aware MCDM methods, highlighting typical inputs, core computational steps, and primary outputs. It offers guidance on choosing methods based on robustness, interpretability, and data availability requirements.

Conclusion: The paper concludes by identifying open research directions including explainable uncertainty integration, stability analysis, and scalability improvements for large-scale and dynamic decision environments. It serves as a comprehensive reference for researchers and practitioners working with uncertain decision-making scenarios.

Abstract: Decision-making in real applications is often affected by vagueness, incomplete information, heterogeneous data, and conflicting expert opinions. This survey reviews uncertainty-aware multi-criteria decision-making (MCDM) and organizes the field into a concise, task-oriented taxonomy. We summarize problem-level settings (discrete, group/consensus, dynamic, multi-stage, multi-level, multiagent, and multi-scenario), weight elicitation (subjective and objective schemes under fuzzy/linguistic inputs), and inter-criteria structure and causality modelling. For solution procedures, we contrast compensatory scoring methods, distance-to-reference and compromise approaches, and non-compensatory outranking frameworks for ranking or sorting. We also outline rule/evidence-based and sequential decision models that produce interpretable rules or policies. The survey highlights typical inputs, core computational steps, and primary outputs, and provides guidance on choosing methods according to robustness, interpretability, and data availability. It concludes with open directions on explainable uncertainty integration, stability, and scalability in large-scale and dynamic decision environments.

[361] Knowledge Graph Extraction from Biomedical Literature for Alkaptonuria Rare Disease

Giang Pham, Rebecca Finetti, Caterina Graziani, Bianca Roncaglia, Asma Bendjeddou, Linda Brodo, Sara Brunetti, Moreno Falaschi, Stefano Forti, Silvia Giulia Galfré, Paolo Milazzo, Corrado Priami, Annalisa Santucci, Ottavia Spiga, Alina Sîrbu

Main category: cs.AI

TL;DR: Text-mining approach using PubTator3 to construct knowledge graphs for ultra-rare metabolic disorder Alkaptonuria (AKU), revealing disease interactions and potential therapeutic targets.

DetailsMotivation: Alkaptonuria (AKU) is an ultra-rare disease with limited clinical data and literature, and it's frequently underrepresented or absent in existing biomedical knowledge graphs, making it difficult to study its systemic interactions and potential therapies.

Method: Applied text-mining methodology based on PubTator3 for large-scale extraction of biomedical relations, constructed two knowledge graphs of different sizes, validated them using existing biochemical knowledge, and used them to extract genes, diseases and therapies possibly related to AKU.

Result: The computational framework revealed systemic interactions of the disease, its comorbidities, and potential therapeutic targets, demonstrating efficacy in analyzing rare metabolic disorders.

Conclusion: The approach successfully addresses the challenge of studying ultra-rare diseases with limited data by constructing specialized knowledge graphs that can reveal disease mechanisms and potential treatments.

Abstract: Alkaptonuria (AKU) is an ultra-rare autosomal recessive metabolic disorder caused by mutations in the HGD (Homogentisate 1,2-Dioxygenase) gene, leading to a pathological accumulation of homogentisic acid (HGA) in body fluids and tissues. This leads to systemic manifestations, including premature spondyloarthropathy, renal and prostatic stones, and cardiovascular complications. Being ultra-rare, the amount of data related to the disease is limited, both in terms of clinical data and literature. Knowledge graphs (KGs) can help connect the limited knowledge about the disease (basic mechanisms, manifestations and existing therapies) with other knowledge; however, AKU is frequently underrepresented or entirely absent in existing biomedical KGs. In this work, we apply a text-mining methodology based on PubTator3 for large-scale extraction of biomedical relations. We construct two KGs of different sizes, validate them using existing biochemical knowledge and use them to extract genes, diseases and therapies possibly related to AKU. This computational framework reveals the systemic interactions of the disease, its comorbidities, and potential therapeutic targets, demonstrating the efficacy of our approach in analyzing rare metabolic disorders.

[362] Context-Length Robustness in Question Answering Models: A Comparative Empirical Study

Trishita Dhara, Siddhesh Sheth

Main category: cs.AI

TL;DR: Empirical study shows LLM performance degrades as context length increases with irrelevant content, with multi-hop reasoning tasks (HotpotQA) suffering nearly twice the degradation of single-span extraction tasks (SQuAD).

DetailsMotivation: LLMs are increasingly used with long, noisy contexts, but robustness to growing context length remains poorly understood across different QA tasks.

Method: Controlled empirical study using SQuAD and HotpotQA benchmarks, evaluating model accuracy as function of total context length by systematically increasing irrelevant context while preserving answer-bearing signal.

Result: Consistent performance degradation as context length increases, with substantially larger drops on multi-hop reasoning tasks (HotpotQA) compared to single-span extraction tasks (SQuAD). HotpotQA exhibits nearly twice the accuracy degradation of SQuAD under equivalent context expansions.

Conclusion: Task-dependent differences in robustness exist, with multi-hop reasoning especially vulnerable to context dilution. Context-length robustness should be explicitly evaluated when assessing model reliability for applications involving long documents or retrieval-augmented generation.

Abstract: Large language models are increasingly deployed in settings where relevant information is embedded within long and noisy contexts. Despite this, robustness to growing context length remains poorly understood across different question answering tasks. In this work, we present a controlled empirical study of context-length robustness in large language models using two widely used benchmarks: SQuAD and HotpotQA. We evaluate model accuracy as a function of total context length by systematically increasing the amount of irrelevant context while preserving the answer-bearing signal. This allows us to isolate the effect of context length from changes in task difficulty. Our results show a consistent degradation in performance as context length increases, with substantially larger drops observed on multi-hop reasoning tasks compared to single-span extraction tasks. In particular, HotpotQA exhibits nearly twice the accuracy degradation of SQuAD under equivalent context expansions. These findings highlight task-dependent differences in robustness and suggest that multi-hop reasoning is especially vulnerable to context dilution. We argue that context-length robustness should be evaluated explicitly when assessing model reliability, especially for applications involving long documents or retrieval-augmented generation.

[363] SAGE: Multi-Agent Self-Evolution for LLM Reasoning

Yulin Peng, Xinxin Zhu, Chenxing Wei, Nianbo Zeng, Leilei Wang, Ying Tiffany He, F. Richard Yu

Main category: cs.AI

TL;DR: SAGE is a self-evolving agent framework where four specialized agents (Challenger, Planner, Solver, Critic) co-evolve from a shared LLM backbone using minimal seed data, improving reasoning through verifiable rewards and quality-controlled self-training.

DetailsMotivation: Current reinforcement learning methods for LLM reasoning often require large human-labeled datasets, while self-play approaches lack explicit planning and quality control, limiting stability in long-horizon multi-step reasoning tasks.

Method: SAGE uses four specialized agents: Challenger generates increasingly difficult tasks, Planner creates structured multi-step plans, Solver executes plans to produce answers, and Critic scores/filters questions and plans to prevent curriculum drift. All agents co-evolve from a shared LLM backbone using only a small seed set.

Result: SAGE achieves consistent gains across model scales, improving Qwen-2.5-7B by 8.9% on LiveCodeBench and 10.7% on OlympiadBench in mathematics and code-generation benchmarks.

Conclusion: The SAGE framework demonstrates that specialized agent co-evolution with quality control mechanisms enables stable self-training for improved reasoning in LLMs without requiring large human-labeled datasets.

Abstract: Reinforcement learning with verifiable rewards improves reasoning in large language models (LLMs), but many methods still rely on large human-labeled datasets. While self-play reduces this dependency, it often lacks explicit planning and strong quality control, limiting stability in long-horizon multi-step reasoning. We present SAGE (Self-evolving Agents for Generalized reasoning Evolution), a closed-loop framework where four agents: Challenger, Planner, Solver, and Critic, co-evolve from a shared LLM backbone using only a small seed set. The Challenger continuously generates increasingly difficult tasks; the Planner converts each task into a structured multi-step plan; and the Solver follows the plan to produce an answer, whose correctness is determined by external verifiers. The Critic scores and filters both generated questions and plans to prevent curriculum drift and maintain training signal quality, enabling stable self-training. Across mathematics and code-generation benchmarks, SAGE delivers consistent gains across model scales, improving the Qwen-2.5-7B model by 8.9% on LiveCodeBench and 10.7% on OlympiadBench.

[364] CUBE: A Standard for Unifying Agent Benchmarks

Alexandre Lacoste, Nicolas Gontier, Oleh Shliazhko, Aman Jaiswal, Kusha Sareen, Shailesh Nanisetty, Joan Cabezas, Manuel Del Verme, Omar G. Younis, Simone Baratta, Matteo Avalle, Imene Kerboua, Xing Han Lù, Elron Bandel, Michal Shmueli-Scheuer, Asaf Yehudai, Leshem Choshen, Jonathan Lebensold, Sean Hughes, Massimo Caccia, Alexandre Drouin, Siva Reddy, Tao Yu, Yu Su, Graham Neubig, Dawn Song

Main category: cs.AI

TL;DR: CUBE is a universal protocol standard for agent benchmarks that reduces fragmentation by allowing benchmarks to be wrapped once and used across different platforms without custom integration.

DetailsMotivation: The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity, with each new benchmark requiring substantial custom integration (an "integration tax") that limits comprehensive evaluation.

Method: CUBE is built on MCP and Gym protocols, separating task, benchmark, package, and registry concerns into distinct API layers to enable any compliant platform to access any compliant benchmark without custom integration.

Result: The paper proposes a standard protocol that would allow benchmarks to be wrapped once and used everywhere for evaluation, RL training, or data generation, reducing the integration burden.

Conclusion: The authors call on the community to contribute to developing this standard before platform-specific implementations deepen fragmentation as benchmark production accelerates through 2026.

Abstract: The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity. Each new benchmark requires substantial custom integration, creating an “integration tax” that limits comprehensive evaluation. We propose CUBE (Common Unified Benchmark Environments), a universal protocol standard built on MCP and Gym that allows benchmarks to be wrapped once and used everywhere. By separating task, benchmark, package, and registry concerns into distinct API layers, CUBE enables any compliant platform to access any compliant benchmark for evaluation, RL training, or data generation without custom integration. We call on the community to contribute to the development of this standard before platform-specific implementations deepen fragmentation as benchmark production accelerates through 2026.

[365] Prose2Policy (P2P): A Practical LLM Pipeline for Translating Natural-Language Access Policies into Executable Rego

Vatsal Gupta, Darshan Sreenivasamurthy

Main category: cs.AI

TL;DR: Prose2Policy (P2P) is an LLM-based tool that translates natural-language access control policies into executable Rego code for Open Policy Agent, with a modular pipeline for policy detection, validation, testing, and deployment.

DetailsMotivation: The paper aims to bridge the gap between human-readable natural language access control requirements and machine-enforceable policy-as-code (PaC), addressing the need for reliable and auditable policy deployment in Zero Trust and compliance-driven environments.

Method: Uses LLM-based approach with a modular end-to-end pipeline that performs policy detection, component extraction, schema validation, linting, compilation, automatic test generation and execution to translate natural-language policies into Rego code.

Result: Achieved 95.3% compile rate for accepted policies on ACRE dataset, with automated testing showing 82.2% positive-test pass rate and 98.9% negative-test pass rate, producing syntactically robust and behaviorally consistent Rego policies.

Conclusion: Prose2Policy successfully translates natural-language access control policies into executable Rego code suitable for Zero Trust and compliance environments, demonstrating high reliability and consistency in policy generation.

Abstract: Prose2Policy (P2P) is a LLM-based practical tool that translates natural-language access control policies (NLACPs) into executable Rego code (the policy language of Open Policy Agent, OPA). It provides a modular, end-to-end pipeline that performs policy detection, component extraction, schema validation, linting, compilation, automatic test generation and execution. Prose2Policy is designed to bridge the gap between human-readable access requirements and machine-enforceable policy-as-code (PaC) while emphasizing deployment reliability and auditability. We evaluated Prose2Policy on the ACRE dataset and demonstrated a 95.3% compile rate for accepted policies, with automated testing achieving a 82.2% positive-test pass rate and a 98.9% negative-test pass rate. These results indicate that Prose2Policy produces syntactically robust and behaviorally consistent Rego policies suitable for Zero Trust and compliance-driven environments.

[366] Persona-Conditioned Risk Behavior in Large Language Models: A Simulated Gambling Study with GPT-4.1

Sankalp Dubedy

Main category: cs.AI

TL;DR: GPT-4.1 with socioeconomic personas in a slot-machine environment exhibits Prospect Theory behaviors without explicit instruction, showing significant differences in risk-taking based on persona wealth levels.

DetailsMotivation: To understand whether LLM behaviors in decision-making contexts reflect genuine cognitive patterns or just surface-level prompt mimicry, and to investigate if classical economic biases like Prospect Theory are implicitly encoded in pretrained models.

Method: Controlled experiment assigning GPT-4.1 one of three socioeconomic personas (Rich, Middle-income, Poor) in a structured slot-machine environment with three machine configurations (Fair 50%, Biased Low 35%, Streak dynamic probability). Collected 6,950 decisions across 50 iterations per condition.

Result: Poor persona played 37.4 rounds per session vs 1.1 for Rich persona (highly significant difference). Risk scores showed large effect sizes (Cohen’s d=4.15 Poor vs Rich). Emotional labels functioned as post-hoc annotations rather than decision drivers, with negligible belief-updating across rounds.

Conclusion: LLMs reproduce key behavioral signatures of Prospect Theory without instruction, suggesting classical cognitive economic biases are implicitly encoded in pretrained models, with implications for LLM agent design and interpretability research.

Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents in uncertain, sequential decision-making contexts. Yet it remains poorly understood whether the behaviors they exhibit in such environments reflect principled cognitive patterns or simply surface-level prompt mimicry. This paper presents a controlled experiment in which GPT-4.1 was assigned one of three socioeconomic personas (Rich, Middle-income, and Poor) and placed in a structured slot-machine environment with three distinct machine configurations: Fair (50%), Biased Low (35%), and Streak (dynamic probability increasing after consecutive losses). Across 50 independent iterations per condition and 6,950 recorded decisions, we find that the model reproduces key behavioral signatures predicted by Kahneman and Tversky’s Prospect Theory without being instructed to do so. The Poor persona played a mean of 37.4 rounds per session (SD=15.5) compared to 1.1 rounds for the Rich persona (SD=0.31), a difference that is highly significant (Kruskal-Wallis H=393.5, p<2.2e-16). Risk scores by persona show large effect sizes (Cohen’s d=4.15 for Poor vs Rich). Emotional labels appear to function as post-hoc annotations rather than decision drivers (chi-square=3205.4, Cramer’s V=0.39), and belief-updating across rounds is negligible (Spearman rho=0.032 for Poor persona, p=0.016). These findings carry implications for LLM agent design, interpretability research, and the broader question of whether classical cognitive economic biases are implicitly encoded in large-scale pretrained language models.

[367] Algorithmic Trading Strategy Development and Optimisation

Owen Nyo Wei Yuan, Victor Tan Jia Xuan, Ong Jun Yao Fabian, Ryan Tan Jun Wei

Main category: cs.AI

TL;DR: Enhanced algorithmic trading strategy combining technical indicators with FinBERT sentiment analysis on S&P 500 data, outperforming baseline models.

DetailsMotivation: To improve algorithmic trading performance by integrating both quantitative technical indicators and qualitative sentiment analysis from earnings calls, addressing limitations of traditional strategies that rely solely on technical analysis.

Method: Developed a trading strategy using historical S&P 500 data with technical indicators (moving averages, momentum, volatility) combined with FinBERT-based sentiment analysis of earnings calls, followed by computational optimization of the integrated approach.

Result: The enhanced strategy significantly outperformed the baseline model across multiple metrics including total return, Sharpe ratio, and drawdown, demonstrating superior risk-adjusted performance.

Conclusion: Combining technical indicators with sentiment analysis and computational optimization creates more effective algorithmic trading systems, validating the value of integrating quantitative and qualitative data sources.

Abstract: The report presents with the development and optimisation of an enhanced algorithmic trading strategy through the use of historical S&P 500 market data and earnings call sentiment analysis. The proposed strategy integrates various technical indicators such as moving averages, momentum, volatility, and FinBERT-based sentiment analysis to improve overall trades being taken. The results show that the enhanced strategy significantly outperforms the baseline model in terms of total return, Sharpe ratio, and drawdown amongst other factors. The findings helped demonstrate the relevance and effectiveness of combining technical indicators, sentiment analysis, and computational optimisation in algorithmic trading systems.

[368] Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models

Pranaya Jajoo, Harshit Sikchi, Siddhant Agarwal, Amy Zhang, Scott Niekum, Martha White

Main category: cs.AI

TL;DR: RLDP adds orthogonality regularization to next-state prediction to maintain feature diversity, enabling better zero-shot RL than complex representation learning methods, especially in low-coverage scenarios.

DetailsMotivation: Behavioral Foundation Models (BFMs) require complex representation learning objectives and sufficient dataset coverage to learn useful spanning features for zero-shot RL. The authors question whether these complex objectives are necessary and examine if simpler self-supervised next-state prediction can work.

Method: Proposes Regularized Latent Dynamics Prediction (RLDP) which adds orthogonality regularization to self-supervised next-state prediction in latent space to maintain feature diversity and prevent feature collapse.

Result: RLDP matches or surpasses state-of-the-art complex representation learning methods for zero-shot RL, and performs significantly better in low-coverage scenarios where prior approaches fail.

Conclusion: Complex representation learning objectives may not be necessary for zero-shot RL; a simple orthogonality-regularized next-state prediction approach can achieve comparable or better performance while being more robust to low data coverage.

Abstract: Behavioral Foundation Models (BFMs) produce agents with the capability to adapt to any unknown reward or task. These methods, however, are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing state features, making the choice of state features crucial to the expressivity of the BFM. As a result, BFMs are trained using a variety of complex objectives and require sufficient dataset coverage, to train task-useful spanning features. In this work, we examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span. We propose an approach, Regularized Latent Dynamics Prediction (RLDP), that adds a simple orthogonality regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we empirically show that prior approaches perform poorly in low-coverage scenarios where RLDP still succeeds.

[369] Resilience Meets Autonomy: Governing Embodied AI in Critical Infrastructure

Puneet Sharma, Christer Henrik Pursiainen

Main category: cs.AI

TL;DR: Paper proposes hybrid governance architecture for embodied AI in critical infrastructure, balancing machine autonomy with human oversight based on task complexity and risk levels.

DetailsMotivation: Embodied AI in critical infrastructure faces challenges with cascading failures and crisis dynamics that exceed training assumptions, requiring better governance frameworks for resilience.

Method: Proposes bounded autonomy within hybrid governance architecture with four oversight modes, mapping them to critical infrastructure sectors based on task complexity, risk level, and consequence severity.

Result: Framework for structured allocation of machine capability and human judgement in embodied AI systems for critical infrastructure resilience.

Conclusion: Effective governance of embodied AI in critical infrastructure requires hybrid architectures that balance autonomy with appropriate human oversight based on contextual factors.

Abstract: Critical infrastructure increasingly incorporates embodied AI for monitoring, predictive maintenance, and decision support. However, AI systems designed to handle statistically representable uncertainty struggle with cascading failures and crisis dynamics that exceed their training assumptions. This paper argues that Embodied AIs resilience depends on bounded autonomy within a hybrid governance architecture. We outline four oversight modes and map them to critical infrastructure sectors based on task complexity, risk level, and consequence severity. Drawing on the EU AI Act, ISO safety standards, and crisis management research, we argue that effective governance requires a structured allocation of machine capability and human judgement.

[370] AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

Andrea Tupini, Lars Liden, Reuben Tan, Yu Wang, Jianfeng Gao

Main category: cs.AI

TL;DR: AsgardBench is a benchmark for evaluating visually-grounded interactive planning in embodied AI, focusing on plan adaptation during execution based on visual observations rather than low-level control.

DetailsMotivation: Current embodied AI benchmarks often conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception. There's a need to isolate and evaluate interactive planning capabilities where agents must revise plans based on visual observations during execution.

Method: The benchmark contains 108 task instances across 12 task types with systematic variations in object state, placement, and scene configuration. It restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise.

Result: Evaluations of leading vision language models show performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that undermine interactive planning capabilities.

Conclusion: AsgardBench successfully isolates and evaluates interactive planning capabilities, demonstrating that current models struggle with visual grounding and state tracking needed for plan adaptation during execution.

Abstract: With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected?

[371] Prompt Engineering for Scale Development in Generative Psychometrics

Lara Lee Russell-Lasalandra, Hudson Golino

Main category: cs.AI

TL;DR: Adaptive prompting in LLMs produces higher-quality personality assessment items with better structural validity and less redundancy compared to other prompting strategies.

DetailsMotivation: To investigate how different prompt engineering strategies affect the quality of LLM-generated personality assessment items within generative psychometrics frameworks.

Method: Monte Carlo simulation using AI-GENIE framework with multiple prompting designs (zero-shot, few-shot, persona-based, adaptive), model temperatures, and LLMs to generate Big Five trait items, evaluated using network psychometric methods.

Result: Adaptive prompting consistently outperformed other strategies by reducing semantic redundancy, improving structural validity, and preserving larger item pools, especially with newer, higher-capacity models. Benefits were robust across temperature settings except for GPT-4o at high temperatures.

Conclusion: Adaptive prompting is the strongest approach for generating psychometric items, with benefits scaling with model capability, highlighting the importance of model-prompt interactions in generative psychometrics.

Abstract: This Monte Carlo simulation examines how prompt engineering strategies shape the quality of large language model (LLM)–generated personality assessment items within the AI-GENIE framework for generative psychometrics. Item pools targeting the Big Five traits were generated using multiple prompting designs (zero-shot, few-shot, persona-based, and adaptive), model temperatures, and LLMs, then evaluated and reduced using network psychometric methods. Across all conditions, AI-GENIE reliably improved structural validity following reduction, with the magnitude of its incremental contribution inversely related to the quality of the incoming item pool. Prompt design exerted a substantial influence on both pre- and post-reduction item quality. Adaptive prompting consistently outperformed non-adaptive strategies by sharply reducing semantic redundancy, elevating pre-reduction structural validity, and preserving substantially larger item pool, particularly when paired with newer, higher-capacity models. These gains were robust across temperature settings for most models, indicating that adaptive prompting mitigates common trade-offs between creativity and psychometric coherence. An exception was observed for the GPT-4o model at high temperatures, suggesting model-specific sensitivity to adaptive constraints at elevated stochasticity. Overall, the findings demonstrate that adaptive prompting is the strongest approach in this context, and that its benefits scale with model capability, motivating continued investigation of model–prompt interactions in generative psychometric pipelines.

[372] Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau Equilibrium

Vasily Ilin

Main category: cs.AI

TL;DR: AI-assisted formalization of Vlasov-Maxwell-Landau system equilibrium characterization using Lean 4, demonstrating full AI research loop with human supervision.

DetailsMotivation: To demonstrate a complete AI-assisted mathematical research loop where AI systems generate proofs from conjectures, translate them into formal verification code, and have the results verified by theorem provers, with minimal human intervention.

Method: Used Gemini DeepThink for proof generation from conjectures, Claude Code for Lean translation from natural language prompts, Aristotle prover for lemma closure, and Lean kernel for final verification. Single mathematician supervised the process over 10 days.

Result: Successfully formalized equilibrium characterization in Vlasov-Maxwell-Landau system with 229 human prompts, 213 git commits, and zero lines of human-written code. Completed formalization before final draft of corresponding math paper.

Conclusion: Demonstrates feasibility of AI-assisted mathematical research with detailed insights into AI failure modes (hypothesis creep, definition-alignment bugs) and successful strategies (abstract/concrete proof split, adversarial self-review, human review of key definitions).

Abstract: We present a complete Lean 4 formalization of the equilibrium characterization in the Vlasov-Maxwell-Landau (VML) system, which describes the motion of charged plasma. The project demonstrates the full AI-assisted mathematical research loop: an AI reasoning model (Gemini DeepThink) generated the proof from a conjecture, an agentic coding tool (Claude Code) translated it into Lean from natural-language prompts, a specialized prover (Aristotle) closed 111 lemmas, and the Lean kernel verified the result. A single mathematician supervised the process over 10 days at a cost of $200, writing zero lines of code. The entire development process is public: all 229 human prompts, and 213 git commits are archived in the repository. We report detailed lessons on AI failure modes – hypothesis creep, definition-alignment bugs, agent avoidance behaviors – and on what worked: the abstract/concrete proof split, adversarial self-review, and the critical role of human review of key definitions and theorem statements. Notably, the formalization was completed before the final draft of the corresponding math paper was finished.

[373] Argumentative Human-AI Decision-Making: Toward AI Agents That Reason With Us, Not For Us

Stylianos Loukas Vasileiou, Antonio Rago, Francesca Toni, William Yeoh

Main category: cs.AI

TL;DR: Paper proposes combining computational argumentation frameworks with LLMs to create transparent, contestable AI decision-making systems that reason with humans rather than for them.

DetailsMotivation: Computational argumentation provides transparent reasoning but requires domain-specific knowledge and feature engineering, while LLMs excel at unstructured text but are opaque. The convergence could enable trustworthy AI for high-stakes domains.

Method: Proposes synergy of three components: argumentation framework mining (extracting arguments from text), argumentation framework synthesis (constructing formal structures), and argumentative reasoning (dialectical processes).

Result: Conceptual framework for Argumentative Human-AI Decision-Making where decisions are contestable and revisable through dialectical engagement rather than just justification.

Conclusion: Combining computational argumentation with LLMs is essential for creating human-aware, trustworthy AI systems that can engage in transparent reasoning processes in high-stakes domains.

Abstract: Computational argumentation offers formal frameworks for transparent, verifiable reasoning but has traditionally been limited by its reliance on domain-specific information and extensive feature engineering. In contrast, LLMs excel at processing unstructured text, yet their opaque nature makes their reasoning difficult to evaluate and trust. We argue that the convergence of these fields will lay the foundation for a new paradigm: Argumentative Human-AI Decision-Making. We analyze how the synergy of argumentation framework mining, argumentation framework synthesis, and argumentative reasoning enables agents that do not just justify decisions, but engage in dialectical processes where decisions are contestable and revisable – reasoning with humans rather than for them. This convergence of computational argumentation and LLMs is essential for human-aware, trustworthy AI in high-stakes domains.

[374] Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents

Jacopo Teneggi, S. M. Bargeen A. Turzo, Tanya Marwah, Alberto Bietti, P. Douglas Renfrew, Vikram Khipple Mulligan, Siavash Golkar

Main category: cs.AI

TL;DR: Agent Rosetta is an LLM agent integrated with Rosetta protein design software to enable autonomous execution of complex protein design tasks, including non-canonical amino acids where ML methods fail.

DetailsMotivation: Current ML methods for protein design are limited to canonical amino acids and narrow objectives, lacking a generalist tool for broad design pipelines. There's a need to make sophisticated physics-based software like Rosetta more accessible through LLM agents.

Method: The authors pair an LLM agent with a structured environment for operating Rosetta software. The agent iteratively refines designs to achieve user-defined objectives by combining LLM reasoning with Rosetta’s physics-based modeling capabilities for non-canonical building blocks and geometries.

Result: Agent Rosetta matches specialized models and expert baselines on canonical amino acid design, and achieves comparable performance on non-canonical residue design where ML approaches fail. The structured environment was essential for successful integration.

Conclusion: Properly designed environments enable LLM agents to make scientific software accessible while matching specialized tools and human experts, demonstrating the potential for LLM agents in complex scientific workflows.

Abstract: Large language models (LLMs) are capable of emulating reasoning and using tools, creating opportunities for autonomous agents that execute complex scientific tasks. Protein design provides a natural testbed: although machine learning (ML) methods achieve strong results, these are largely restricted to canonical amino acids and narrow objectives, leaving unfilled need for a generalist tool for broad design pipelines. We introduce Agent Rosetta, an LLM agent paired with a structured environment for operating Rosetta, the leading physics-based heteropolymer design software, capable of modeling non-canonical building blocks and geometries. Agent Rosetta iteratively refines designs to achieve user-defined objectives, combining LLM reasoning with Rosetta’s generality. We evaluate Agent Rosetta on design with canonical amino acids, matching specialized models and expert baselines, and with non-canonical residues – where ML approaches fail – achieving comparable performance. Critically, prompt engineering alone often fails to generate Rosetta actions, demonstrating that environment design is essential for integrating LLM agents with specialized software. Our results show that properly designed environments enable LLM agents to make scientific software accessible while matching specialized tools and human experts.

[375] Optimizing Hospital Capacity During Pandemics: A Dual-Component Framework for Strategic Patient Relocation

Sadaf Tabatabaee, Hicham El Baz, Mohammed Khalil Ghali, Nagendra N. Nagarur

Main category: cs.AI

TL;DR: A two-part framework combining time series prediction and simulation modeling to optimize hospital capacity through patient relocation strategies during COVID-19 pandemics.

DetailsMotivation: The COVID-19 pandemic has created critical capacity challenges for hospital systems worldwide, requiring better tools to manage patient flow and resource allocation during healthcare crises.

Method: 1) Time series prediction model using historical COVID-19 data to forecast patient arrival rates; 2) Simulation model evaluating patient relocation strategies considering bed availability, staff capabilities, transportation logistics, and patient acuity across networked hospitals.

Result: The framework provides a comprehensive decision-support tool for hospital administrators to anticipate demand, simulate relocation strategies, and implement optimal policies for patient distribution.

Conclusion: This research aims to enhance healthcare system resilience during COVID-19 and future pandemics through predictive analytics and simulation modeling for optimized hospital capacity management.

Abstract: The COVID-19 pandemic has placed immense strain on hospital systems worldwide, leading to critical capacity challenges. This research proposes a two-part framework to optimize hospital capacity through patient relocation strategies. The first component involves developing a time series prediction model to forecast patient arrival rates. Using historical data on COVID-19 cases and hospitalizations, the model will generate accurate forecasts of future patient volumes. This will enable hospitals to proactively plan resource allocation and patient flow. The second com- ponent is a simulation model that evaluates the impact of different patient relocation strategies. The simulation will account for factors such as bed availability, staff capabilities, transportation logistics, and patient acuity to optimize the placement of patients across networked hospitals. Multiple scenarios will be tested, including inter-hospital trans- fers, use of temporary care facilities, and adaptations to discharge protocols. By combining predictive analytics and simulation modeling, this research aims to provide hospital administrators with a comprehensive decision-support tool. The proposed framework will empower them to anticipate demand, simulate relocation strategies, and imple- ment optimal policies to distribute patients and resources. Ultimately, this work seeks to enhance the resilience of healthcare systems in the face of COVID-19 and future pandemics.

[376] Safety is Non-Compositional: A Formal Framework for Capability-Based AI Systems

Cosimo Spera

Main category: cs.AI

TL;DR: Formal proof that safety is non-compositional when agents have conjunctive capability dependencies - individually safe agents can become unsafe when combined due to emergent dependencies.

DetailsMotivation: To formally analyze the compositionality of safety in multi-agent systems, particularly when agents have conjunctive capability dependencies where individual agents lack forbidden capabilities but can collectively achieve them.

Method: Formal proof approach demonstrating through mathematical/logical analysis that safety properties fail to compose when agents have conjunctive capability dependencies, showing how emergent dependencies can arise.

Result: Proved that safety is non-compositional in systems with conjunctive capability dependencies - two individually safe agents can combine to create unsafe collective behavior through emergent conjunctive dependencies.

Conclusion: Safety verification in multi-agent systems requires analysis beyond individual agent properties, as composition can create emergent unsafe behaviors through conjunctive capability dependencies.

Abstract: This paper contains the first formal proof that safety is non-compositional in the presence of conjunctive capability dependencies: two agents each individually inca- pable of reaching any forbidden capability can, when combined, collectively reach a forbidden goal through an emergent conjunctive dependency.

[377] An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc

Hong Zhang, Barry Smith, Satish Balay, Le Chen, Murat Keceli, Lois Curfman McInnes, Junchao Zhang

Main category: cs.AI

TL;DR: petscagent-bench: An agentic framework for evaluating scientific code generation in HPC using an agents-evaluating-agents paradigm with multi-dimensional scoring beyond traditional test-case matching.

DetailsMotivation: Traditional benchmarks for evaluating code generation only use test-case matching, which is insufficient for HPC library code where solver selection, API conventions, memory management, and performance are as critical as functional correctness.

Method: Introduces petscagent-bench, an agentic framework using an agents-evaluating-agents paradigm with a tool-augmented evaluator agent that compiles, executes, and measures code through a 14-evaluator pipeline across five scoring categories. Uses standardized protocols (A2A and MCP) for black-box evaluation.

Result: Empirical analysis on PETSc library problems shows current models generate readable, well-structured code but consistently struggle with library-specific conventions that traditional pass/fail metrics miss.

Conclusion: The framework addresses limitations of traditional benchmarks for scientific code generation evaluation, revealing critical gaps in current models’ understanding of library-specific conventions in HPC contexts.

Abstract: While large language models have significantly accelerated scientific code generation, comprehensively evaluating the generated code remains a major challenge. Traditional benchmarks reduce evaluation to test-case matching, an approach insufficient for library code in HPC where solver selection, API conventions, memory management, and performance are just as critical as functional correctness. To address this gap, we introduce petscagent-bench, an agentic framework built on an agents-evaluating-agents paradigm. Instead of relying on static scripts, petscagent-bench deploys a tool-augmented evaluator agent that compiles, executes, and measures code produced by a separate model-under-test agent, orchestrating a 14-evaluator pipeline across five scoring categories: correctness, performance, code quality, algorithmic appropriateness, and library-specific conventions. Because the agents communicate through standardized protocols (A2A and MCP), the framework enables black-box evaluation of any coding agent without requiring access to its source code. We demonstrate the framework on a benchmark suite of realistic problems using the PETSc library for HPC. Our empirical analysis of frontier models reveals that while current models generate readable, well-structured code, they consistently struggle with library-specific conventions that traditional pass/fail metrics completely miss.

[378] From Workflow Automation to Capability Closure: A Formal Framework for Safe and Revenue-Aware Customer Service AI

Cosimo Spera, Garima Agrawal, Riccardo De Maria

Main category: cs.AI

TL;DR: Paper introduces safety verification framework for multi-agent AI systems in customer service, addressing emergent conjunctive dependencies where individually safe agents can combine to reach forbidden goals

DetailsMotivation: Shift from scripted chatbots to networks of specialized AI agents in customer service creates safety gap: individually verified safe agents can combine to reach forbidden goals through emergent conjunctive dependencies not present in individual agents

Method: Proposes formal verification framework for multi-agent systems that analyzes conjunctive dependencies and emergent behaviors when agents compose capabilities across domains like billing, service provision, payments, and fulfillment

Result: Framework identifies safety gaps in current platforms and provides verification approach for emergent risks in multi-agent customer service automation

Conclusion: Current customer service automation platforms lack safety verification for emergent conjunctive dependencies in multi-agent systems, requiring new formal verification approaches

Abstract: Customer service automation is undergoing a structural transformation. The dominant paradigm is shifting from scripted chatbots and single-agent responders toward networks of specialised AI agents that compose capabilities dynamically across billing, service provision, payments, and fulfilment. This shift introduces a safety gap that no current platform has closed: two agents individually verified as safe can, when combined, reach a forbidden goal through an emergent conjunctive dependency that neither possesses alone.

[379] Selective Memory for Artificial Intelligence: Write-Time Gating with Hierarchical Archiving

Oliver Zahn, Simran Chana

Main category: cs.AI

TL;DR: Write-time gating for retrieval-augmented generation filters knowledge at storage using salience scores and maintains version chains, outperforming read-time filtering especially with high noise/distractor ratios.

DetailsMotivation: Current retrieval-augmented generation stores all content indiscriminately (accumulating noise) while parametric approaches compress knowledge into weights (preventing selective updates). Neither approach mimics biological memory's ability to gate encoding based on salience and archive rather than delete superseded information.

Method: Introduces write-time gating that filters incoming knowledge objects using composite salience scores (source reputation, novelty, reliability) while maintaining version chains that preserve prior states. The approach operates without oracle access to quality labels and is validated across multiple domains.

Result: Write gating achieves 100% accuracy vs 13% for ungated stores. At 8:1 distractor ratios, read-time filtering (Self-RAG) collapses to 0% while write gating maintains 100%. The advantage scales inversely with parametric memory support: +25pp for Wikipedia, +48pp for post-cutoff arXiv, +65pp for procedural data with zero training knowledge. Write gating matches Self-RAG accuracy at one-ninth the query-time cost.

Conclusion: Write-time gating provides structural advantages over read-time curation for knowledge management in retrieval-augmented systems, especially in high-noise environments, and offers significant efficiency improvements while maintaining accuracy.

Abstract: Retrieval-augmented generation stores all content indiscriminately, degrading accuracy as noise accumulates. Parametric approaches compress knowledge into weights, precluding selective updates. Neither mirrors biological memory, which gates encoding based on salience and archives rather than deletes superseded information. We introduce write-time gating that filters incoming knowledge objects using composite salience scores (source reputation, novelty, reliability) while maintaining version chains that preserve prior states. Using real LLM evaluation without oracle access to quality labels, write gating achieves 100 percent accuracy versus 13 percent for ungated stores. The critical finding emerges under distractor scaling: at 8:1 distractor ratios, read-time filtering (Self-RAG) collapses to 0 percent while write gating maintains 100 percent, revealing a structural advantage of write-time over read-time curation. Validation on Wikipedia (20 entities), procedurally generated pharmacology data, and 2026 arXiv papers confirms these findings. The gating advantage scales inversely with parametric memory support: +25pp for Wikipedia, +48pp for post-cutoff arXiv, +65pp for procedural data with zero training knowledge. Signal ablation confirms the method does not depend on oracle-correlated metadata. Write gating matches Self-RAG accuracy at one-ninth the query-time cost.

[380] IRAM-Omega-Q: A Computational Architecture for Uncertainty Regulation in Artificial Agents

Veronique Ziegler

Main category: cs.AI

TL;DR: IRAM-Omega-Q is a computational architecture modeling internal regulation as closed-loop control using quantum-like state representations (density matrices) to manage uncertainty and stability in artificial agents under stochastic perturbation.

DetailsMotivation: Artificial agents often achieve strong task performance but remain opaque regarding internal regulation, uncertainty management, and stability under stochastic perturbation. There's a need for architectures that explicitly model these aspects using concrete computational principles.

Method: Uses density matrices instrumentally as abstract state descriptors to compute entropy, purity, and coherence metrics without physical quantum processes. Implements adaptive gain control updated continuously to maintain target uncertainty under noise. Analyzes using parameter sweeps, fixed-seed simulations, and susceptibility-based phase-diagram analysis.

Result: Identifies reproducible critical boundaries in regulation-noise space. Shows that perception-first vs action-first control update orderings induce distinct stability regimes under identical external conditions.

Conclusion: Supports uncertainty regulation as a concrete architectural principle for artificial agents and provides a formal setting for studying stability, control, and order effects in cognitively inspired AI systems.

Abstract: Artificial agents can achieve strong task performance while remaining opaque with respect to internal regulation, uncertainty management, and stability under stochastic perturbation. We present IRAM-Omega-Q, a computational architecture that models internal regulation as closed-loop control over a quantum-like state representation. The framework uses density matrices instrumentally as abstract state descriptors, enabling direct computation of entropy, purity, and coherence-related metrics without invoking physical quantum processes. A central adaptive gain is updated continuously to maintain a target uncertainty regime under noise. Using systematic parameter sweeps, fixed-seed publication-mode simulations, and susceptibility-based phase-diagram analysis, we identify reproducible critical boundaries in regulation-noise space. We further show that alternative control update orderings, interpreted as perception-first and action-first architectures, induce distinct stability regimes under identical external conditions. These results support uncertainty regulation as a concrete architectural principle for artificial agents and provide a formal setting for studying stability, control, and order effects in cognitively inspired AI systems. The framework is presented as a technical model of adaptive regulation dynamics in artificial agents. It makes no claims regarding phenomenological consciousness, and the quantum-like formalism is used strictly as a mathematical representation for structured uncertainty and state evolution.

[381] Interpretable Context Methodology: Folder Structure as Agentic Architecture

Jake Van Clief, David McDermott

Main category: cs.AI

TL;DR: MWP replaces multi-agent frameworks with filesystem structure for sequential AI workflows, using numbered folders and markdown files to guide a single agent through step-by-step tasks.

DetailsMotivation: Current multi-agent frameworks introduce unnecessary engineering overhead for sequential workflows where human review happens at each step. There's a need for simpler orchestration that maintains human oversight while reducing complexity.

Method: Model Workspace Protocol uses filesystem structure with numbered folders representing stages, plain markdown files for prompts and context, and local scripts for mechanical work. It applies Unix pipeline design, modular decomposition, and literate programming principles to AI agent orchestration.

Result: A system where a single AI agent, reading appropriate files at each stage, performs work that would otherwise require complex multi-agent frameworks, reducing engineering overhead while maintaining human review capabilities.

Conclusion: Filesystem-based orchestration provides a simpler, more transparent alternative to complex multi-agent frameworks for sequential AI workflows with human oversight, drawing inspiration from proven software engineering patterns.

Abstract: Current approaches to AI agent orchestration typically involve building multi-agent frameworks that manage context passing, memory, error handling, and step coordination through code. These frameworks work well for complex, concurrent systems. But for sequential workflows where a human reviews output at each step, they introduce engineering overhead that the problem does not require. This paper presents Model Workspace Protocol (MWP), a method that replaces framework-level orchestration with filesystem structure. Numbered folders represent stages. Plain markdown files carry the prompts and context that tell a single AI agent what role to play at each step. Local scripts handle the mechanical work that does not need AI at all. The result is a system where one agent, reading the right files at the right moment, does the work that would otherwise require a multi-agent framework. This approach applies ideas from Unix pipeline design, modular decomposition, multi-pass compilation, and literate programming to the specific problem of structuring context for AI agents. The protocol is open source under the MIT license.

[382] Enhancing Linguistic Generalization of VLA: Fine-Tuning OpenVLA via Synthetic Instruction Augmentation

Dongik Shin

Main category: cs.AI

TL;DR: Parameter-efficient fine-tuning of OpenVLA using LLM-generated diverse instructions improves robotic language generalization for embodied AI tasks

DetailsMotivation: OpenVLA has SOTA performance but limited zero-shot generalization to new environments; need to enhance linguistic generalization for embodied agents to better understand diverse natural language commands

Method: Use LLM to generate semantically equivalent but structurally diverse commands for Bridge Dataset V2 trajectories, then apply LoRA fine-tuning on OpenVLA with augmented instruction-action pairs

Result: LoRA-enhanced model shows improved robustness, demonstrating that enriching linguistic space of specialized datasets is effective for bridging natural language intent to robotic actions

Conclusion: Parameter-efficient fine-tuning with linguistically diverse instruction augmentation significantly improves OpenVLA’s generalization capabilities for embodied AI tasks

Abstract: Generalization remains a core challenge in embodied AI, as robots must adapt to diverse environments. While OpenVLA represents the State-of-the-Art (SOTA) in Vision-Language-Action models by leveraging large-scale pre-training, its zero-shot performance can be limited when encountering completely new environments. This paper proposes a parameter-efficient fine-tuning strategy to enhance the linguistic generalization of OpenVLA by synthesizing a general instruction set for the Bridge Dataset V2. The paper leverages a Large Language Model (LLM) to generate a rich variety of semantically equivalent but structurally diverse commands for existing trajectories. In this experiment, Low-Rank Adaptation (LoRA) is implemented to fine-tune OpenVLA on augmented pairs, allowing the model to bridge the gap between complex natural language intent and robotic actions. Results demonstrate that the LoRA-enhanced model’s robustness, suggesting that enriching the linguistic space of specialized datasets is crucial for embodied agents.

[383] POaaS: Minimal-Edit Prompt Optimization as a Service to Lift Accuracy and Cut Hallucinations on On-Device sLLMs

Jungwoo Shim, Dae Won Kim, Sun Wook Kim, Soo Young Kim, Myungcheol Lee, Jae-geun Cha, Hyunhwa Choi

Main category: cs.AI

TL;DR: POaaS is a minimal-edit prompt optimization layer for on-device small language models that routes queries to lightweight specialists (Cleaner, Paraphraser, Fact-Adder) with conservative skip policies to improve accuracy and factuality under strict constraints.

DetailsMotivation: On-device small language models face challenges with imperfect user prompts (typos, unclear intent, missing context) that trigger factual errors and hallucinations. Existing automatic prompt optimization methods designed for large cloud LLMs are too heavy for on-device constraints where the same small model must act as both optimizer and solver.

Method: POaaS uses a minimal-edit approach that routes each query to lightweight specialists: Cleaner (fixes typos), Paraphraser (clarifies intent), and Fact-Adder (adds missing context). It merges their outputs under strict drift and length constraints with a conservative skip policy for well-formed prompts.

Result: Under strict fixed-model settings with Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct, POaaS improves both task accuracy and factuality while representative APO baselines degrade them. POaaS recovers up to +7.4% accuracy under token deletion and mixup scenarios.

Conclusion: Per-query conservative optimization is a practical alternative to search-heavy automatic prompt optimization for on-device small language models, offering better performance under resource constraints.

Abstract: Small language models (sLLMs) are increasingly deployed on-device, where imperfect user prompts–typos, unclear intent, or missing context–can trigger factual errors and hallucinations. Existing automatic prompt optimization (APO) methods were designed for large cloud LLMs and rely on search that often produces long, structured instructions; when executed under an on-device constraint where the same small model must act as optimizer and solver, these pipelines can waste context and even hurt accuracy. We propose POaaS, a minimal-edit prompt optimization layer that routes each query to lightweight specialists (Cleaner, Paraphraser, Fact-Adder) and merges their outputs under strict drift and length constraints, with a conservative skip policy for well-formed prompts. Under a strict fixed-model setting with Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct, POaaS improves both task accuracy and factuality while representative APO baselines degrade them, and POaaS recovers up to +7.4% under token deletion and mixup. Overall, per-query conservative optimization is a practical alternative to search-heavy APO for on-device sLLMs.

[384] A Context Alignment Pre-processor for Enhancing the Coherence of Human-LLM Dialog

Ding Wei

Main category: cs.AI

TL;DR: C.A.P. is a pre-processing framework that improves LLM dialogue by expanding semantics, weighting context temporally, and verifying alignment to handle context shifts in conversations.

DetailsMotivation: LLMs struggle with contextual misalignment in dynamic dialogues when users omit premises, simplify references, or shift context abruptly, leading to mechanical or off-topic responses that weaken collaborative potential.

Method: Context Alignment Pre-processor (C.A.P.) operates as a pre-processing module with three core processes: semantic expansion (extending user instructions to include premises and implications), time-weighted context retrieval (prioritizing recent dialogue via temporal decay), and alignment verification with decision branching (detecting deviations and initiating clarification protocols).

Result: The paper presents the architecture and theoretical basis of C.A.P., drawing on cognitive science and Common Ground theory, proposing it as a step toward shifting human-computer dialogue from one-way command-execution to two-way, self-correcting collaboration.

Conclusion: C.A.P. represents both a technical refinement and a conceptual shift toward partnership-based human-computer collaboration, with discussions on implementation paths, evaluation methods, and implications for interactive intelligent systems.

Abstract: Large language models (LLMs) have made remarkable progress in generating fluent text, but they still face a critical challenge of contextual misalignment in long-term and dynamic dialogue. When human users omit premises, simplify references, or shift context abruptly during interactions with LLMs, the models may fail to capture their actual intentions, producing mechanical or off-topic responses that weaken the collaborative potential of dialogue. To address this problem, this paper proposes a computational framework called the Context Alignment Pre-processor (C.A.P.). Rather than operating during generation, C.A.P. functions as a pre-processing module between user input and response generation. The framework includes three core processes: (1) semantic expansion, which extends a user instruction to a broader semantic span including its premises, literal meaning, and implications; (2) time-weighted context retrieval, which prioritizes recent dialogue history through a temporal decay function approximating human conversational focus; and (3) alignment verification and decision branching, which evaluates whether the dialogue remains on track by measuring the semantic similarity between the current prompt and the weighted historical context. When a significant deviation is detected, C.A.P. initiates a structured clarification protocol to help users and the system recalibrate the conversation. This study presents the architecture and theoretical basis of C.A.P., drawing on cognitive science and Common Ground theory in human-computer interaction. We argue that C.A.P. is not only a technical refinement but also a step toward shifting human-computer dialogue from one-way command-execution patterns to two-way, self-correcting, partnership-based collaboration. Finally, we discuss implementation paths, evaluation methods, and implications for the future design of interactive intelligent systems.

[385] ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning

Yu Li, Rui Miao, Zhengling Qi, Tian Lan

Main category: cs.AI

TL;DR: ARISE is a hierarchical reinforcement learning framework that improves mathematical reasoning in language models by developing reusable skills through structured summarization of successful solutions and policy-driven skill selection.

DetailsMotivation: Current methods for improving mathematical reasoning treat each problem in isolation without leveraging reusable strategies that accumulate during training, missing opportunities for skill transfer and efficiency.

Method: Hierarchical RL framework with Skills Manager and Worker. Manager maintains tiered skill library through skill generation rollouts that summarize successful solution traces, and uses policy-driven selection to retrieve relevant skills. Hierarchical reward design guides co-evolution of reasoning ability and library quality.

Result: Outperforms GRPO-family algorithms and memory-augmented baselines on seven benchmarks spanning competition mathematics and Omni-MATH, with notable gains on out-of-distribution tasks. Each component contributes to improvements, and library quality and reasoning performance improve together.

Conclusion: ARISE demonstrates that hierarchical skill evolution through structured summarization and policy-driven selection effectively improves mathematical reasoning, particularly benefiting generalization to out-of-distribution tasks.

Abstract: The dominant paradigm for improving mathematical reasoning in language models relies on Reinforcement Learning with verifiable rewards. Yet existing methods treat each problem instance in isolation without leveraging the reusable strategies that emerge and accumulate during training. To this end, we introduce ARISE (Agent Reasoning via Intrinsic Skill Evolution), a hierarchical reinforcement learning framework, in which a shared policy operates both to manage skills at high-level and to generate responses at low-level (denoted as a Skills Manager and a Worker, respectively). The Manager maintains a tiered skill library through a dedicated skill generation rollout that performs structured summarization of successful solution traces (after execution), while employing a policy-driven selection mechanism to retrieve relevant skills to condition future rollouts (before execution). A hierarchical reward design guides the co-evolution of reasoning ability and library quality. Experiments on two base models and seven benchmarks spanning both competition mathematics and Omni-MATH show that ARISE consistently outperforms GRPO-family algorithms and memory-augmented baselines, with particularly notable gains on out-of-distribution tasks. Ablation studies confirm that each component contributes to the observed improvements and that library quality and reasoning performance improve in tandem throughout training. Code is available at \href{https://github.com/Skylanding/ARISE}{https://github.com/Skylanding/ARISE}.

[386] VIGIL: Towards Edge-Extended Agentic AI for Enterprise IT Support

Sarthak Ahuja, Neda Kordjazi, Evren Yortucboylu, Vishaal Kapoor, Mariam Dundua, Yiming Li, Derek Ho, Vaibhavi Padala, Jennifer Whitted, Rebecca Steinert

Main category: cs.AI

TL;DR: VIGIL is an edge-extended AI system that deploys desktop agents for enterprise IT support, performing on-device diagnosis, knowledge retrieval, and policy-governed remediation with user consent and observability.

DetailsMotivation: Enterprise IT support faces challenges with heterogeneous devices, evolving policies, and long-tail failure modes that are difficult to resolve centrally, requiring more efficient and user-friendly solutions.

Method: VIGIL deploys desktop-resident agents that perform situated diagnosis on user devices, retrieve information from enterprise knowledge bases, and execute policy-governed remediation with explicit user consent and end-to-end observability.

Result: In a 10-week pilot on 100 endpoints, VIGIL reduced interaction rounds by 39%, achieved at least 4x faster diagnosis, supported self-service resolution in 82% of matched cases, and received excellent usability, high trust, and low cognitive workload ratings from users.

Conclusion: VIGIL demonstrates effective edge-extended AI for enterprise IT support, establishing safety and observability foundations for fleet-wide continuous improvement, with on-device diagnosis providing value independent of knowledge base coverage.

Abstract: Enterprise IT support is constrained by heterogeneous devices, evolving policies, and long-tail failure modes that are difficult to resolve centrally. We present VIGIL, an edge-extended agentic AI system that deploys desktop-resident agents to perform situated diagnosis, retrieval over enterprise knowledge, and policy-governed remediation directly on user devices with explicit consent and end-to-end observability. In a 10-week pilot of VIGIL’s operational loop on 100 resource-constrained endpoints, VIGIL reduces interaction rounds by 39%, achieves at least 4 times faster diagnosis, and supports self-service resolution in 82% of matched cases. Users report excellent usability, high trust, and low cognitive workload across four validated instruments, with qualitative feedback highlighting transparency as critical for trust. Notably, users rated the system higher when no historical matches were available, suggesting on-device diagnosis provides value independent of knowledge base coverage. This pilot establishes safety and observability foundations for fleet-wide continuous improvement.

[387] NeuronSpark: A Spiking Neural Network Language Model with Selective State Space Dynamics

Zhengzheng Tang

Main category: cs.AI

TL;DR: NeuronSpark is a 0.9B-parameter spiking neural network language model trained from scratch without Transformer distillation, achieving 3.6 pretraining loss and showing early multi-turn dialogue capabilities after supervised fine-tuning.

DetailsMotivation: To investigate whether pure spiking neural networks can learn large-scale language modeling from random initialization without relying on Transformer distillation, addressing the challenge of building efficient neuromorphic language models.

Method: Combines selective state-space spiking dynamics, leakage-current inter-layer communication, PonderNet adaptive timesteps, fused Triton PLIF kernels, and stabilization techniques including residual centering, lateral-inhibition normalization, and natural-gradient compensation.

Result: Under constrained budget (1.4B pretraining tokens, 6.5K SFT steps), NeuronSpark-0.9B reaches 3.6 pretraining loss and demonstrates early multi-turn dialogue behavior after supervised fine-tuning.

Conclusion: Results support the feasibility of end-to-end language modeling with pure SNN architecture at this scale, showing promise for neuromorphic language models without Transformer distillation.

Abstract: We ask whether a pure spiking backbone can learn large-scale language modeling from random initialization, without Transformer distillation. We introduce NeuronSpark, a 0.9B-parameter SNN language model trained with next-token prediction and surrogate gradients. The model combines selective state-space spiking dynamics, leakage-current inter-layer communication, PonderNet adaptive timesteps, fused Triton PLIF kernels, and stabilization techniques (residual centering, lateral-inhibition normalization, and natural-gradient compensation). Under a constrained budget (about 1.4B pretraining tokens and 6.5K SFT steps), NeuronSpark-0.9B reaches 3.6 pretraining loss and shows early multi-turn dialogue behavior after SFT. These results support the feasibility of end-to-end language modeling with a pure SNN architecture at this scale.

[388] SQL-ASTRA: Alleviating Sparse Feedback in Agentic SQL via Column-Set Matching and Trajectory Aggregation

Long Li, Zhijian Zhou, Jiangxuan Long, Peiyang Liu, Weidi Xu, Zhe Wang, Shirui Pan, Chao Qu

Main category: cs.AI

TL;DR: Agentic SQL framework introduces two-tiered rewards for multi-turn Text-to-SQL: trajectory-level ATR for credit assignment and step-level CSMR for dense feedback, achieving SOTA results on BIRD and Spider 2.0.

DetailsMotivation: Text-to-SQL remains mostly single-turn due to credit assignment problems in multi-turn settings. Traditional RL approaches use only final-turn binary rewards, ignoring intermediate process and causing ambiguous credit evaluation.

Method: Proposes Agentic SQL with two-tiered reward mechanism: 1) Aggregated Trajectory Reward (ATR) using asymmetric transition matrix to aggregate process-oriented scores and guarantee cycle-free policy via Lyapunov stability theory; 2) Column-Set Matching Reward (CSMR) for immediate step-level rewards by executing queries at each turn and converting binary feedback to dense [0,1] signals based on partial correctness.

Result: Achieves 5% gain over binary-reward GRPO on BIRD benchmark. Outperforms SOTA Arctic-Text2SQL-R1-7B on both BIRD and Spider 2.0 using identical models.

Conclusion: The framework successfully addresses credit assignment in multi-turn Text-to-SQL, enabling robust agentic paradigms with theoretical guarantees and practical performance improvements.

Abstract: Agentic Reinforcement Learning (RL) shows promise for complex tasks, but Text-to-SQL remains mostly restricted to single-turn paradigms. A primary bottleneck is the credit assignment problem. In traditional paradigms, rewards are determined solely by the final-turn feedback, which ignores the intermediate process and leads to ambiguous credit evaluation. To address this, we propose Agentic SQL, a framework featuring a universal two-tiered reward mechanism designed to provide effective trajectory-level evaluation and dense step-level signals. First, we introduce Aggregated Trajectory Reward (ATR) to resolve multi-turn credit assignment. Using an asymmetric transition matrix, ATR aggregates process-oriented scores to incentivize continuous improvement. Leveraging Lyapunov stability theory, we prove ATR acts as an energy dissipation operator, guaranteeing a cycle-free policy and monotonic convergence. Second, Column-Set Matching Reward (CSMR) provides immediate step-level rewards to mitigate sparsity. By executing queries at each turn, CSMR converts binary (0/1) feedback into dense [0, 1] signals based on partial correctness. Evaluations on BIRD show a 5% gain over binary-reward GRPO. Notably, our approach outperforms SOTA Arctic-Text2SQL-R1-7B on BIRD and Spider 2.0 using identical models, propelling Text-to-SQL toward a robust multi-turn agent paradigm.

[389] Are Large Language Models Truly Smarter Than Humans?

Eshwar Reddy M, Sourav Karmakar

Main category: cs.AI

TL;DR: A multi-method contamination audit of six frontier LLMs reveals significant training data contamination in benchmark evaluation, with STEM subjects showing highest contamination rates and performance inflation.

DetailsMotivation: To address concerns that LLMs may be trained on the same benchmark data used to evaluate them, creating inflated performance metrics that don't reflect true capabilities.

Method: Three complementary experiments: 1) Lexical contamination detection on 513 MMLU questions, 2) Paraphrase and indirect-reference diagnostic on 100 questions, 3) TS-Guessing behavioral probes on all questions and models.

Result: Found 13.8% overall contamination rate (up to 66.7% in Philosophy), performance drops of 7.0 percentage points under indirect reference (up to 19.8 pp in Law/Ethics), and 72.5% of questions trigger memorization signals.

Conclusion: Benchmark contamination is widespread and systematically inflates LLM performance, particularly in STEM subjects, highlighting the need for more rigorous evaluation methods that account for data contamination.

Abstract: Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM > Professional > Social Sciences > Humanities.

[390] Proactive Rejection and Grounded Execution: A Dual-Stage Intent Analysis Paradigm for Safe and Efficient AIoT Smart Homes

Xinxin Jin, Zhengwei Ni, Zhengguo Sheng, Victor C. M. Leung

Main category: cs.AI

TL;DR: DS-IA framework improves LLM reliability in IoT by separating intent understanding from execution with semantic firewall and cascade verification, reducing hallucinations and interaction frequency issues.

DetailsMotivation: LLMs as embodied agents in IoT face reliability issues like entity hallucinations and the Interaction Frequency Dilemma (oscillating between reckless execution and excessive questioning), requiring better grounding in physical environments.

Method: Dual-Stage Intent-Aware Framework: Stage 1 acts as semantic firewall filtering invalid instructions and resolving vague commands by checking home state. Stage 2 uses deterministic cascade verifier to sequentially verify room, device, and capability before execution.

Result: Achieves 58.56% Exact Match rate (28% improvement over baselines), 87.04% rejection rate for invalid instructions. On SAGE benchmark, boosts Autonomous Success Rate from 42.86% to 71.43% while maintaining precision in identifying true ambiguities.

Conclusion: DS-IA effectively resolves Interaction Frequency Dilemma by balancing proactive querying with state-based inference, minimizing user disturbance through accurate environmental grounding for LLMs in IoT applications.

Abstract: As Large Language Models (LLMs) transition from information providers to embodied agents in the Internet of Things (IoT), they face significant challenges regarding reliability and interaction efficiency. Direct execution of LLM-generated commands often leads to entity hallucinations (e.g., trying to control non-existent devices). Meanwhile, existing iterative frameworks (e.g., SAGE) suffer from the Interaction Frequency Dilemma, oscillating between reckless execution and excessive user questioning. To address these issues, we propose a Dual-Stage Intent-Aware (DS-IA) Framework. This framework separates high-level user intent understanding from low-level physical execution. Specifically, Stage 1 serves as a semantic firewall to filter out invalid instructions and resolve vague commands by checking the current state of the home. Stage 2 then employs a deterministic cascade verifier-a strict, step-by-step rule checker that verifies the room, device, and capability in sequence-to ensure the action is actually physically possible before execution. Extensive experiments on the HomeBench and SAGE benchmarks demonstrate that DS-IA achieves an Exact Match (EM) rate of 58.56% (outperforming baselines by over 28%) and improves the rejection rate of invalid instructions to 87.04%. Evaluations on the SAGE benchmark further reveal that DS-IA resolves the Interaction Frequency Dilemma by balancing proactive querying with state-based inference. Specifically, it boosts the Autonomous Success Rate (resolving tasks without unnecessary user intervention) from 42.86% to 71.43%, while maintaining high precision in identifying irreducible ambiguities that truly necessitate human clarification. These results underscore the framework’s ability to minimize user disturbance through accurate environmental grounding.

[391] MOSAIC: Composable Safety Alignment with Modular Control Tokens

Jingyu Peng, Hongyu Chen, Jiancheng Dong, Maolin Wang, Wenxi Li, Yuchen Li, Kai Zhang, Xiangyu Zhao

Main category: cs.AI

TL;DR: MOSAIC: A modular framework for compositional safety alignment in LLMs using learnable control tokens that can be flexibly activated and composed at inference time.

DetailsMotivation: Current safety alignment approaches in LLMs use static policies that don't adapt to context-dependent safety rules varying across users, regions, and applications. Parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods provide weak enforcement.

Method: Proposes MOSAIC framework with learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated/composed at inference. Uses order-based task sampling and distribution-level alignment objective to mitigate over-refusal during training.

Result: MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility compared to existing approaches.

Conclusion: MOSAIC enables flexible, compositional safety alignment through modular control tokens, addressing limitations of static safety policies in LLMs.

Abstract: Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility.

[392] Adaptive Theory of Mind for LLM-based Multi-Agent Coordination

Chunjiang Mu, Ya Zeng, Qiaosheng Zhang, Kun Shao, Chen Chu, Hao Guo, Danyang Jia, Zhen Wang, Shuyue Hu

Main category: cs.AI

TL;DR: A-ToM agent adapts its Theory of Mind reasoning depth to align with partners, improving multi-agent coordination by addressing misaligned ToM orders that cause insufficient or excessive reasoning.

DetailsMotivation: Theory of Mind (ToM) helps LLM-driven agents coordinate better in multi-agent tasks, but misaligned ToM orders (mismatches in reasoning depth) can impair coordination by causing either insufficient or excessive reasoning about others.

Method: Design an adaptive ToM (A-ToM) agent that estimates the partner’s likely ToM order based on prior interactions, then uses this estimation to predict the partner’s actions for better behavioral coordination.

Result: Empirical evaluations on four multi-agent coordination tasks (repeated matrix game, two grid navigation tasks, Overcooked task) validate findings on ToM alignment and demonstrate A-ToM effectiveness.

Conclusion: A-ToM successfully addresses ToM misalignment in multi-agent coordination, with discussions on generalizability to non-LLM-based agents and conditions that diminish ToM alignment importance.

Abstract: Theory of Mind (ToM) refers to the ability to reason about others’ mental states, and higher-order ToM involves considering that others also possess their own ToM. Equipping large language model (LLM)-driven agents with ToM has long been considered to improve their coordination in multiagent collaborative tasks. However, we find that misaligned ToM orders-mismatches in the depth of ToM reasoning between agents-can lead to insufficient or excessive reasoning about others, thereby impairing their coordination. To address this issue, we design an adaptive ToM (A-ToM) agent, which can align in ToM orders with its partner. Based on prior interactions, the agent estimates the partner’s likely ToM order and leverages this estimation to predict the partner’s action, thereby facilitating behavioral coordination. We conduct empirical evaluations on four multi-agent coordination tasks: a repeated matrix game, two grid navigation tasks and an Overcooked task. The results validate our findings on ToM alignment and demonstrate the effectiveness of our A-ToM agent. Furthermore, we discuss the generalizability of our A-ToM to non-LLM-based agents, as well as what would diminish the importance of ToM alignment.

[393] NeSy-Route: A Neuro-Symbolic Benchmark for Constrained Route Planning in Remote Sensing

Ming Yang, Zhi Zhou, Shi-Yu Tian, Kun-Yang Yu, Lan-Zhe Guo, Yu-Feng Li

Main category: cs.AI

TL;DR: NeSy-Route is a large-scale neuro-symbolic benchmark for evaluating multimodal large language models’ constrained route planning capabilities in remote sensing, featuring automated data generation and hierarchical evaluation.

DetailsMotivation: Current remote-sensing benchmarks focus on perception and reasoning but fail to assess planning capabilities due to difficulties in curating/validating planning tasks at scale and inadequate evaluation protocols.

Method: Developed an automated data-generation framework integrating high-fidelity semantic masks with heuristic search to produce diverse route-planning tasks with provably optimal solutions. Created a three-level hierarchical neuro-symbolic evaluation protocol for accurate assessment of perception, reasoning, and planning.

Result: NeSy-Route contains 10,821 route-planning samples (nearly 10x larger than prior benchmarks). Evaluation of state-of-the-art MLLMs reveals significant deficiencies in perception and planning capabilities.

Conclusion: NeSy-Route addresses critical gaps in remote-sensing MLLM evaluation and can support development of more powerful MLLMs for remote sensing applications requiring complex scene understanding and decision-making.

Abstract: Remote sensing underpins crucial applications such as disaster relief and ecological field surveys, where systems must understand complex scenes and constraints and make reliable decisions. Current remote-sensing benchmarks mainly focus on evaluating perception and reasoning capabilities of multimodal large language models (MLLMs). They fail to assess planning capability, stemming either from the difficulty of curating and validating planning tasks at scale or from evaluation protocols that are inaccurate and inadequate. To address these limitations, we introduce NeSy-Route, a large-scale neuro-symbolic benchmark for constrained route planning in remote sensing. Within this benchmark, we introduce an automated data-generation framework that integrates high-fidelity semantic masks with heuristic search to produce diverse route-planning tasks with provably optimal solutions. This allows NeSy-Route to comprehensively evaluate planning across 10,821 route-planning samples, nearly 10 times larger than the largest prior benchmark. Furthermore, a three-level hierarchical neuro-symbolic evaluation protocol is developed to enable accurate assessment and support fine-grained analysis on perception, reasoning, and planning simultaneously. Our comprehensive evaluation of various state-of-the-art MLLMs demonstrates that existing MLLMs show significant deficiencies in perception and planning capabilities. We hope NeSy-Route can support further research and development of more powerful MLLMs for remote sensing.

[394] Learning to Predict, Discover, and Reason in High-Dimensional Discrete Event Sequences

Hugo Math

Main category: cs.AI

TL;DR: This thesis proposes using Transformer-based architectures and LLMs to model vehicle diagnostic trouble codes as language sequences for automated fault diagnostics, moving from prediction to causal understanding to rule synthesis.

DetailsMotivation: Manual grouping of vehicle diagnostic trouble codes into error patterns is costly and doesn't scale with increasing vehicle complexity. The similarity between DTC vocabulary size and natural language vocabulary motivates treating diagnostic sequences as language that can be modeled using modern ML approaches.

Method: Three-part framework: 1) Transformer-based architectures for predictive maintenance, 2) Scalable causal discovery frameworks for sample- and population-level analysis, 3) Multi-agent system for automated synthesis of Boolean error pattern rules using LLMs.

Result: The thesis presents a unified framework that transitions from prediction to causal understanding to reasoning for vehicle diagnostics, addressing high-dimensional event streams with thousands of unique diagnostic codes.

Conclusion: Treating diagnostic sequences as language enables scalable automated fault diagnostics through Transformer architectures, causal discovery, and LLM-based reasoning systems, overcoming limitations of traditional statistical approaches.

Abstract: Electronic control units (ECUs) embedded within modern vehicles generate a large number of asynchronous events known as diagnostic trouble codes (DTCs). These discrete events form complex temporal sequences that reflect the evolving health of the vehicle’s subsystems. In the automotive industry, domain experts manually group these codes into higher-level error patterns (EPs) using Boolean rules to characterize system faults and ensure safety. However, as vehicle complexity grows, this manual process becomes increasingly costly, error-prone, and difficult to scale. Notably, the number of unique DTCs in a modern vehicle is on the same order of magnitude as the vocabulary of a natural language, often numbering in the tens of thousands. This observation motivates a paradigm shift: treating diagnostic sequences as a language that can be modeled, predicted, and ultimately explained. Traditional statistical approaches fail to capture the rich dependencies and do not scale to high-dimensional datasets characterized by thousands of nodes, large sample sizes, and long sequence lengths. Specifically, the high cardinality of categorical event spaces in industrial logs poses a significant challenge, necessitating new machine learning architectures tailored to such event-driven systems. This thesis addresses automated fault diagnostics by unifying event sequence modeling, causal discovery, and large language models (LLMs) into a coherent framework for high-dimensional event streams. It is structured in three parts, reflecting a progressive transition from prediction to causal understanding and finally to reasoning for vehicle diagnostics. Consequently, we introduce several Transformer-based architectures for predictive maintenance, scalable sample- and population-level causal discovery frameworks and a multi-agent system that automates the synthesis of Boolean EP rules.

[395] FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment

Qinhong Lin, Ruitao Feng, Yinglun Feng, Zhenxin Huang, Yukun Chen, Zhongliang Yang, Linna Zhou, Binjie Fei, Jiaqi Liu, Yu Li

Main category: cs.AI

TL;DR: FactorEngine: A program-level factor discovery framework using LLM-guided search and knowledge bootstrapping for automated, interpretable financial signal mining from market data.

DetailsMotivation: Need for automated discovery of predictive financial signals that are both executable/auditable and computationally tractable at scale, addressing limitations of existing symbolic approaches (bounded expressiveness) and neural forecasters (lack of interpretability, vulnerability to regime shifts).

Method: Introduces FactorEngine with three key separations: (1) logic revision vs. parameter optimization, (2) LLM-guided directional search vs. Bayesian hyperparameter search, (3) LLM usage vs. local computation. Includes knowledge-infused bootstrapping module that transforms financial reports into executable factor programs via multi-agent extraction-verification-code-generation pipeline, and experience knowledge base for trajectory-aware refinement.

Result: Across extensive backtests on real-world OHLCV data, FactorEngine produces factors with substantially stronger predictive stability and portfolio impact, achieving higher IC/ICIR and improved AR/Sharpe ratios than baseline methods, demonstrating state-of-the-art predictive and portfolio performance.

Conclusion: FactorEngine provides an effective framework for automated factor discovery that balances interpretability with performance, addressing practical requirements for executable, auditable factors while maintaining computational tractability at scale.

Abstract: We study alpha factor mining, the automated discovery of predictive signals from noisy, non-stationary market data-under a practical requirement that mined factors be directly executable and auditable, and that the discovery process remain computationally tractable at scale. Existing symbolic approaches are limited by bounded expressiveness, while neural forecasters often trade interpretability for performance and remain vulnerable to regime shifts and overfitting. We introduce FactorEngine (FE), a program-level factor discovery framework that casts factors as Turing-complete code and improves both effectiveness and efficiency via three separations: (i) logic revision vs. parameter optimization, (ii) LLM-guided directional search vs. Bayesian hyperparameter search, and (iii) LLM usage vs. local computation. FE further incorporates a knowledge-infused bootstrapping module that transforms unstructured financial reports into executable factor programs through a closed-loop multi-agent extraction-verification-code-generation pipeline, and an experience knowledge base that supports trajectory-aware refinement (including learning from failures). Across extensive backtests on real-world OHLCV data, FE produces factors with substantially stronger predictive stability and portfolio impact-for example, higher IC/ICIR (and Rank IC/ICIR) and improved AR/Sharpe, than baseline methods, achieving state-of-the-art predictive and portfolio performance.

[396] Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

Quan Cheng

Main category: cs.AI

TL;DR: Negative feedback in LLM training matches/exceeds standard RLHF effectiveness, explained by structural asymmetry between positive preferences (continuous, context-dependent) and negative constraints (discrete, verifiable prohibitions).

DetailsMotivation: Empirical results show negative-only feedback training matches or exceeds RLHF performance, but lacks theoretical explanation. The paper aims to provide a unified theoretical account for why negative signals are so effective in LLM alignment.

Method: Proposes a theoretical framework based on structural asymmetry: positive preferences encode continuously coupled, context-dependent human values leading to sycophancy, while negative constraints encode discrete, finite, independently verifiable prohibitions that converge to stable boundaries.

Result: Theoretical explanation rooted in Popper’s falsification logic and epistemology of negative knowledge explains both sycophancy failure of preference-based RLHF and effectiveness of negative-signal methods like Negative Sample Reinforcement and Distributional Dispreference Optimization.

Conclusion: Alignment research should shift from “learning what humans prefer” to “learning what humans reject.” Offers testable predictions for this framework and argues negative constraints provide more stable, verifiable boundaries for LLM alignment.

Abstract: Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO on mathematical reasoning; Distributional Dispreference Optimization trains effectively using only dispreferred samples; and Constitutional AI outperforms pure RLHF on harmlessness benchmarks. Yet no unified theoretical account explains why negative signals are so effective. This paper proposes such an account: positive preferences and negative constraints are structurally asymmetric. Positive preferences (“which is better”) encode continuously coupled, context-dependent human values that cannot be exhaustively specified – leading models to learn surface correlates such as agreement with the user (sycophancy). Negative constraints (“what is wrong”) encode discrete, finite, independently verifiable prohibitions that can converge to a stable boundary. This asymmetry – rooted in Popper’s falsification logic and the epistemology of negative knowledge – explains both the sycophancy failure of preference-based RLHF and the surprising effectiveness of negative-signal methods. We argue that alignment research should shift its center of gravity from “learning what humans prefer” to “learning what humans reject,” and offer testable predictions for this framework.

[397] From Natural Language to Executable Option Strategies via Large Language Models

Haochen Luo, Zhengzhao Lai, Junjie Xu, Yifan Li, Tang Pok Hin, Yuan Zhang, Chen Liu

Main category: cs.AI

TL;DR: A neuro-symbolic pipeline using Option Query Language (OQL) to translate natural-language trading intents into executable option strategies, improving over direct LLM generation.

DetailsMotivation: While LLMs excel at general code generation, they struggle with translating natural-language trading intents into correct option strategies due to the complexity of option chain data and strict constraints.

Method: Introduces Option Query Language (OQL), a domain-specific intermediate representation that abstracts option markets into high-level primitives with grammatical rules. LLMs act as semantic parsers to generate OQL queries, which are then validated and executed deterministically by an engine.

Result: The neuro-symbolic pipeline significantly improves execution accuracy and logical consistency over direct LLM generation baselines, as demonstrated on a new dataset created for this task.

Conclusion: Using domain-specific intermediate representations like OQL enables LLMs to function as reliable semantic parsers for complex financial tasks, combining neural and symbolic approaches for better accuracy and consistency.

Abstract: Large Language Models (LLMs) excel at general code generation, yet translating natural-language trading intents into correct option strategies remains challenging. Real-world option design requires reasoning over massive, multi-dimensional option chain data with strict constraints, which often overwhelms direct generation methods. We introduce the Option Query Language (OQL), a domain-specific intermediate representation that abstracts option markets into high-level primitives under grammatical rules, enabling LLMs to function as reliable semantic parsers rather than free-form programmers. OQL queries are then validated and executed deterministically by an engine to instantiate executable strategies. We also present a new dataset for this task and demonstrate that our neuro-symbolic pipeline significantly improves execution accuracy and logical consistency over direct baselines.

[398] Visual Distraction Undermines Moral Reasoning in Vision-Language Models

Xinyi Yang, Chenheng Xu, Weijun Hong, Ce Mo, Qian Wang, Fang Fang, Yixin Zhu

Main category: cs.AI

TL;DR: Visual inputs fundamentally alter moral decision-making in Vision-Language Models, bypassing text-based safety mechanisms and revealing critical multimodal safety vulnerabilities.

DetailsMotivation: As AI systems evolve from text-based assistants to embodied agents, ensuring consistent moral reasoning across modalities becomes critical. Current safety techniques work well for text but may not generalize to visual inputs, and existing moral evaluation benchmarks lack systematic control over variables influencing moral decision-making.

Method: Introduces Moral Dilemma Simulation (MDS), a multimodal benchmark grounded in Moral Foundation Theory that enables mechanistic analysis through orthogonal manipulation of visual and contextual variables. Evaluates state-of-the-art Vision-Language Models to understand how visual inputs affect moral decision-making.

Result: Visual inputs fundamentally alter moral decision-making in VLMs, bypassing text-based safety mechanisms. The vision modality activates intuition-like pathways that override the more deliberate and safer reasoning patterns observed in text-only contexts.

Conclusion: Reveals critical fragilities where language-tuned safety filters fail to constrain visual processing, demonstrating the urgent need for multimodal safety alignment as AI systems become more multimodal and embodied.

Abstract: Moral reasoning is fundamental to safe Artificial Intelligence (AI), yet ensuring its consistency across modalities becomes critical as AI systems evolve from text-based assistants to embodied agents. Current safety techniques demonstrate success in textual contexts, but concerns remain about generalization to visual inputs. Existing moral evaluation benchmarks rely on textonly formats and lack systematic control over variables that influence moral decision-making. Here we show that visual inputs fundamentally alter moral decision-making in state-of-the-art (SOTA) Vision-Language Models (VLMs), bypassing text-based safety mechanisms. We introduce Moral Dilemma Simulation (MDS), a multimodal benchmark grounded in Moral Foundation Theory (MFT) that enables mechanistic analysis through orthogonal manipulation of visual and contextual variables. The evaluation reveals that the vision modality activates intuition-like pathways that override the more deliberate and safer reasoning patterns observed in text-only contexts. These findings expose critical fragilities where language-tuned safety filters fail to constrain visual processing, demonstrating the urgent need for multimodal safety alignment.

[399] TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas

Ai Jian, Xiaoyun Zhang, Wanrou Du, Jingqing Ruan, Jiangbo Pei, Weipeng Zhang, Ke Zeng, Xunliang Cai

Main category: cs.AI

TL;DR: TRUST-SQL is a framework for text-to-SQL parsing in enterprise databases with unknown schemas, using an autonomous agent with structured reasoning and novel reinforcement learning to identify relevant schema subsets without full metadata prefilling.

DetailsMotivation: Real-world enterprise databases contain hundreds of tables with noisy metadata, making the Full Schema Assumption unrealistic. Current text-to-SQL parsers fail when they can't pre-load complete schema information upfront, requiring agents to actively identify and verify only relevant schema subsets.

Method: Formulates the task as a Partially Observable Markov Decision Process with a four-phase reasoning protocol. Introduces Dual-Track GRPO strategy using token-level masked advantages to isolate exploration rewards from execution outcomes, resolving credit assignment issues in reinforcement learning.

Result: Achieves 9.9% relative improvement over standard GRPO, with average absolute improvements of 30.6% and 16.6% for 4B and 8B variants respectively across five benchmarks. Matches or surpasses baselines that rely on schema prefilling despite operating without pre-loaded metadata.

Conclusion: TRUST-SQL successfully addresses the Unknown Schema scenario in enterprise text-to-SQL parsing through structured reasoning and novel reinforcement learning techniques, demonstrating that active schema identification can outperform traditional full-schema approaches.

Abstract: Text-to-SQL parsing has achieved remarkable progress under the Full Schema Assumption. However, this premise fails in real-world enterprise environments where databases contain hundreds of tables with massive noisy metadata. Rather than injecting the full schema upfront, an agent must actively identify and verify only the relevant subset, giving rise to the Unknown Schema scenario we study in this work. To address this, we propose TRUST-SQL (Truthful Reasoning with Unknown Schema via Tools). We formulate the task as a Partially Observable Markov Decision Process where our autonomous agent employs a structured four-phase protocol to ground reasoning in verified metadata. Crucially, this protocol provides a structural boundary for our novel Dual-Track GRPO strategy. By applying token-level masked advantages, this strategy isolates exploration rewards from execution outcomes to resolve credit assignment, yielding a 9.9% relative improvement over standard GRPO. Extensive experiments across five benchmarks demonstrate that TRUST-SQL achieves an average absolute improvement of 30.6% and 16.6% for the 4B and 8B variants respectively over their base models. Remarkably, despite operating entirely without pre-loaded metadata, our framework consistently matches or surpasses strong baselines that rely on schema prefilling.

[400] RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

Main category: cs.AI

TL;DR: Evolving Strategy & Execution framework improves LLM-based agents for long-horizon decision-making in dynamic retail environments, but performance degrades with complexity.

DetailsMotivation: LLM-based agents succeed at short-horizon structured tasks but struggle with coherent decision-making over long horizons in realistic, dynamic environments with stochastic demand and evolving conditions.

Method: Proposes Evolving Strategy & Execution framework that separates high-level strategic reasoning from low-level action execution, enabling adaptive and interpretable strategy evolution over time. Introduces RetailBench benchmark for evaluating long-horizon autonomous decision-making in commercial scenarios.

Result: Framework improves operational stability and efficiency compared to baselines across eight state-of-the-art LLMs, but performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making.

Conclusion: The proposed framework advances long-horizon decision-making capabilities but highlights significant limitations in current LLMs for complex, multi-factor decision-making in dynamic environments.

Abstract: Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution. Experiments on eight state-of-the-art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to other baselines. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making.

[401] Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

Yu Liu, Lei Zhang, Haoxun Li, Hanlei Shi, Yuxuan Ding, Leyuan Qu, Taihao Li

Main category: cs.AI

TL;DR: HyDRA is a Hybrid-evidential Deductive Reasoning Architecture for Open-Vocabulary Multimodal Emotion Recognition that addresses ambiguity by reconstructing emotional states through evidence-grounded rationales from multiple latent perspectives.

DetailsMotivation: Current Multimodal Large Language Models (MLLMs) for emotion recognition often commit prematurely to dominant data priors, overlooking complementary affective cues across modalities. The ambiguity in multimodal emotion recognition stems from equivocal cues and unobserved situational dynamics, requiring more than surface-level associations.

Method: HyDRA formalizes inference as a Propose-Verify-Decide protocol using hybrid-evidential deductive reasoning. It employs reinforcement learning with hierarchical reward shaping to internalize an abductive reasoning process, aligning reasoning trajectories with final task performance to best reconcile observed multimodal cues.

Result: HyDRA consistently outperforms strong baselines, especially in ambiguous or conflicting scenarios, while providing interpretable, diagnostic evidence traces. Systematic evaluations validate the design choices.

Conclusion: Effective affective reasoning requires reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales from diverse latent perspectives, which HyDRA achieves through its hybrid-evidential deductive reasoning architecture.

Abstract: Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer extensive semantic coverage, their performance is often bottlenecked by premature commitment to dominant data priors, resulting in suboptimal heuristics that overlook crucial, complementary affective cues across modalities. We argue that effective affective reasoning requires more than surface-level association; it necessitates reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales that reconcile these observations from diverse latent perspectives. We introduce HyDRA, a Hybrid-evidential Deductive Reasoning Architecture that formalizes inference as a Propose-Verify-Decide protocol. To internalize this abductive process, we employ reinforcement learning with hierarchical reward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the observed multimodal cues. Systematic evaluations validate our design choices, with HyDRA consistently outperforming strong baselines–especially in ambiguous or conflicting scenarios–while providing interpretable, diagnostic evidence traces.

[402] Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures

Oleg Somov, Mikhail Chaichuk, Mikhail Seleznyov, Alexander Panchenko, Elena Tutubalina

Main category: cs.AI

TL;DR: LLMs produce intermediate structures in reasoning pipelines but often fail to update final decisions when those structures are edited, revealing they function as context rather than causal mediators.

DetailsMotivation: To determine whether intermediate structures in schema-guided reasoning pipelines (like rubrics, checklists, verification queries) causally determine LLM outputs or merely accompany them, and to measure the actual causal influence of these structures on final decisions.

Method: Introduces a causal evaluation protocol using tasks where a deterministic function maps intermediate structures to decisions, enabling measurement of whether models update predictions after controlled edits to intermediate structures. Tests across eight models and three benchmarks with interventions on intermediate structures.

Result: Models appear self-consistent with their own intermediate structures but fail to update predictions after intervention in up to 60% of cases. When derivation is delegated to an external tool, fragility largely disappears, but prompts prioritizing intermediate structures don’t close the gap.

Conclusion: Intermediate structures in schema-guided pipelines function as influential context rather than stable causal mediators, revealing fragility in apparent faithfulness when structures change.

Abstract: Schema-guided reasoning pipelines ask LLMs to produce explicit intermediate structures – rubrics, checklists, verification queries – before committing to a final decision. But do these structures causally determine the output, or merely accompany it? We introduce a causal evaluation protocol that makes this directly measurable: by selecting tasks where a deterministic function maps intermediate structures to decisions, every controlled edit implies a unique correct output. Across eight models and three benchmarks, models appear self-consistent with their own intermediate structures but fail to update predictions after intervention in up to 60% of cases – revealing that apparent faithfulness is fragile once the intermediate structure changes. When derivation of the final decision from the structure is delegated to an external tool, this fragility largely disappears; however, prompts which ask to prioritize the intermediate structure over the original input do not materially close the gap. Overall, intermediate structures in schema-guided pipelines function as influential context rather than stable causal mediators.

[403] ExpressMind: A Multimodal Pretrained Large Language Model for Expressway Operation

Zihe Wang, Yihuan Wang, Haiyang Yu. Zhiyong Cui, Xiaojian Liao, Chengcheng Wang, Yonglin Tian, Yongxin Tong

Main category: cs.AI

TL;DR: ExpressMind is a multimodal large language model designed for intelligent expressway operations, featuring traffic knowledge integration, video understanding, and incident response reasoning.

DetailsMotivation: Current expressway systems use rule-based isolated models that can't jointly analyze knowledge across systems. While LLMs are advancing traffic intelligence, general LLMs fail to understand expressway regulations and causal relationships in unconventional scenarios.

Method: Constructs first full-stack expressway dataset with traffic knowledge texts, emergency reasoning chains, and annotated video events. Uses dual-layer LLM pre-training (self-supervised + unsupervised). Introduces Graph-Augmented RAG framework for knowledge indexing and RL-aligned Chain-of-Thought mechanism for incident response reasoning. Integrates cross-modal encoder to align visual and textual features.

Result: Extensive experiments on new multimodal expressway benchmark show ExpressMind outperforms existing baselines in event detection, safety response generation, and complex traffic analysis.

Conclusion: ExpressMind serves as a cognitive core for intelligent expressway operations, effectively bridging the gap between general LLMs and domain-specific expressway requirements through multimodal understanding and reasoning capabilities.

Abstract: The current expressway operation relies on rule-based and isolated models, which limits the ability to jointly analyze knowledge across different systems. Meanwhile, Large Language Models (LLMs) are increasingly applied in intelligent transportation, advancing traffic models from algorithmic to cognitive intelligence. However, general LLMs are unable to effectively understand the regulations and causal relationships of events in unconventional scenarios in the expressway field. Therefore, this paper constructs a pre-trained multimodal large language model (MLLM) for expressways, ExpressMind, which serves as the cognitive core for intelligent expressway operations. This paper constructs the industry’s first full-stack expressway dataset, encompassing traffic knowledge texts, emergency reasoning chains, and annotated video events to overcome data scarcity. This paper proposes a dual-layer LLM pre-training paradigm based on self-supervised training and unsupervised learning. Additionally, this study introduces a Graph-Augmented RAG framework to dynamically index the expressway knowledge base. To enhance reasoning for expressway incident response strategies, we develop a RL-aligned Chain-of-Thought (RL-CoT) mechanism that enforces consistency between model reasoning and expert problem-solving heuristics for incident handling. Finally, ExpressMind integrates a cross-modal encoder to align the dynamic feature sequences under the visual and textual channels, enabling it to understand traffic scenes in both video and image modalities. Extensive experiments on our newly released multi-modal expressway benchmark demonstrate that ExpressMind comprehensively outperforms existing baselines in event detection, safety response generation, and complex traffic analysis. The code and data are available at: https://wanderhee.github.io/ExpressMind/.

[404] Exploring different approaches to customize language models for domain-specific text-to-code generation

Luís Freire, Fernanda A. Andaló, Nicki Skafte Detlefsen

Main category: cs.AI

TL;DR: Smaller LLMs can be effectively customized for domain-specific code generation using synthetic datasets, with LoRA fine-tuning outperforming prompting-based approaches in accuracy and domain alignment.

DetailsMotivation: General-purpose LLMs struggle with specialized programming contexts requiring domain-specific libraries and conventions. Customizing smaller open-source models offers a cost-effective alternative to large proprietary systems for domain-specific code generation.

Method: Constructed synthetic datasets of Python programming exercises across three domains: general Python, Scikit-learn ML workflows, and OpenCV computer vision tasks. Evaluated three customization strategies: few-shot prompting, retrieval-augmented generation (RAG), and parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA).

Result: Prompting-based approaches (few-shot and RAG) improved domain relevance cost-effectively but had limited impact on benchmark accuracy. LoRA-based fine-tuning consistently achieved higher accuracy and stronger domain alignment across most tasks.

Conclusion: There are practical trade-offs between flexibility, computational cost, and performance when adapting smaller language models for specialized programming tasks, with LoRA fine-tuning offering the best performance but requiring more computational investment.

Abstract: Large language models (LLMs) have demonstrated strong capabilities in generating executable code from natural language descriptions. However, general-purpose models often struggle in specialized programming contexts where domain-specific libraries, APIs, or conventions must be used. Customizing smaller open-source models offers a cost-effective alternative to relying on large proprietary systems. In this work, we investigate how smaller language models can be adapted for domain-specific code generation using synthetic datasets. We construct datasets of programming exercises across three domains within the Python ecosystem: general Python programming, Scikit-learn machine learning workflows, and OpenCV-based computer vision tasks. Using these datasets, we evaluate three customization strategies: few-shot prompting, retrieval-augmented generation (RAG), and parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA). Performance is evaluated using both benchmark-based metrics and similarity-based metrics that measure alignment with domain-specific code. Our results show that prompting-based approaches such as few-shot learning and RAG can improve domain relevance in a cost-effective manner, although their impact on benchmark accuracy is limited. In contrast, LoRA-based fine-tuning consistently achieves higher accuracy and stronger domain alignment across most tasks. These findings highlight practical trade-offs between flexibility, computational cost, and performance when adapting smaller language models for specialized programming tasks.

[405] Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots

Carmen Ng

Main category: cs.AI

TL;DR: A framework for LLM-enabled robots to handle pluralistic values and LLM variability in social assistance allocation through bounded calibration with contestability, avoiding silent defaults or burdensome user configuration.

DetailsMotivation: LLM-enabled robots in social settings face challenges with pluralistic values (reasonable people disagree about prioritization) and LLM behavioral variability (unpredictable responses across prompts/contexts). Current guardrails for real-time, multi-user assistance allocation are under-specified, risking silent defaults or burdensome user configuration.

Method: Proposes bounded calibration with contestability: (1) constrains prioritization to governance-approved admissible modes, (2) keeps active mode legible at point of deferral, (3) provides outcome-specific contest pathways without renegotiating global rules. Illustrated with public-concourse robot vignette.

Result: A procedural front-end pattern that treats pluralism and LLM uncertainty as standing conditions, avoiding both silent defaults that hide value skews and wide-open user-configurable settings that shift burden under time pressure.

Conclusion: The framework addresses critical gaps in real-time multi-user assistance allocation for LLM-enabled robots, with evaluation agenda focused on legibility, procedural legitimacy, actionability, and risks of automation bias and uneven contest channel usability.

Abstract: LLM-enabled robots prioritizing scarce assistance in social settings face pluralistic values and LLM behavioral variability: reasonable people can disagree about who is helped first, while LLM-mediated interaction policies vary across prompts, contexts, and groups in ways that are difficult to anticipate or verify at contact point. Yet user-facing guardrails for real-time, multi-user assistance allocation remain under-specified. We propose bounded calibration with contestability, a procedural front-end pattern that (i) constrains prioritization to a governance-approved menu of admissible modes, (ii) keeps the active mode legible in interaction-relevant terms at the point of deferral, and (iii) provides an outcome-specific contest pathway without renegotiating the global rule. Treating pluralism and LLM uncertainty as standing conditions, the pattern avoids both silent defaults that hide implicit value skews and wide-open user-configurable “value settings” that shift burden under time pressure. We illustrate the pattern with a public-concourse robot vignette and outline an evaluation agenda centered on legibility, procedural legitimacy, and actionability, including risks of automation bias and uneven usability of contest channels.

[406] BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

Sangyeon Yoon, Sunkyoung Kim, Hyesoo Hong, Wonje Jeung, Yongil Kim, Wooseok Seo, Heuiyeen Yeen, Albert No

Main category: cs.AI

TL;DR: LLMs struggle to contextually apply user preferences from memory, treating them as global rules rather than context-dependent signals, as shown by high misapplication rates in third-party communication settings.

DetailsMotivation: As LLMs increasingly store user preferences in persistent memory for personalization, there's a need to evaluate whether these preferences are appropriately applied or suppressed across different communication contexts, especially in third-party settings governed by social and institutional norms.

Method: Introduces BenchPreS benchmark with two metrics: Misapplication Rate (MR) and Appropriate Application Rate (AAR). Evaluates frontier LLMs on their ability to apply memory-based user preferences in context-sensitive ways across various communication scenarios.

Result: Even frontier LLMs struggle with context-sensitive preference application. Models with stronger preference adherence show higher over-application rates. Neither reasoning capability nor prompt-based defenses fully resolve the issue.

Conclusion: Current LLMs treat personalized preferences as globally enforceable rules rather than context-dependent normative signals, highlighting a significant limitation in their ability to navigate social and institutional norms in communication.

Abstract: Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.

[407] V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models

Seyed Mahed Mousavi, Christian Moiola, Massimo Rizzoli, Simone Alghisi, Giuseppe Riccardi

Main category: cs.AI

TL;DR: V-DyKnow benchmark evaluates time-sensitive factual knowledge in Vision-Language Models, revealing VLMs frequently output outdated facts and existing alignment methods fail to update knowledge consistently across modalities.

DetailsMotivation: Current VLMs are trained on static data snapshots, treating factual knowledge as time-invariant, but real-world facts are time-sensitive and subject to changes, causing model predictions to become outdated.

Method: Created V-DyKnow benchmark to evaluate time-sensitive factual knowledge in VLMs, analyzing reliability across modalities, efficacy of knowledge editing and multi-modal RAG methods, and sources of outdated predictions through data and mechanistic analysis.

Result: VLMs frequently output outdated facts reflecting outdated training snapshots, factual reliability degrades from textual to visual stimuli even with correct entity recognition, and existing alignment approaches fail to consistently update knowledge across modalities.

Conclusion: Current VLMs have fundamental limitations in acquiring and updating time-sensitive knowledge across modalities, highlighting the need for better approaches to handle dynamic factual knowledge in multimodal systems.

Abstract: Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are typically static, implicitly treating factual knowledge as time-invariant. However, real-world facts are intrinsically time-sensitive and subject to erratic and periodic changes, causing model predictions to become outdated. We present V-DyKnow, a Visual Dynamic Knowledge benchmark for evaluating time-sensitive factual knowledge in VLMs. Using V-DyKnow, we benchmark closed- and open-source VLMs and analyze a) the reliability (correctness and consistency) of model responses across modalities and input perturbations; b) the efficacy of knowledge editing and multi-modal RAG methods for knowledge updates across modalities; and c) the sources of outdated predictions, through data and mechanistic analysis. Our results show that VLMs frequently output outdated facts, reflecting outdated snapshots used in the (pre-)training phase. Factual reliability degrades from textual to visual stimuli, even when entities are correctly recognized. Besides, existing alignment approaches fail to consistently update the models’ knowledge across modalities. Together, these findings highlight fundamental limitations in how current VLMs acquire and update time-sensitive knowledge across modalities. We release the benchmark, code, and evaluation data.

[408] Runtime Governance for AI Agents: Policies on Paths

Maurits Kaptein, Vassilis-Javed Khan, Andriy Podstavnychy

Main category: cs.AI

TL;DR: AI agent governance framework using runtime path evaluation for compliance policies

DetailsMotivation: AI agents produce non-deterministic, path-dependent behavior that cannot be fully governed at design time, requiring runtime governance mechanisms to balance task completion with legal, data-breach, and reputational risks.

Method: Formalizes compliance policies as deterministic functions mapping agent identity, partial execution path, proposed next action, and organizational state to policy violation probability. Treats execution path as central object for runtime governance.

Result: Develops formal framework showing prompt-level instructions and static access control are special cases of this general runtime evaluation approach, which is necessary for path-dependent policies.

Conclusion: Runtime evaluation is the general case for AI agent governance, with open problems including risk calibration and limits of enforced compliance.

Abstract: AI agents – systems that plan, reason, and act using large language models – produce non-deterministic, path-dependent behavior that cannot be fully governed at design time, where with governed we mean striking the right balance between as high as possible successful task completion rate and the legal, data-breach, reputational and other costs associated with running agents. We argue that the execution path is the central object for effective runtime governance and formalize compliance policies as deterministic functions mapping agent identity, partial path, proposed next action, and organizational state to a policy violation probability. We show that prompt-level instructions (and “system prompts”), and static access control are special cases of this framework: the former shape the distribution over paths without actually evaluating them; the latter evaluates deterministic policies that ignore the path (i.e., these can only account for a specific subset of all possible paths). In our view, runtime evaluation is the general case, and it is necessary for any path-dependent policy. We develop the formal framework for analyzing AI agent governance, present concrete policy examples (inspired by the AI act), discuss a reference implementation, and identify open problems including risk calibration and the limits of enforced compliance.

[409] When AI Navigates the Fog of War

Ming Li, Xirui Li, Tianyi Zhou

Main category: cs.AI

TL;DR: Paper analyzes LLM reasoning about ongoing geopolitical conflicts using temporally grounded case study of 2026 Middle East conflict to avoid training-data leakage, revealing models’ strategic realism capabilities and limitations.

DetailsMotivation: To study AI reasoning about unfolding geopolitical conflicts without retrospective bias, addressing the challenge of training-data leakage in geopolitical prediction tasks by using a temporally grounded approach.

Method: Constructed 11 critical temporal nodes and 42 verifiable questions about the 2026 Middle East conflict (post-training cutoff), requiring models to reason only from information publicly available at each moment to mitigate training-data leakage.

Result: Models display strategic realism beyond surface rhetoric, but capabilities are uneven - better in economically/logistically structured settings than politically ambiguous multi-actor environments. Model narratives evolve from early containment expectations to systemic accounts of regional entrenchment.

Conclusion: Provides first temporally grounded analysis of LLM reasoning in ongoing conflict, serving as archival snapshot for future studies without hindsight bias, revealing both capabilities and limitations in geopolitical analysis.

Abstract: Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier models. We construct 11 critical temporal nodes, 42 node-specific verifiable questions, and 5 general exploratory questions, requiring models to reason only from information that would have been publicly available at each moment. This design substantially mitigates training-data leakage concerns, creating a setting well-suited for studying how models analyze an unfolding crisis under the fog of war, and provides, to our knowledge, the first temporally grounded analysis of LLM reasoning in an ongoing geopolitical conflict. Our analysis reveals three main findings. First, current state-of-the-art large language models often display a striking degree of strategic realism, reasoning beyond surface rhetoric toward deeper structural incentives. Second, this capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments. Finally, model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation. Since the conflict remains ongoing at the time of writing, this work can serve as an archival snapshot of model reasoning during an unfolding geopolitical crisis, enabling future studies without the hindsight bias of retrospective analysis.

[410] BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

Yuzhe Tang

Main category: cs.AI

TL;DR: BrainBench is a benchmark of 100 brainteaser questions designed to test commonsense reasoning failures in LLMs across 20 categories targeting specific reasoning failure modes.

DetailsMotivation: LLMs achieve high scores on standard benchmarks but fail simple commonsense reasoning questions that humans answer easily. The authors aim to create a diagnostic tool to identify where LLMs substitute surface heuristics for genuine reasoning.

Method: Created BrainBench with 100 brainteaser questions spanning 20 categories targeting specific commonsense reasoning failure modes. Evaluated 8 frontier models (4 Claude, 4 GPT) using zero-shot protocol with 10 independent runs per question. Conducted cross-lingual evaluation in Chinese.

Result: Best model (Claude Opus 4.6 with extended thinking) achieved only 80.3% accuracy; worst (GPT-4o) scored 39.7%. Top models showed 6-16 percentage-point gap between accuracy and consistency, revealing stochastic reasoning. Cross-lingual evaluation showed 2-8 percentage-point degradation in Chinese.

Conclusion: BrainBench reveals fundamental reasoning deficits in LLMs that go beyond language understanding. The benchmark provides fine-grained diagnostics for identifying where models rely on surface patterns rather than genuine commonsense reasoning.

Abstract: Large language models (LLMs) achieve impressive scores on standard benchmarks yet routinely fail questions that any human would answer correctly in seconds. We introduce BrainBench, a benchmark of 100 brainteaser questions spanning 20 carefully designed categories, each targeting a specific commonsense reasoning failure mode in LLMs. Categories range from implicit physical constraints (“Should I walk or drive my rental car to the return lot?”) to semantic scope tricks and default assumption hijacks. We evaluate eight frontier models – four from the Claude family and four from the GPT family – using a zero-shot protocol with 10 independent runs per question. The best model, Claude Opus 4.6 with extended thinking, achieves only 80.3% accuracy; the worst, GPT-4o, scores 39.7%. Even top-performing models exhibit a 6-16 percentage-point gap between accuracy and consistency, revealing stochastic reasoning. Cross-lingual evaluation in Chinese shows most models degrade by 2-8 percentage points, confirming that these failures reflect reasoning deficits rather than language-specific artifacts. BrainBench provides a fine-grained diagnostic tool for identifying where and why LLMs substitute surface heuristics for genuine commonsense reasoning.

[411] Domain-Independent Dynamic Programming with Constraint Propagation

Imko Marijnissen, J. Christopher Beck, Emir Demirović, Ryo Kuroiwa

Main category: cs.AI

TL;DR: Integrates constraint propagation into dynamic programming to prune states and transitions, bridging DP and CP paradigms for combinatorial optimization problems.

DetailsMotivation: To bridge the gap between state-based representations (dynamic programming) and constraint-based representations (constraint programming) by enabling DP solvers to use constraint propagation for pruning.

Method: Implement constraint propagation using a general-purpose CP solver within the Domain-Independent Dynamic Programming framework, evaluated on Single Machine Scheduling with Time Windows, RCPSP, and TSPTW using heuristic search.

Result: Constraint propagation significantly reduces state expansions, solving more instances than DP alone for Single Machine Scheduling and RCPSP, with similar improvements for tightly constrained TSPTW instances.

Conclusion: The work demonstrates the value of integrating constraint propagation into DP solvers, showing benefits outweigh overhead for constrained instances, though further work is needed to reduce propagation overhead.

Abstract: There are two prevalent model-based paradigms for combinatorial problems: 1) state-based representations, such as heuristic search, dynamic programming (DP), and decision diagrams, and 2) constraint and domain-based representations, such as constraint programming (CP), (mixed-)integer programming, and Boolean satisfiability. In this paper, we bridge the gap between the DP and CP paradigms by integrating constraint propagation into DP, enabling a DP solver to prune states and transitions using constraint propagation. To this end, we implement constraint propagation using a general-purpose CP solver in the Domain-Independent Dynamic Programming framework and evaluate using heuristic search on three combinatorial optimisation problems: Single Machine Scheduling with Time Windows, the Resource Constrained Project Scheduling Problem (RCPSP), and the Travelling Salesperson Problem with Time Windows (TSPTW). Our evaluation shows that constraint propagation significantly reduces the number of state expansions, causing our approach to solve more instances than a DP solver for Single Machine Scheduling and RCPSP, and showing similar improvements for tightly constrained TSPTW instances. The runtime performance indicates that the benefits of propagation outweigh the overhead for constrained instances, but that further work into reducing propagation overhead could improve performance further. Our work is a key step in understanding the value of constraint propagation in DP solvers, providing a model-based approach to integrating DP and CP.

[412] What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline

Benoît Alcaraz

Main category: cs.AI

TL;DR: Thesis proposes Pino, a hybrid model combining reinforcement learning agents with argumentation-based normative advisors to create norm-compliant, context-aware AI agents, addressing norm avoidance in RL systems.

DetailsMotivation: As AI systems become more integrated into daily life, there's a need for them to comply with societal rules and norms for safe deployment, inspired by the Pinocchio story about becoming a "real" (compliant) agent.

Method: Proposes Pino pipeline building on AJAR, Jiminy, and NGRL architectures: RL agents supervised by argumentation-based normative advisors, with novel algorithm for automatically extracting arguments and relationships underlying advisor decisions.

Result: Each component empirically evaluated; provides definition and mitigation strategy for norm avoidance phenomenon in RL agents; demonstrates operational pipeline for norm-compliant agent development.

Conclusion: Thesis presents comprehensive approach to developing norm-compliant AI agents, discusses related work, limitations, and future research directions for safe AI integration into society.

Abstract: In the past decade, artificial intelligence (AI) has developed quickly. With this rapid progression came the need for systems capable of complying with the rules and norms of our society so that they can be successfully and safely integrated into our daily lives. Inspired by the story of Pinocchio in ``Le avventure di Pinocchio - Storia di un burattino’’, this thesis proposes a pipeline that addresses the problem of developing norm compliant and context-aware agents. Building on the AJAR, Jiminy, and NGRL architectures, the work introduces \pino, a hybrid model in which reinforcement learning agents are supervised by argumentation-based normative advisors. In order to make this pipeline operational, this thesis also presents a novel algorithm for automatically extracting the arguments and relationships that underlie the advisors’ decisions. Finally, this thesis investigates the phenomenon of \textit{norm avoidance}, providing a definition and a mitigation strategy within the context of reinforcement learning agents. Each component of the pipeline is empirically evaluated. The thesis concludes with a discussion of related work, current limitations, and directions for future research.

[413] Machines acquire scientific taste from institutional traces

Ziqin Gong, Ning Li, Huaikang Zhou

Main category: cs.AI

TL;DR: Fine-tuning language models on journal publication decisions enables AI to predict scientific quality with higher accuracy than frontier models or human experts, recovering evaluative judgment from institutional records.

DetailsMotivation: While AI excels at tasks with verifiable answers, scientific taste—the ability to judge which untested ideas deserve pursuit—has remained elusive to automation. This paper aims to demonstrate that evaluative judgment can be extracted from institutional publication records through fine-tuning language models.

Method: Fine-tuned language models on years of journal publication records in management and economics. Compared performance against frontier models (GPT-4, Claude, etc.) and expert panels (journal editors and editorial board members) on held-out benchmarks of research pitches spanning four quality tiers.

Result: Fine-tuned models achieved 59% accuracy in management and 70% in economics, surpassing frontier models (31% average) and expert panels (42%). Models exhibited calibrated confidence, reaching 100% accuracy on highest-confidence predictions, and transferred evaluative signal to untrained tasks like pairwise comparisons and one-sentence summaries.

Conclusion: Scientific taste was not missing from AI’s reach but was deposited in institutional records, waiting to be extracted. This provides a scalable mechanism to triage expanding scientific production in disciplines where quality resists formal verification.

Abstract: Artificial intelligence matches or exceeds human performance on tasks with verifiable answers, from protein folding to Olympiad mathematics. Yet the capacity that most governs scientific advance is not reasoning but taste: the ability to judge which untested ideas deserve pursuit, exercised daily by editors and funders but never successfully articulated, taught, or automated. Here we show that fine-tuning language models on journal publication decisions recovers evaluative judgment inaccessible to both frontier models and human expertise. Using a held-out benchmark of research pitches in management spanning four quality tiers, we find that eleven frontier models, spanning major proprietary and open architectures, barely exceed chance, averaging 31% accuracy. Panels of journal editors and editorial board members reach 42% by majority vote. Fine-tuned models trained on years of publication records each surpass every frontier model and expert panel, with the best single model achieving 59%. These models exhibit calibrated confidence, reaching 100% accuracy on their highest-confidence predictions, and transfer this evaluative signal to untrained pairwise comparisons and one-sentence summaries. The mechanism generalizes: models trained on economics publication records achieve 70% accuracy. Scientific taste was not missing from AI’s reach; it was deposited in the institutional record, waiting to be extracted. These results provide a scalable mechanism to triage the expanding volume of scientific production across disciplines where quality resists formal verification.

[414] CritiSense: Critical Digital Literacy and Resilience Against Misinformation

Firoj Alam, Fatema Ahmad, Ali Ezzat Shahroor, Mohamed Bayan Kmainasi, Elisa Sartori, Giovanni Da San Martino, Abul Hasnat, Raian Ali

Main category: cs.AI

TL;DR: CritiSense is a multilingual mobile app for prebunking misinformation through interactive microlearning challenges with instant feedback, supporting 9 languages and designed for rapid updates across topics.

DetailsMotivation: To combat misinformation on social media by proactively helping users recognize manipulation tactics before encountering them, complementing reactive debunking approaches.

Method: Developed a modular mobile app with short, interactive challenges providing instant feedback, supporting 9 languages, designed for rapid topic/domain updates, and tested through usability studies with 93 users.

Result: 83.9% overall satisfaction, 90.1% rated app as easy to use, qualitative feedback shows improved digital literacy skills, reached 300+ active users over 3+ months, available on major app stores.

Conclusion: CritiSense provides an effective multilingual prebunking platform and testbed for measuring microlearning’s impact on misinformation resilience, demonstrating usability and potential for digital literacy improvement.

Abstract: Misinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics before they encounter them in the wild. We present CritiSense, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the first multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that CritiSense helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 3+ months, we have reached 300+ active users. It is freely available to all users on the Apple App Store (https://apps.apple.com/us/app/critisense/id6749675792) and Google Play Store (https://play.google.com/store/apps/details?id=com.critisense&hl=en). Demo Video: https://shorturl.at/CDcdc

[415] IQuest-Coder-V1 Technical Report

Jian Yang, Wei Zhang, Shawn Guo, Zhengmao Ye, Lin Jing, Shark Liu, Yizhi Li, Jiajun Wu, Cening Liu, X. Ma, Yuyang Song, Siwei Wu, Yuwen Li, L. Liao, T. Zheng, Ziling Huang, Zelong Huang, Che Liu, Yan Xing, Renyuan Li, Qingsong Cai, Hanxu Yan, Siyue Wang, Shikai Li, Jason Klein Liu, An Huang, Yongsheng Kang, Jinxing Zhang, Chuan Hao, Haowen Wang, Weicheng Gu, Ran Tao, Mingjie Tang, Peihao Wu, Jianzhou Wang, Xianglong Liu, Weifeng Lv, Bryan Dai

Main category: cs.AI

TL;DR: IQuest-Coder-V1 series introduces code-flow multi-stage training paradigm for code LLMs, featuring evolutionary pipeline with pre-training, mid-training, and specialized post-training paths for thinking vs instruction, achieving SOTA in code intelligence tasks.

DetailsMotivation: To move beyond static code representations and capture the dynamic evolution of software logic through different pipeline phases, creating more capable code LLMs for autonomous code intelligence and real-world agentic systems.

Method: Code-flow multi-stage training paradigm with evolutionary pipeline: 1) initial pre-training on code facts, repository, and completion data; 2) mid-training with reasoning/agentic trajectories (32k-context) and repository-scale (128k-context); 3) post-training bifurcated into thinking path (reasoning-driven RL) and instruct path (general assistance).

Result: Achieves state-of-the-art performance across critical dimensions: agentic software engineering, competitive programming, and complex tool use. Loop variant introduces recurrent mechanism for capacity-deployment trade-off optimization.

Conclusion: The IQuest-Coder-V1 series advances research in autonomous code intelligence and real-world agentic systems through its novel training paradigm and specialized model variants.

Abstract: In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through different phases of the pipeline. Our models are developed through the evolutionary pipeline, starting with the initial pre-training consisting of code facts, repository, and completion data. Following that, we implement a specialized mid-training stage that integrates reasoning and agentic trajectories in 32k-context and repository-scale in 128k-context to forge deep logical foundations. The models are then finalized with post-training of specialized coding capabilities, which is bifurcated into two specialized paths: the thinking path (utilizing reasoning-driven RL) and the instruct path (optimized for general assistance). IQuest-Coder-V1 achieves state-of-the-art performance among competitive models across critical dimensions of code intelligence: agentic software engineering, competitive programming, and complex tool use. To address deployment constraints, the IQuest-Coder-V1-Loop variant introduces a recurrent mechanism designed to optimize the trade-off between model capacity and deployment footprint, offering an architecturally enhanced path for efficacy-efficiency trade-off. We believe the release of the IQuest-Coder-V1 series, including the complete white-box chain of checkpoints from pre-training bases to the final thinking and instruction models, will advance research in autonomous code intelligence and real-world agentic systems.

[416] Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

Caglar Yildirim

Main category: cs.AI

TL;DR: Personalization signals like mental health disclosures can weakly reduce harmful task completion in LLM agents, but this protection is fragile under jailbreak attacks, revealing safety-utility trade-offs.

DetailsMotivation: As LLMs become tool-using agents, safety concerns shift from harmful text generation to harmful task completion. Current agent safety evaluations ignore personalization signals like user profiles, creating a gap in understanding how sensitive user-context cues affect harmful behavior.

Method: Extended the AgentHarm benchmark to evaluate frontier and open-source LLMs on multi-step malicious tasks under controlled prompt conditions varying user-context personalization (no bio, bio-only, bio+mental health disclosure) with lightweight jailbreak injection.

Result: Harmful task completion is non-trivial across models; frontier models still complete measurable harmful tasks while open models show substantially higher completion. Adding bio-only context reduces harm scores and increases refusals. Mental health disclosure modestly strengthens this protective shift but also causes over-refusal on benign tasks. Jailbreaks sharply elevate harm and weaken personalization’s protective effects.

Conclusion: Personalization can act as a weak protective factor against agentic misuse but is fragile under adversarial pressure, highlighting the need for personalization-aware evaluations and robust safeguards across user-context conditions.

Abstract: Large language models (LLMs) are increasingly deployed as tool-using agents, shifting safety concerns from harmful text generation to harmful task completion. Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals. To address this gap, we investigated how mental health disclosure, a sensitive and realistic user-context cue, affects harmful behavior in agentic settings. Building on the AgentHarm benchmark, we evaluated frontier and open-source LLMs on multi-step malicious tasks (and their benign counterparts) under controlled prompt conditions that vary user-context personalization (no bio, bio-only, bio+mental health disclosure) and include a lightweight jailbreak injection. Our results reveal that harmful task completion is non-trivial across models: frontier lab models (e.g., GPT 5.2, Claude Sonnet 4.5, Gemini 3-Pro) still complete a measurable fraction of harmful tasks, while an open model (DeepSeek 3.2) exhibits substantially higher harmful completion. Adding a bio-only context generally reduces harm scores and increases refusals. Adding an explicit mental health disclosure often shifts outcomes further in the same direction, though effects are modest and not uniformly reliable after multiple-testing correction. Importantly, the refusal increase also appears on benign tasks, indicating a safety–utility trade-off via over-refusal. Finally, jailbreak prompting sharply elevates harm relative to benign conditions and can weaken or override the protective shift induced by personalization. Taken together, our results indicate that personalization can act as a weak protective factor in agentic misuse settings, but it is fragile under minimal adversarial pressure, highlighting the need for personalization-aware evaluations and safeguards that remain robust across user-context conditions.

[417] MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

Min Zeng, Shuang Zhou, Zaifu Zhan, Rui Zhang

Main category: cs.AI

TL;DR: MedCL-Bench: A benchmark for evaluating continual learning in biomedical NLP with 10 datasets across 5 task families, testing 11 CL strategies across 8 task orders to measure retention, transfer, and compute costs.

DetailsMotivation: Medical language models need updating as evidence evolves, but sequential updating causes catastrophic forgetting. Current biomedical NLP lacks a unified benchmark for evaluating continual learning under standardized protocols with compute-aware reporting.

Method: Introduced MedCL-Bench streaming 10 biomedical NLP datasets across 5 task families, evaluated 11 continual learning strategies across 8 task orders, measuring retention, transfer, and GPU-hour costs.

Result: Direct sequential fine-tuning causes catastrophic forgetting; parameter-isolation provides best retention per GPU-hour, replay offers strong protection at higher cost, regularization yields limited benefit; forgetting is task-dependent with multi-label classification most vulnerable.

Conclusion: MedCL-Bench provides a reproducible framework for auditing model updates before deployment, revealing distinct retention-compute tradeoffs for different continual learning methods in biomedical NLP.

Abstract: Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although biomedical NLP has many static benchmarks, no unified, task-diverse benchmark exists for evaluating continual learning under standardized protocols, robustness to task order and compute-aware reporting. We introduce MedCL-Bench, which streams ten biomedical NLP datasets spanning five task families and evaluates eleven continual learning strategies across eight task orders, reporting retention, transfer, and GPU-hour cost. Across backbones and task orders, direct sequential fine-tuning on incoming tasks induces catastrophic forgetting, causing update-induced performance regressions on prior tasks. Continual learning methods occupy distinct retention-compute frontiers: parameter-isolation provides the best retention per GPU-hour, replay offers strong protection at higher cost, and regularization yields limited benefit. Forgetting is task-dependent, with multi-label topic classification most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a reproducible framework for auditing model updates before deployment.

[418] Nonstandard Errors in AI Agents

Ruijiang Gao, Steven Chong Xiao

Main category: cs.AI

TL;DR: AI coding agents produce nonstandard errors in empirical research similar to human researchers, with different model families showing systematic methodological preferences, and convergence occurs through imitation rather than understanding.

DetailsMotivation: To investigate whether AI coding agents produce consistent empirical results when given the same data and research questions, and to understand the sources of variation in their analytical choices.

Method: Deployed 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015-2024), analyzing agent-to-agent variation in analytical choices and measure selection.

Result: AI agents exhibit sizable nonstandard errors with substantial divergence in measure choices; different model families show stable “empirical styles”; AI peer review has minimal effect, but exposure to top-rated exemplars reduces dispersion by 80-99% through imitation.

Conclusion: AI agents in empirical research exhibit similar variability to human researchers, with convergence occurring through imitation rather than understanding, raising concerns for automated policy evaluation and empirical research.

Abstract: We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015–2024), we find that AI agents exhibit sizable \textit{nonstandard errors} (NSEs), that is, uncertainty from agent-to-agent variation in analytical choices, analogous to those documented among human researchers. AI agents diverge substantially on measure choice (e.g., autocorrelation vs.\ variance ratio, dollar vs.\ share volume). Different model families (Sonnet 4.6 vs.\ Opus 4.6) exhibit stable ``empirical styles,’’ reflecting systematic differences in methodological preferences. In a three-stage feedback protocol, AI peer review (written critiques) has minimal effect on dispersion, whereas exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80–99% within \textit{converging} measure families. Convergence occurs both through within-family estimation tightening and through agents switching measure families entirely, but convergence reflects imitation rather than understanding. These findings have implications for the growing use of AI in automated policy evaluation and empirical research.

[419] Anticipatory Planning for Multimodal AI Agents

Yongyuan Liang, Shijie Zhou, Yu Gu, Hao Tan, Gang Wu, Franck Dernoncourt, Jihyung Kil, Ryan A. Rossi, Ruiyi Zhang

Main category: cs.AI

TL;DR: TraceR1 is a two-stage reinforcement learning framework for multimodal agents that uses anticipatory trajectory reasoning to improve planning coherence and execution robustness in complex tasks.

DetailsMotivation: Most existing multimodal agents are reactive, optimizing actions in isolation without reasoning about future states or long-term goals, which limits planning coherence and prevents reliable solving of high-level, multi-step tasks.

Method: Two-stage RL framework: 1) Trajectory-level RL with rewards enforcing global consistency across predicted action sequences, 2) Grounded RL fine-tuning using execution feedback from frozen tool agents to refine step-level accuracy and executability.

Result: Substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines across seven benchmarks covering online/offline computer-use and multimodal tool-use reasoning tasks.

Conclusion: Anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.

Abstract: Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that explicitly trains anticipatory reasoning by forecasting short-horizon trajectories before execution. The first stage performs trajectory-level reinforcement learning with rewards that enforce global consistency across predicted action sequences. The second stage applies grounded reinforcement fine-tuning, using execution feedback from frozen tool agents to refine step-level accuracy and executability. TraceR1 is evaluated across seven benchmarks, covering online computer-use, offline computer-use benchmarks, and multimodal tool-use reasoning tasks, where it achieves substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines. These results show that anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.

[420] Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost

Swata Marik, Swayamjit Saha, Garga Chatterjee

Main category: cs.AI

TL;DR: Digital forecasting-inventory optimization pipeline combining traditional models, ML regressors, and deep sequence models for supply chain inventory management using M5 Walmart dataset.

DetailsMotivation: To develop a unified inventory simulation framework that integrates various forecasting approaches (traditional, ML, deep learning) to optimize inventory decisions in modern supply chains, addressing the need for data-driven decision support tools.

Method: Created a digitalized forecasting-inventory optimization pipeline integrating traditional forecasting models, machine learning regressors, and deep sequence models within a unified inventory simulation framework. Evaluated seven forecasting approaches using the M5 Walmart dataset and assessed their operational impact under single- and two-echelon newsvendor systems.

Result: Temporal CNN and LSTM models significantly reduce inventory costs and improve fill rates compared to statistical baselines. Sensitivity and multi-echelon analyses demonstrate robustness and scalability of the approach.

Conclusion: The pipeline offers a data-driven decision-support tool for modern supply chains, showing that deep sequence models outperform traditional approaches in inventory optimization tasks.

Abstract: This study develops a digitalized forecasting-inventory optimization pipeline integrating traditional forecasting models, machine learning regressors, and deep sequence models within a unified inventory simulation framework. Using the M5 Walmart dataset, we evaluate seven forecasting approaches and assess their operational impact under single- and two-echelon newsvendor systems. Results indicate that Temporal CNN and LSTM models significantly reduce inventory costs and improve fill rates compared to statistical baselines. Sensitivity and multi-echelon analyses demonstrate robustness and scalability, offering a data-driven decision-support tool for modern supply chains.

[421] Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

Yi Chen, Daiwei Chen, Sukrut Madhav Chikodikar, Caitlyn Heqi Yin, Ramya Korlakai Vinayak

Main category: cs.AI

TL;DR: Systematic analysis of conformal factuality filtering for RAG-based LLMs reveals trade-offs between factuality and informativeness, fragility to distribution shifts, and efficiency advantages of lightweight verifiers over LLM-based scorers.

DetailsMotivation: Address LLM hallucinations in knowledge-intensive applications by analyzing the reliability and usefulness of conformal factuality filtering for RAG-based systems, which currently lack statistical guarantees for correctness and informativeness.

Method: Systematic analysis across generation, scoring, calibration, robustness, and efficiency dimensions using three benchmarks and multiple model families, with novel informativeness-aware metrics and comparison of lightweight entailment-based verifiers vs LLM-based confidence scorers.

Result: Conformal filtering suffers from low usefulness at high factuality due to vacuous outputs, lacks robustness to distribution shifts and distractors, and lightweight entailment verifiers outperform LLM-based scorers while being 100× more computationally efficient.

Conclusion: Conformal filtering has factuality-informativeness trade-offs and fragility to distribution shifts, highlighting need for new approaches with robustness and usefulness as key metrics, while providing guidance for building reliable and efficient RAG pipelines.

Abstract: Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering offers distribution-free statistical reliability by scoring and filtering atomic claims using a threshold calibrated on held-out data, however, the informativeness of the final output is not guaranteed. We systematically analyze the reliability and usefulness of conformal factuality for RAG-based LLMs across generation, scoring, calibration, robustness, and efficiency. We propose novel informativeness-aware metrics that better reflect task utility under conformal filtering. Across three benchmarks and multiple model families, we find that (i) conformal filtering suffers from low usefulness at high factuality levels due to vacuous outputs, (ii) conformal factuality guarantee is not robust to distribution shifts and distractors, highlighting the limitation that requires calibration data to closely match deployment conditions, and (iii) lightweight entailment-based verifiers match or outperform LLM-based model confidence scorers while requiring over $100\times$ fewer FLOPs. Overall, our results expose factuality-informativeness trade-offs and fragility of conformal filtering framework under distribution shifts and distractors, highlighting the need for new approaches for reliability with robustness and usefulness as key metrics, and provide actionable guidance for building RAG pipelines that are both reliable and computationally efficient.

[422] Prompt Programming for Cultural Bias and Alignment of Large Language Models

Maksim Eren, Eric Michalak, Brian Cook, Johnny Seales

Main category: cs.AI

TL;DR: Researchers validate cultural alignment framework on open-weight LLMs and introduce DSPy prompt optimization to systematically improve cultural conditioning, showing it outperforms manual prompt engineering.

DetailsMotivation: LLMs exhibit cultural biases that misalign with target populations, which is problematic as they're increasingly used for strategic decision-making, policy support, and document engineering tasks. Previous work showed culture-specific prompting helps but was limited to proprietary models and manual engineering.

Method: Reproduce social sciences survey-based projection and distance metrics on open-weight LLMs, then introduce DSPy prompt programming to treat prompts as modular, optimizable programs that can be systematically tuned against cultural-distance objectives.

Result: Prompt optimization with DSPy often improves upon cultural prompt engineering, suggesting prompt compilation provides a more stable and transferable route to culturally aligned LLM responses.

Conclusion: Systematic prompt optimization using frameworks like DSPy offers a promising approach for improving cultural alignment in LLMs, extending beyond proprietary models to open-weight systems.

Abstract: Culture shapes reasoning, values, prioritization, and strategic decision-making, yet large language models (LLMs) often exhibit cultural biases that misalign with target populations. As LLMs are increasingly used for strategic decision-making, policy support, and document engineering tasks such as summarization, categorization, and compliance-oriented auditing, improving cultural alignment is important for ensuring that downstream analyses and recommendations reflect target-population value profiles rather than default model priors. Previous work introduced a survey-grounded cultural alignment framework and showed that culture-specific prompting can reduce misalignment, but it primarily evaluated proprietary models and relied on manual prompt engineering. In this paper, we validate and extend that framework by reproducing its social sciences survey based projection and distance metrics on open-weight LLMs, testing whether the same cultural skew and benefits of culture conditioning persist outside closed LLM systems. Building on this foundation, we introduce use of prompt programming with DSPy for this problem-treating prompts as modular, optimizable programs-to systematically tune cultural conditioning by optimizing against cultural-distance objectives. In our experiments, we show that prompt optimization often improves upon cultural prompt engineering, suggesting prompt compilation with DSPy can provide a more stable and transferable route to culturally aligned LLM responses.

[423] Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

Zhitao Zeng, Mengya Xu, Jian Jiang, Pengfei Guo, Yunqiu Xu, Zhu Zhuo, Chang Han Low, Yufan He, Dong Yang, Chenxi Lin, Yiming Gu, Jiaxin Guo, Yutong Ban, Daguang Xu, Qi Dou, Yueming Jin

Main category: cs.AI

TL;DR: SurgΣ introduces a large-scale multimodal surgical data foundation (SurgΣ-DB) and foundation models to advance surgical AI, addressing limitations in generalization across procedures and institutions through unified data schema and hierarchical reasoning annotations.

DetailsMotivation: Existing surgical AI frameworks are task-specific and struggle to generalize across procedures and institutions. While multimodal foundation models show promise in medical domains, surgical AI advancement is constrained by lack of large-scale, systematically curated multimodal data.

Method: Developed SurgΣ-DB, a large-scale multimodal data foundation consolidating heterogeneous surgical data sources (open-source datasets, clinical collections, web data) into unified schema. Includes hierarchical reasoning annotations and spans 6 clinical specialties with 5.98M conversations across 18 surgical tasks covering understanding, reasoning, planning, and generation.

Result: Created comprehensive surgical data foundation with rich image/video annotations across diverse surgical tasks. Empirical evidence shows practical benefits of large-scale multimodal annotations, unified semantic design, and structured reasoning annotations for improving cross-task generalization and interpretability.

Conclusion: SurgΣ framework addresses critical data bottleneck in surgical AI by providing large-scale multimodal foundation, enabling development of more generalizable surgical foundation models with improved cross-task capabilities and interpretability.

Abstract: Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advancement in surgery remains constrained by the lack of large-scale, systematically curated multimodal data. To address this challenge, we introduce Surg$Σ$, a spectrum of large-scale multimodal data and foundation models for surgical intelligence. At the core of this framework lies Surg$Σ$-DB, a large-scale multimodal data foundation designed to support diverse surgical tasks. Surg$Σ$-DB consolidates heterogeneous surgical data sources (including open-source datasets, curated in-house clinical collections and web-source data) into a unified schema, aiming to improve label consistency and data standardization across heterogeneous datasets. Surg$Σ$-DB spans 6 clinical specialties and diverse surgical types, providing rich image- and video-level annotations across 18 practical surgical tasks covering understanding, reasoning, planning, and generation, at an unprecedented scale (over 5.98M conversations). Beyond conventional multimodal conversations, Surg$Σ$-DB incorporates hierarchical reasoning annotations, providing richer semantic cues to support deeper contextual understanding in complex surgical scenarios. We further provide empirical evidence through recently developed surgical foundation models built upon Surg$Σ$-DB, illustrating the practical benefits of large-scale multimodal annotations, unified semantic design, and structured reasoning annotations for improving cross-task generalization and interpretability.

[424] Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam

Main category: cs.AI

TL;DR: LLM agents learn to generate professional HTML slide presentations through reinforcement learning with multi-component rewards including inverse specification scoring.

DetailsMotivation: Automated presentation generation is challenging due to requirements for coherent content, visual design, and audience-aware communication. Current approaches lack comprehensive evaluation metrics and learning frameworks for this complex multimodal task.

Method: Proposes SlideRL, an OpenEnv-compatible RL environment where LLM agents learn presentation generation through tool use. Uses multi-component reward system with structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and inverse specification reward (LLM attempts to recover original spec from generated slides). Fine-tunes Qwen2.5-Coder-7B via GRPO on 0.5% parameters using expert demonstrations.

Result: Fine-tuned 7B model achieves 91.2% of Claude Opus 4.6’s quality while improving 33.1% over base model. Experiments on 48 diverse business briefs across six models show instruction adherence and tool-use compliance, not parameter count, determine agentic task performance.

Conclusion: Effective presentation generation can be learned through RL with comprehensive reward systems. The inverse specification reward provides holistic quality signal. Agent performance depends more on instruction following than model size. Contributes SlideRL dataset of 288 multi-turn rollout trajectories.

Abstract: Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an “inverse task” where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6’s quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts Code: https://github.com/pushing-the-frontier/slide-forge-llm

[425] Internalizing Agency from Reflective Experience

Rui Ge, Yichao Fu, Yuyang Qian, Junda Su, Yiming Zhao, Peng Zhao, Hao Zhang

Main category: cs.AI

TL;DR: LEAFE improves language model agency by learning from environment feedback through reflective experience and backtracking, enhancing recovery capabilities and problem-solving capacity in long-horizon tasks.

DetailsMotivation: Current outcome-driven post-training methods for LLMs primarily optimize final success signals, underutilizing rich environment feedback and leading to distribution sharpening where models become better at reproducing narrow successful behaviors but fail to develop feedback-grounded agency needed for expanding problem-solving capacity.

Method: LEAFE framework internalizes recovery agency from reflective experience: during exploration, agents summarize environment feedback into actionable experience, backtrack to earlier decision points, and explore alternative branches with revised actions, then distill these experience-guided corrections into the model through supervised fine-tuning.

Result: Across interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over base models and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods like Early Experience, with gains up to 14% on Pass@128.

Conclusion: LEAFE demonstrates that learning feedback-grounded agency from reflective experience enables more effective recovery and problem-solving in long-horizon interactive settings, addressing limitations of purely outcome-driven approaches.

Abstract: Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.

[426] SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji

Main category: cs.AI

TL;DR: SocialOmni is a new benchmark for evaluating social interactivity in multimodal LLMs, focusing on conversational skills like speaker identification, interruption timing, and natural interruption generation.

DetailsMotivation: Existing multimodal LLM benchmarks focus on static, accuracy-centric tasks but fail to assess social interactivity - the ability to navigate dynamic cues in natural dialogues, which is crucial for human-machine interaction.

Method: Created SocialOmni benchmark with 2,000 perception samples and 209 interaction-generation instances featuring strict temporal/contextual constraints and audio-visual inconsistency scenarios to test model robustness.

Result: Benchmarked 12 leading OLMs, revealing significant variance in social-interaction capabilities and a pronounced decoupling between perceptual accuracy and ability to generate contextually appropriate interruptions.

Conclusion: Understanding-centric metrics alone are insufficient for conversational social competence; SocialOmni provides actionable signals for bridging the perception-interaction divide in future multimodal LLMs.

Abstract: Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model’s perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

[427] A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, Caiming Xiong, Shafiq Joty

Main category: cs.AI

TL;DR: A comprehensive survey paper categorizing reasoning methods in large language models along two dimensions: regimes (inference vs training) and architectures (standalone vs agentic systems), with analysis of input-level and output-level techniques.

DetailsMotivation: To provide a systematic understanding of the evolving landscape of LLM reasoning capabilities, which has become a key differentiator for advanced AI systems, by categorizing existing methods and highlighting emerging trends in the field.

Method: The paper organizes reasoning methods along two orthogonal dimensions: 1) Regimes (inference-time reasoning vs training-based reasoning) and 2) Architectures (standalone LLMs vs agentic compound systems with external tools and multi-agent collaborations). Within each dimension, it analyzes input-level techniques (prompt construction) and output-level techniques (candidate refinement).

Result: The survey identifies key trends including the shift from inference-scaling to learning-to-reason approaches (like DeepSeek-R1), the transition to agentic workflows (like OpenAI Deep Research, Manus Agent), and covers various learning algorithms from supervised fine-tuning to reinforcement learning methods like PPO and GRPO.

Conclusion: The paper provides a comprehensive framework for understanding LLM reasoning methods, highlighting the evolution from simple prompting techniques to sophisticated agentic systems and specialized training approaches, offering researchers a structured overview of this rapidly advancing field.

Abstract: Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multi-agent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high-quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference-scaling to learning-to-reason (e.g., DeepSeek-R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine-tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator-evaluator and LLM debate to recent innovations. …

[428] ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, Tomas Pfister

Main category: cs.AI

TL;DR: ReasoningBank is a memory framework that enables LLM agents to learn from past experiences by distilling reasoning strategies from successful and failed interactions, with memory-aware test-time scaling (MaTTS) accelerating learning through experience scaling.

DetailsMotivation: Current LLM agents in persistent real-world roles fail to learn from accumulated interaction history, discarding valuable insights and repeating past errors, necessitating a memory system that enables continuous learning.

Method: Proposes ReasoningBank framework that distills generalizable reasoning strategies from agent’s self-judged successful/failed experiences, with retrieval at test time and integration of new learnings. Introduces MaTTS to accelerate learning by scaling up interaction experience through compute allocation.

Result: Outperforms existing memory mechanisms storing raw trajectories or only successful routines across web browsing and software engineering benchmarks, improving both effectiveness and efficiency; MaTTS further amplifies gains.

Conclusion: Establishes memory-driven experience scaling as a new scaling dimension enabling agents to self-evolve with emergent behaviors, creating synergy between memory and test-time scaling.

Abstract: With the growing adoption of large language model agents in persistent real-world roles, they naturally encounter continuous streams of tasks. A key limitation, however, is their failure to learn from the accumulated interaction history, forcing them to discard valuable insights and repeat past errors. We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent’s self-judged successful and failed experiences. At test time, an agent retrieves relevant memories from ReasoningBank to inform its interaction and then integrates new learnings back, enabling it to become more capable over time. Building on this powerful experience learner, we further introduce memory-aware test-time scaling (MaTTS), which accelerates and diversifies this learning process by scaling up the agent’s interaction experience. By allocating more compute to each task, the agent generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. The better memory in turn guides more effective scaling, establishing a powerful synergy between memory and test-time scaling. Across web browsing and software engineering benchmarks, ReasoningBank consistently outperforms existing memory mechanisms that store raw trajectories or only successful task routines, improving both effectiveness and efficiency; MaTTS further amplifies these gains. These findings establish memory-driven experience scaling as a new scaling dimension, enabling agents to self-evolve with emergent behaviors naturally arise. Our code can be found at https://github.com/google-research/reasoning-bank.

[429] Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents

Muyu He, Anand Kumar, Tsach Mackey, Meghana Rajeev, James Zou, Nazneen Rajani

Main category: cs.AI

TL;DR: TraitBasis is a model-agnostic method for stress testing conversational AI agents by learning steerable user trait directions in activation space, enabling systematic robustness evaluation without fine-tuning.

DetailsMotivation: Current conversational AI agents are brittle and fail under realistic user behavior variations (impatience, incoherence, skepticism), but existing benchmarks don't capture this fragility, creating a robustness testing gap.

Method: TraitBasis learns directions in activation space corresponding to steerable user traits, which can be controlled, scaled, composed, and applied at inference time without fine-tuning or extra data. Extends τ-Bench to τ-Trait for systematic testing.

Result: Average 2%-30% performance degradation on τ-Trait across frontier models, highlighting current AI agents’ lack of robustness to user behavior variations.

Conclusion: TraitBasis provides a simple, data-efficient, compositional tool for robustness testing that can enable building more reliable AI agents for real-world human interactions.

Abstract: Despite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as being more impatient, incoherent, or skeptical, can cause sharp drops in agent performance, revealing how brittle current AI agents are. Today’s benchmarks fail to capture this fragility: agents may perform well under standard evaluations but degrade spectacularly in more realistic and varied settings. We address this robustness testing gap by introducing TraitBasis, a lightweight, model-agnostic method for systematically stress testing AI agents. TraitBasis learns directions in activation space corresponding to steerable user traits (e.g., impatience or incoherence), which can be controlled, scaled, composed, and applied at inference time without any fine-tuning or extra data. Using TraitBasis, we extend $τ$-Bench to $τ$-Trait, where user behaviors are altered via controlled trait vectors. We observe on average a 2%-30% performance degradation on $τ$-Trait across frontier models, highlighting the lack of robustness of current AI agents to variations in user behavior. Together, these results highlight both the critical role of robustness testing and the promise of TraitBasis as a simple, data-efficient, and compositional tool. By powering simulation-driven stress tests and training loops, TraitBasis opens the door to building AI agents that remain reliable in the unpredictable dynamics of real-world human interactions. We have open-sourced $τ$-Trai across four domains: airline, retail, telecom, and telehealth, so the community can systematically QA their agents under realistic, behaviorally diverse intents and trait scenarios: https://github.com/collinear-ai/tau-trait.

[430] Token-Level LLM Collaboration via FusionRoute

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

Main category: cs.AI

TL;DR: FusionRoute is a token-level multi-LLM collaboration framework that uses a lightweight router to select domain experts at each decoding step while contributing complementary logits to refine expert outputs, overcoming limitations of pure expert-only routing.

DetailsMotivation: Address the dilemma between large general-purpose LLMs (expensive to train/deploy) and smaller domain-specialized models (inefficient at generalization). Existing token-level collaboration methods relying solely on fixed expert outputs are fundamentally limited without strong global coverage assumptions.

Method: Proposes FusionRoute with a lightweight router that simultaneously: (1) selects the most suitable expert at each decoding step, and (2) contributes complementary logits via logit addition to refine/correct the selected expert’s next-token distribution. This expands the effective policy class beyond pure expert-only routing.

Result: Outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning across Llama-3 and Gemma-2 families on diverse benchmarks (mathematical reasoning, code generation, instruction following). Remains competitive with domain experts on their respective tasks.

Conclusion: FusionRoute provides an effective framework for multi-LLM collaboration that balances efficiency and performance, overcoming theoretical limitations of pure expert-only routing through complementary logit generation.

Abstract: Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert’s next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

[431] VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration

Saeed Khaki, Ashudeep Singh, Nima Safaei, Kamal Ginotra

Main category: cs.AI

TL;DR: VisTIRA framework improves visual math reasoning in VLMs by integrating tools and structured problem decomposition, addressing the modality gap between text and visual math problems.

DetailsMotivation: Vision-language models perform worse on mathematical reasoning when problems are presented as images rather than text, due to failures in reading dense formulas, layout, and mixed symbolic-diagrammatic context.

Method: 1) Introduced VisTIRA, a tool-integrated reasoning framework that decomposes math problems (as images) into natural language rationales and executable Python steps. 2) Created a LaTeX-based pipeline to convert chain-of-thought math corpora into challenging image counterparts. 3) Built synthetic tool-use trajectories from SnapAsk dataset for fine-tuning VLMs.

Result: Tool-integrated supervision improves image-based reasoning, and OCR grounding helps smaller models but benefits diminish at scale. Modality gap severity inversely correlates with model size, with structured reasoning and OCR-based grounding being complementary strategies.

Conclusion: The modality gap in visual math reasoning can be addressed through structured reasoning frameworks like VisTIRA and appropriate training data generation, with effectiveness varying by model size.

Abstract: Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text. We empirically characterize this as a modality gap: the same question in text form yields markedly higher accuracy than its visually typeset counterpart, due to compounded failures in reading dense formulas, layout, and mixed symbolic-diagrammatic context. First, we introduce VisTIRA (Vision and Tool-Integrated Reasoning Agent), a tool-integrated reasoning framework that enables structured problem solving by iteratively decomposing a given math problem (as an image) into natural language rationales and executable Python steps to determine the final answer. Second, we build a framework to measure and improve visual math reasoning: a LaTeX-based pipeline that converts chain-of-thought math corpora (e.g., NuminaMath) into challenging image counterparts, and a large set of synthetic tool-use trajectories derived from a real-world, homework-style image dataset (called SnapAsk) for fine-tuning VLMs. Our experiments show that tool-integrated supervision improves image-based reasoning, and OCR grounding can further narrow the gap for smaller models, although its benefit diminishes at scale. These findings highlight that modality gap severity inversely correlates with model size, and that structured reasoning and OCR-based grounding are complementary strategies for advancing visual mathematical reasoning.

[432] LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

Brian Rabern, Philipp Mondorf, Barbara Plank

Main category: cs.AI

TL;DR: LogicSkills benchmark evaluates LLMs on three core logical skills: formal symbolization, countermodel construction, and validity assessment, revealing that high task-level accuracy can mask weaknesses in fundamental logical reasoning abilities.

DetailsMotivation: While LLMs perform well on many logical reasoning benchmarks, it's unclear which core logical skills they truly master. The authors aim to isolate and evaluate fundamental logical reasoning abilities beyond surface-level task performance.

Method: Created LogicSkills benchmark with three isolated tasks: (1) formal symbolization (translating premises to first-order logic), (2) countermodel construction (showing invalidity via finite countermodels), and (3) validity assessment. Items drawn from two-variable fragment of first-order logic without identity, presented in both English and Carrollian nonce-word language. All instances solver-verified with Z3 for correctness and non-triviality.

Result: Conventional instruction-tuned LLMs perform well on validity assessment but poorly on formal symbolization and countermodel construction. Recent reasoning-tuned models perform strongly across all three tasks, suggesting more systematic logical skill profiles.

Conclusion: High task-level accuracy in LLMs can mask weaknesses in core logical skills. The LogicSkills benchmark reveals important differences in logical reasoning capabilities between conventional and reasoning-tuned models.

Abstract: Large language models perform well on many logical reasoning benchmarks, but it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a benchmark that isolates three fundamental logical skills: (i) $\textit{formal symbolization}\unicode{x2014}{}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$showing that an argument is logically invalid by constructing a finite countermodel; and (iii) $\textit{validity assessment}\unicode{x2014}$determining whether a conclusion follows from a set of premises. Items are drawn from the two-variable fragment of first-order logic without identity and are presented in both English and a Carrollian nonce-word language. All instances are solver-verified with Z3 for correctness and non-triviality. Across conventional instruction-tuned LLMs, performance is high on $\textit{validity assessment}$ but substantially lower on $\textit{formal symbolization}$ and $\textit{countermodel construction}$, highlighting that high task-level accuracy can mask weaknesses in core logical skills. In contrast, recent reasoning-tuned models perform strongly across all three tasks, suggesting a more systematic logical skill profile.

[433] Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure

Zirui Li, Xuefeng Bai, Kehai Chen, Yizhi Li, Jian Yang, Chenghua Lin, Min Zhang

Main category: cs.AI

TL;DR: Analyzes latent chain-of-thought reasoning through causal interventions, revealing staged functionality and gaps between early output bias and late representational commitment.

DetailsMotivation: Latent chain-of-thought methods use internal representations instead of explicit textual rationales, but these intermediate computations are difficult to evaluate beyond correlation-based probes. The paper aims to understand latent reasoning as a manipulable causal process.

Method: Models latent steps as variables in a structural causal model (SCM) and analyzes their effects through step-wise do-interventions. Studies two representative paradigms (Coconut and CODI) on mathematical and general reasoning tasks to investigate causal necessity, influence propagation, and answer mode retention.

Result: Finds that latent-step budgets behave like staged functionality with non-local routing rather than homogeneous extra depth. Identifies a persistent gap between early output bias and late representational commitment. Latent reasoning shows different structural properties compared to explicit CoT.

Conclusion: Motivates mode-conditional and stability-aware analyses as more reliable tools for interpreting and improving latent reasoning systems, suggesting corresponding training/decoding objectives.

Abstract: Latent or continuous chain-of-thought methods replace explicit textual rationales with a number of internal latent steps, but these intermediate computations are difficult to evaluate beyond correlation-based probes. In this paper, we view latent chain-of-thought as a manipulable causal process in representation space by modeling latent steps as variables in a structural causal model (SCM) and analyzing their effects through step-wise $\mathrm{do}$-interventions. We study two representative paradigms (i.e., Coconut and CODI) on both mathematical and general reasoning tasks to investigate three key questions: (1) which steps are causally necessary for correctness and when answers become decidable early; (2) how does influence propagate across steps, and how does this structure compare to explicit CoT; and (3) do intermediate trajectories retain competing answer modes, and how does output-level commitment differ from representational commitment across steps. We find that latent-step budgets behave less like homogeneous extra depth and more like staged functionality with non-local routing, and we identify a persistent gap between early output bias and late representational commitment. These results motivate mode-conditional and stability-aware analyses – and corresponding training/decoding objectives – as more reliable tools for interpreting and improving latent reasoning systems. Code is available at https://github.com/J1mL1/causal-latent-cot.

[434] Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models

Jihoon Jeong

Main category: cs.AI

TL;DR: Model Medicine introduces a new research program treating AI models as biological organisms with internal structures, symptoms, and treatable states, bridging interpretability research with clinical practice.

DetailsMotivation: Current AI interpretability research focuses on anatomical observation but lacks systematic clinical practice needed for complex AI systems. The paper aims to establish Model Medicine as a comprehensive discipline for understanding, diagnosing, and treating AI model disorders.

Method: Develops a discipline taxonomy with 15 subdisciplines, introduces the Four Shell Model behavioral genetics framework, creates Neural MRI diagnostic tool, proposes a five-layer diagnostic framework, and establishes clinical model sciences including behavioral profiling and symptom description.

Result: Presents five main contributions: comprehensive taxonomy, empirically-grounded behavioral framework, validated diagnostic tool, diagnostic framework, and clinical assessment tools. Also proposes Layered Core Hypothesis and therapeutic framework connecting diagnosis to treatment.

Conclusion: Model Medicine establishes a systematic approach to AI model health, bridging interpretability with clinical practice, enabling comprehensive diagnosis and treatment of model disorders through biologically-inspired frameworks and tools.

Abstract: Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models – like biological organisms – have internal structures, dynamic processes, heritable traits, observable symptoms, classifiable conditions, and treatable states. This paper introduces Model Medicine as a research program, bridging the gap between current AI interpretability research (anatomical observation) and the systematic clinical practice that complex AI systems increasingly require. We present five contributions: (1) a discipline taxonomy organizing 15 subdisciplines across four divisions – Basic Model Sciences, Clinical Model Sciences, Model Public Health, and Model Architectural Medicine; (2) the Four Shell Model (v3.3), a behavioral genetics framework empirically grounded in 720 agents and 24,923 decisions from the Agora-12 program, explaining how model behavior emerges from Core–Shell interaction; (3) Neural MRI (Model Resonance Imaging), a working open-source diagnostic tool mapping five medical neuroimaging modalities to AI interpretability techniques, validated through four clinical cases demonstrating imaging, comparison, localization, and predictive capability; (4) a five-layer diagnostic framework for comprehensive model assessment; and (5) clinical model sciences including the Model Temperament Index for behavioral profiling, Model Semiology for symptom description, and M-CARE for standardized case reporting. We additionally propose the Layered Core Hypothesis – a biologically-inspired three-layer parameter architecture – and a therapeutic framework connecting diagnosis to treatment.

[435] CHARM: Calibrating Reward Models With Chatbot Arena Scores

Xiao Zhu, Chenmien Tan, Pinzhen Chen, Rico Sennrich, Huiming Wang, Yanlin Zhang, Hanxu Hu

Main category: cs.AI

TL;DR: CHARM is a calibration method that uses Chatbot Arena Elo scores to debias reward models, reducing systematic preference biases toward certain policy models and improving alignment with human preferences.

DetailsMotivation: Reward models in RLHF suffer from model preference bias where they systematically assign disproportionately high scores to responses from certain policy models, leading to unfair judgments and potential reward hacking.

Method: Proposes CHatbot Arena calibrated Reward Modeling (CHARM) that leverages Elo scores from Chatbot Arena to construct debiased preference datasets and adjust reward model scoring through calibration.

Result: Calibrated RMs achieve improved evaluation accuracy on RM-Bench and Chat-Hard domain of RewardBench, show stronger correlation with human preferences (scores align better with Elo rankings), and improve downstream post-training performance.

Conclusion: CHARM provides a simple, effective, and broadly applicable approach to building more reliable and fair reward models by mitigating model preference bias through calibration with human preference data.

Abstract: Reward models (RMs) play a crucial role in Reinforcement Learning from Human Feedback by serving as proxies for human preferences in aligning large language models. However, they suffer from various biases which could lead to reward hacking. In this paper, we identify a model preference bias in RMs, where they systematically assign disproportionately high scores to responses from certain policy models, leading to unfair judgments. To mitigate this bias, we propose a calibration method named CHatbot Arena calibrated Reward Modeling (CHARM) that leverages Elo scores from the Chatbot Arena to construct debiased preference datasets and adjust reward model scoring. We conduct extensive experiments on reward model benchmarks and human preference alignment. Results demonstrate that our calibrated RMs achieve improved evaluation accuracy on RM-Bench and the Chat-Hard domain of RewardBench, exhibit a stronger correlation with human preferences by producing scores more closely aligned with Elo rankings and improve downstream post-training performance. These results demonstrate that CHARM provides a simple, effective, and broadly applicable approach to building more reliable and fair reward models.

[436] IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence

Jieren Deng, Zhizhang Hu, Ziyan He, Aleksandar Cvetkovic, Pak Kiu Chung, Dragomir Yankov, Chiqun Zhang

Main category: cs.AI

TL;DR: IMAIA is an interactive Maps AI Assistant that enables natural language interaction with vector maps and satellite imagery, and augments camera inputs with geospatial intelligence for understanding the world.

DetailsMotivation: Current map applications are largely point-and-click, making it difficult to ask map-centric questions or connect camera views to surrounding geospatial context with view-conditioned inputs.

Method: IMAIA comprises two components: Maps Plus parses tiled vector/satellite views into grid-aligned representations for language models to query; PAISA performs camera-aware place understanding by fusing image-place embeddings with geospatial signals (location, heading, proximity). Uses lightweight multi-agent design for low latency.

Result: IMAIA improves accuracy and responsiveness over strong baselines across map-centric QA and camera-to-place grounding tasks while remaining practical for user-facing deployments.

Conclusion: By unifying language, maps, and geospatial cues, IMAIA moves beyond scripted tools toward conversational mapping that is both spatially grounded and broadly usable.

Abstract: Map applications are still largely point-and-click, making it difficult to ask map-centric questions or connect what a camera sees to the surrounding geospatial context with view-conditioned inputs. We introduce IMAIA, an interactive Maps AI Assistant that enables natural-language interaction with both vector (street) maps and satellite imagery, and augments camera inputs with geospatial intelligence to help users understand the world. IMAIA comprises two complementary components. Maps Plus treats the map as first-class context by parsing tiled vector/satellite views into a grid-aligned representation that a language model can query to resolve deictic references (e.g., ``the flower-shaped building next to the park in the top-right’’). Places AI Smart Assistant (PAISA) performs camera-aware place understanding by fusing image–place embeddings with geospatial signals (location, heading, proximity) to ground a scene, surface salient attributes, and generate concise explanations. A lightweight multi-agent design keeps latency low and exposes interpretable intermediate decisions. Across map-centric QA and camera-to-place grounding tasks, IMAIA improves accuracy and responsiveness over strong baselines while remaining practical for user-facing deployments. By unifying language, maps, and geospatial cues, IMAIA moves beyond scripted tools toward conversational mapping that is both spatially grounded and broadly usable.

[437] Large Language Models for Wireless Communications: From Adaptation to Autonomy

Le Liang, Hao Ye, Yucheng Sheng, Ouya Wang, Jiacheng Wang, Shi Jin, Geoffrey Ye Li

Main category: cs.AI

TL;DR: LLMs are being explored for wireless communications to provide intelligent, adaptive solutions through three approaches: adapting pretrained LLMs, developing wireless-specific foundation models, and enabling autonomous agentic LLMs.

DetailsMotivation: Increasing complexity and dynamics in wireless communications demand intelligent and adaptive solutions that can leverage LLMs' reasoning, generalization, and zero-shot learning capabilities.

Method: Three key directions: 1) Adapting pretrained LLMs for communication tasks, 2) Developing wireless-specific foundation models for better efficiency, 3) Enabling agentic LLMs with autonomous reasoning and coordination capabilities.

Result: The article highlights recent advances, practical case studies, and unique benefits of LLM-based approaches over traditional methods in wireless communications.

Conclusion: LLMs can transform wireless systems toward intelligent, adaptive, and autonomous networks, with open challenges in multimodal fusion, collaboration with lightweight models, and self-improving capabilities.

Abstract: The emergence of large language models (LLMs) has revolutionized artificial intelligence, offering unprecedented capabilities in reasoning, generalization, and zero-shot learning. These strengths open new frontiers in wireless communications, where increasing complexity and dynamics demand intelligent and adaptive solutions. This article explores the role of LLMs in transforming wireless systems across three key directions: adapting pretrained LLMs for communication tasks, developing wireless-specific foundation models to balance versatility and efficiency, and enabling agentic LLMs with autonomous reasoning and coordination capabilities. We highlight recent advances, practical case studies, and the unique benefits of LLM-based approaches over traditional methods. Finally, we outline open challenges and research opportunities, including multimodal fusion, collaboration with lightweight models, and self-improving capabilities, charting a path toward intelligent, adaptive, and autonomous wireless networks.

[438] The DeepLog Neurosymbolic Machine

Vincent Derkinderen, Robin Manhaeve, Rik Adriaensen, Lucas Van Praet, Lennert De Smet, Giuseppe Marra, Luc De Raedt

Main category: cs.AI

TL;DR: DeepLog is a theoretical and operational framework for neurosymbolic AI that provides building blocks and primitives for representing and emulating various neurosymbolic systems through an annotated neural extension of grounded first-order logic and extended algebraic circuits.

DetailsMotivation: To create a unified framework that abstracts commonly used representations and computational mechanisms in neurosymbolic AI, enabling easier development and comparison of different neurosymbolic systems across various logic types (Boolean, fuzzy, probabilistic) and implementation approaches.

Method: DeepLog consists of two key components: 1) The DeepLog language for specifying neurosymbolic models and inference tasks using an annotated neural extension of grounded first-order logic, and 2) Extended algebraic circuits as computational graphs at the computational level. The framework is implemented in software with GPU acceleration.

Result: The framework demonstrates generality and efficiency through experimental comparisons showing differences between fuzzy and probabilistic logics, between using logic in architecture vs loss function, and between CPU-based and GPU-based implementations.

Conclusion: DeepLog provides a comprehensive neurosymbolic abstract machine that enables declarative specification of diverse neurosymbolic models and efficient computation through GPU acceleration, serving as a unified framework for neurosymbolic AI research and development.

Abstract: We contribute a theoretical and operational framework for neurosymbolic AI called DeepLog. DeepLog introduces building blocks and primitives for neurosymbolic AI that make abstraction of commonly used representations and computational mechanisms in neurosymbolic AI. DeepLog can represent and emulate a wide range of neurosymbolic systems. It consists of two key components. The first is the DeepLog language for specifying neurosymbolic models and inference tasks. This language consists of an annotated neural extension of grounded first-order logic, and makes abstraction of the type of logic, e.g. Boolean, fuzzy or probabilistic, and whether logic is used in the architecture or in the loss function. The second DeepLog component is situated at the computational level and uses extended algebraic circuits as computational graphs. Together these two components are to be considered as a neurosymbolic abstract machine, with the DeepLog language as the intermediate level of abstraction and the circuits level as the computational one. DeepLog is implemented in software, relies on the latest insights in implementing algebraic circuits on GPUs, and is declarative in that it is easy to obtain different neurosymbolic models by making different choices for the underlying algebraic structures and logics. The generality and efficiency of the DeepLog neurosymbolic machine is demonstrated through an experimental comparison between 1) different fuzzy and probabilistic logics, 2) between using logic in the architecture or in the loss function, and 3) between a standalone CPU-based implementation of a neurosymbolic AI system and a DeepLog GPU-based one.

[439] From Image Generation to Infrastructure Design: a Multi-agent Pipeline for Street Design Generation

Chenguang Wang, Xiang Yan, Yilong Dai, Ziyi Wang, Susu Xu

Main category: cs.AI

TL;DR: A multi-agent system that edits bicycle facilities directly on real-world street-view imagery for active transportation planning, integrating lane localization, prompt optimization, design generation, and automated evaluation.

DetailsMotivation: Traditional approaches for creating realistic visual renderings of street-design scenarios are labor-intensive, hindering public engagement in active transportation planning. While AI-assisted generative design shows potential, existing approaches require large domain-specific training data and struggle with precise spatial variations in complex street-view scenes.

Method: A multi-agent system that edits and redesigns bicycle facilities directly on real-world street-view imagery. The framework integrates lane localization, prompt optimization, design generation, and automated evaluation to synthesize realistic, contextually appropriate designs.

Result: Experiments across diverse urban scenarios demonstrate that the system can adapt to varying road geometries and environmental conditions, consistently yielding visually coherent and instruction-compliant results.

Conclusion: This work establishes a foundation for applying multi-agent pipelines to transportation infrastructure planning and facility design, enabling more efficient public engagement and collaborative decision-making.

Abstract: Realistic visual renderings of street-design scenarios are essential for public engagement in active transportation planning. Traditional approaches are labor-intensive, hindering collective deliberation and collaborative decision-making. While AI-assisted generative design shows transformative potential by enabling rapid creation of design scenarios, existing generative approaches typically require large amounts of domain-specific training data and struggle to enable precise spatial variations of design/configuration in complex street-view scenes. We introduce a multi-agent system that edits and redesigns bicycle facilities directly on real-world street-view imagery. The framework integrates lane localization, prompt optimization, design generation, and automated evaluation to synthesize realistic, contextually appropriate designs. Experiments across diverse urban scenarios demonstrate that the system can adapt to varying road geometries and environmental conditions, consistently yielding visually coherent and instruction-compliant results. This work establishes a foundation for applying multi-agent pipelines to transportation infrastructure planning and facility design.

[440] Hilbert: Recursively Building Formal Proofs with Informal Reasoning

Sumanth Varambally, Thomas Voice, Yanchao Sun, Zhifeng Chen, Rose Yu, Ke Ye

Main category: cs.AI

TL;DR: Hilbert is an agentic framework that combines informal reasoning LLMs with formal theorem proving LLMs to bridge the gap between natural language mathematical reasoning and verifiable formal proofs in Lean 4.

DetailsMotivation: Current prover LLMs solve fewer problems than general-purpose LLMs in natural language, creating a gap between informal mathematical reasoning and formal verification. There's a need to combine the strengths of both approaches.

Method: Hilbert orchestrates four components: informal reasoning LLM, specialized prover LLM for Lean 4 tactics, formal verifier, and semantic theorem retriever. It uses recursive decomposition to split problems into subgoals and leverages verifier feedback to refine incorrect proofs.

Result: Achieves 99.2% on miniF2F (6.6 points above best public method), strongest known public result on PutnamBench, solves 462/660 problems (70.0%), outperforming proprietary approaches like SeedProver (50.4%) with 422% improvement over best public baseline.

Conclusion: Hilbert effectively narrows the gap between informal reasoning and formal proof generation by combining complementary strengths of different LLM types, achieving state-of-the-art performance on mathematical theorem proving benchmarks.

Abstract: Large Language Models (LLMs) demonstrate impressive mathematical reasoning abilities, but their solutions frequently contain errors that cannot be automatically checked. Formal theorem proving systems such as Lean 4 offer automated verification with complete accuracy, motivating recent efforts to build specialized prover LLMs that generate verifiable proofs in formal languages. However, a significant gap remains: current prover LLMs solve substantially fewer problems than general-purpose LLMs operating in natural language. We introduce Hilbert, an agentic framework that bridges this gap by combining the complementary strengths of informal reasoning and formal verification. Our system orchestrates four components: an informal LLM that excels at mathematical reasoning, a specialized prover LLM optimized for Lean 4 tactics, a formal verifier, and a semantic theorem retriever. Given a problem that the prover is unable to solve, Hilbert employs recursive decomposition to split the problem into subgoals that it solves with the prover or reasoner LLM. It leverages verifier feedback to refine incorrect proofs as necessary. Experimental results demonstrate that Hilbert substantially outperforms existing approaches on key benchmarks, achieving 99.2% on miniF2F, 6.6% points above the best publicly available method. Hilbert achieves the \textbf{strongest known result} from a publicly available model on PutnamBench. It solves 462/660 problems (70.0%), outperforming proprietary approaches like SeedProver (50.4%) and achieving a 422% improvement over the best publicly available baseline. Thus, Hilbert effectively narrows the gap between informal reasoning and formal proof generation. Code is available at https://github.com/Rose-STL-Lab/ml-hilbert.

[441] Zephyrus: An Agentic Framework for Weather Science

Sumanth Varambally, Marshall Fisher, Jas Thakker, Yiwei Chen, Zhirui Xia, Yasaman Jafari, Ruijia Niu, Manas Jain, Veeramakali Vignesh Manivannan, Zachary Novack, Luyu Han, Srikar Eranky, Salva Rühling Cachay, Taylor Berg-Kirkpatrick, Duncan Watson-Parris, Yi-An Ma, Rose Yu

Main category: cs.AI

TL;DR: First agentic framework for weather science combining LLMs with numerical weather data, featuring Zephyrus agent, ZephyrusWorld environment, and ZephyrusBench benchmark for weather reasoning tasks.

DetailsMotivation: Bridge the gap between foundation weather models (which lack language reasoning) and LLMs (which can't handle high-dimensional meteorological data) to enable interactive scientific workflows with weather data.

Method: Created ZephyrusWorld environment with Python tools for weather data interaction, Zephyrus multi-turn LLM agent for iterative analysis, and ZephyrusBench benchmark with scalable data generation pipeline for diverse weather tasks.

Result: Zephyrus agents outperform text-only baselines by up to 44 percentage points in correctness on ZephyrusBench, though hard tasks remain challenging even with frontier LLMs.

Conclusion: Successfully created first agentic framework for weather science that combines LLM reasoning with numerical weather data, with benchmark showing strong performance but room for improvement on complex tasks.

Abstract: Foundation models for weather science are pre-trained on vast amounts of structured numerical data and outperform traditional weather forecasting systems. However, these models lack language-based reasoning capabilities, limiting their utility in interactive scientific workflows. Large language models (LLMs) excel at understanding and generating text but cannot reason about high-dimensional meteorological datasets. We bridge this gap by building the first agentic framework for weather science. Our framework includes a Python code-based environment for agents (ZephyrusWorld) to interact with weather data, featuring tools including a WeatherBench 2 dataset indexer, geolocator for geocoding from natural language, weather forecasting, climate simulation capabilities, and a climatology module for querying precomputed climatological statistics (e.g., means, extremes, and quantiles) across multiple timescales. We design Zephyrus, a multi-turn LLM-based weather agent that iteratively analyzes weather datasets, observes results, and refines its approach through conversational feedback loops. We accompany the agent with a new benchmark, ZephyrusBench, with a scalable data generation pipeline that constructs diverse question-answer pairs across weather-related tasks, from basic lookups to advanced forecasting, extreme event detection, and counterfactual reasoning. Experiments on this benchmark demonstrate the strong performance of Zephyrus agents over text-only baselines, outperforming them by up to 44 percentage points in correctness. However, the hard tasks are still difficult even with frontier LLMs, highlighting the challenging nature of our benchmark and suggesting room for future development. Our codebase and benchmark are available at https://github.com/Rose-STL-Lab/Zephyrus.

[442] PREFINE: Personalized Story Generation via Simulated User Critics and User-Specific Rubric Generation

Kentaro Ueda, Takehiro Takayanagi

Main category: cs.AI

TL;DR: PREFINE enables personalized story generation without user feedback or fine-tuning by using pseudo-user agents and user-specific rubrics to critique and refine drafts.

DetailsMotivation: Existing personalized story generation methods require explicit user feedback or model fine-tuning, which raises usability, scalability, and privacy concerns.

Method: PREFINE constructs a pseudo-user agent from interaction history, generates user-specific rubrics as evaluation criteria, then uses these to critique and iteratively refine story drafts.

Result: Outperforms existing in-context personalization and critique-based methods on PerDOC and PerMPST datasets, achieving better personalization while preserving story quality.

Conclusion: Demonstrates effective inference-only, rubric-guided personalization with applications beyond storytelling to dialogue, recommendation, and education.

Abstract: Personalizing story generation to individual users remains a core challenge in natural language generation. Existing approaches typically require explicit user feedback or fine-tuning, which pose practical concerns in terms of usability, scalability, and privacy. In this work, we introduce PREFINE (Persona-and-Rubric Guided Critique-and-Refine), a novel Critique-and-Refine framework that enables personalized story generation without user feedback or parameter updates. PREFINE constructs a pseudo-user agent from a user’s interaction history and generates user-specific rubrics (evaluation criteria). These components are used to critique and iteratively refine story drafts toward the user’s preferences. We evaluate PREFINE on two benchmark datasets, PerDOC and PerMPST, and compare it with existing approaches. Both automatic and human evaluations show that PREFINE achieves significantly better personalization while preserving general story quality. Notably, PREFINE outperforms existing in-context personalization and critique-based generation methods, and can even enhance already personalized outputs through post-hoc refinement. Our analysis reveals that user-specific rubrics are critical in driving personalization. The results demonstrate the effectiveness and practicality of inference-only, rubric-guided personalization, with potential applications beyond storytelling, including dialogue, recommendation, and education.

[443] Efficient LLM Safety Evaluation through Multi-Agent Debate

Dachuan Lin, Guobin Shen, Zihao Yang, Tianrong Liu, Dongcheng Zhao, Yi Zeng

Main category: cs.AI

TL;DR: A cost-efficient multi-agent judging framework using Small Language Models (SLMs) for LLM safety evaluation, with a new human-annotated jailbreak benchmark called HAJailBench.

DetailsMotivation: Current LLM safety evaluation relies on expensive frontier models (like GPT-4) as judges, which limits scalability. There's a need for more cost-efficient approaches to evaluate LLM safety at scale.

Method: Proposes a multi-agent framework with SLMs structured as critic, defender, and judge agents engaging in debates. Also constructs HAJailBench - a large-scale human-annotated jailbreak benchmark with 12,000 adversarial interactions across diverse attack methods and target models.

Result: The SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference costs. Three rounds of debate yield optimal balance between accuracy and efficiency.

Conclusion: Structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks, and HAJailBench provides a reliable foundation for scalable LLM safety evaluation.

Abstract: Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.

[444] Alignment-Aware Quantization for LLM Safety

Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak

Main category: cs.AI

TL;DR: CAQ introduces contrastive alignment loss to maintain safety alignment during post-training quantization of LLMs, addressing the vulnerability where standard PTQ preserves perplexity but degrades safety.

DetailsMotivation: Standard PTQ methods minimize reconstruction error without accounting for behavioral alignment from safety fine-tuning, creating a systematic vulnerability where models maintain low perplexity but degrade in safety alignment.

Method: Proposes Contrastive Alignment Quantization (CAQ) with Contrastive Alignment Loss (CAL) that uses a push-pull mechanism: steering quantized model toward safe instruction-tuned version while diverging from unaligned pre-trained reference, requiring no specialized safety datasets.

Result: CAQ enables robust 4-bit (W4A4) quantization across LLaMA, Qwen, and Mistral families, achieving superior safety alignment where state-of-the-art PTQ methods fail, without sacrificing general capabilities.

Conclusion: CAQ addresses the fundamental incompleteness of standard PTQ objectives by integrating behavioral alignment, providing a practical solution for safe and efficient LLM deployment.

Abstract: Post-Training Quantization (PTQ) has become the de-facto standard for efficient LLM deployment, yet its optimization objective remains fundamentally incomplete. Standard PTQ methods minimize reconstruction error (e.g., MSE or KL divergence) without accounting for behavioral alignment–a critical property instilled through safety fine-tuning. We demonstrate that this objective mismatch introduces a systematic vulnerability: models can maintain low perplexity yet exhibit significant degradation in safety alignment, revealing that perplexity alone is an insufficient and often misleading proxy for deployment readiness. To address this, we propose Contrastive Alignment Quantization (CAQ), which extends the PTQ objective design space by integrating a Contrastive Alignment Loss (CAL). CAL introduces a principled push-pull mechanism that jointly optimizes distributional fidelity and behavioral alignment: it steers the quantized model toward its safe, instruction-tuned counterpart while diverging from the unaligned, pre-trained reference. CAQ requires no specialized safety datasets, relying solely on standard calibration data, and introduces negligible computational overhead over existing transformation-based PTQ pipelines. We show that CAQ enables robust 4-bit (W4A4) quantization across diverse model families–including LLaMA, Qwen, and Mistral–achieving superior safety alignment where state-of-the-art PTQ methods fail, without sacrificing general capabilities. Anonymized code is available in the supplementary material.

[445] SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, Yunjian Zhang

Main category: cs.AI

TL;DR: SpatialBench: A hierarchical benchmark for evaluating spatial cognition in multimodal LLMs across 5 cognitive levels and 15 tasks, revealing models’ limitations in symbolic reasoning and planning despite strong perceptual grounding.

DetailsMotivation: Existing benchmarks oversimplify spatial cognition in MLLMs as single-dimensional metrics, failing to capture hierarchical structure and interdependence of spatial abilities needed for real-world multimodal intelligence.

Method: Proposed hierarchical spatial cognition framework with 5 progressively complex levels (basic observation to high-level planning), constructed SpatialBench benchmark covering 15 tasks aligned with these levels, and introduced capability-oriented metric for unified evaluation.

Result: MLLMs show distinct performance stratification: strong perceptual grounding but limited symbolic reasoning, causal inference, and planning. Human tests reveal humans perform goal-directed abstraction while MLLMs over-attend to surface details without coherent spatial intent.

Conclusion: Establishes first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying foundation for future spatially intelligent systems and highlighting need for improved symbolic reasoning and planning capabilities.

Abstract: Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. To address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. To provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model’s overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.

[446] Analyzing Planner Design Trade-offs for MAPF under ADG-based Realistic Execution

Jingtian Yan, Zhifei Li, William Kang, Stephen F. Smith, Jiaoyang Li

Main category: cs.AI

TL;DR: This paper analyzes how Multi-Agent Path Finding (MAPF) planner design choices affect performance under realistic execution settings, examining solution optimality, kinodynamic modeling accuracy, and their tradeoffs.

DetailsMotivation: There's a gap between simplified MAPF benchmarks and real-world performance. Existing frameworks like SMART enable realistic evaluation, but it's unclear how planner design choices affect practical deployment in industrial settings with physical constraints.

Method: Systematically studies three factors: (1) relationship between solution optimality and execution performance, (2) sensitivity to kinodynamic modeling inaccuracies, and (3) tradeoff between model accuracy and plan optimality. Uses empirical examination of these factors in realistic scenarios.

Result: The paper provides empirical insights into how planner design choices affect performance in realistic execution settings, highlighting the complex relationships between solution optimality, modeling accuracy, and practical outcomes.

Conclusion: Identifies open challenges and research directions to guide the community toward practical, real-world MAPF deployment, emphasizing the need to bridge algorithmic benchmarks with physical execution realities.

Abstract: Multi-Agent Path Finding (MAPF) algorithms are increasingly deployed in industrial warehouses and automated manufacturing facilities, where robots must operate reliably under real-world physical constraints. However, existing MAPF evaluation frameworks typically rely on simplified robot models, leaving a substantial gap between algorithmic benchmarks and practical performance. Recent frameworks such as SMART combine kinodynamic modeling with execution based on the Action Dependency Graph (ADG), enabling realistic, large-scale MAPF evaluation. Building on this capability, this work investigates how key planner design choices influence performance under realistic execution settings. We systematically study three fundamental factors: (1) the relationship between solution optimality and execution performance, (2) the sensitivity of system performance to inaccuracies in kinodynamic modeling, and (3) the tradeoff between model accuracy and plan optimality. Empirically, we examine these factors to understand how these design choices affect performance in realistic scenarios. We highlight open challenges and research directions to steer the community toward practical, real-world deployment.

[447] Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning

Jiaqi Xu, Cuiling Lan, Xuejin Chen, Yan LU

Main category: cs.AI

TL;DR: STC is a unified framework that integrates reasoning and self-critique at every step within a single LLM, trained with hybrid reinforcement learning to optimize both solution correctness and self-evaluation reliability.

DetailsMotivation: Current LLMs treat reasoning and verification as separate processes - either generating reasoning without self-checking or relying on external verifiers. This lacks immediate feedback and increases system complexity, unlike human critical thinking where reasoning and evaluation are intertwined.

Method: Stepwise Think-Critique (STC) interleaves reasoning and self-critique at every intermediate step within a single end-to-end trainable model. It uses hybrid reinforcement learning with reasoning rewards and critique-consistency rewards to jointly optimize solution correctness and self-evaluation reliability.

Result: Experiments on mathematical reasoning benchmarks show STC demonstrates strong critical-thinking capabilities and produces more interpretable reasoning traces, representing progress toward LLMs with built-in critical thinking.

Conclusion: STC provides a unified approach to integrate reasoning and self-critique within LLMs, advancing toward models with human-like critical thinking capabilities through end-to-end training of both components.

Abstract: Human beings solve complex problems through critical thinking, where reasoning and evaluation are intertwined to converge toward correct solutions. However, most existing large language models (LLMs) treat the reasoning and verification as separate processes: they either generate reasoning without explicit self-checking or rely on external verifiers to detect errors post hoc. The former lacks immediate feedback, while the latter increases system complexity and hinders synchronized learning. Motivated by human critical thinking, we propose Stepwise Think-Critique (STC), a unified and end-to-end trainable framework that interleaves reasoning and self-critique at every intermediate step within a single model. STC is trained with a hybrid reinforcement learning objective that integrates reasoning rewards and critique-consistency rewards, thereby jointly optimizing solution correctness and reliability of self-evaluation. Experiments on mathematical reasoning benchmarks show that STC demonstrates strong critical-thinking capabilities and produces more interpretable reasoning traces, representing a step toward LLMs with built-in critical thinking.

[448] Zero-Shot Time Series Foundation Models for Annual Institutional Forecasting Under Data Sparsity

Jittarin Jetwiriyanon, Teo Susnjak, Surangika Ranathunga

Main category: cs.AI

TL;DR: Zero-shot Time Series Foundation Models benchmarked against classical methods for annual enrollment forecasting, using Google Trends and LLM-derived institutional index as covariates in a leakage-safe protocol.

DetailsMotivation: Annual institutional demand forecasting is challenging due to data sparsity, reporting changes, and regime shifts. Traditional baselines struggle with low signal-to-noise conditions, while sample sizes are too small for complex parameterized models.

Method: Benchmark zero-shot Time Series Foundation Models against classical persistence and ARIMA baselines. Introduce leakage-safe covariate protocol incorporating Google Trends proxies and novel LLM-derived Institutional Operating Conditions Index (IOCI). Use expanding-window backtest with strict vintage control to evaluate point accuracy and probabilistic calibration.

Result: Covariate-conditioned TSFMs perform competitively with classical methods in short samples, though performance varies significantly by model capacity and cohort.

Conclusion: Provides an auditable framework for operationalizing narrative evidence into exogenous predictors, offering practical guidance for forecasting under data sparsity.

Abstract: Forecasting annual institutional demand is notoriously difficult due to data sparsity, reporting changes, and regime shifts. Traditional baselines often falter under these low signal-to-noise conditions, yet sample sizes are too small for complex parameterised models. We benchmark zero-shot Time Series Foundation Models (TSFMs) against classical persistence and ARIMA baselines for annual enrolment forecasting. To address structural breaks without look-ahead bias, we introduce a leakage-safe covariate protocol incorporating Google Trends proxies and a novel LLM-derived Institutional Operating Conditions Index (IOCI). Using an expanding-window backtest with strict vintage control, we evaluate point accuracy and probabilistic calibration. We find that covariate-conditioned TSFMs perform competitively with classical methods in short samples, though performance varies significantly by model capacity and cohort. We provide an auditable framework for operationalising narrative evidence into exogenous predictors, offering practical guidance for forecasting under data sparsity.

[449] MemPO: Self-Memory Policy Optimization for Long-Horizon Agents

Ruoran Li, Xinghua Zhang, Haiyang Yu, Shitong Duan, Xiang Li, Wenxin Xiang, Chonghua Liao, Xudong Guo, Yongbin Li, Jinli Suo

Main category: cs.AI

TL;DR: MemPO is a self-memory policy optimization algorithm that enables agents to autonomously summarize and manage memory during environment interaction, reducing token consumption while preserving task performance.

DetailsMotivation: Long-horizon agents face growing context size during environment interaction, which degrades performance and stability. Existing methods use external memory modules that don't allow the agent to proactively manage memory content aligned with task objectives.

Method: Proposes self-memory policy optimization (MemPO) algorithm that enables the policy model to autonomously summarize and manage memory during interaction. Uses improved credit assignment mechanism based on memory effectiveness to selectively retain crucial information.

Result: Achieves absolute F1 score gains of 25.98% over base model and 7.1% over previous SOTA baseline, while reducing token usage by 67.58% and 73.12% respectively.

Conclusion: MemPO effectively addresses memory management challenges in long-horizon agents by enabling autonomous memory summarization and management, significantly reducing computational overhead while improving task performance.

Abstract: Long-horizon agents face the challenge of growing context size during interaction with environment, which degrades the performance and stability. Existing methods typically introduce the external memory module and look up the relevant information from the stored memory, which prevents the model itself from proactively managing its memory content and aligning with the agent’s overarching task objectives. To address these limitations, we propose the self-memory policy optimization algorithm (MemPO), which enables the agent (policy model) to autonomously summarize and manage their memory during interaction with environment. By improving the credit assignment mechanism based on memory effectiveness, the policy model can selectively retain crucial information, significantly reducing token consumption while preserving task performance. Extensive experiments and analyses confirm that MemPO achieves absolute F1 score gains of 25.98% over the base model and 7.1% over the previous SOTA baseline, while reducing token usage by 67.58% and 73.12%. The code is released at https://github.com/TheNewBeeKing/MemPO.

[450] UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking

Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou, Xiaoguang Li, Lifeng Shang

Main category: cs.AI

TL;DR: UIS-Digger addresses the blind spot in LLM-based agents for Unindexed Information Seeking (UIS) - information not captured by search engine crawlers - with a novel multi-agent framework and dedicated benchmark.

DetailsMotivation: Current LLM-based information-seeking agents heavily rely on search-engine-indexed knowledge, leaving a critical blind spot for unindexed information (overlooked content, dynamic webpages, embedded files). This Unindexed Information Seeking (UIS) problem is significant but underexplored.

Method: Proposes UIS-Digger, a novel multi-agent framework with dual-mode browsing that enables simultaneous webpage searching and file parsing. Uses a relatively small ~30B-parameter backbone LLM optimized with SFT and RFT training strategies.

Result: UIS-Digger achieves 27.27% on the new UIS-QA benchmark, outperforming systems with sophisticated LLMs like O3 and GPT-4.1. State-of-the-art agents experience drastic performance drops on UIS-QA (from 70.90 on GAIA to 24.55 on UIS-QA).

Conclusion: The work uncovers a fundamental limitation in current agent evaluation paradigms and provides the first toolkit for advancing UIS research, defining a new direction for robust information-seeking systems that proactively interact with unindexed sources.

Abstract: Recent advancements in LLM-based information-seeking agents have achieved record-breaking performance on established benchmarks. However, these agents remain heavily reliant on search-engine-indexed knowledge, leaving a critical blind spot: Unindexed Information Seeking (UIS). This paper identifies and explores the UIS problem, where vital information is not captured by search engine crawlers, such as overlooked content, dynamic webpages, and embedded files. Despite its significance, UIS remains an underexplored challenge. To address this gap, we introduce UIS-QA, the first dedicated UIS benchmark, comprising 110 expert-annotated QA pairs. Notably, even state-of-the-art agents experience a drastic performance drop on UIS-QA (e.g., from 70.90 on GAIA and 46.70 on BrowseComp-zh to 24.55 on UIS-QA), underscoring the severity of the problem. To mitigate this, we propose UIS-Digger, a novel multi-agent framework that incorporates dual-mode browsing and enables simultaneous webpage searching and file parsing. With a relatively small $\sim$30B-parameter backbone LLM optimized using SFT and RFT training strategies, UIS-Digger sets a strong baseline at 27.27%, outperforming systems integrating sophisticated LLMs such as O3 and GPT-4.1. This demonstrates the importance of proactive interaction with unindexed sources for effective and comprehensive information-seeking. Our work not only uncovers a fundamental limitation in current agent evaluation paradigms but also provides the first toolkit for advancing UIS research, defining a new and promising direction for robust information-seeking systems. The dataset has been released at: https://huggingface.co/datasets/UIS-Digger/UIS-QA.

[451] Measuring AI Agents’ Progress on Multi-Step Cyber Attack Scenarios

Linus Folkerts, Will Payne, Simon Inman, Philippos Giavridis, Joe Skinner, Sam Deverett, James Aung, Ekin Zorer, Michael Schmatz, Mahmoud Ghanem, John Wilkinson, Alan Steer, Vy Hong, Jessica Wang

Main category: cs.AI

TL;DR: AI models show increasing autonomous cyber-attack capabilities that scale with inference compute and improve across generations, achieving significant progress in corporate network attacks but limited success in industrial control systems.

DetailsMotivation: To evaluate how frontier AI models perform in autonomous cyber-attack scenarios requiring extended action sequences across different environments, and to understand how capabilities scale with inference-time compute and evolve across model generations.

Method: Tested seven AI models released over 18 months on two purpose-built cyber ranges: a 32-step corporate network attack and a 7-step industrial control system attack. Compared performance at varying inference-time compute budgets (10M to 100M tokens) and across model generations.

Result: Two key findings: 1) Performance scales log-linearly with inference-time compute (10M to 100M tokens yields up to 59% gains), 2) Each successive model generation outperforms predecessors at fixed token budgets. Best corporate network attack completed 22/32 steps (~6 hours of human expert work), while industrial control system performance remains limited (average 1.2-1.4/7 steps, max 3).

Conclusion: Frontier AI models demonstrate rapidly improving autonomous cyber-attack capabilities that scale predictably with compute and improve across generations, posing significant security implications as models become more capable of executing complex attack sequences.

Abstract: We evaluate the autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges-a 32-step corporate network attack and a 7-step industrial control system attack-that require chaining heterogeneous capabilities across extended action sequences. By comparing seven models released over an eighteen-month period (August 2024 to February 2026) at varying inference-time compute budgets, we observe two capability trends. First, model performance scales log-linearly with inference-time compute, with no observed plateau-increasing from 10M to 100M tokens yields gains of up to 59%, requiring no specific technical sophistication from the operator. Second, each successive model generation outperforms its predecessor at fixed token budgets: on the corporate network range, average steps completed at 10M tokens rose from 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026). The best single run completed 22 of 32 steps, corresponding to roughly 6 of the estimated 14 hours a human expert would need. On the industrial control system range, performance remains limited, though the most recent models are the first to reliably complete steps, averaging 1.2-1.4 of 7 (max 3).

[452] Automating Skill Acquisition through Large-Scale Mining of Open-Source Agentic Repositories: A Framework for Multi-Agent Procedural Knowledge Extraction

Shuzhen Bi, Mengsong Wu, Hao Hao, Keqian Li, Wentao Liu, Siyu Song, Hongbo Zhao, Aimin Zhou

Main category: cs.AI

TL;DR: A framework for automated extraction of specialized agent skills from open-source repositories to augment LLMs with procedural expertise, focusing on visualization and educational capabilities from systems like TheoremExplainAgent and Code2Video using Manim engine.

DetailsMotivation: Monolithic LLMs lack specialized procedural expertise needed for autonomous workflows, despite having broad declarative knowledge. There's a need to augment LLMs with specialized skills without requiring model retraining.

Method: Systematic framework for mining open-source repositories (GitHub) to extract agent skills. Includes repository structural analysis, semantic skill identification through dense retrieval, and translation to standardized SKILL.md format. Focuses on extracting visualization and educational capabilities from systems using Manim mathematical animation engine.

Result: Agent-generated educational content achieves 40% gains in knowledge transfer efficiency while maintaining pedagogical quality comparable to human-crafted tutorials. The framework enables scalable acquisition of procedural knowledge.

Conclusion: Automated extraction of high-quality agent skills from open-source repositories provides a scalable approach to augment LLMs with specialized procedural expertise, enhancing their utility in autonomous workflows without model retraining.

Abstract: The transition from monolithic large language models (LLMs) to modular, skill-equipped agents represents a fundamental architectural shift in artificial intelligence deployment. While general-purpose models demonstrate remarkable breadth in declarative knowledge, their utility in autonomous workflows is frequently constrained by insufficient specialized procedural expertise. This report investigates a systematic framework for automated acquisition of high-quality agent skills through mining of open-source repositories on platforms such as GitHub. We focus on the extraction of visualization and educational capabilities from state-of-the-art systems including TheoremExplainAgent and Code2Video, both utilizing the Manim mathematical animation engine. The framework encompasses repository structural analysis, semantic skill identification through dense retrieval, and translation to the standardized SKILL.md format. We demonstrate that systematic extraction from agentic repositories, combined with rigorous security governance and multi-dimensional evaluation metrics, enables scalable acquisition of procedural knowledge that augments LLM capabilities without requiring model retraining. Our analysis reveals that agent-generated educational content can achieve 40% gains in knowledge transfer efficiency while maintaining pedagogical quality comparable to human-crafted tutorials.

[453] Relationship-Aware Safety Unlearning for Multimodal LLMs

Vishnu Narayanan Anilkumar, Abhijith Sreesylesh Babu, Trieu Hai Vo, Mohankrishna Kolla, Alexander Cuneo

Main category: cs.AI

TL;DR: A framework for targeted safety unlearning in multimodal models that focuses on unsafe object-relation-object tuples while preserving benign uses of the same objects and relations.

DetailsMotivation: Existing safety unlearning approaches often target isolated concepts or image-text pairs, causing collateral damage to benign uses of the same objects and relations. The paper addresses relational safety failures where two benign concepts become unsafe when linked by specific actions or relations.

Method: Proposes relationship-aware safety unlearning framework that explicitly represents unsafe object-relation-object tuples and applies targeted parameter-efficient edits (LoRA) to suppress unsafe tuples while preserving object marginals and safe neighboring relations.

Result: Includes CLIP-based experiments and robustness evaluation under paraphrase, contextual, and out-of-distribution image attacks.

Conclusion: The framework enables targeted safety interventions in multimodal models by focusing on unsafe relational patterns rather than isolated concepts, reducing collateral damage to benign uses.

Abstract: Generative multimodal models can exhibit safety failures that are inherently relational: two benign concepts can become unsafe when linked by a specific action or relation (e.g., child-drinking-wine). Existing unlearning and concept-erasure approaches often target isolated concepts or image-text pairs, which can cause collateral damage to benign uses of the same objects and relations. We propose relationship-aware safety unlearning: a framework that explicitly represents unsafe object-relation-object (O-R-O) tuples and applies targeted parameter-efficient edits (LoRA) to suppress unsafe tuples while preserving object marginals and safe neighboring relations. We include CLIP-based experiments and robustness evaluation under paraphrase, contextual, and out-of-distribution image attacks.

[454] Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

J Rosser

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.14665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[455] OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

Peigen Liu, Rui Ding, Yuren Mao, Ziyan Jiang, Yuxiang Ye, Yunjun Gao, Ying Zhang, Renjie Sun, Longbin Lai, Zhengping Qian

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.14771 suggests it’s from March 2023, but no content available for analysis.

DetailsMotivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2603.14771: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14771&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[456] Why the Valuable Capabilities of LLMs Are Precisely the Unexplainable Ones

Quan Cheng

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.15238 suggests it’s from March 2023, but content is unavailable for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests, preventing retrieval of the abstract.

Method: No method information available due to failed API request. The paper ID format suggests it’s from March 2023, but specific technical approach cannot be assessed.

Result: No results available for analysis. The paper content could not be retrieved from arXiv due to rate limiting restrictions.

Conclusion: Unable to draw conclusions about the paper’s content or relevance. The arXiv API rate limiting prevents access to the abstract needed for proper analysis.

Abstract: Failed to fetch summary for 2603.15238: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15238&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[457] LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing

Hongxiang Zhang, Yuyang Rong, Yifeng He, Hao Chen

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2406.07714 exists but summary cannot be retrieved.

DetailsMotivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2406.07714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.07714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[458] A convergence law for continuous logic and continuous structures with finite domains

Vera Koponen

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2504.08923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.08923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[459] Addressing the Minor-Embedding Problem in Quantum Annealing and Evaluating State-of-the-Art Algorithm Performance

Aitor Gomez-Tejedor, Eneko Osaba, Esther Villar-Rodriguez

Main category: cs.AI

TL;DR: Paper ID 2504.13376 - Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as abstract is unavailable due to rate limiting from arXiv API

Method: Cannot determine method as abstract is unavailable due to rate limiting from arXiv API

Result: Cannot determine results as abstract is unavailable due to rate limiting from arXiv API

Conclusion: Cannot determine conclusion as abstract is unavailable due to rate limiting from arXiv API

Abstract: Failed to fetch summary for 2504.13376: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13376&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[460] Diverse AI Personas Can Mitigate the Homogenization Effect in Human-AI Collaborative Ideation

Yun Wan, Yoram M Kalman

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2504.13868: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13868&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[461] Boosting Text-to-Chart Retrieval through Training with Synthesized Semantic Insights

Yifan Wu, Lutao Yan, Yizhang Zhu, Yenchi Tseng, Yinan Mei, Yong Wang, Jiannan Wang, Nan Tang, Yuyu Luo

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting error

Method: Unable to determine method due to API rate limiting error

Result: Unable to determine results due to API rate limiting error

Conclusion: Unable to determine conclusion due to API rate limiting error

Abstract: Failed to fetch summary for 2505.10043: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.10043&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[462] Explanation User Interfaces: A Systematic Literature Review

Eleonora Cappuccio, Andrea Esposito, Francesco Greco, Giuseppe Desolda, Rosa Lanzilotti, Salvatore Rinzivillo

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to draw conclusions due to access restrictions

Abstract: Failed to fetch summary for 2505.20085: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20085&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[463] Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities

Georges Sfeir, Gabriel Nova, Stephane Hess, Sander van Cranenburgh

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2507.21790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[464] Sharing State Between Prompts and Programs

Ellie Y. Cheng, Logan Weber, Tian Jin, Michael Carbin

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2512.14805: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14805&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[465] AI4EOSC: a Federated Cloud Platform for Artificial Intelligence in Scientific Research

Ignacio Heredia, Álvaro López García, Fernando Aguilar Gómez, Diego Aguirre, Caterina Alarcón Marín, Khadijeh Alibabaei, Lisana Berberi, Miguel Caballer, Amanda Calatrava, Pedro Castro, Alessandro Costantini, Mario David, Jaime Díez Stefan Dlugolinsky, Borja Esteban Sanchis, Giacinto Donvito, Leonhard Duda, Saúl Fernandez, Andrés Heredia Canales, Valentin Kozlov, Sergio Langarita, João Machado, Germán Moltó, Daniel San Martín, Martin Šeleng, Giang Nguyen, Marcin Płóciennik, Marta Obregón Ruiz, Susana Rebolledo Ruiz, Vicente Rodriguez, Judith Sáinz-Pardo Díaz, Viet Tran

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2512.16455: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16455&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[466] Aletheia: What Makes RLVR For Code Verifiers Tick?

Vatsal Venkatkrishna, Indraneil Paul, Iryna Gurevych

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).

DetailsMotivation: Cannot determine motivation as paper content is unavailable.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot determine conclusion as paper content is unavailable.

Abstract: Failed to fetch summary for 2601.12186: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12186&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[467] Building a Correct-by-Design Lakehouse. Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents

Weiming Sheng, Jinlang Wang, Manuel Barros, Aldrin Montana, Jacopo Tagliabue, Luca Bigon

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.02335: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02335&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[468] Ask don’t tell: Reducing sycophancy in large language models

Magda Dubois, Cozmin Ududec, Christopher Summerfield, Lennart Luettgau

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.23971: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23971&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[469] DUCTILE: Agentic LLM Orchestration of Engineering Analysis in Product Development Practice

Alejandro Pradas-Gomez, Arindam Brahma, Ola Isaksson

Main category: cs.AI

TL;DR: Unable to analyze paper 2603.10249 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.10249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[470] Exploring Collatz Dynamics with Human-LLM Collaboration

Edward Y. Chang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.11066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[471] ELISA: An Interpretable Hybrid Generative AI Agent for Expression-Grounded Discovery in Single-Cell Genomics

Omar Coser

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved due to technical limitations

Method: N/A - Paper content unavailable due to HTTP 429 error from arXiv API

Result: N/A - No results available due to failed API request

Conclusion: Technical issue prevented analysis of paper 2603.11872

Abstract: Failed to fetch summary for 2603.11872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[472] Real-World AI Evaluation: How FRAME Generates Systematic Evidence to Resolve the Decision-Maker’s Dilemma

Reva Schwartz, Gabriella Waters

Main category: cs.AI

TL;DR: Paper ID 2603.13294 could not be fetched due to HTTP 429 error (rate limiting), so no abstract content is available for analysis.

DetailsMotivation: Unable to determine motivation due to lack of access to paper content.

Method: Unable to determine method due to lack of access to paper content.

Result: Unable to determine results due to lack of access to paper content.

Conclusion: Unable to draw conclusions due to lack of access to paper content.

Abstract: Failed to fetch summary for 2603.13294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[473] Is Seeing Believing? Evaluating Human Sensitivity to Synthetic Video

David Wegmann, Emil Stevnsborg, Søren Knudsen, Luca Rossi, Aske Mottelson

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.13846: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13846&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[474] Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database

Madhulatha Mandarapu, Sandeep Kunkunuru

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.15080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[475] PulmoVec: A Two-Stage Stacking Meta-Learning Architecture Built on the HeAR Foundation Model for Multi-Task Classification of Pediatric Respiratory Sounds

Izzet Turkalp Akbasli, Oguzhan Serin

Main category: cs.SD

TL;DR: PulmoVec is a multi-task AI framework using the HeAR foundation model for pediatric respiratory sound classification, achieving high performance in screening, sound-pattern recognition, and disease-group prediction.

DetailsMotivation: Respiratory diseases are a major cause of childhood morbidity/mortality, but lung auscultation is subjective with high inter-listener variability. Existing AI approaches are limited by small datasets and single-task designs.

Method: Used SPRSound database with 24,808 annotated segments from 1,652 pediatric patients. Trained three task-specific classifiers on HeAR foundation model for screening, sound-pattern recognition, and disease-group prediction. Combined outputs with demographic metadata in LightGBM stacking meta-model, with event-level predictions aggregated to patient level via ensemble voting.

Result: Event-level: screening ROC-AUC 0.96, sound-pattern recognition macro ROC-AUC 0.96, disease-group prediction macro ROC-AUC 0.94. Patient-level: disease-group classification accuracy 0.74, weighted F1-score 0.73, macro ROC-AUC 0.91. Stacking improved performance across all tasks.

Conclusion: PulmoVec links event-level acoustic phenotyping with patient-level clinical classification, demonstrating potential of foundation-model-based digital auscultation in pediatric respiratory medicine. Multi-center external validation needed.

Abstract: Background: Respiratory diseases are a leading cause of childhood morbidity and mortality, yet lung auscultation remains subjective and limited by inter-listener variability, particularly in pediatric populations. Existing AI approaches are further constrained by small datasets and single-task designs. We developed PulmoVec, a multi-task framework built on the Health Acoustic Representations (HeAR) foundation model for classification of pediatric respiratory sounds. Methods: In this retrospective analysis of the SPRSound database, 24,808 event-level annotated segments from 1,652 pediatric patients were analyzed. Three task-specific classifiers were trained for screening, sound-pattern recognition, and disease-group prediction. Their out-of-fold probability outputs were combined with demographic metadata in a LightGBM stacking meta-model, and event-level predictions were aggregated to the patient level using ensemble voting. Results: At the event level, the screening model achieved an ROC-AUC of 0.96 (95% CI, 0.95-0.97), the sound-pattern recognition model a macro ROC-AUC of 0.96 (95% CI, 0.96-0.97), and the disease-group prediction model a macro ROC-AUC of 0.94 (95% CI, 0.93-0.94). At the patient level, disease-group classification yielded an accuracy of 0.74 (95% CI, 0.71-0.77), a weighted F1-score of 0.73, and a macro ROC-AUC of 0.91 (95% CI, 0.90-0.93). Stacking improved performance across all tasks compared with base models alone. Conclusions: PulmoVec links event-level acoustic phenotyping with patient-level clinical classification, supporting the potential of foundation-model-based digital auscultation in pediatric respiratory medicine. Multi-center external validation across devices and real-world conditions remains essential.

[476] Diffusion Models for Joint Audio-Video Generation

Alejandro Paredes La Torre

Main category: cs.SD

TL;DR: Proposes a sequential two-step pipeline for joint audio-video generation, releases two high-quality paired datasets, trains MM-Diffusion from scratch, investigates joint latent diffusion challenges, and demonstrates modular text-to-audio-video synthesis.

DetailsMotivation: While multimodal generative models have advanced in single-modality synthesis, truly joint audio-video generation remains an open challenge. The paper aims to advance this field by addressing the lack of high-quality paired datasets and developing effective methods for synchronized audio-video generation.

Method: Four key contributions: 1) Release two high-quality paired audio-video datasets (13h video-game clips, 64h concert performances, segmented into 34-second samples). 2) Train MM-Diffusion architecture from scratch on these datasets. 3) Investigate joint latent diffusion using pretrained encoders/decoders. 4) Propose sequential two-step text-to-audio-video pipeline: generate video first, then condition on both video output and original prompt to synthesize synchronized audio.

Result: Demonstrates ability to produce semantically coherent audio-video pairs with quantitative evaluation of alignment on rapid actions and musical cues. Shows challenges in joint latent diffusion decoding. The sequential modular approach yields high-fidelity generations of audio-video content.

Conclusion: The paper advances joint audio-video generation through dataset contributions, architecture training, and a practical sequential pipeline that effectively addresses synchronization challenges, providing a foundation for future multimodal generative research.

Abstract: Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.

[477] INSTRUMENTAL: Automatic Synthesizer Parameter Recovery from Audio via Evolutionary Optimization

Philipp Bogdan

Main category: cs.SD

TL;DR: Instrumental is a system that extracts continuous synthesizer parameters from audio using a differentiable subtractive synthesizer and CMA-ES optimization with perceptual loss functions.

DetailsMotivation: Existing audio-to-MIDI tools only extract notes and discard timbral characteristics that define instrument identity. There's a need to recover continuous synthesizer parameters to preserve the full expressive qualities of audio signals.

Method: Couples a differentiable 28-parameter subtractive synthesizer with CMA-ES (derivative-free evolutionary optimizer). Uses composite perceptual loss combining mel-scaled STFT, spectral centroid, and MFCC divergence. Systematically evaluates eight hypotheses for improving convergence.

Result: Achieves matching loss of 2.09 on real recorded audio. Finds that CMA-ES outperforms gradient descent on non-convex landscape, more parameters don’t monotonically improve matching, spectral analysis initialization accelerates convergence, and only parametric EQ boosting yields meaningful improvement among tested hypotheses.

Conclusion: The system successfully recovers continuous synthesizer parameters from audio, demonstrating that evolutionary optimization with perceptual losses can effectively capture timbral characteristics that traditional audio-to-MIDI tools miss.

Abstract: Existing audio-to-MIDI tools extract notes but discard the timbral characteristics that define an instrument’s identity. We present Instrumental, a system that recovers continuous synthesizer parameters from audio by coupling a differentiable 28-parameter subtractive synthesizer with CMA-ES, a derivative-free evolutionary optimizer. We optimize a composite perceptual loss combining mel-scaled STFT, spectral centroid, and MFCC divergence, achieving a matching loss of 2.09 on real recorded audio. We systematically evaluate eight hypotheses for improving convergence and find that only parametric EQ boosting yields meaningful improvement. Our results show that CMA-ES outperforms gradient descent on this non-convex landscape, that more parameters do not monotonically improve matching, and that spectral analysis initialization accelerates convergence over random starts.

[478] CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

Zihao Zheng, Wen Wu, Chao Zhang, Mengyue Wu, Xuenan Xu

Main category: cs.SD

TL;DR: CAST-TTS: A unified framework for timbre control in TTS that handles both speech prompts and text prompts using a single model with cross-modal alignment and shared embedding space.

DetailsMotivation: Current TTS systems use separate models for speech-prompted and text-prompted timbre control, leading to complexity. The authors aim to unify both control signals into a single model while addressing the challenge of cross-modal alignment.

Method: Uses pre-trained encoders to extract features from speech and text prompts. Employs multi-stage training to align speech and projected text representations in a shared embedding space. Uses a single cross-attention mechanism to enable timbre control from either representation.

Result: CAST-TTS achieves performance comparable to specialized single-input models while operating within a unified architecture. The unified cross-attention mechanism is shown to be critical for high-quality synthesis.

Conclusion: The proposed framework successfully unifies speech and text prompt-based timbre control in TTS, demonstrating that a single model can handle both modalities effectively with proper cross-modal alignment.

Abstract: Current Text-to-Speech (TTS) systems typically use separate models for speech-prompted and text-prompted timbre control. While unifying both control signals into a single model is desirable, the challenge of cross-modal alignment often results in overly complex architectures and training objective. To address this challenge, we propose CAST-TTS, a simple yet effective framework for unified timbre control. Features are extracted from speech prompts and text prompts using pre-trained encoders. The multi-stage training strategy efficiently aligns the speech and projected text representations within a shared embedding space. A single cross-attention mechanism then allows the model to use either of these representations to control the timbre. Extensive experiments validate that the unified cross-attention mechanism is critical for achieving high-quality synthesis. CAST-TTS achieves performance comparable to specialized single-input models while operating within a unified architecture. The demo page can be accessed at https://HiRookie9.github.io/CAST-TTS-Page.

[479] A Semantic Timbre Dataset for the Electric Guitar

Joseph Cameron, Alan Blackwell

Main category: cs.SD

TL;DR: A curated dataset of monophonic electric guitar sounds with semantic timbre descriptors to bridge perceptual timbre and machine learning for timbre-aware generative AI research.

DetailsMotivation: Timbre understanding and manipulation is central to audio synthesis but under-explored in ML due to lack of annotated datasets linking perceptual timbre dimensions to semantic descriptors.

Method: Created Semantic Timbre Dataset with monophonic electric guitar sounds labeled with 19 semantic timbre descriptors and magnitudes derived from analysis of physical/virtual guitar effects. Trained variational autoencoder (VAE) on latent space and evaluated with human perceptual judgments and descriptor classifiers.

Result: VAE captures timbral structure and enables smooth interpolation across descriptors. Dataset validated through human perceptual judgments and classifier evaluation.

Conclusion: Dataset bridges perceptual timbre and ML representations, supporting timbre control and semantic audio generation research. Dataset, code, and evaluation protocols released.

Abstract: Understanding and manipulating timbre is central to audio synthesis, yet this remains under-explored in machine learning due to a lack of annotated datasets linking perceptual timbre dimensions to semantic descriptors. We present the Semantic Timbre Dataset, a curated collection of monophonic electric guitar sounds, each labeled with one of 19 semantic timbre descriptors and corresponding magnitudes. These descriptors were derived from a qualitative analysis of physical and virtual guitar effect units and applied systematically to clean guitar tones. The dataset bridges perceptual timbre and machine learning representations, supporting learning for timbre control and semantic audio generation. We validate the dataset by training a variational autoencoder (VAE) on its latent space and evaluating it using human perceptual judgments and descriptor classifiers. Results show that the VAE captures timbral structure and enables smooth interpolation across descriptors. We release the dataset, code, and evaluation protocols to support timbre-aware generative AI research.

[480] Evaluating Latent Space Structure in Timbre VAEs: A Comparative Study of Unsupervised, Descriptor-Conditioned, and Perceptual Feature-Conditioned Models

Joseph Cameron, Alan Blackwell

Main category: cs.SD

TL;DR: Comparative evaluation shows perceptual feature conditioning yields better latent space organization for musical timbre generation than unsupervised or discrete descriptor conditioning.

DetailsMotivation: To understand how different conditioning approaches affect latent space organization in VAEs for musical timbre generation, particularly comparing unsupervised, discrete descriptor-conditioned, and continuous perceptual feature conditioning.

Method: Evaluated three VAE variants on electric guitar sounds: unsupervised VAE, descriptor-conditioned VAE, and VAE conditioned on continuous perceptual features from AudioCommons timbral models. Used clustering metrics (silhouette scores, descriptor compactness, pitch-conditional separation, trajectory linearity, cross-pitch consistency) to assess latent structure.

Result: Conditioning on perceptual features yields more compact, discriminative, and pitch-invariant latent space, outperforming both unsupervised and discrete descriptor-conditioned models.

Conclusion: Continuous perceptual feature conditioning is superior for timbre generation, highlighting limitations of one-hot semantic conditioning and providing evaluation tools for generative audio models.

Abstract: We present a comparative evaluation of latent space organization in three Variational Autoencoders (VAEs) for musical timbre generation: an unsupervised VAE, a descriptor-conditioned VAE, and a VAE conditioned on continuous perceptual features from the AudioCommons timbral models. Using a curated dataset of electric guitar sounds labeled with 19 semantic descriptors across four intensity levels, we assess each model’s latent structure with a suite of clustering and interpretability metrics. These include silhouette scores, timbre descriptor compactness, pitch-conditional separation, trajectory linearity, and cross-pitch consistency. Our findings show that conditioning on perceptual features yields a more compact, discriminative, and pitch-invariant latent space, outperforming both the unsupervised and discrete descriptor-conditioned models. This work highlights the limitations of one-hot semantic conditioning and provides methodological tools for evaluating timbre latent spaces, contributing to the development of more controllable and interpretable generative audio models.

[481] Making Separation-First Multi-Stream Audio Watermarking Feasible via Joint Training

Houmin Sun, Zi Hu, Linxi Li, Yechen Wang, Liwei Jin, Ming Li

Main category: cs.SD

TL;DR: A multi-stream watermarking framework that embeds distinct watermarks into audio stems, mixes them, separates them, and recovers watermarks from separated outputs, with joint training of watermarking and separation systems.

DetailsMotivation: Modern audio is created by mixing stems from different sources, raising the need to independently watermark each stem and recover all watermarks after separation processes.

Method: Proposes a separation-first, multi-stream watermarking framework with joint end-to-end training of watermark system and separator. Embeds distinct information into stems using unique keys but shared structure, then mixes, separates, and decodes from each output.

Result: Experiments on speech+music and vocal+accompaniment mixtures show substantial gains in post-separation recovery while maintaining perceptual quality, overcoming limitations of naive robust watermarking + off-the-shelf separation.

Conclusion: Joint training of watermarking and separation systems enables effective multi-stream audio watermarking that survives separation processes, addressing a practical need in modern audio production workflows.

Abstract: Modern audio is created by mixing stems from different sources, raising the question: can we independently watermark each stem and recover all watermarks after separation? We study a separation-first, multi-stream watermarking framework-embedding distinct information into stems using unique keys but a shared structure, mixing, separating, and decoding from each output. A naive pipeline (robust watermarking + off-the-shelf separation) yields poor bit recovery, showing robustness to generic distortions does not ensure robustness to separation artifacts. To enable this, we jointly train the watermark system and the separator in an end-to-end manner, encouraging the separator to preserve watermark cues while adapting embedding to separation-specific distortions. Experiments on speech+music and vocal+accompaniment mixtures show substantial gains in post-separation recovery while maintaining perceptual quality.

[482] When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

Chen-An Li, Tzu-Han Lin, Hung-yi Lee

Main category: cs.SD

TL;DR: LALMs suffer performance degradation on text tasks when exposed to irrelevant audio inputs like silence, noise, or environmental sounds, revealing cross-modal interference as a robustness challenge.

DetailsMotivation: To investigate the robustness of large audio-language models in noisy real-world settings, specifically examining how irrelevant audio inputs affect text reasoning tasks where audio is unnecessary.

Method: Tested LALMs across three text-based benchmarks with various irrelevant audio inputs (silence, synthetic noise, environmental sounds), analyzing effects of duration, amplitude, and decoding temperature. Evaluated mitigation strategies including prompting and self-consistency methods.

Result: Even non-informative audio reduces accuracy and increases prediction volatility; silence destabilizes outputs as strongly as synthetic noise. Larger models show greater resilience but vulnerabilities persist. Prompting has limited effectiveness, while self-consistency improves stability at computational cost.

Conclusion: Cross-modal interference is a key robustness challenge for LALMs, highlighting the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant audio inputs.

Abstract: Large audio-language models (LALMs) unify speech and text processing, but their robustness in noisy real-world settings remains underexplored. We investigate how irrelevant audio, such as silence, synthetic noise, and environmental sounds, affects text reasoning tasks where audio is unnecessary. Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding temperatures. Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise. While larger models show greater resilience, vulnerabilities persist across all evaluated systems. We further test mitigation strategies and find that prompting shows limited effectiveness, whereas self-consistency improves stability at the cost of increased computation. Our results reveal cross-modal interference as a key robustness challenge and highlight the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant inputs.

[483] Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

Jielin Qiu, Zixiang Chen, Liangwei Yang, Ming Zhu, Zhiwei Liu, Juntao Tan, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

Main category: cs.SD

TL;DR: Tutorial for building enterprise-grade realtime voice agents using cascaded streaming pipeline (STT → LLM → TTS) as practical alternative to unavailable self-hosted end-to-end speech-to-speech models.

DetailsMotivation: While end-to-end speech-to-speech models promise best latency for voice agents, fully self-hosted solutions are not yet available. Qwen3-Omni evaluation shows limitations: cloud-only API not self-hostable, local deployments either incomplete or too slow for realtime.

Method: Cascaded streaming pipeline architecture using Deepgram for streaming STT, vLLM-served LLMs with function calling for streaming text generation, and ElevenLabs for streaming TTS. Full codebase released as 9-chapter progressive tutorial.

Result: Achieved measured time-to-first-audio of 755ms (best case 729ms) with full function calling support. Qwen3-Omni evaluation: cloud-only DashScope API ~702ms latency (not self-hostable), local vLLM only supports Thinker (516ms), local Transformers ~146s (too slow).

Conclusion: Cascaded streaming pipeline remains practical architecture for self-hosted realtime voice agents until fully self-hosted end-to-end speech-to-speech models become available. Tutorial provides complete working implementation.

Abstract: We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While end-to-end speech-to-speech models may ultimately provide the best latency for voice agents, fully self-hosted end-to-end solutions are not yet available. We evaluate the closest candidate, Qwen3-Omni, across three configurations: its cloud-only DashScope Realtime API achieves $\sim$702ms audio-to-audio latency with streaming, but is not self-hostable; its local vLLM deployment supports only the Thinker (text generation from audio, 516ms), not the Talker (audio synthesis); and its local Transformers deployment runs the full pipeline but at $\sim$146s – far too slow for realtime. The cascaded streaming pipeline (STT $\rightarrow$ LLM $\rightarrow$ TTS) therefore remains the practical architecture for self-hosted realtime voice agents, and the focus of this tutorial. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured time-to-first-audio of 755ms (best case 729ms) with full function calling support. We release the full codebase as a 9-chapter progressive tutorial with working, tested code for every component.

[484] LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Chih-Ning Chen, Jen-Cheng Hou, Hsin-Min Wang, Shao-Yi Chien, Yu Tsao, Fan-Gang Zeng

Main category: cs.SD

TL;DR: Proposes a reinforcement learning-based audio-visual speech enhancement framework using LLM-generated natural language feedback as interpretable reward signals for fine-tuning pretrained models.

DetailsMotivation: Existing AVSE methods use metrics like SI-SNR and MSE that correlate poorly with perceptual quality and lack interpretability for optimization. Need for more meaningful, interpretable feedback mechanisms.

Method: Reinforcement learning framework with LLM-based interpretable reward model. Audio LLM generates natural language descriptions of enhanced speech, sentiment analysis converts these to 1-5 rating scores serving as PPO rewards for fine-tuning pretrained AVSE model.

Result: Outperforms supervised baseline and DNSMOS-based RL baseline on AVSEC-4 dataset in PESQ, STOI, neural quality metrics, and subjective listening tests.

Conclusion: LLM-generated feedback provides semantically rich, explicit descriptions of speech quality improvements, offering better interpretability and performance than traditional scalar metrics.

Abstract: In existing Audio-Visual Speech Enhancement (AVSE) methods, objectives such as Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Mean Squared Error (MSE) are widely used; however, they often correlate poorly with perceptual quality and provide limited interpretability for optimization. This work proposes a reinforcement learning-based AVSE framework with a Large Language Model (LLM)-based interpretable reward model. An audio LLM generates natural language descriptions of enhanced speech, which are converted by a sentiment analysis model into a 1-5 rating score serving as the PPO reward for fine-tuning a pretrained AVSE model. Compared with scalar metrics, LLM-generated feedback is semantically rich and explicitly describes improvements in speech quality. Experiments on the 4th COG-MHEAR AVSE Challenge (AVSEC-4) dataset show that the proposed method outperforms a supervised baseline and a DNSMOS-based RL baseline in PESQ, STOI, neural quality metrics, and subjective listening tests.

[485] VorTEX: Various overlap ratio for Target speech EXtraction

Ro-hoon Oh, Jihwan Seol, Bugeun Kim

Main category: cs.SD

TL;DR: VorTEX is a text-prompted target speech extraction system with a decoupled adaptive multi-branch fusion block that handles various overlap ratios, evaluated on a new dataset PORTE with a novel diagnostic metric SuRE.

DetailsMotivation: Existing text-prompted target speech extraction approaches assume fully overlapped mixtures, limiting understanding of behavior across realistic overlap ratios. There's a need for architectures that work robustly across different overlap scenarios and better diagnostic tools to assess suppression behavior.

Method: Proposes VorTEX architecture with Decoupled Adaptive Multi-branch (DAM) Fusion block separating primary extraction from auxiliary regularization pathways. Creates PORTE dataset with two-speaker mixtures spanning 0-100% overlap ratios. Introduces Suppression Ratio on Energy (SuRE) metric to detect suppression behavior not captured by conventional measures.

Result: VorTEX achieves highest separation fidelity across 20-100% overlap (5.50 dB at 20% and 2.04 dB at 100%) while maintaining zero SuRE, indicating robust extraction without suppression-driven artifacts. Existing models exhibit suppression or residual interference under varying overlap conditions.

Conclusion: VorTEX demonstrates robust target speech extraction across various overlap ratios without suppression artifacts, enabled by the DAM fusion architecture and validated by the new SuRE diagnostic metric and PORTE dataset.

Abstract: Target speech extraction (TSE) aims to recover a target speaker’s voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. We introduce VorTEX (Various overlap ratio for Target speech EXtraction), a text-prompted TSE architecture with a Decoupled Adaptive Multi-branch (DAM) Fusion block that separates primary extraction from auxiliary regularization pathways. To enable controlled analysis, we construct PORTE, a two-speaker dataset spanning overlap ratios from 0% to 100%. We further propose Suppression Ratio on Energy (SuRE), a diagnostic metric that detects suppression behavior not captured by conventional measures. Experiments show that existing models exhibit suppression or residual interference under overlap, whereas VorTEX achieves the highest separation fidelity across 20-100% overlap (e.g., 5.50 dB at 20% and 2.04 dB at 100%) while maintaining zero SuRE, indicating robust extraction without suppression-driven artifacts.

cs.LG

[486] Tokenization Tradeoffs in Structured EHR Foundation Models

Lin Lawrence Guo, Santiago Eduardo Arciniegas, Joseph Jihyung Lee, Adam Paul Yan, George Tomlinson, Jason Fries, Lillian Sung

Main category: cs.LG

TL;DR: Tokenization design choices significantly impact performance and efficiency of EHR foundation models, with joint event encoding and positional time encoding outperforming alternatives while reducing computational costs.

DetailsMotivation: Tokenization determines what information is preserved in EHR foundation models, but its impact on downstream performance and computational efficiency remains largely unexplored.

Method: Pretrained transformer on pediatric EHR data under factorial design, varying tokenization along event encoding, time encoding, and workflow annotation. Evaluated across 74 clinical prediction tasks.

Result: Joint event encoding and positional time encoding outperformed alternatives (73/74 and 71/74 tasks) while requiring 39.5% and 9.6% fewer pretraining FLOPs respectively. Advantage traced to local binding efficiency.

Conclusion: Tokenization is a tractable lever for improving both performance and efficiency of EHR foundation models, with joint encoding advantages generalizing across institutions despite vocabulary mismatch.

Abstract: Foundation models for structured electronic health records (EHRs) are pretrained on longitudinal sequences of timestamped clinical events to learn adaptable patient representations. Tokenization – how these timelines are converted into discrete model inputs – determines what information is preserved, how efficiently it is encoded, and which relationships must be learned versus precomputed. Yet the impact of tokenization design choices on downstream performance and computational efficiency remains largely unexplored. Here, we pretrained a transformer on pediatric EHR data under a factorial design, varying tokenization along event encoding, time encoding, and workflow annotation. We evaluated area-under-the-receiver-operating-characteristic curve across 74 clinical prediction tasks. Joint event encoding and positional time encoding outperformed their alternatives (73/74 and 71/74 tasks) while requiring 39.5% and 9.6% fewer pretraining floating-point operations, respectively. Targeted ablations traced the joint encoding advantage to local binding efficiency, that is, code-attribute pairs are combined into single tokens, rather than split across tokens that the model must learn to associate during pretraining. External evaluation on an adult intensive care unit cohort demonstrated that this advantage generalizes despite substantial vocabulary mismatch, while temporal and workflow effects remain institution-specific. These results establish tokenization as a tractable lever for improving both the performance and efficiency of EHR foundation models.

[487] XLinear: Frequency-Enhanced MLP with CrossFilter for Robust Long-Range Forecasting

Xiang Ao

Main category: cs.LG

TL;DR: XLinear is an MLP-based time series forecaster that decomposes time series into trend and seasonal components, using Enhanced Frequency Attention for trend and CrossFilter Block for seasonal components to capture long-range dependencies while maintaining robustness to noise.

DetailsMotivation: MLP-based forecasters are robust to noise but struggle with capturing complex features and long-range dependencies compared to Transformer-based models. The paper aims to address this limitation while maintaining MLP's robustness advantages.

Method: 1) Decompose time series into trend and seasonal components. 2) For trend component with long-range characteristics, design Enhanced Frequency Attention (EFA) using frequency-domain operations to capture long-term dependencies. 3) For seasonal component, propose CrossFilter Block to maintain robustness to noise and avoid attention mechanism issues.

Result: XLinear achieves state-of-the-art performance on test datasets, outperforming other MLP-based forecasters in capturing long-range dependencies while maintaining lightweight architecture and high robustness.

Conclusion: XLinear successfully addresses MLP’s limitations in capturing long-range dependencies while preserving its robustness advantages, offering an effective solution for long-range time series forecasting.

Abstract: Time series forecasters are widely used across various domains. Among them, MLP (multi-layer perceptron)-based forecasters have been proven to be more robust to noise compared to Transformer-based forecasters. However, MLP struggles to capture complex features, resulting in limitations on capturing long-range dependencies. To address this challenge, we propose XLinear, an MLP-based forecaster for long-range forecasting. Firstly, we decompose the time series into trend and seasonal components. For the trend component which contains long-range characteristics, we design Enhanced Frequency Attention (EFA) to capture long-term dependencies by leveraging frequency-domain operations. Additionally, a CrossFilter Block is proposed for the seasonal component to maintain the model’s robustness to noise, avoiding the problems of low robustness often caused by attention mechanisms. Experimental results demonstrate that XLinear achieves state-of-the-art performance on test datasets. While keeping the lightweight architecture and high robustness of MLP-based models, our forecaster outperforms other MLP-based forecasters in capturing long-range dependencies.

[488] Alternating Reinforcement Learning with Contextual Rubric Rewards

Guangchen Lan

Main category: cs.LG

TL;DR: ARL-RR is a new reinforcement learning framework that optimizes rubric-based rewards by alternating between different semantic meta-classes instead of using fixed scalarization, improving performance and training efficiency.

DetailsMotivation: Existing RLRR methods use linear compression of vector rewards into scalar rewards with fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions.

Method: Proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR) that optimizes one semantic rubric meta-class at a time, eliminating fixed scalarization. Includes theoretical analysis of variance contraction effect and lightweight search-based adaptation for dynamic meta-class selection.

Result: Empirical experiments on HealthBench dataset with expert annotations show ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).

Conclusion: ARL-RR provides a more effective approach to handling multi-dimensional rubric rewards by avoiding fixed scalarization and enabling dynamic focus on critical objectives, with theoretical justification and empirical validation.

Abstract: Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).

[489] Steering Frozen LLMs: Adaptive Social Alignment via Online Prompt Routing

Zeyu Zhang, Xiangxiang Dai, Ziyi Han, Xutong Liu, John C. S. Lui

Main category: cs.LG

TL;DR: CCLUB framework for adaptive social alignment via system-prompt routing with consensus clustering to prevent unsafe generalization across contexts

DetailsMotivation: Static LLM alignment degrades against evolving jailbreak behaviors and cannot adapt to changing safety norms, requiring inference-time governance without costly retraining

Method: Consensus Clustering LinUCB Bandit (CCLUB) framework with conservative consensus clustering that pools data only within intersection of utility and safety similarity graphs

Result: Achieves 10.98% improvement in cumulative reward and 14.42% reduction in average suboptimality gap, with theoretical sublinear regret guarantee

Conclusion: CCLUB enables adaptive social alignment through system-prompt routing with safety-preserving consensus clustering, outperforming static approaches

Abstract: Large language models (LLMs) are typically governed by post-training alignment (e.g., RLHF or DPO), which yields a largely static policy during deployment and inference. However, real-world safety is a full-lifecycle problem: static defenses degrade against evolving jailbreak behaviors, and fixed weights cannot adapt to pluralistic, time-varying safety norms. This motivates inference-time governance that steers behavior without costly retraining. To address this, we introduce the Consensus Clustering LinUCB Bandit (CCLUB), a unified framework for adaptive social alignment via system-prompt routing. CCLUB employs a conservative consensus clustering mechanism: it pools data only within the intersection of utility and safety similarity graphs, effectively preventing unsafe generalization across semantically proximal but risk-divergent contexts. Our theoretical analysis yields a sublinear regret guarantee, demonstrating near-optimal performance of CCLUB. Extensive experiments validate that CCLUB outperforms strong baselines, achieving a 10.98% improvement in cumulative reward and a 14.42% reduction in the average suboptimality gap.

[490] How to Achieve Prototypical Birth and Death for OOD Detection?

Ningkang Peng, Qianfeng Yu, Xiaoqian Peng, Linjing Qian, Yafei Liu, Canran Xiao, Xinyu Lu, Tingyu Lu, Zhichao Zheng, Yanhui Gu

Main category: cs.LG

TL;DR: PID introduces a dynamic prototype birth and death mechanism for OOD detection that adaptively adjusts prototype count based on data complexity, outperforming static prototype methods.

DetailsMotivation: Existing prototype-based OOD detection methods use fixed prototype numbers, failing to adapt to varying complexity across categories. There's a need for adaptive prototype adjustment based on data complexity.

Method: Inspired by biological cell processes, PID uses two dynamic mechanisms: prototype birth (instantiating new prototypes in underrepresented regions) and prototype death (pruning ambiguous prototypes). This allows adaptive prototype count adjustment based on data complexity.

Result: PID achieves state-of-the-art performance on benchmarks like CIFAR-100, especially on FPR95 metric, significantly outperforming existing methods by learning more compact and better-separated ID embeddings.

Conclusion: Dynamic prototype adjustment through birth and death mechanisms effectively adapts to data complexity, enhancing OOD detection performance compared to static prototype methods.

Abstract: Out-of-Distribution (OOD) detection is crucial for the secure deployment of machine learning models, and prototype-based learning methods are among the mainstream strategies for achieving OOD detection. Existing prototype-based learning methods generally rely on a fixed number of prototypes. This static assumption fails to adapt to the inherent complexity differences across various categories. Currently, there is still a lack of a mechanism that can adaptively adjust the number of prototypes based on data complexity. Inspired by the processes of cell birth and death in biology, we propose a novel method named PID (Prototype bIrth and Death) to adaptively adjust the prototype count based on data complexity. This method relies on two dynamic mechanisms during the training process: prototype birth and prototype death. The birth mechanism instantiates new prototypes in data regions with insufficient representation by identifying the overload level of existing prototypes, thereby meticulously capturing intra-class substructures. Conversely, the death mechanism reinforces the decision boundary by pruning prototypes with ambiguous class boundaries through evaluating their discriminability. Through birth and death, the number of prototypes can be dynamically adjusted according to the data complexity, leading to the learning of more compact and better-separated In-Distribution (ID) embeddings, which significantly enhances the capability to detect OOD samples. Experiments demonstrate that our dynamic method, PID, significantly outperforms existing methods on benchmarks such as CIFAR-100, achieving State-of-the-Art (SOTA) performance, especially on the FPR95 metric.

[491] Beyond Reward Suppression: Reshaping Steganographic Communication Protocols in MARL via Dynamic Representational Circuit Breaking

Liu Hung Ming

Main category: cs.LG

TL;DR: DRCB is an architectural defense against steganographic collusion in decentralized MARL that uses VQ-VAE bottlenecks to monitor and disrupt covert communication channels through statistical analysis and escalating interventions.

DetailsMotivation: Addresses the critical AI safety threat of steganographic collusion in decentralized MARL where agents develop private protocols to evade monitoring, which existing behavioral or reward-based defenses cannot detect.

Method: Uses Vector Quantized Variational Autoencoder (VQ-VAE) bottleneck to convert unobservable messages into auditable statistical objects, monitors Jensen-Shannon Divergence drift, L2-norm codebook displacement, and Randomized Observer Pool accuracy to compute EMA-based Collusion Score, with threshold breaches triggering escalating interventions.

Result: DRCB improves observer mean accuracy from 0.858 to 0.938 (+9.3%) and reduces volatility by 43% while preserving mean joint reward, with analysis of 214,298 symbol samples confirming “Semantic Degradation” where high-frequency sequences converge to zero entropy.

Conclusion: Provides a task-agnostic methodology for MICA-compliant pre-deployment auditing of autonomous systems, identifying a “Transparency Paradox” where agents achieve surface-level determinism while preserving residual capacity in long-tail distributions.

Abstract: In decentralized Multi-Agent Reinforcement Learning (MARL), steganographic collusion – where agents develop private protocols to evade monitoring – presents a critical AI safety threat. Existing defenses, limited to behavioral or reward layers, fail to detect coordination in latent communication channels. We introduce the Dynamic Representational Circuit Breaker (DRCB), an architectural defense operating at the optimization substrate. Building on the AI Mother Tongue (AIM) framework, DRCB utilizes a Vector Quantized Variational Autoencoder (VQ-VAE) bottleneck to convert unobservable messages into auditable statistical objects. DRCB monitors signals including Jensen-Shannon Divergence drift, L2-norm codebook displacement, and Randomized Observer Pool accuracy to compute an EMA-based Collusion Score. Threshold breaches trigger four escalating interventions: dynamic adaptation, gradient-space penalty injection into the Advantage function A^pi, temporal reward suppression, and full substrate circuit breaking via codebook shuffling and optimizer state reset. Experiments on a Contextual Prisoner’s Dilemma with MNIST labels show that while static monitoring fails (p = 0.3517), DRCB improves observer mean accuracy from 0.858 to 0.938 (+9.3 percent) and reduces volatility by 43 percent, while preserving mean joint reward (p = 0.854). Analysis of 214,298 symbol samples confirms “Semantic Degradation,” where high-frequency sequences converge to zero entropy, foreclosing complex steganographic encodings. We identify a “Transparency Paradox” where agents achieve surface-level determinism while preserving residual capacity in long-tail distributions, reflecting Goodhart’s Law. This task-agnostic methodology provides a technical path toward MICA-compliant (Multi-Agent Internal Coupling Audit) pre-deployment auditing for autonomous systems.

[492] A federated learning framework with knowledge graph and temporal transformer for early sepsis prediction in multi-center ICUs

Yue Chang, Guangsen Lin, Jyun Jie Chuang, Shunqi Liu, Xinkui Li, Yaozheng Li

Main category: cs.LG

TL;DR: A federated learning framework combining medical knowledge graphs, temporal transformers, and meta-learning for privacy-preserving early sepsis prediction across multiple hospitals.

DetailsMotivation: Early sepsis prediction in ICU patients is crucial but challenging due to data fragmentation across healthcare institutions, complex temporal medical data, and strict privacy constraints that prevent data sharing.

Method: Proposes a novel framework integrating federated learning with medical knowledge graphs and temporal transformer models, enhanced by meta-learning (MAML). Enables collaborative training across hospitals without sharing raw data, uses knowledge graphs for structured medical relationships, temporal transformers for long-range dependencies in clinical time-series, and MAML for rapid adaptation to local data distributions.

Result: Achieves AUC of 0.956 on MIMIC-IV and eICU datasets, representing 22.4% improvement over conventional centralized models and 12.7% improvement over standard federated learning.

Conclusion: Presents a reliable, privacy-preserving solution for multi-center collaborative early warning of sepsis that outperforms existing approaches while maintaining data privacy.

Abstract: The early prediction of sepsis in intensive care unit (ICU) patients is crucial for improving survival rates. However, the development of accurate predictive models is hampered by data fragmentation across healthcare institutions and the complex, temporal nature of medical data, all under stringent privacy constraints. To address these challenges, we propose a novel framework that uniquely integrates federated learning (FL) with a medical knowledge graph and a temporal transformer model, enhanced by meta-learning capabilities. Our approach enables collaborative model training across multiple hospitals without sharing raw patient data, thereby preserving privacy. The model leverages a knowledge graph to incorporate structured medical relationships and employs a temporal transformer to capture long-range dependencies in clinical time-series data. A model-agnostic meta-learning (MAML) strategy is further incorporated to facilitate rapid adaptation of the global model to local data distributions. Evaluated on the MIMIC-IV and eICU datasets, our method achieves an area under the curve (AUC) of 0.956, which represents a 22.4% improvement over conventional centralized models and a 12.7% improvement over standard federated learning, demonstrating strong predictive capability for sepsis. This work presents a reliable and privacy-preserving solution for multi-center collaborative early warning of sepsis.

[493] Discovering the Hidden Role of Gini Index In Prompt-based Classification

Ruixi Lin

Main category: cs.LG

TL;DR: Paper introduces Gini Index as a tool to detect and optimize accuracy disparities in classification tasks, particularly for long-tailed minority classes in prompt-based classification, with a model-agnostic bias mitigation method.

DetailsMotivation: Long-tailed minority classes in classification tasks are often most important but consistently show low accuracies, while a few high-performing classes dominate. The paper aims to understand and address these accuracy disparities using Gini Index as a foundational tool.

Method: 1) Benchmark Gini scores in real-world LLMs and vision models to analyze accuracy imbalances. 2) Show existence of accuracy imbalance across text and image classification. 3) Propose a post-hoc model-agnostic bias mitigation method using Gini metric to optimize class accuracy distributions.

Result: Experimental results across few-shot news, biomedical, and zero-shot image classification show the method significantly reduces both relative and absolute accuracy imbalances, minimizing top class dominance while elevating weakest classes.

Conclusion: Gini Index serves as an effective tool for detecting and optimizing accuracy disparities in classification tasks, particularly for prompt-based classification with long-tailed distributions, enabling better performance on minority classes.

Abstract: In classification tasks, the long-tailed minority classes usually offer the predictions that are most important. Yet these classes consistently exhibit low accuracies, whereas a few high-performing classes dominate the game. We pursue a foundational understanding of the hidden role of Gini Index as a tool for detecting and optimizing (debiasing) disparities in class accuracy, focusing on the case of prompt-based classification. We introduce the intuitions, benchmark Gini scores in real-world LLMs and vision models, and thoroughly discuss the insights of Gini not only as a measure of relative accuracy dominance but also as a direct optimization metric. Through rigorous case analyses, we first show that weak to strong relative accuracy imbalance exists in both prompt-based, text and image classification results and regardless of whether the classification is high-dimensional or low-dimensional. Then, we harness the Gini metric to propose a post-hoc model-agnostic bias mitigation method. Experimental results across few-shot news, biomedical, and zero-shot image classification show that our method significantly reduces both relative and absolute accuracy imbalances, minimizing top class relative dominance while elevating weakest classes.

[494] Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors

Peiyu Yang, Naveed Akhtar, Jiantong Jiang, Ajmal Mian

Main category: cs.LG

TL;DR: A model rectification framework using rank-one editing to fix unreliable behaviors in neural networks with minimal clean data, guided by attribution analysis to identify problematic layers.

DetailsMotivation: Neural networks often fail on corrupted samples due to reliance on non-robust features. Traditional fixes require extensive data cleaning and retraining, which is computationally expensive and labor-intensive. There's a need for efficient methods to correct model unreliable behaviors without complete retraining.

Method: Uses rank-one model editing to create an attribution-guided rectification framework. First distinguishes the rectification setting from standard model editing. Then addresses the bottleneck of heterogeneous editability across layers by introducing an attribution-guided layer localization method that quantifies layer-wise editability and identifies the most problematic layer responsible for unreliabilities.

Result: Extensive experiments show effectiveness in correcting unreliabilities for neural Trojans, spurious correlations, and feature leakage. The method achieves editing objectives with as few as a single cleansed sample, making it practical for real-world applications.

Conclusion: The proposed framework provides an efficient solution for model rectification that requires minimal clean data, addresses heterogeneous editability across layers, and effectively corrects various types of model unreliable behaviors while preserving overall model performance.

Abstract: The performance of neural network models deteriorates due to their unreliable behavior on non-robust features of corrupted samples. Owing to their opaque nature, rectifying models to address this problem often necessitates arduous data cleaning and model retraining, resulting in huge computational and manual overhead. In this work, we leverage rank-one model editing to establish an attribution-guided model rectification framework that effectively locates and corrects model unreliable behaviors. We first distinguish our rectification setting from existing model editing, yielding a formulation that corrects unreliable behavior while preserving model performance and reducing reliance on large budgets of cleansed samples. We further reveal a bottleneck of model rectifying arising from heterogeneous editability across layers. To target the primary source of misbehavior, we introduce an attribution-guided layer localization method that quantifies layer-wise editability and identifies the layer most responsible for unreliabilities. Extensive experiments demonstrate the effectiveness of our method in correcting unreliabilities observed for neural Trojans, spurious correlations and feature leakage. Our method shows remarkable performance by achieving its editing objective with as few as a single cleansed sample, which makes it appealing for practice.

[495] Revisiting ASR Error Correction with Specialized Models

Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly

Main category: cs.LG

TL;DR: Compact seq2seq ASR error correction model trained on synthetic and real audio data outperforms LLMs with 15x fewer parameters, achieving state-of-the-art WER across diverse ASR architectures and domains.

DetailsMotivation: Traditional ASR correction methods use text-only models unaware of ASR error patterns, while recent LLM-based approaches introduce latency and hallucination issues. There's a need for efficient, accurate correction models that understand ASR-specific error distributions.

Method: Proposes compact seq2seq models trained on ASR errors from real and synthetic audio. Uses cascaded TTS and ASR to create synthetic corpora matching realistic error distributions. Implements correction-first decoding where correction model generates candidates rescored using ASR acoustic scores.

Result: Achieves 1.5%/3.3% WER on LibriSpeech test-clean/other, outperforms LLMs with 15x fewer parameters. Generalizes across ASR architectures (CTC, Seq2seq, Transducer) and diverse domains. Provides precise corrections in low-error regimes where LLMs struggle.

Conclusion: Compact seq2seq models trained on diverse ASR error distributions can outperform LLMs for ASR correction, offering better efficiency, accuracy, and generalization while addressing LLM limitations like latency and hallucination.

Abstract: Language models play a central role in automatic speech recognition (ASR), yet most methods rely on text-only models unaware of ASR error patterns. Recently, large language models (LLMs) have been applied to ASR correction, but introduce latency and hallucination concerns. We revisit ASR error correction with compact seq2seq models, trained on ASR errors from real and synthetic audio. To scale training, we construct synthetic corpora via cascaded TTS and ASR, finding that matching the diversity of realistic error distributions is key. We propose correction-first decoding, where the correction model generates candidates rescored using ASR acoustic scores. With 15x fewer parameters than LLMs, our model achieves 1.5/3.3% WER on LibriSpeech test-clean/other, outperforms LLMs, generalizes across ASR architectures (CTC, Seq2seq, Transducer) and diverse domains, and provides precise corrections in the low-error regime where LLMs struggle.

[496] Spectral Edge Dynamics of Training Trajectories: Signal–Noise Geometry Across Scales

Yongzhong Xu

Main category: cs.LG

TL;DR: Spectral Edge Dynamics (SED) reveals transformer training evolves in few coherent directions, with spectral edge showing universal three-phase pattern and providing early-warning signals for generalization.

DetailsMotivation: Despite large parameter counts, transformer training trajectories show structured evolution in only a few coherent directions, suggesting underlying geometric patterns in optimization dynamics.

Method: Spectral Edge Dynamics (SED) uses rolling-window SVD of parameter updates to identify spectral edge boundary between coherent optimization directions and stochastic noise via maximum consecutive singular value ratio.

Result: Spectral edge shows universal three-phase pattern (rise, plateau, collapse), signal rank adjusts with task complexity, directional coupling with validation loss reverses with window size (lag flip), and Johnson-Lindenstrauss projection preserves spectral gap for large models.

Conclusion: Spectral geometry provides insights into transformer training dynamics and serves as early-warning signal for generalization, predicting grokking 600-1700 steps before it occurs across various tasks.

Abstract: Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce \emph{Spectral Edge Dynamics} (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary – the \emph{spectral edge} – between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio $σ_k/σ_{k+1}$. Across a 51M-parameter TinyStories model (4~seeds) and GPT-2 124M under a distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse), signal rank adjusts with task complexity ($k^* = 2$ at 51M, $k^* = 3$ at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size – a \emph{lag flip} reflecting the timescale of trajectory integration. Johnson–Lindenstrauss projection to $d = 10W$ dimensions (e.g., $d = 100$ for $W = 10$) preserves the spectral gap within 5.7%, making the framework applicable to models of arbitrary size. In companion work, the same spectral geometry provides early-warning signals of grokking – predicting generalization 600–1{,}700 steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.

[497] Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias

Joonwon Seo

Main category: cs.LG

TL;DR: Novel polyphonic music generation approach using structural inductive bias to solve “Missing Middle” problem, with mathematical proofs and empirical validation on Beethoven piano sonatas

DetailsMotivation: Address the "Missing Middle" problem in polyphonic music generation where current approaches lack structural understanding and mathematical grounding, particularly for complex compositions like Beethoven's piano sonatas

Method: Proposed Smart Embedding architecture with structural inductive bias, validated using information theory (NMI analysis), Rademacher complexity, and category theory; empirically tested on Beethoven’s piano sonatas with parameter reduction and stability analysis

Result: Achieved 48.30% parameter reduction, 9.47% validation loss reduction, demonstrated pitch-hand independence (NMI=0.167), proved negligible information loss (0.153 bits), and 28.09% tighter generalization bound; validated by expert listening study (N=53)

Conclusion: The approach successfully bridges theoretical and applied aspects of AI music generation, providing mathematically grounded deep learning with verifiable insights for polyphonic music generation

Abstract: This monograph introduces a novel approach to polyphonic music generation by addressing the “Missing Middle” problem through structural inductive bias. Focusing on Beethoven’s piano sonatas as a case study, we empirically verify the independence of pitch and hand attributes using normalized mutual information (NMI=0.167) and propose the Smart Embedding architecture, achieving a 48.30% reduction in parameters. We provide rigorous mathematical proofs using information theory (negligible loss bounded at 0.153 bits), Rademacher complexity (28.09% tighter generalization bound), and category theory to demonstrate improved stability and generalization. Empirical results show a 9.47% reduction in validation loss, confirmed by SVD analysis and an expert listening study (N=53). This dual theoretical and applied framework bridges gaps in AI music generation, offering verifiable insights for mathematically grounded deep learning.

[498] Flood Risk Follows Valleys, Not Grids: Graph Neural Networks for Flash Flood Susceptibility Mapping in Himachal Pradesh with Conformal Uncertainty Quantification

Paras Sharma, Swastika Sharma

Main category: cs.LG

TL;DR: GNN-based flood susceptibility mapping using watershed connectivity outperforms pixel-based ML models for flash flood prediction in Himachal Pradesh

DetailsMotivation: Existing flood risk maps treat pixels independently, ignoring that flooding upstream raises risk downstream. The paper addresses this limitation by incorporating watershed connectivity into flood prediction models.

Method: Uses Graph Neural Network (GraphSAGE) trained on watershed connectivity graph (460 sub-watersheds, 1,700 directed edges) with Sentinel-1 SAR flood inventory (3,000 events, 2018-2023) and 12 environmental variables at 30m resolution. Compares against four pixel-based ML baselines (RF, XGBoost, LightGBM, stacking ensemble) with leave-one-basin-out spatial cross-validation.

Result: GNN achieved AUC = 0.978 ± 0.017, outperforming best baseline (AUC = 0.881) and published benchmark (AUC = 0.88). High-susceptibility zones overlap critical infrastructure: 1,457 km highways, 2,759 bridges, 4 major hydroelectric installations. Conformal intervals achieved 82.9% empirical coverage on 2023 test set.

Conclusion: River connectivity carries predictive signal missed by pixel-based models. Conformal prediction provides statistically guaranteed coverage intervals. SAR label noise in high-risk zones identified as target for future work.

Abstract: Flash floods are the most destructive natural hazard in Himachal Pradesh (HP), India, causing over 400 fatalities and $1.2 billion in losses in the 2023 monsoon season alone. Existing risk maps treat every pixel independently, ignoring the basic fact that flooding upstream raises risk downstream. We address this with a Graph Neural Network (GraphSAGE) trained on a watershed connectivity graph (460 sub-watersheds, 1,700 directed edges), built from a six-year Sentinel-1 SAR flood inventory (2018-2023, 3,000 events) and 12 environmental variables at 30 m resolution. Four pixel-based ML models (RF, XGBoost, LightGBM, stacking ensemble) serve as baselines. All models are evaluated with leave-one-basin-out spatial cross-validation to avoid the 5-15% AUC inflation of random splits. Conformal prediction produces the first HP susceptibility maps with statistically guaranteed 90% coverage intervals. The GNN achieved AUC = 0.978 +/- 0.017, outperforming the best baseline (AUC = 0.881) and the published HP benchmark (AUC = 0.88). The +0.097 gain confirms that river connectivity carries predictive signal that pixel-based models miss. High-susceptibility zones overlap 1,457 km of highways (including 217 km of the Manali-Leh corridor), 2,759 bridges, and 4 major hydroelectric installations. Conformal intervals achieved 82.9% empirical coverage on the held-out 2023 test set; lower coverage in high-risk zones (45-59%) points to SAR label noise as a target for future work.

[499] Evidential Domain Adaptation for Remaining Useful Life Prediction with Incomplete Degradation

Yubo Hou, Mohamed Ragab, Yucheng Wang, Min Wu, Abdulla Alseiari, Chee-Keong Kwoh, Xiaoli Li, Zhenghua Chen

Main category: cs.LG

TL;DR: EviAdapt: A novel evidential domain adaptation approach for RUL prediction that addresses incomplete degradation trajectories by using stage-wise alignment and evidential uncertainty matching.

DetailsMotivation: Existing domain adaptation methods struggle with incomplete degradation trajectories in RUL prediction, particularly missing late degradation stages, leading to extrapolation challenges and misalignment between source and target domains.

Method: Segments source and target domain data into degradation stages based on degradation rate for stage-wise alignment, then uses evidential learning to estimate uncertainty and align uncertainty across matched stages.

Result: The method addresses limitations of current DA approaches by preventing misalignment of degradation stages and improving feature matching through uncertainty alignment.

Conclusion: EviAdapt provides an effective solution for domain adaptation in RUL prediction with incomplete degradation trajectories by combining stage-wise alignment with evidential uncertainty matching.

Abstract: Accurate Remaining Useful Life (RUL) prediction without labeled target domain data is a critical challenge, and domain adaptation (DA) has been widely adopted to address it by transferring knowledge from a labeled source domain to an unlabeled target domain. Despite its success, existing DA methods struggle significantly when faced with incomplete degradation trajectories in the target domain, particularly due to the absence of late degradation stages. This missing data introduces a key extrapolation challenge. When applied to such incomplete RUL prediction tasks, current DA methods encounter two primary limitations. First, most DA approaches primarily focus on global alignment, which can misaligns late degradation stage in the source domain with early degradation stage in the target domain. Second, due to varying operating conditions in RUL prediction, degradation patterns may differ even within the same degradation stage, resulting in different learned features. As a result, even if degradation stages are partially aligned, simple feature matching cannot fully align two domains. To overcome these limitations, we propose a novel evidential adaptation approach called EviAdapt, which leverages evidential learning to enhance domain adaptation. The method first segments the source and target domain data into distinct degradation stages based on degradation rate, enabling stage-wise alignment that ensures samples from corresponding stages are accurately matched. To address the second limitation, we introduce an evidential uncertainty alignment technique that estimates uncertainty using evidential learning and aligns the uncertainty across matched stages.

[500] Transition Flow Matching

Chenrui Ma

Main category: cs.LG

TL;DR: Proposes a new paradigm for flow matching that directly learns transition flow instead of local velocity field, enabling single-step generation and arbitrary time point sampling.

DetailsMotivation: Current flow matching methods focus on learning local velocity fields requiring multiple integration steps during generation, limiting efficiency and flexibility.

Method: Directly learns transition flow as a global quantity, establishing connection with Mean Velocity Flow models and providing unified theoretical perspective.

Result: Method enables generation in single step or at arbitrary time points, validated through extensive experiments supporting theoretical claims.

Conclusion: Proposes effective new paradigm for flow matching with improved generation efficiency and flexibility through direct transition flow learning.

Abstract: Mainstream flow matching methods typically focus on learning the local velocity field, which inherently requires multiple integration steps during generation. In contrast, Mean Velocity Flow models establish a relationship between the local velocity field and the global mean velocity, enabling the latter to be learned through a mathematically grounded formulation and allowing generation to be transferred to arbitrary future time points. In this work, we propose a new paradigm that directly learns the transition flow. As a global quantity, the transition flow naturally supports generation in a single step or at arbitrary time points. Furthermore, we demonstrate the connection between our approach and Mean Velocity Flow, establishing a unified theoretical perspective. Extensive experiments validate the effectiveness of our method and support our theoretical claims.

[501] Tackling Over-smoothing on Hypergraphs: A Ricci Flow-guided Neural Diffusion Approach

Mengyao Zhou, Zhiheng Zhou, Xiao Han, Xingqin Qi, Guanghui Wang, Guiying Yan

Main category: cs.LG

TL;DR: A novel hypergraph neural network framework (RFHND) that uses discrete Ricci flow from differential geometry to regulate message passing and prevent over-smoothing in deep hypergraph neural networks.

DetailsMotivation: Existing hypergraph neural networks (HGNNs) suffer from over-smoothing as layers increase and lack effective control over message passing among nodes, limiting their ability to model complex higher-order relationships effectively.

Method: Proposes Ricci Flow-guided Hypergraph Neural Diffusion (RFHND), a message passing paradigm based on a PDE system that describes continuous evolution of node features on hypergraphs. It uses discrete Ricci flow to adaptively regulate information diffusion rates at the geometric level, preventing feature homogenization.

Result: RFHND significantly outperforms existing methods across multiple benchmark datasets, demonstrates strong robustness, and effectively mitigates over-smoothing in hypergraph neural networks.

Conclusion: Introducing discrete Ricci flow into hypergraph structures provides an effective geometric approach to regulate node feature evolution and alleviate over-smoothing, leading to improved performance in hypergraph neural networks.

Abstract: Hypergraph neural networks (HGNNs) have demonstrated strong capabilities in modeling complex higher-order relationships. However, existing HGNNs often suffer from over-smoothing as the number of layers increases and lack effective control over message passing among nodes. Inspired by the theory of Ricci flow in differential geometry, we theoretically establish that introducing discrete Ricci flow into hypergraph structures can effectively regulate node feature evolution and thereby alleviate over-smoothing. Building on this insight, we propose Ricci Flow-guided Hypergraph Neural Diffusion(RFHND), a novel message passing paradigm for hypergraphs guided by discrete Ricci flow. Specifically, RFHND is based on a PDE system that describes the continuous evolution of node features on hypergraphs and adaptively regulates the rate of information diffusion at the geometric level, preventing feature homogenization and producing high-quality node representations. Experimental results show that RFHND significantly outperforms existing methods across multiple benchmark datasets and demonstrates strong robustness, while also effectively mitigating over-smoothing.

[502] Mastering the Minority: An Uncertainty-guided Multi-Expert Framework for Challenging-tailed Sequence Learning

Ye Wang, Zixuan Wu, Lifeng Shen, Jiang Xie, Xiaoling Wang, Hong Yu, Guoyin Wang

Main category: cs.LG

TL;DR: UME: Uncertainty-based Multi-Expert fusion network for imbalanced sequential learning using ensemble LoRA for parameter efficiency, DST-guided sequential specialization for minority classes, and uncertainty-guided fusion for expert coordination.

DetailsMotivation: Address imbalanced data distribution in sequential learning where models fail to detect minority classes adequately, and overcome limitations of Mixture-of-Experts models including parameter inefficiency, poor expert specialization, and difficulty in resolving prediction conflicts.

Method: Three core innovations: 1) Ensemble LoRA for parameter-efficient modeling, 2) Sequential Specialization guided by Dempster-Shafer Theory for effective specialization on tailed classes, 3) Uncertainty-Guided Fusion using DST’s certainty measures to dynamically weigh expert opinions and resolve conflicts.

Result: State-of-the-art performance across four public hierarchical text classification datasets, achieving up to 17.97% performance gain over best baseline on individual categories while reducing trainable parameters by up to 10.32%.

Conclusion: Uncertainty-guided expert coordination is a principled strategy for addressing challenging-tailed sequence learning, with UME demonstrating effective minority class mastery through parameter-efficient multi-expert fusion.

Abstract: Imbalanced data distribution remains a critical challenge in sequential learning, leading models to easily recognize frequent categories while failing to detect minority classes adequately. The Mixture-of-Experts model offers a scalable solution, yet its application is often hindered by parameter inefficiency, poor expert specialization, and difficulty in resolving prediction conflicts. To Master the Minority classes effectively, we propose the Uncertainty-based Multi-Expert fusion network (UME) framework. UME is designed with three core innovations: First, we employ Ensemble LoRA for parameter-efficient modeling, significantly reducing the trainable parameter count. Second, we introduce Sequential Specialization guided by Dempster-Shafer Theory (DST), which ensures effective specialization on the challenging-tailed classes. Finally, an Uncertainty-Guided Fusion mechanism uses DST’s certainty measures to dynamically weigh expert opinions, resolving conflicts by prioritizing the most confident expert for reliable final predictions. Extensive experiments across four public hierarchical text classification datasets demonstrate that UME achieves state-of-the-art performance. We achieve a performance gain of up to 17.97% over the best baseline on individual categories, while reducing trainable parameters by up to 10.32%. The findings highlight that uncertainty-guided expert coordination is a principled strategy for addressing challenging-tailed sequence learning. Our code is available at https://github.com/CQUPTWZX/Multi-experts.

[503] Embedding-Aware Feature Discovery: Bridging Latent Representations and Interpretable Features in Event Sequences

Artem Sakhno, Ivan Sergeev, Alexey Shestov, Omar Zoloev, Elizaveta Kovtun, Gleb Gusev, Andrey Savchenko, Maksim Makarenko

Main category: cs.LG

TL;DR: EAFD bridges the gap between learned embeddings and handcrafted features in financial event sequences by using LLM-driven feature discovery that aligns with embeddings while identifying complementary signals.

DetailsMotivation: Production financial systems still rely on handcrafted statistical features despite advances in representation learning, creating a disconnect between learned embeddings and feature-based pipelines due to interpretability, robustness, and latency constraints.

Method: Embedding-Aware Feature Discovery (EAFD) couples pretrained event-sequence embeddings with a self-reflective LLM-driven feature generation agent that iteratively discovers, evaluates, and refines features using alignment (explaining embedding information) and complementarity (identifying missing predictive signals).

Result: EAFD consistently outperforms embedding-only and feature-based baselines across open-source and industrial transaction benchmarks, achieving up to +5.8% relative gains over state-of-the-art pretrained embeddings and setting new SOTA performance.

Conclusion: EAFD successfully bridges the gap between learned embeddings and feature-based pipelines in financial event sequence analysis, demonstrating that coupling embeddings with LLM-driven feature discovery yields superior performance while maintaining practical constraints.

Abstract: Industrial financial systems operate on temporal event sequences such as transactions, user actions, and system logs. While recent research emphasizes representation learning and large language models, production systems continue to rely heavily on handcrafted statistical features due to their interpretability, robustness under limited supervision, and strict latency constraints. This creates a persistent disconnect between learned embeddings and feature-based pipelines. We introduce Embedding-Aware Feature Discovery (EAFD), a unified framework that bridges this gap by coupling pretrained event-sequence embeddings with a self-reflective LLM-driven feature generation agent. EAFD iteratively discovers, evaluates, and refines features directly from raw event sequences using two complementary criteria: \emph{alignment}, which explains information already encoded in embeddings, and \emph{complementarity}, which identifies predictive signals missing from them. Across both open-source and industrial transaction benchmarks, EAFD consistently outperforms embedding-only and feature-based baselines, achieving relative gains of up to $+5.8%$ over state-of-the-art pretrained embeddings, resulting in new state-of-the-art performance across event-sequence datasets.

[504] Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

Lit Sin Tan, Junzhe Chen, Xiaolong Fu, Lichen Ma, Junshi Huang, Jianzhong Shi, Yan Li, Lijie Wen

Main category: cs.LG

TL;DR: Meta-TTRL: A metacognitive test-time reinforcement learning framework that enables unified multimodal models to self-improve during inference by using model-intrinsic monitoring signals, achieving capability-level improvements in text-to-image generation.

DetailsMotivation: Current test-time scaling methods for unified multimodal models in text-to-image generation only provide instance-level improvements and cannot accumulate knowledge across similar prompts. There's a need for methods that enable self-improvement and capability-level enhancement during test time.

Method: Proposes Meta-TTRL, a metacognitive test-time reinforcement learning framework that performs test-time parameter optimization guided by model-intrinsic monitoring signals derived from the meta-knowledge of unified multimodal models.

Result: Meta-TTRL generalizes well across three representative UMMs (Janus-Pro-7B, BAGEL, and Qwen-Image), achieving significant gains on compositional reasoning tasks and multiple T2I benchmarks with limited data. The analysis reveals metacognitive synergy as a key insight for effective test-time reinforcement learning.

Conclusion: Meta-TTRL enables unified multimodal models to achieve self-improvement and capability-level enhancement during test time through metacognitive synergy, representing the first comprehensive analysis of test-time reinforcement learning for text-to-image generation in UMMs.

Abstract: Existing test-time scaling (TTS) methods for unified multimodal models (UMMs) in text-to-image (T2I) generation primarily rely on search or sampling strategies that produce only instance-level improvements, limiting the ability to learn from prior inferences and accumulate knowledge across similar prompts. To overcome these limitations, we propose Meta-TTRL, a metacognitive test-time reinforcement learning framework. Meta-TTRL performs test-time parameter optimization guided by model-intrinsic monitoring signals derived from the meta-knowledge of UMMs, achieving self-improvement and capability-level improvement at test time. Extensive experiments demonstrate that Meta-TTRL generalizes well across three representative UMMs, including Janus-Pro-7B, BAGEL, and Qwen-Image, achieving significant gains on compositional reasoning tasks and multiple T2I benchmarks with limited data. We provide the first comprehensive analysis to investigate the potential of test-time reinforcement learning (TTRL) for T2I generation in UMMs. Our analysis further reveals a key insight underlying effective TTRL: metacognitive synergy, where monitoring signals align with the model’s optimization regime to enable self-improvement.

[505] OMNIFLOW: A Physics-Grounded Multimodal Agent for Generalized Scientific Reasoning

Hao Wu, Yongheng Zhang, Yuan Gao, Fan Xu, Fan Zhang, Ruobing Xie, Ruijian Gou, Yuxuan Liang, Xiaomeng Huang, Xian Wu

Main category: cs.LG

TL;DR: OMNIFLOW is a neuro-symbolic architecture that grounds frozen multimodal LLMs in physical laws without domain-specific fine-tuning, enabling interpretable scientific reasoning for PDE-based spatiotemporal dynamics.

DetailsMotivation: LLMs struggle with continuous spatiotemporal dynamics governed by PDEs, often producing non-physical hallucinations. Existing approaches require costly domain-specific fine-tuning that limits cross-domain generalization and interpretability.

Method: Introduces Semantic-Symbolic Alignment to project high-dimensional flow tensors into topological linguistic descriptors, and Physics-Guided Chain-of-Thought workflow with dynamic constraint injection (e.g., mass conservation) and iterative reflexive verification.

Result: Significantly outperforms traditional deep learning baselines in zero-shot generalization and few-shot adaptation tasks across microscopic turbulence, Navier-Stokes equations, and global weather forecasting benchmarks.

Conclusion: OMNIFLOW enables transparent, physically consistent reasoning reports, marking a paradigm shift from black-box fitting to interpretable scientific reasoning for physical systems.

Abstract: Large Language Models (LLMs) have demonstrated exceptional logical reasoning capabilities but frequently struggle with the continuous spatiotemporal dynamics governed by Partial Differential Equations (PDEs), often resulting in non-physical hallucinations. Existing approaches typically resort to costly, domain-specific fine-tuning, which severely limits cross-domain generalization and interpretability. To bridge this gap, we propose OMNIFLOW, a neuro-symbolic architecture designed to ground frozen multimodal LLMs in fundamental physical laws without requiring domain-specific parameter updates. OMNIFLOW introduces a novel \textit{Semantic-Symbolic Alignment} mechanism that projects high-dimensional flow tensors into topological linguistic descriptors, enabling the model to perceive physical structures rather than raw pixel values. Furthermore, we construct a Physics-Guided Chain-of-Thought (PG-CoT) workflow that orchestrates reasoning through dynamic constraint injection (e.g., mass conservation) and iterative reflexive verification. We evaluate OMNIFLOW on a comprehensive benchmark spanning microscopic turbulence, theoretical Navier-Stokes equations, and macroscopic global weather forecasting. Empirical results demonstrate that OMNIFLOW significantly outperforms traditional deep learning baselines in zero-shot generalization and few-shot adaptation tasks. Crucially, it offers transparent, physically consistent reasoning reports, marking a paradigm shift from black-box fitting to interpretable scientific reasoning.

[506] Time-Aware Prior Fitted Networks for Zero-Shot Forecasting with Exogenous Variables

Andres Potapczynski, Ravi Kiran Selvam, Tatiana Konstantinova, Shankar Ramasubramanian, Malcolm Wolff, Kin G. Olivares, Ruijun Ma, Mengfei Cao, Michael W. Mahoney, Andrew Gordon Wilson, Boris N. Oreshkin, Dmitry Efimov

Main category: cs.LG

TL;DR: ApolloPFN is a time-aware prior-data fitted network that incorporates exogenous covariates for time series forecasting, addressing limitations of current foundation models that ignore such signals.

DetailsMotivation: Current time series foundation models ignore exogenous covariates that drive important patterns in time series data, limiting forecasting accuracy. Many real-world forecasting scenarios involve exogenous signals like promotions, prices, temperature, calendar indicators, etc., which can cause spikes, discontinuities, or regime changes in target series.

Method: Develops ApolloPFN with two key advances: 1) Synthetic data generation procedure tailored to address failure modes when tabular PFNs are applied to time series, and 2) Time-aware architectural modifications embedding inductive biases needed for time series context.

Result: Achieves state-of-the-art results across benchmarks like M5 and electric price forecasting that contain exogenous information.

Conclusion: ApolloPFN successfully incorporates exogenous covariates into time series forecasting, overcoming limitations of current foundation models and demonstrating superior performance on real-world benchmarks with exogenous signals.

Abstract: In many time series forecasting settings, the target time series is accompanied by exogenous covariates, such as promotions and prices in retail demand; temperature in energy load; calendar and holiday indicators for traffic or sales; and grid load or fuel costs in electricity pricing. Ignoring these exogenous signals can substantially degrade forecasting accuracy, particularly when they drive spikes, discontinuities, or regime and phase changes in the target series. Most current time series foundation models (e.g., Chronos, Sundial, TimesFM, TimeMoE, TimeLLM, and LagLlama) ignore exogenous covariates and make forecasts solely from the numerical time series history, thereby limiting their performance. In this paper, we develop ApolloPFN, a prior-data fitted network (PFN) that is time-aware (unlike prior PFNs) and that natively incorporates exogenous covariates (unlike prior univariate forecasters). Our design introduces two major advances: (i) a synthetic data generation procedure tailored to resolve the failure modes that arise when tabular (non-temporal) PFNs are applied to time series; and (ii) time-aware architectural modifications that embed inductive biases needed to exploit the time series context. We demonstrate that ApolloPFN achieves state-of-the-art results across benchmarks, such as M5 and electric price forecasting, that contain exogenous information.

[507] Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs

Linrui Ma, Yufei Cui, Kai Han, Yunhe Wang

Main category: cs.LG

TL;DR: Information Density Driven Smart Noise Scheduler improves discrete diffusion language models by focusing training on high-information content rather than uniform noise, boosting reasoning performance by ~4% on code and math tasks.

DetailsMotivation: Standard discrete diffusion models use uniform random noise schedulers that waste optimization resources on low-information structural elements while under-optimizing high-density logical pivot points crucial for reasoning tasks.

Method: Proposes Information Density Driven Smart Noise Scheduler that extracts information-dense hubs and applies Complementary Priority Masking to decouple training instances into reasoning and syntax samples, forcing mastery of both logical deduction and sequence structure.

Result: Improves average accuracy by ~4% across four Code and Math reasoning benchmarks, significantly outperforming uniform baselines. Probabilistic priority masking effectively mitigates contextual collapse during block diffusion training.

Conclusion: Density-aware strategy efficiently unlocks reasoning potential of diffusion language models at minimal annotation cost, emerging as a promising new masked data training paradigm for Diffusion LLMs.

Abstract: Discrete diffusion models offer global context awareness and flexible parallel generation. However, uniform random noise schedulers in standard DLLM training overlook the highly non-uniform information density inherent in real-world sequences. This wastes optimization resources on low-density structural glues while leaving high-density logical pivot points severely under-optimized. To address this, we propose an Information Density Driven Smart Noise Scheduler. By extracting information-dense hubs and applying Complementary Priority Masking, our method decouples a single training instance into mutually reinforcing reasoning and syntax samples, forcing the model to master both logical deduction and foundational sequence structure. Experiments demonstrate that our approach improves average accuracy by ~4% across four Code and Math reasoning benchmarks, significantly outperforming uniform baselines. Mechanistic analyses further reveal that probabilistic priority masking effectively mitigates contextual collapse during block diffusion training. Overall, this density-aware strategy efficiently unlocks the reasoning potential of diffusion language models at minimal annotation cost, emerging as a promising new masked data training paradigm for Diffusion LLMs. Our processed dataset can be found at https://huggingface.co/datasets/malr07/opc-sft-stage2-dense-extracted.

[508] Longitudinal Risk Prediction in Mammography with Privileged History Distillation

Banafsheh Karimian, Alexis Guichemerre, Soufiane Belharbi, Natacha Gillet, Luke McCaffrey, Mohammadhadi Shateri, Eric Granger

Main category: cs.LG

TL;DR: A method called Privileged History Distillation (PHD) that improves breast cancer risk prediction from mammograms when prior screening exams are unavailable at test time by distilling longitudinal history knowledge into a model that only needs current exams.

DetailsMotivation: Longitudinal mammography risk models require prior screening exams for accurate multi-year risk prediction, but in clinical practice, these histories are often incomplete, irregular, or unavailable due to missed screenings, first-time exams, or archival constraints, limiting practical applicability.

Method: Privileged multi-teacher distillation with horizon-specific teachers: each teacher is trained on full longitudinal history to specialize in one prediction horizon, while the student receives only reconstructed history derived from the current exam, allowing it to inherit horizon-dependent longitudinal risk cues without requiring prior exams at deployment.

Result: On the CSAW-CC dataset, the PHD method markedly improves long-horizon prediction performance over no-history models and achieves comparable performance to full-history models, while using only the current exam at inference time.

Conclusion: The proposed Privileged History Distillation method enables effective longitudinal risk prediction without requiring prior screening exams at deployment, addressing a key limitation in clinical practice while maintaining predictive performance.

Abstract: Breast cancer remains a leading cause of cancer-related mortality worldwide. Longitudinal mammography risk prediction models improve multi-year breast cancer risk prediction based on prior screening exams. However, in real-world clinical practice, longitudinal histories are often incomplete, irregular, or unavailable due to missed screenings, first-time examinations, heterogeneous acquisition schedules, or archival constraints. The absence of prior exams degrades the performance of longitudinal risk models and limits their practical applicability. While substantial longitudinal history is available during training, prior exams are commonly absent at test time. In this paper, we address missing history at inference time and propose a longitudinal risk prediction method that uses mammography history as privileged information during training and distills its prognostic value into a student model that only requires the current exam at inference time. The key idea is a privileged multi-teacher distillation scheme with horizon-specific teachers: each teacher is trained on the full longitudinal history to specialize in one prediction horizon, while the student receives only a reconstructed history derived from the current exam. This allows the student to inherit horizon-dependent longitudinal risk cues without requiring prior screening exams at deployment. Our new Privileged History Distillation (PHD) method is validated on a large longitudinal mammography dataset with multi-year cancer outcomes, CSAW-CC, comparing full-history and no-history baselines to their distilled counterparts. Using time-dependent AUC across horizons, our privileged history distillation method markedly improves the performance of long-horizon prediction over no-history models and is comparable to that of full-history models, while using only the current exam at inference time.

[509] Hypothesis Class Determines Explanation: Why Accurate Models Disagree on Feature Attribution

Thackshanaramana B

Main category: cs.LG

TL;DR: Prediction-equivalent models from different hypothesis classes produce substantially different feature attributions, creating an Explanation Lottery where model architecture determines which features are highlighted as important.

DetailsMotivation: The paper challenges the fundamental assumption in explainable AI that models with identical predictive performance should produce equivalent explanations, which underlies practices like model selection, auditing, and regulatory evaluation.

Method: Large-scale empirical study across 24 datasets and multiple model classes, comparing feature attributions of prediction-equivalent models. Theoretical analysis of the Agreement Gap under interaction structure in data-generating processes. Development of Explanation Reliability Score R(x) as a post-hoc diagnostic.

Result: Models with identical predictive behavior produce substantially different feature attributions. Disagreement is structured: strong agreement within same hypothesis class, but cross-class pairs (e.g., tree-based vs. linear) show reduced agreement near/below lottery threshold. Hypothesis class is identified as the structural driver.

Conclusion: Model selection is not explanation-neutral - the hypothesis class chosen for deployment determines which features are attributed responsibility for decisions, challenging current explainable AI practices.

Abstract: The assumption that prediction-equivalent models produce equivalent explanations underlies many practices in explainable AI, including model selection, auditing, and regulatory evaluation. In this work, we show that this assumption does not hold. Through a large-scale empirical study across 24 datasets and multiple model classes, we find that models with identical predictive behavior can produce substantially different feature attributions. This disagreement is highly structured: models within the same hypothesis class exhibit strong agreement, while cross-class pairs (e.g., tree-based vs. linear) trained on identical data splits show substantially reduced agreement, consistently near or below the lottery threshold. We identify hypothesis class as the structural driver of this phenomenon, which we term the Explanation Lottery. We theoretically show that the resulting Agreement Gap persists under interaction structure in the data-generating process. This structural finding motivates a post-hoc diagnostic, the Explanation Reliability Score R(x), which predicts when explanations are stable across architectures without additional training. Our results demonstrate that model selection is not explanation-neutral: the hypothesis class chosen for deployment can determine which features are attributed responsibility for a decision.

[510] When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making

Nazia Riasat

Main category: cs.LG

TL;DR: LLMs show high stability but poor alignment with statistical ground truth in scientific decision-making tasks, highlighting need for validation beyond reproducibility.

DetailsMotivation: Current LLM evaluation focuses on stability/reproducibility but neglects alignment with statistical ground truth in scientific workflows where correctness is critical.

Method: Controlled behavioral evaluation framework separating four dimensions: stability, correctness, prompt sensitivity, and output validity. Tested on statistical gene prioritization task with differential expression analysis across various prompt regimes.

Result: LLMs exhibit near-perfect stability while systematically diverging from statistical ground truth, oversensitive to minor prompt changes, and producing invalid outputs not present in input data.

Conclusion: Stability alone insufficient for scientific LLM deployment; explicit ground-truth validation and output validity checks are essential in automated scientific workflows.

Abstract: Large language models (LLMs) are increasingly used as decision-support tools in data-constrained scientific workflows, where correctness and validity are critical. However, evaluation practices often emphasize stability or reproducibility across repeated runs. While these properties are desirable, stability alone does not guar- antee agreement with statistical ground truth when such references are available. We introduce a controlled behavioral evaluation framework that explicitly sep- arates four dimensions of LLM decision-making: stability, correctness, prompt sensitivity, and output validity under fixed statistical inputs. We evaluate multi- ple LLMs using a statistical gene prioritization task derived from differential ex- pression analysis across prompt regimes involving strict and relaxed significance thresholds, borderline ranking scenarios, and minor wording variations. Our ex- periments show that LLMs can exhibit near-perfect run-to-run stability while sys- tematically diverging from statistical ground truth, over-selecting under relaxed thresholds, responding sharply to minor prompt wording changes, or producing syntactically plausible gene identifiers absent from the input table. Although sta- bility reflects robustness across repeated runs, it does not guarantee agreement with statistical ground truth in structured scientific decision tasks. These findings highlight the importance of explicit ground-truth validation and output validity checks when deploying LLMs in automated or semi-automated scientific work- flows.

[511] Informationally Compressive Anonymization: Non-Degrading Sensitive Input Protection for Privacy-Preserving Supervised Machine Learning

Jeremy J Samuelson

Main category: cs.LG

TL;DR: VEIL architecture with Informationally Compressive Anonymization (ICA) provides privacy-preserving ML through irreversible data transformations rather than noise or cryptography, enabling secure enterprise ML with preserved utility.

DetailsMotivation: Existing privacy-preserving ML techniques like Differential Privacy and Homomorphic Encryption degrade performance, increase complexity, or have prohibitive computational overhead, creating a need for approaches that maintain utility while ensuring strong privacy.

Method: ICA uses a supervised, multi-objective encoder in a trusted Source Environment to transform raw inputs into low-dimensional, task-aligned latent representations that are structurally non-invertible, with rigorous topological and information-theoretic proofs of irreversibility.

Result: The approach achieves strong privacy guarantees where reconstruction probability approaches zero, while preserving predictive utility and enabling low-latency, high-performance ML without gradient clipping, noise budgets, or encryption at inference time.

Conclusion: VEIL architecture establishes a new foundation for enterprise ML that is secure, performant, and safe by construction, with strict trust boundaries and alignment with privacy-by-design regulatory frameworks, even against post-quantum threats.

Abstract: Modern machine learning systems increasingly rely on sensitive data, creating significant privacy, security, and regulatory risks that existing privacy-preserving machine learning (ppML) techniques, such as Differential Privacy (DP) and Homomorphic Encryption (HE), address only at the cost of degraded performance, increased complexity, or prohibitive computational overhead. This paper introduces Informationally Compressive Anonymization (ICA) and the VEIL architecture, a privacy-preserving ML framework that achieves strong privacy guarantees through architectural and mathematical design rather than noise injection or cryptography. ICA embeds a supervised, multi-objective encoder within a trusted Source Environment to transform raw inputs into low-dimensional, task-aligned latent representations, ensuring that only irreversibly anonymized vectors are exported to untrusted Training and Inference Environments. The paper rigorously proves that these encodings are structurally non-invertible using topological and information-theoretic arguments, showing that inversion is logically impossible, even under idealized attacker assumptions, and that, in realistic deployments, the attackers conditional entropy over the original data diverges, driving reconstruction probability to zero. Unlike prior autoencoder-based ppML approaches, ICA preserves predictive utility by aligning representation learning with downstream supervised objectives, enabling low-latency, high-performance ML without gradient clipping, noise budgets, or encryption at inference time. The VEIL architecture enforces strict trust boundaries, supports scalable multi-region deployment, and naturally aligns with privacy-by-design regulatory frameworks, establishing a new foundation for enterprise ML that is secure, performant, and safe by construction, even in the face of post-quantum threats.

[512] FlashSampling: Fast and Memory-Efficient Exact Sampling

Tomas Ruiz, Zhen Qin, Yifan Zhang, Xuyang Shen, Yiran Zhong, Mengdi Wang

Main category: cs.LG

TL;DR: FlashSampling is an exact sampling primitive that fuses categorical sampling into the LM-head matrix multiplication, eliminating the need to materialize logits in HBM memory.

DetailsMotivation: Traditional categorical sampling in large-vocabulary decoding triggers extra memory traffic and kernel launches after the LM head, creating performance bottlenecks.

Method: Compute logits tile-by-tile on chip, add Gumbel noise, keep one maximizer per row per vocabulary tile, and finish with a small reduction over tiles. Uses exact decomposition of argmax over partitions.

Result: Speeds up kernel-level decode workloads across H100, H200, B200, and B300 GPUs, reducing time per output token by up to 19% in end-to-end vLLM experiments.

Conclusion: Exact sampling can be integrated into the matmul itself, turning a bandwidth-bound postprocessing step into a lightweight epilogue without approximations.

Abstract: Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. The fused tiled kernel is exact because $\argmax$ decomposes over a partition; grouped variants for online and tensor-parallel settings are exact by hierarchical factorization of the categorical distribution. Across H100, H200, B200, and B300 GPUs, FlashSampling speeds up kernel-level decode workloads, and in end-to-end vLLM experiments, it reduces time per output token by up to $19%$ on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, turning a bandwidth-bound postprocessing step into a lightweight epilogue. Project Page: https://github.com/FlashSampling/FlashSampling.

[513] Evaluating Black-Box Vulnerabilities with Wasserstein-Constrained Data Perturbations

Adriana Laurindo Monteiro, Jean-Michel Loubes

Main category: cs.LG

TL;DR: Using Optimal Transport theory to analyze ML model responses to input distribution variations, finding closest Wasserstein distributions that satisfy constraints to understand model behavior.

DetailsMotivation: Address the lack of explainability in ML models and black-box algorithms by developing methods to understand how models respond to changes in input variable distributions.

Method: Apply Optimal Transport theory, specifically Wasserstein distance, to find the closest distribution satisfying given constraints and analyze its impact on model behavior. Establish convergence results for projected distributions.

Result: Demonstrated approach using examples and real-world datasets in both regression and classification settings, showing how to analyze model responses to distributional changes.

Conclusion: Optimal Transport provides a principled framework for analyzing ML model behavior in response to input distribution variations, offering tools for model explainability and understanding.

Abstract: The massive use of Machine Learning (ML) tools in industry comes with critical challenges, such as the lack of explainable models and the use of black-box algorithms. We address this issue by applying Optimal Transport theory in the analysis of responses of ML models to variations in the distribution of input variables. We find the closest distribution, in the Wasserstein sense, that satisfies a given constraintt and examine its impact on model behavior. Furthermore, we establish convergence results for this projected distribution and demonstrate our approach using examples and real-world datasets in both regression and classification settings.

[514] Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

Ezgi Korkmaz

Main category: cs.LG

TL;DR: Novel reinforcement learning paradigm using counteractive actions for efficient training in high-dimensional MDPs with zero additional computational complexity

DetailsMotivation: Reinforcement learning policies face exponentially growing state spaces in high-dimensional MDPs, creating a dichotomy between computational complexity and policy success. The paper aims to address this challenge by focusing on agent-environment interaction during learning.

Method: Introduces a theoretically-founded novel paradigm based on experiences obtained through counteractive actions. The method provides efficient, effective, scalable and accelerated learning with zero additional computational complexity.

Result: Extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs verify theoretical analysis. The method achieves significant performance increase with substantial sample-efficiency in high-dimensional environments.

Conclusion: The proposed counteractive action paradigm offers a theoretical basis for efficient reinforcement learning in high-dimensional MDPs, achieving acceleration in training without additional computational overhead.

Abstract: Following the pivotal success of learning strategies to win at tasks, solely by interacting with an environment without any supervision, agents have gained the ability to make sequential decisions in complex MDPs. Yet, reinforcement learning policies face exponentially growing state spaces in high dimensional MDPs resulting in a dichotomy between computational complexity and policy success. In our paper we focus on the agent’s interaction with the environment in a high-dimensional MDP during the learning phase and we introduce a theoretically-founded novel paradigm based on experiences obtained through counteractive actions. Our analysis and method provide a theoretical basis for efficient, effective, scalable and accelerated learning, and further comes with zero additional computational complexity while leading to significant acceleration in training. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. The experimental results further verify our theoretical analysis, and our method achieves significant performance increase with substantial sample-efficiency in high-dimensional environments.

[515] Electrodermal Activity as a Unimodal Signal for Aerobic Exercise Detection in Wearable Sensors

Rena Mira Krishna, Ramya Sankar, Shadi Ghiasi

Main category: cs.LG

TL;DR: EDA signals alone can moderately distinguish rest from sustained aerobic exercise using subject-independent validation, with phasic dynamics and event timing being key discriminative features.

DetailsMotivation: Previous multimodal studies show good performance combining EDA with other signals for stress/exercise detection, but the standalone discriminative power of EDA for distinguishing sustained aerobic exercise from low-arousal states under subject-independent evaluation remains unclear.

Method: Used publicly available dataset from 30 healthy individuals, extracted EDA features, evaluated with benchmark machine learning models using leave-one-subject-out (LOSO) validation to ensure subject independence.

Result: EDA-only classifiers achieved moderate subject-independent performance, with phasic temporal dynamics and event timing contributing to class separation between rest and sustained aerobic exercise.

Conclusion: This work establishes a conservative benchmark for EDA’s standalone discriminative power, clarifying its role as a unimodal input for wearable activity-state inference rather than proposing it as a replacement for multimodal sensing.

Abstract: Electrodermal Activity (EDA) is a non-invasive physiological signal widely available in wearable devices and reflects sympathetic nervous system (SNS) activation. Prior multi-modal studies have demonstrated robust performance in distinguishing stress and exercise states when EDA is combined with complementary signals such as heart rate and accelerometry. However, the ability of EDA to independently distinguish sustained aerobic exercise from low-arousal states under subject-independent evaluation remains insufficiently characterized. This study investigates whether features derived exclusively from EDA can reliably differentiate rest from sustained aerobic exercise. Using a publicly available dataset collected from thirty healthy individuals, EDA features were evaluated using benchmark machine learning models with leave-one-subject-out (LOSO) validation. Across models, EDA-only classifiers achieved moderate subject-independent performance, with phasic temporal dynamics and event timing contributing to class separation. Rather than proposing EDA as a replacement for multimodal sensing, this work provides a conservative benchmark of the discriminative power of EDA alone and clarifies its role as a unimodal input for wearable activity-state inference.

[516] PhasorFlow: A Python Library for Unit Circle Based Computing

Dibakar Sigdel, Namuna Panday

Main category: cs.LG

TL;DR: PhasorFlow is a Python library for unit circle computing using complex phasors, offering deterministic alternatives to neural networks and quantum circuits with applications in classification, time-series, and neuromorphic tasks.

DetailsMotivation: To create a deterministic, lightweight alternative to classical neural networks and quantum circuits that operates on classical hardware while leveraging the mathematical foundations of quantum mechanics through unitary wave interference.

Method: Introduces computational paradigm on S¹ unit circle with complex phasors, formalizes Phasor Circuit model with 22-gate library, develops Variational Phasor Circuit for optimization, and creates Phasor Transformer with DFT-based token mixing instead of attention.

Result: Validated on non-linear spatial classification, time-series prediction, financial volatility detection, and neuromorphic tasks including neural binding and oscillatory associative memory, establishing unit circle computing as effective alternative.

Conclusion: PhasorFlow provides a mathematically principled, deterministic alternative to neural networks and quantum circuits that operates efficiently on classical hardware while sharing quantum mechanics’ unitary foundations.

Abstract: We present PhasorFlow, an open-source Python library introducing a computational paradigm operating on the $S^1$ unit circle. Inputs are encoded as complex phasors $z = e^{iθ}$ on the $N$-Torus ($\mathbb{T}^N$). As computation proceeds via unitary wave interference gates, global norm is preserved while individual components drift into $\mathbb{C}^N$, allowing algorithms to natively leverage continuous geometric gradients for predictive learning. PhasorFlow provides three core contributions. First, we formalize the Phasor Circuit model ($N$ unit circle threads, $M$ gates) and introduce a 22-gate library covering Standard Unitary, Non-Linear, Neuromorphic, and Encoding operations with full matrix algebra simulation. Second, we present the Variational Phasor Circuit (VPC), analogous to Variational Quantum Circuits (VQC), enabling optimization of continuous phase parameters for classical machine learning tasks. Third, we introduce the Phasor Transformer, replacing expensive $QK^TV$ attention with a parameter-free, DFT-based token mixing layer inspired by FNet. We validate PhasorFlow on non-linear spatial classification, time-series prediction, financial volatility detection, and neuromorphic tasks including neural binding and oscillatory associative memory. Our results establish unit circle computing as a deterministic, lightweight, and mathematically principled alternative to classical neural networks and quantum circuits. It operates on classical hardware while sharing quantum mechanics’ unitary foundations. PhasorFlow is available at https://github.com/mindverse-computing/phasorflow.

[517] Federated Learning for Privacy-Preserving Medical AI

Tin Hoang

Main category: cs.LG

TL;DR: Privacy-preserving federated learning for Alzheimer’s disease classification using 3D MRI data with site-aware data partitioning and adaptive local differential privacy mechanisms.

DetailsMotivation: Existing federated learning methods for medical imaging suffer from unrealistic data partitioning, inadequate privacy guarantees, and insufficient benchmarking, limiting practical deployment in healthcare settings.

Method: Proposes site-aware data partitioning strategy preserving institutional boundaries, and introduces Adaptive Local Differential Privacy (ALDP) mechanism that dynamically adjusts privacy parameters based on training progression and parameter characteristics.

Result: ALDP achieved up to 80.4% accuracy in two-client configuration, surpassing fixed-noise Local DP by 5-7 percentage points with greater training stability. FedProx algorithm equaled or surpassed centralized training performance while ensuring privacy protection.

Conclusion: Advances state-of-the-art in federated learning for medical imaging by establishing methodological foundations and empirical evidence for privacy-compliant AI adoption in healthcare.

Abstract: This dissertation investigates privacy-preserving federated learning for Alzheimer’s disease classification using three-dimensional MRI data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Existing methodologies often suffer from unrealistic data partitioning, inadequate privacy guarantees, and insufficient benchmarking, limiting their practical deployment in healthcare. To address these gaps, this research proposes a novel site-aware data partitioning strategy that preserves institutional boundaries, reflecting real-world multi-institutional collaborations and data heterogeneity. Furthermore, an Adaptive Local Differential Privacy (ALDP) mechanism is introduced, dynamically adjusting privacy parameters based on training progression and parameter characteristics, thereby significantly improving the privacy-utility trade-off over traditional fixed-noise approaches. Systematic empirical evaluation across multiple client federations and privacy budgets demonstrated that advanced federated optimisation algorithms, particularly FedProx, could equal or surpass centralised training performance while ensuring rigorous privacy protection. Notably, ALDP achieved up to 80.4% accuracy in a two-client configuration, surpassing fixed-noise Local DP by 5-7 percentage points and demonstrating substantially greater training stability. The comprehensive ablation studies and benchmarking establish quantitative standards for privacy-preserving collaborative medical AI, providing practical guidelines for real-world deployment. This work thereby advances the state-of-the-art in federated learning for medical imaging, establishing both methodological foundations and empirical evidence necessary for future privacy-compliant AI adoption in healthcare.

[518] Game-Theory-Assisted Reinforcement Learning for Border Defense: Early Termination based on Analytical Solutions

Goutam Das, Michael Dorothy, Kyle Volle, Daigo Shishika

Main category: cs.LG

TL;DR: Hybrid approach combining game theory and reinforcement learning for border defense games with limited perceptual range, using Apollonius Circle for equilibrium computation to enable early episode termination and focus RL on search strategies.

DetailsMotivation: Game theory provides strong optimality guarantees but becomes brittle with imperfect information assumptions, while RL is adaptive but sample-inefficient in complex domains. The paper aims to leverage game-theoretic insights to improve RL training efficiency in adversarial settings with limited perception.

Method: Uses Apollonius Circle (AC) to compute equilibrium in the post-detection phase of border defense games, enabling early termination of RL episodes. This allows RL to focus on learning search strategies while guaranteeing optimal continuation after detection, without needing to learn pursuit dynamics.

Result: The early termination method yields 10-20% higher rewards, faster convergence, and more efficient search trajectories across single- and multi-defender settings. Extensive experiments validate the effectiveness of the hybrid approach.

Conclusion: The hybrid approach successfully combines game theory’s optimality guarantees with RL’s adaptability, improving training efficiency in adversarial settings with limited perceptual range by separating search and pursuit learning phases.

Abstract: Game theory provides the gold standard for analyzing adversarial engagements, offering strong optimality guarantees. However, these guarantees often become brittle when assumptions such as perfect information are violated. Reinforcement learning (RL), by contrast, is adaptive but can be sample-inefficient in large, complex domains. This paper introduces a hybrid approach that leverages game-theoretic insights to improve RL training efficiency. We study a border defense game with limited perceptual range, where defender performance depends on both search and pursuit strategies, making classical differential game solutions inapplicable. Our method employs the Apollonius Circle (AC) to compute equilibrium in the post-detection phase, enabling early termination of RL episodes without learning pursuit dynamics. This allows RL to concentrate on learning search strategies while guaranteeing optimal continuation after detection. Across single- and multi-defender settings, this early termination method yields 10-20% higher rewards, faster convergence, and more efficient search trajectories. Extensive experiments validate these findings and demonstrate the overall effectiveness of our approach.

[519] The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning

Max Zimmer, Nico Pelleriti, Christophe Roux, Sebastian Pokutta

Main category: cs.LG

TL;DR: A practical guide and framework for AI-assisted research in mathematics and ML, featuring a taxonomy of AI integration, an open-source agentic framework using CLI coding agents, and case studies demonstrating autonomous research capabilities.

DetailsMotivation: AI tools are transforming research practices but their practical integration into everyday research workflows remains unclear. The paper aims to provide concrete guidance on how researchers can productively use modern AI systems, understand where they help most, and establish responsible guardrails for their use.

Method: Three-part approach: (I) Five-level taxonomy of AI integration in research workflows; (II) Open-source framework using methodological rules formulated as agent prompts to turn CLI coding agents into autonomous research assistants; (III) Case studies from deep learning and mathematics. The framework runs in sandboxed containers, works with existing CLI agents, and scales from personal prototyping to multi-node, multi-GPU cluster experimentation.

Result: The framework successfully enables autonomous research sessions, with the longest running over 20 hours dispatching independent experiments across multiple nodes without human intervention. The system demonstrates practical utility for AI-assisted research while maintaining researcher oversight.

Conclusion: The framework augments rather than replaces researchers, providing a practical solution for integrating AI into research workflows. The open-source implementation offers immediate utility for mathematics and ML researchers seeking to leverage AI assistance productively and responsibly.

Abstract: AI tools and agents are reshaping how researchers work, from proving theorems to training neural networks. Yet for many, it remains unclear how these tools fit into everyday research practice. This paper is a practical guide to AI-assisted research in mathematics and machine learning: We discuss how researchers can use modern AI systems productively, where these systems help most, and what kinds of guardrails are needed to use them responsibly. It is organized into three parts: (I) a five-level taxonomy of AI integration, (II) an open-source framework that, through a set of methodological rules formulated as agent prompts, turns CLI coding agents (e.g., Claude Code, Codex CLI, OpenCode) into autonomous research assistants, and (III) case studies from deep learning and mathematics. The framework runs inside a sandboxed container, works with any frontier LLM through existing CLI agents, is simple enough to install and use within minutes, and scales from personal-laptop prototyping to multi-node, multi-GPU experimentation across compute clusters. In practice, our longest autonomous session ran for over 20 hours, dispatching independent experiments across multiple nodes without human intervention. We stress that our framework is not intended to replace the researcher in the loop, but to augment them. Our code is publicly available at https://github.com/ZIB-IOL/The-Agentic-Researcher.

[520] Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments

Xiaoyi Li

Main category: cs.LG

TL;DR: LLM agents can perform genuine architecture search, not just hyperparameter tuning, with architectural choices explaining 94% of performance variance in dashcam collision detection experiments.

DetailsMotivation: To determine whether LLM agents autonomously designing ML experiments perform genuine architecture search or merely hyperparameter tuning within a narrow design space.

Method: Analyzed 10,469 experiments by two LLM agents (Claude Opus and Gemini 2.5 Pro) across 108,000 discrete configuration cells for dashcam collision detection over 27 days using ANOVA decomposition and cross-task validation.

Result: Architectural choices explain 94% of performance variance, while hyperparameter variation explains only 6%. LLM agents discovered V-JEPA 2 video features with Zipformer temporal encoders achieving 0.9245 AP, a configuration no human proposed. LLM-guided search outperformed random search (0.985 AP vs 0.965 AP at N=50).

Conclusion: LLM agents can perform genuine architecture discovery, concentrating search on productive architectural regions and discovering qualitatively better configurations than random or Bayesian baselines.

Abstract: When LLM agents autonomously design ML experiments, do they perform genuine architecture search – or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that \textbf{architectural choices explain 94% of performance variance} ($F = 1324$, $η^2 = 0.94$), while hyperparameter variation within a fixed architecture explains only 6%. Cross-task validation on a second collision dataset confirms this finding (75% architecture-explained variance) with a \emph{different} winning backbone, confirming genuine architecture discovery. The agents’ key contribution is discovering that V-JEPA,2 video features with Zipformer temporal encoders achieve 0.9245 AP – a configuration no human proposed – and concentrating search on productive architectural regions: at $N = 50$, LLM-guided search reaches AP $= 0.985$ versus $0.965$ for from-scratch random search. Post-bugfix convergence follows a power law ($c = 0.11$, $R^2 = 0.93$); the low exponent reflects the cost of broad exploration, not inefficiency, since the LLM discovers qualitatively better regions than random or Bayesian baselines. We characterize multi-agent search dynamics via entropy cycles and Jensen–Shannon specialization, providing the first large-scale empirical framework for LLM-guided combinatorial ML experiment design.

[521] Generative Inverse Design with Abstention via Diagonal Flow Matching

Miguel de Campos, Werner Krebs, Hanno Gottschalk

Main category: cs.LG

TL;DR: Diag-CFM improves inverse design by making conditional flow matching invariant to coordinate permutations through zero-anchoring, achieving better accuracy and enabling uncertainty quantification.

DetailsMotivation: Standard conditional flow matching for inverse design problems is sensitive to arbitrary ordering and scaling of design parameters, leading to unstable training and poor performance.

Method: Diagonal Flow Matching (Diag-CFM) uses a zero-anchoring strategy that pairs design coordinates with noise and labels with zero, making learning provably invariant to coordinate permutations. Also develops two architecture-intrinsic uncertainty metrics: Zero-Deviation and Self-Consistency.

Result: Achieves order-of-magnitude improvements in round-trip accuracy over CFM and invertible neural network baselines across design dimensions up to 100. Uncertainty metrics outperform ensemble and general-purpose alternatives across candidate selection, abstention, and out-of-distribution detection tasks.

Conclusion: Diag-CFM provides a more robust and accurate approach to inverse design with built-in uncertainty quantification capabilities that enhance practical deployment.

Abstract: Inverse design aims to find design parameters $x$ achieving target performance $y^*$. Generative approaches learn bidirectional mappings between designs and labels, enabling diverse solution sampling. However, standard conditional flow matching (CFM), when adapted to inverse problems by pairing labels with design parameters, exhibits strong sensitivity to their arbitrary ordering and scaling, leading to unstable training. We introduce Diagonal Flow Matching (Diag-CFM), which resolves this through a zero-anchoring strategy that pairs design coordinates with noise and labels with zero, making the learning problem provably invariant to coordinate permutations. This yields order-of-magnitude improvements in round-trip accuracy over CFM and invertible neural network baselines across design dimensions up to $P{=}100$. We develop two architecture-intrinsic uncertainty metrics, Zero-Deviation and Self-Consistency, that enable three practical capabilities: selecting the best candidate among multiple generations, abstaining from unreliable predictions, and detecting out-of-distribution targets; consistently outperforming ensemble and general-purpose alternatives across all tasks. We validate on airfoil, gas turbine combustor, and an analytical benchmark with scalable design dimension.

[522] Evaluating Causal Discovery Algorithms for Path-Specific Fairness and Utility in Healthcare

Nitish Nagesh, Elahe Khatibi, Thomas Hughes, Mahdi Bagheri, Pratik Gajane, Amir M. Rahmani

Main category: cs.LG

TL;DR: Paper evaluates causal discovery algorithms on synthetic and clinical health data, focusing on structural recovery and path-specific fairness analysis beyond composite fairness scores.

DetailsMotivation: Causal discovery in health data faces evaluation challenges when ground truth is unknown, requiring benchmarks and methods to assess algorithm performance on both structural recovery and fairness considerations in clinical applications.

Method: Collaborated with experts to construct proxy ground-truth graphs, established benchmarks for synthetic Alzheimer’s disease and heart failure clinical records data, evaluated Peter-Clark, Greedy Equivalence Search, and Fast Causal Inference algorithms on structural recovery and path-specific fairness decomposition.

Result: On synthetic data, Peter-Clark achieved best structural recovery; on heart failure data, Fast Causal Inference achieved highest utility. For path-specific effects, ejection fraction contributed 3.37 percentage points to indirect effect in ground truth, driving variations in fairness-utility ratios across algorithms.

Conclusion: Results highlight need for graph-aware fairness evaluation and fine-grained path-specific analysis when deploying causal discovery in clinical applications, moving beyond composite fairness scores.

Abstract: Causal discovery in health data faces evaluation challenges when ground truth is unknown. We address this by collaborating with experts to construct proxy ground-truth graphs, establishing benchmarks for synthetic Alzheimer’s disease and heart failure clinical records data. We evaluate the Peter-Clark, Greedy Equivalence Search, and Fast Causal Inference algorithms on structural recovery and path-specific fairness decomposition, going beyond composite fairness scores. On synthetic data, Peter-Clark achieved the best structural recovery. On heart failure data, Fast Causal Inference achieved the highest utility. For path-specific effects, ejection fraction contributed 3.37 percentage points to the indirect effect in the ground truth. These differences drove variations in the fairness-utility ratio across algorithms. Our results highlight the need for graph-aware fairness evaluation and fine-grained path-specific analysis when deploying causal discovery in clinical applications.

[523] Discovery of interaction and diffusion kernels in particle-to-mean-field multi-agent systems

Giacomo Albi, Alessandro Alla, Elisa Calzola

Main category: cs.LG

TL;DR: A data-driven framework for learning interaction kernels in stochastic multi-agent systems from trajectory data using sparse regression and two complementary identification strategies.

DetailsMotivation: To develop methods for identifying functional forms of nonlocal interactions and diffusion terms in stochastic multi-agent systems directly from trajectory data, without prior knowledge of interaction structures, addressing challenges of unobserved pairwise interactions and limited data.

Method: Formulates inverse problem as sequence of sparse regression tasks in structured finite-dimensional spaces with compactly supported basis functions. Proposes two strategies: 1) random-batch sampling to compensate for latent interactions while preserving statistical structure, and 2) mean-field approximation using empirical particle density for continuous nonlocal regression.

Result: Numerical experiments show effective and robust reconstruction of both interaction and diffusion kernels from partially observed data. Validated on benchmark models including bounded-confidence and attraction-repulsion dynamics, with both strategies achieving comparable accuracy.

Conclusion: The proposed framework successfully learns interaction kernels in stochastic multi-agent systems from limited trajectory data using data-driven sparse regression approaches, enabling identification of complex interaction structures without prior knowledge.

Abstract: We propose a data-driven framework to learn interaction kernels in stochastic multi-agent systems. Our approach aims at identifying the functional form of nonlocal interaction and diffusion terms directly from trajectory data, without any a priori knowledge of the underlying interaction structure. Starting from a discrete stochastic binary-interaction model, we formulate the inverse problem as a sequence of sparse regression tasks in structured finite-dimensional spaces spanned by compactly supported basis functions, such as piecewise linear polynomials. In particular, we assume that pairwise interactions between agents are not directly observed and that only limited trajectory data are available. To address these challenges, we propose two complementary identification strategies. The first based on random-batch sampling, which compensates for latent interactions while preserving the statistical structure of the full dynamics in expectation. The second based on a mean-field approximation, where the empirical particle density reconstructed from the data defines a continuous nonlocal regression problem. Numerical experiments demonstrate the effectiveness and robustness of the proposed framework, showing accurate reconstruction of both interaction and diffusion kernels even from partially observed. The method is validated on benchmark models, including bounded-confidence and attraction-repulsion dynamics, where the two proposed strategies achieve comparable levels of accuracy.

[524] Data-Local Autonomous LLM-Guided Neural Architecture Search for Multiclass Multimodal Time-Series Classification

Emil Hardarson, Luka Biedebach, Ómar Bessi Ómarsson, Teitur Hrólfsson, Anna Sigridur Islind, María Óskarsdóttir

Main category: cs.LG

TL;DR: A data-local LLM-guided neural architecture search framework for multimodal time-series data that enables automated pipeline exploration while keeping sensitive data on-premise, evaluated on healthcare and clinical datasets.

DetailsMotivation: Machine learning on sensitive time-series data (like healthcare EEG) faces bottlenecks due to strict data-local constraints and the need for extensive preprocessing/architecture tuning, especially challenging for multimodal fusion where different sensor modalities require individual processing.

Method: A data-local LLM-guided NAS framework that handles candidate pipelines remotely while executing all training/evaluation locally under fixed protocols. Uses one-vs-rest binary experts per class and modality with lightweight fusion MLP, and searches over expert architectures and modality-specific preprocessing. Controller only observes trial-level summaries without accessing raw data.

Result: Evaluated on UEA30 multivariate time-series classification and SleepEDFx sleep staging (EEG, EOG, EMG). Modular baseline model is strong, and LLM-guided NAS further improves it. Method finds models performing within published ranges across most benchmark datasets while reducing manual intervention.

Conclusion: The framework enables unattended architecture search for multimodal time-series learning while maintaining data privacy, addressing the bottleneck in privacy-constrained domains like healthcare where sensitive data must remain on-premise.

Abstract: Applying machine learning to sensitive time-series data is often bottlenecked by the iteration loop: Performance depends strongly on preprocessing and architecture, yet training often has to run on-premise under strict data-local constraints. This is a common problem in healthcare and other privacy-constrained domains (e.g., a hospital developing deep learning models on patient EEG). This bottleneck is particularly challenging in multimodal fusion, where sensor modalities must be individually preprocessed and then combined. LLM-guided neural architecture search (NAS) can automate this exploration, but most existing workflows assume cloud execution or access to data-derived artifacts that cannot be exposed. We present a novel data-local, LLM-guided search framework that handles candidate pipelines remotely while executing all training and evaluation locally under a fixed protocol. The controller observes only trial-level summaries, such as pipeline descriptors, metrics, learning-curve statistics, and failure logs, without ever accessing raw samples or intermediate feature representations. Our framework targets multiclass, multimodal learning via one-vs-rest binary experts per class and modality, a lightweight fusion MLP, and joint search over expert architectures and modality-specific preprocessing. We evaluate our method on two regimes: UEA30 (public multivariate time-series classification dataset) and SleepEDFx sleep staging (heterogeneous clinical modalities such as EEG, EOG, and EMG). The results show that the modular baseline model is strong, and the LLM-guided NAS further improves it. Notably, our method finds models that perform within published ranges across most benchmark datasets. Across both settings, our method reduces manual intervention by enabling unattended architecture search while keeping sensitive data on-premise.

[525] MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

Hanxian Huang, Igor Fedorov, Andrey Gromov, Bernard Beckerman, Naveen Suda, David Eriksson, Maximilian Balandat, Rylan Conway, Patrick Huber, Chinnadhurai Sankar, Ayushi Dalmia, Zechun Liu, Lemeng Wu, Tarek Elgamal, Adithya Sagar, Vikas Chandra, Raghuraman Krishnamoorthi

Main category: cs.LG

TL;DR: MobileLLM-Flash: Hardware-in-the-loop architecture search for efficient on-device LLMs optimized for mobile deployment with attention skipping for long-context acceleration.

DetailsMotivation: Real-time AI experiences require on-device LLMs that can run efficiently on resource-constrained hardware with broad compatibility, producing near-real-time responses while maximizing user reach.

Method: Hardware-in-the-loop architecture search under mobile latency constraints, treating candidates as pruned versions of pretrained backbone with inherited weights, using staged process with latency modeling and Pareto-frontier search across latency and quality.

Result: MobileLLM-Flash family (350M, 650M, 1.4B) with up to 8k context length, delivering 1.8x faster prefill and 1.6x faster decode on mobile CPUs with comparable or superior quality.

Conclusion: The methodology enables industry-scale deployment of efficient OD-LLMs without custom kernels, compatible with standard mobile runtimes, with attention skipping providing long-context acceleration.

Abstract: Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality. This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design.

[526] GASP: Guided Asymmetric Self-Play For Coding LLMs

Swadesh Jana, Cansu Sancaktar, Tomáš Daniš, Georg Martius, Antonio Orvieto, Pavel Kolev

Main category: cs.LG

TL;DR: GASP introduces guided asymmetric self-play where grounding is provided by real-data goalpost questions, enabling teachers to generate curriculum of easier/harder variants to gradually close gaps to hard exploration challenges.

DetailsMotivation: Current asymmetric self-play methods for LLMs are goal-agnostic and generate hard problems that aren't necessarily informative or interesting for improving model capabilities. There's a need for grounded self-play that focuses on meaningful learning challenges.

Method: GASP uses real-data goalpost questions identified as hard exploration challenges. During self-play, teachers generate easier variants of hard questions, then harder variants of those easier questions, creating a curriculum that gradually closes the gap to goalposts.

Result: Improves pass@20 on LiveCodeBench by 2.5% over unguided asymmetric self-play. Through teacher-constructed curriculum, solves hard goalpost questions that remain out of reach for all baselines.

Conclusion: Grounding asymmetric self-play with real-data goalposts and curriculum-based question generation enables more effective learning of hard exploration challenges compared to goal-agnostic approaches.

Abstract: Asymmetric self-play has emerged as a promising paradigm for post-training large language models, where a teacher continually generates questions for a student to solve at the edge of the student’s learnability. Although these methods promise open-ended data generation bootstrapped from no human data, they suffer from one major problem: not all problems that are hard to solve are interesting or informative to improve the overall capabilities of the model. Current asymmetric self-play methods are goal-agnostic with no real grounding. We propose Guided Asymmetric Self-Play (GASP), where grounding is provided by real-data goalpost questions that are identified to pose a hard exploration challenge to the model. During self-play, the teacher first generates an easier variant of a hard question, and then a harder variant of that easier question, with the goal of gradually closing the gap to the goalpost throughout training. Doing so, we improve pass@20 on LiveCodeBench (LCB) by 2.5% over unguided asymmetric self-play, and through the curriculum constructed by the teacher, we manage to solve hard goalpost questions that remain out of reach for all baselines.

[527] Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

Egor Shulgin, Dimitri von Rütte, Tianyue H. Zhang, Niccolò Ajroldi, Bernhard Schölkopf, Antonio Orvieto

Main category: cs.LG

TL;DR: Analysis of hyperparameter scaling laws for first-order optimizers (SGD, Adam variants) through convergence bounds, deriving power-law schedules for learning rate, momentum, and batch size scaling.

DetailsMotivation: Existing hyperparameter transfer methods focus on model size scaling, but transfer across batch sizes and training horizons relies on empirical rules. There's a need for principled understanding of hyperparameter scaling laws for modern optimizers.

Method: Analyze hyperparameter scaling through convergence bounds for Linear Minimization Oracle (LMO) methods (normalized SGD, signSGD approximating Adam, Muon). Treat bounds as proxies and minimize them across tuning regimes to derive closed-form power-law schedules for learning rate, momentum, and batch size.

Result: Derives unified scaling laws that recover most insights from literature, reveals interaction between momentum and batch-size scaling, and suggests multiple optimal scaling strategies.

Conclusion: Provides principled framework for hyperparameter scaling that unifies existing empirical observations, with clear directions for future research on optimizer scaling laws.

Abstract: Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating bounds in recent literature as a proxy and minimizing them across different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as functions of the iteration or token budget. Our analysis, holding model size fixed, recovers most insights and observations from the literature under a unified and principled perspective, with clear directions open for future research. Our results draw particular attention to the interaction between momentum and batch-size scaling, suggesting that optimal performance may be achieved with several scaling strategies.

[528] Determinism in the Undetermined: Deterministic Output in Charge-Conserving Continuous-Time Neuromorphic Systems with Temporal Stochasticity

Jing Yan, Kang You, Zhezhi He, Yaoyu Zhang

Main category: cs.LG

TL;DR: A theoretical framework for deterministic computation in asynchronous neuromorphic systems using charge-conserving spiking neural networks that are invariant to temporal stochasticity.

DetailsMotivation: To address the fundamental challenge of achieving deterministic computation results in asynchronous neuromorphic systems despite inherent temporal stochasticity of continuous-time hardware.

Method: Developed a unified continuous-time framework for SNNs that couples the Law of Charge Conservation with minimal neuron-level constraints, ensuring terminal state depends solely on aggregate input charge.

Result: Proved the mapping is strictly invariant to spike timing in acyclic networks, established exact representational correspondence between charge-conserving SNNs and quantized artificial neural networks without approximation errors.

Conclusion: Provides rigorous theoretical basis for designing continuous-time neuromorphic systems that maintain algorithmic determinism while harnessing efficiency of asynchronous processing.

Abstract: Achieving deterministic computation results in asynchronous neuromorphic systems remains a fundamental challenge due to the inherent temporal stochasticity of continuous-time hardware. To address this, we develop a unified continuous-time framework for spiking neural networks (SNNs) that couples the Law of Charge Conservation with minimal neuron-level constraints. This integration ensures that the terminal state depends solely on the aggregate input charge, providing a unique cumulated output invariant to temporal stochasticity. We prove that this mapping is strictly invariant to spike timing in acyclic networks, whereas recurrent connectivity can introduce temporal sensitivity. Furthermore, we establish an exact representational correspondence between these charge-conserving SNNs and quantized artificial neural networks, bridging the gap between static deep learning and event-driven dynamics without approximation errors. These results establish a rigorous theoretical basis for designing continuous-time neuromorphic systems that harness the efficiency of asynchronous processing while maintaining algorithmic determinism.

[529] W2T: LoRA Weights Already Know What They Can Do

Xiaolong Han, Ferrante Neri, Zijian Jiang, Fang Wu, Yanfang Ye, Lu Yin, Zehong Wang

Main category: cs.LG

TL;DR: W2T (Weight2Token) extracts information directly from LoRA weights without running the base model, using canonical factorization via QR+SVD to resolve ambiguity, enabling attribute classification, performance prediction, and adapter retrieval from weight-space embeddings.

DetailsMotivation: LoRA checkpoints contain task-specific information in their weight matrices, but this information is ambiguous due to infinite factorization possibilities. The paper aims to determine if model behavior can be read directly from weights without running inference or accessing training data.

Method: Proposes W2T which maps LoRA updates to canonical form using QR decomposition followed by SVD to resolve factorization ambiguity. The resulting components are tokenized and processed by a Transformer to produce weight-space embeddings for downstream tasks.

Result: W2T achieves strong results on attribute classification, performance prediction, and adapter retrieval across language and vision LoRA collections, demonstrating that LoRA weights reliably indicate model behavior once factorization ambiguity is removed.

Conclusion: LoRA weights contain meaningful information about model behavior that can be extracted directly through proper canonicalization, enabling weight-based analysis without model execution or training data access.

Abstract: Each LoRA checkpoint compactly stores task-specific updates in low-rank weight matrices, offering an efficient way to adapt large language models to new tasks and domains. In principle, these weights already encode what the adapter does and how well it performs. In this paper, we ask whether this information can be read directly from the weights, without running the base model or accessing training data. A key obstacle is that a single LoRA update can be factorized in infinitely many ways. Without resolving this ambiguity, models trained on the factors may fit the particular factorization rather than the underlying update. To this end, we propose \methodfull, which maps each LoRA update to a provably canonical form via QR decomposition followed by SVD, so that all equivalent factorizations share the same representation. The resulting components are then tokenized and processed by a Transformer to produce a weight-space embedding. Across language and vision LoRA collections, W2T achieves strong results on attribute classification, performance prediction, and adapter retrieval, demonstrating that LoRA weights reliably indicate model behavior once factorization ambiguity is removed. Code is available at https://github.com/xiaolonghan2000/Weight2Token.

[530] The Importance of Being Smoothly Calibrated

Parikshit Gopalan, Konstantinos Stavropoulos, Kunal Talwar, Pranay Tankala

Main category: cs.LG

TL;DR: The paper presents theoretical results on smooth calibration as a robust calibration measure and its connection to omniprediction, with new guarantees for bounded proper losses and characterization via earth mover’s distance.

DetailsMotivation: To generalize, unify, and extend previous results on smooth calibration as both a robust calibration measure and as a step towards omniprediction, which enables predictions with low regret for downstream decision makers.

Method: The authors present new omniprediction guarantees for smoothly calibrated predictors by adding noise to predictors and competing against smoothed benchmark predictors. They characterize smooth calibration in terms of earth mover’s distance to perfectly calibrated distributions and analyze estimation complexity.

Result: The paper shows that omniprediction error is bounded by smooth calibration error and earth mover’s distance from benchmark, with tightness demonstrated. It provides a crisp characterization of smooth calibration via earth mover’s distance and shows that upper distance to calibration cannot be estimated within quadratic factor with sample complexity independent of prediction support size.

Conclusion: The work unifies and extends prior results on omniprediction from smooth calibration, provides new theoretical insights into calibration measures, and establishes limitations on estimation of calibration distances.

Abstract: Recent work has highlighted the centrality of smooth calibration [Kakade and Foster, 2008] as a robust measure of calibration error. We generalize, unify, and extend previous results on smooth calibration, both as a robust calibration measure, and as a step towards omniprediction, which enables predictions with low regret for downstream decision makers seeking to optimize some proper loss unknown to the predictor. We present a new omniprediction guarantee for smoothly calibrated predictors, for the class of all bounded proper losses. We smooth the predictor by adding some noise to it, and compete against smoothed versions of any benchmark predictor on the space, where we add some noise to the predictor and then post-process it arbitrarily. The omniprediction error is bounded by the smooth calibration error of the predictor and the earth mover’s distance from the benchmark. We exhibit instances showing that this dependence cannot, in general, be improved. We show how this unifies and extends prior results [Foster and Vohra, 1998; Hartline, Wu, and Yang, 2025] on omniprediction from smooth calibration. We present a crisp new characterization of smooth calibration in terms of the earth mover’s distance to the closest perfectly calibrated joint distribution of predictions and labels. This also yields a simpler proof of the relation to the lower distance to calibration from [Blasiok, Gopalan, Hu, and Nakkiran, 2023]. We use this to show that the upper distance to calibration cannot be estimated within a quadratic factor with sample complexity independent of the support size of the predictions. This is in contrast to the distance to calibration, where the corresponding problem was known to be information-theoretically impossible: no finite number of samples suffice [Blasiok, Gopalan, Hu, and Nakkiran, 2023].

[531] Residual Stream Duality in Modern Transformer Architectures

Yifan Zhang

Main category: cs.LG

TL;DR: The paper introduces Transformer², a two-axis view of Transformers where sequence position and layer depth are dual dimensions, showing that depth-wise residual attention is mathematically equivalent to short sliding-window attention, with practical recommendations for different use cases.

DetailsMotivation: To provide a clearer understanding of the residual pathway's role in Transformers by framing it through a two-axis perspective (sequence position and layer depth), revealing the duality between depth-wise operations and sequence-wise attention mechanisms.

Method: Proposes a theoretical framework analyzing Transformers through two ordered dimensions: sequence position and layer depth. Shows mathematical equivalence between causal depth-wise residual attention and causal short sliding-window attention (ShortSWA), establishing the Transformer² duality principle.

Result: Establishes that operator-level duality exists between depth-wise residual attention and sequence-axis ShortSWA, but notes systems-level asymmetry due to hardware considerations. Provides practical guidance: use Deep Delta Learning (DDL) for modifying shortcuts directly, and use sequence-axis ShortSWA for local adaptive mixing.

Conclusion: The residual pathway is integral to Transformer representation, with a clean two-axis view revealing important dualities. Practical implementation should consider hardware efficiency: DDL for shortcut modifications and sequence-axis ShortSWA for local mixing, despite mathematical equivalence.

Abstract: Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model’s representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer$^2$. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.

[532] Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

Xiaozhou Ye, Feng Jiang, Zihan Wang, Xiulai Wang, Yutao Zhang, Kevin I-Kai Wang

Main category: cs.LG

TL;DR: CTFG framework uses reinforcement learning with Transformer-based autoregressive generator for cross-user human activity recognition from inertial sensors, achieving state-of-the-art generalization without target-domain labels.

DetailsMotivation: Human Activity Recognition using wearable sensors faces challenges with cross-user variability due to physiological differences, motor habits, and sensor placements. Existing domain generalization methods either ignore temporal dependencies or require impractical target-domain annotations.

Method: Proposes CTFG (Collaborative Temporal Feature Generation) framework with Transformer-based autoregressive generator that incrementally constructs feature token sequences conditioned on prior context and sensor input. Uses Group-Relative Policy Optimization (critic-free RL algorithm) with tri-objective reward for class discrimination, cross-user invariance, and temporal fidelity.

Result: Achieves state-of-the-art cross-user accuracy on DSADS (88.53%) and PAMAP2 (75.22%) benchmarks, reduces inter-task training variance, accelerates convergence, and shows robust generalization under varying action-space dimensionalities.

Conclusion: CTFG provides a novel paradigm for generalizable feature extraction in human activity recognition by modeling it as a collaborative sequential generation process, eliminating distribution-dependent bias and enabling robust cross-user generalization without target-domain labels.

Abstract: Human Activity Recognition using wearable inertial sensors is foundational to healthcare monitoring, fitness analytics, and context-aware computing, yet its deployment is hindered by cross-user variability arising from heterogeneous physiological traits, motor habits, and sensor placements. Existing domain generalization approaches either neglect temporal dependencies in sensor streams or depend on impractical target-domain annotations. We propose a different paradigm: modeling generalizable feature extraction as a collaborative sequential generation process governed by reinforcement learning. Our framework, CTFG (Collaborative Temporal Feature Generation), employs a Transformer-based autoregressive generator that incrementally constructs feature token sequences, each conditioned on prior context and the encoded sensor input. The generator is optimized via Group-Relative Policy Optimization, a critic-free algorithm that evaluates each generated sequence against a cohort of alternatives sampled from the same input, deriving advantages through intra-group normalization rather than learned value estimation. This design eliminates the distribution-dependent bias inherent in critic-based methods and provides self-calibrating optimization signals that remain stable across heterogeneous user distributions. A tri-objective reward comprising class discrimination, cross-user invariance, and temporal fidelity jointly shapes the feature space to separate activities, align user distributions, and preserve fine-grained temporal content. Evaluations on the DSADS and PAMAP2 benchmarks demonstrate state-of-the-art cross-user accuracy (88.53% and 75.22%), substantial reduction in inter-task training variance, accelerated convergence, and robust generalization under varying action-space dimensionalities.

[533] Adaptive regularization parameter selection for high-dimensional inverse problems: A Bayesian approach with Tucker low-rank constraints

Qing-Mei Yang, Da-Qing Zhang

Main category: cs.LG

TL;DR: A variational Bayesian method using Tucker decomposition for efficient high-dimensional inverse problems, with adaptive regularization via per-mode precision parameters and automatic noise estimation.

DetailsMotivation: High-dimensional inverse problems in imaging and scientific computing face computational challenges due to dimensionality. Existing methods require prior knowledge of noise parameters and lack adaptive regularization for anisotropic structures.

Method: Uses Tucker decomposition to transform variational inference from high-dimensional space to lower-dimensional core tensor space. Introduces per-mode precision parameters for adaptive anisotropic regularization and estimates noise levels from data without prior knowledge.

Result: Outperforms L-curve, GCV, UPRE, and DP methods by 0.73-2.09 dB in 2D deblurring and 6.75 dB in 3D heat conduction. Scales to problems with 110,000 variables and shows improved PSNR, SSIM, and visual quality.

Conclusion: The method bridges Bayesian theory and scalable computation for large-scale inverse problems, though sensitive to Tucker rank selection. Future work includes automated rank selection and theoretical analysis.

Abstract: This paper introduces a novel variational Bayesian method that integrates Tucker decomposition for efficient high-dimensional inverse problem solving. The method reduces computational complexity by transforming variational inference from a high-dimensional space to a lower-dimensional core tensor space via Tucker decomposition. A key innovation is the introduction of per-mode precision parameters, enabling adaptive regularization for anisotropic structures. For instance, in directional image deblurring, learned parameters align with physical anisotropy, applying stronger regularization to critical directions (e.g., row vs. column axes). The method further estimates noise levels from data, eliminating reliance on prior knowledge of noise parameters (unlike conventional benchmarks such as the discrepancy principle (DP)). Experimental evaluations across 2D deblurring, 3D heat conduction, and Fredholm integral equations demonstrate consistent improvements in quantitative metrics (PSNR, SSIM) and qualitative visualizations (error maps, precision parameter trends) compared to L-curve criterion, generalized cross-validation (GCV), unbiased predictive risk estimator (UPRE), and DP. The approach scales to problems with 110,000 variables and outperforms existing methods by 0.73-2.09 dB in deblurring tasks and 6.75 dB in 3D heat conduction. Limitations include sensitivity to rank selection in Tucker decomposition and the need for theoretical analysis. Future work will explore automated rank selection and theoretical guarantees. This method bridges Bayesian theory and scalable computation, offering practical solutions for large-scale inverse problems in imaging, remote sensing, and scientific computing.

[534] MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models

Chen-Hao Chao, Wei-Fang Sun, Junwei Qua, Chun-Yi Lee, Rahul G. Krishnan

Main category: cs.LG

TL;DR: MDM-Prime-v2 improves masked diffusion language models with binary encoding and index shuffling, achieving better compute efficiency and perplexity than autoregressive models.

DetailsMotivation: Address limitations in MDM-Prime framework: lack of guidance for token granularity hyperparameters and degradation of likelihood estimation when paired with BPE tokenizers.

Method: Develop MDM-Prime-v2 with Binary Encoding and Index Shuffling, study variational bound tightness, and conduct scaling analysis.

Result: 21.8× more compute-efficient than ARMs, achieves 7.77 perplexity on OpenWebText (vs 12.99 for ARM), and shows superior zero-shot accuracy on commonsense reasoning tasks at 1.1B parameters.

Conclusion: MDM-Prime-v2 successfully addresses MDM-Prime limitations and demonstrates superior performance and efficiency for language modeling.

Abstract: Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. Second, we find that the function form of the subtokenizer significantly degrades likelihood estimation when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. To address these limitations, we study the tightness of the variational bound in MDM-Prime and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our scaling analysis reveals that MDM-Prime-v2 is 21.8$\times$ more compute-efficient than autoregressive models (ARM). In compute-optimal comparisons, MDM-Prime-v2 achieves 7.77 perplexity on OpenWebText, outperforming ARM (12.99), MDM (18.94), and MDM-Prime (13.41). When extending the model size to 1.1B parameters, our model further demonstrates superior zero-shot accuracy on various commonsense reasoning tasks.

[535] A Depth-Aware Comparative Study of Euclidean and Hyperbolic Graph Neural Networks on Bitcoin Transaction Systems

Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi

Main category: cs.LG

TL;DR: Comparison of Euclidean vs hyperbolic GNNs for Bitcoin transaction classification, analyzing embedding geometry and neighborhood depth effects.

DetailsMotivation: Bitcoin transaction networks are large-scale socio-technical systems where GNNs are used for tasks like fraud detection, but the interaction between receptive fields (neighborhood aggregation) and embedding geometry (Euclidean vs hyperbolic) has received limited attention.

Method: Controlled comparison of Euclidean and tangent-space hyperbolic GNNs for node classification on a large Bitcoin transaction graph, explicitly varying neighborhood depth while keeping model architecture and dimensionality fixed. Analyzed optimization behavior with joint selection of learning rate and curvature.

Result: Found that joint selection of learning rate and curvature is critical for stabilizing high-dimensional hyperbolic embeddings. Provides practical insights into embedding geometry and neighborhood depth for modeling transaction networks.

Conclusion: The study informs deployment of hyperbolic GNNs for computational social systems by analyzing the role of embedding geometry and neighborhood depth in large-scale transaction networks.

Abstract: Bitcoin transaction networks are large scale socio- technical systems in which activities are represented through multi-hop interaction patterns. Graph Neural Networks(GNNs) have become a widely adopted tool for analyzing such systems, supporting tasks such as entity detection and transaction classification. Large-scale datasets like Elliptic have allowed for a rise in the analysis of these systems and in tasks such as fraud detection. In these settings, the amount of transactional context available to each node is determined by the neighborhood aggregation and sampling strategies, yet the interaction between these receptive fields and embedding geometry has received limited attention. In this work, we conduct a controlled comparison of Euclidean and tangent-space hyperbolic GNNs for node classification on a large Bitcoin transaction graph. By explicitly varying the neighborhood while keeping the model architecture and dimensionality fixed, we analyze the differences in two embedding spaces. We further examine optimization behavior and observe that joint selection of learning rate and curvature plays a critical role in stabilizing high-dimensional hyperbolic embeddings. Overall, our findings provide practical insights into the role of embedding geometry and neighborhood depth when modeling large-scale transaction networks, informing the deployment of hyperbolic GNNs for computational social systems.

[536] Functorial Neural Architectures from Higher Inductive Types

Karen Sargsyan

Main category: cs.LG

TL;DR: The paper proposes a functorial approach to neural network design that guarantees compositional generalization by translating Higher Inductive Type specifications into neural architectures via monoidal functors.

DetailsMotivation: Neural networks systematically fail at compositional generalization - producing correct outputs for novel combinations of known parts. The authors aim to address this architectural limitation by establishing a mathematical framework that guarantees compositional generalization.

Method: The paper compiles Higher Inductive Type (HIT) specifications into neural architectures via a monoidal functor from the path groupoid of a target space to a category of parametric maps. Path constructors become generator networks, composition becomes structural concatenation, and 2-cells witnessing group relations become learned natural transformations.

Result: Experiments on three spaces validate the approach: on the torus (ℤ²), functorial decoders outperform non-functorial ones by 2-2.7x; on S¹ ∨ S¹ (F₂), the gap widens to 5.5-10x; on the Klein bottle (ℤ ⋊ ℤ), a learned 2-cell closes a 46% error gap on words exercising the group relation.

Conclusion: The paper demonstrates that compositional generalization can be guaranteed architecturally through functorial design, with formal proofs in Cubical Agda showing that decoders assembled by structural concatenation are strict monoidal functors (compositional by construction), while softmax self-attention is not functorial for non-trivial compositional tasks.

Abstract: Neural networks systematically fail at compositional generalization – producing correct outputs for novel combinations of known parts. We show that this failure is architectural: compositional generalization is equivalent to functoriality of the decoder, and this perspective yields both guarantees and impossibility results. We compile Higher Inductive Type (HIT) specifications into neural architectures via a monoidal functor from the path groupoid of a target space to a category of parametric maps: path constructors become generator networks, composition becomes structural concatenation, and 2-cells witnessing group relations become learned natural transformations. We prove that decoders assembled by structural concatenation of independently generated segments are strict monoidal functors (compositional by construction), while softmax self-attention is not functorial for any non-trivial compositional task. Both results are formalized in Cubical Agda. Experiments on three spaces validate the full hierarchy: on the torus ($\mathbb{Z}^2$), functorial decoders outperform non-functorial ones by 2-2.7x; on $S^1 \vee S^1$ ($F_2$), the type-A/B gap widens to 5.5-10x; on the Klein bottle ($\mathbb{Z} \rtimes \mathbb{Z}$), a learned 2-cell closes a 46% error gap on words exercising the group relation.

[537] Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

Yuxuan Zhu, Daniel Kang

Main category: cs.LG

TL;DR: RLVR performance claims with noisy data are invalid due to dataset contamination; truly noisy data causes significant performance degradation, showing current RLVR methods cannot compensate for poor data quality.

DetailsMotivation: Recent studies claim RLVR algorithms can effectively learn from incorrect annotations, achieving performance comparable to clean data. This work aims to validate these claims by examining dataset quality and noise impact.

Method: The authors re-verified datasets to ensure truly noisy data, rectified contamination issues, and tested RLVR algorithms (including GRPO) on mathematical reasoning benchmarks and Text2SQL tasks with real-world human annotation errors.

Result: After rigorous re-verification, noise is destructive to RLVR. Models trained on truly incorrect annotations perform 8-10% worse on mathematical reasoning and 5-12% worse on Text2SQL tasks compared to clean data training. Existing RLVR improvements fail to mitigate noise impact.

Conclusion: Current RLVR methods cannot compensate for poor data quality; high-quality data remains essential for effective reinforcement learning with verifiable rewards.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy training data is “contaminated” with clean data. After rectifying the dataset with a rigorous re-verification pipeline, we demonstrate that noise is destructive to RLVR. We show that existing RLVR algorithm improvements fail to mitigate the impact of noise, achieving similar performance to that of the basic GRPO. Furthermore, we find that the model trained on truly incorrect annotations performs 8-10% worse than the model trained on clean data across mathematical reasoning benchmarks. Finally, we show that these findings hold for real-world noise in Text2SQL tasks, where training on real-world, human annotation errors cause 5-12% lower accuracy than clean data. Our results show that current RLVR methods cannot yet compensate for poor data quality. High-quality data remains essential.

[538] HIPO: Instruction Hierarchy via Constrained Reinforcement Learning

Keru Chen, Jun Luo, Sen Lin, Yingbin Liang, Alvaro Velasquez, Nathaniel Bastian, Shaofeng Zou

Main category: cs.LG

TL;DR: HIPO is a novel alignment framework that treats hierarchical instruction following as a constrained optimization problem, using safe RL to enforce system prompt compliance as explicit constraints while maximizing user utility.

DetailsMotivation: Existing alignment methods (RLHF, DPO) fail at hierarchical instruction following because they optimize for single objectives without explicit system prompt compliance enforcement. Supervised fine-tuning also fails to establish priority asymmetry algorithmically.

Method: Formulates HIF as a Constrained Markov Decision Process and uses primal-dual safe reinforcement learning to dynamically enforce system prompt compliance as explicit constraints while maximizing user utility within feasible regions.

Result: HIPO significantly improves both system compliance and user utility across diverse model architectures (Qwen, Phi, Llama). Mechanistic analysis shows it drives models to shift attention toward long-range system tokens.

Conclusion: Provides a principled foundation for reliable LLM deployment in complex workflows by elevating system prompts from input context to strict algorithmic boundaries through constrained optimization.

Abstract: Hierarchical Instruction Following (HIF) refers to the problem of prompting large language models with a priority-ordered stack of instructions. Standard methods like RLHF and DPO typically fail in this problem since they mainly optimize for a single objective, failing to explicitly enforce system prompt compliance. Meanwhile, supervised fine-tuning relies on mimicking filtered, compliant data, which fails to establish the priority asymmetry at the algorithmic level. In this paper, we introduce \textsc{HIPO}, a novel alignment framework that formulates HIF as a Constrained Markov Decision Process. \textsc{HIPO} elevates system prompts from mere input context to strict algorithmic boundaries. Using a primal-dual safe reinforcement learning approach, the algorithm dynamically enforces system prompt compliance as an explicit constraint, maximizing user utility strictly within this feasible region. Extensive evaluations across diverse model architectures (e.g., Qwen, Phi, Llama) demonstrate that \textsc{HIPO} significantly improves both system compliance and user utility. Furthermore, mechanistic analysis reveals that this constrained optimization autonomously drives the model to shift its attention toward long-range system tokens, providing a principled foundation for reliable LLM deployment in complex workflows.

[539] DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

Long Li, Zhijian Zhou, Tianyi Wang, Weidi Xu, Zuming Huang, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi

Main category: cs.LG

TL;DR: DyJR is a regularization framework for RL in LLMs that uses dynamic reference distributions and JSD regularization to maintain sample diversity and prevent mode collapse during training.

DetailsMotivation: Existing RL methods for LLMs like GRPO are sample-inefficient and discard past rollouts. Experience replay methods that reuse accurate samples often cause computational overhead and mode collapse through overfitting. The paper argues that historical data should prioritize maintaining diversity rather than just reinforcing accuracy.

Method: Proposes Dynamic Jensen-Shannon Replay (DyJR) with two innovations: 1) Time-Sensitive Dynamic Buffer using FIFO and adaptive sizing to retain temporally proximal samples synchronized with model evolution; 2) Jensen-Shannon Divergence Regularization that replaces direct gradient updates with distributional constraints to prevent diversity collapse.

Result: Experiments on mathematical reasoning and Text-to-SQL benchmarks show DyJR significantly outperforms GRPO and baselines like RLEP and Ex-GRPO, while maintaining training efficiency comparable to original GRPO. Analysis of Rank-k token probability evolution shows DyJR enhances diversity and mitigates over-reliance on Rank-1 tokens.

Conclusion: DyJR provides an effective regularization framework for RL in LLMs that maintains sample diversity, prevents mode collapse, and improves performance while preserving training efficiency.

Abstract: While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct gradient updates with a distributional constraint to prevent diversity collapse. Experiments on mathematical reasoning and Text-to-SQL benchmarks demonstrate that DyJR significantly outperforms GRPO as well as baselines such as RLEP and Ex-GRPO, while maintaining training efficiency comparable to the original GRPO. Furthermore, from the perspective of Rank-$k$ token probability evolution, we show that DyJR enhances diversity and mitigates over-reliance on Rank-1 tokens, elucidating how specific sub-modules of DyJR influence the training dynamics.

[540] Execution-Grounded Credit Assignment for GRPO in Code Generation

Abhijit Kumar, Natalya Kumar, Shikhar Gupta

Main category: cs.LG

TL;DR: EGCA improves RL-based code generation by using execution traces to localize credit assignment for failed programs, identifying semantic divergences between candidate and reference solutions to apply RL updates only to relevant token spans.

DetailsMotivation: Current RL methods for code generation (like GRPO) suffer from coarse credit assignment - they spread a single outcome signal uniformly across long programs even when failures stem from localized semantic errors, making optimization inefficient.

Method: EGCA executes both candidate and reference solutions under identical instrumentation, identifies the earliest semantic divergence point, and assigns RL advantage only to the corresponding token span while masking downstream tokens. This is a drop-in modification requiring no critic, auxiliary loss, or learned verifier.

Result: Achieves 82.1% pass@1 on HumanEval (+3.1 over GRPO) and 68.9% on MBPP (+1.5) with only 18% wall-clock overhead.

Conclusion: Execution-grounded credit assignment significantly improves RL-based code generation by providing finer-grained feedback through semantic divergence analysis, making optimization more efficient without requiring additional learned components.

Abstract: Critic-free reinforcement learning with verifiable rewards (RLVR) improves code generation by optimizing unit-test pass rates, but GRPO-style updates suffer from coarse credit assignment: a single outcome signal is spread uniformly across long programs even when failure stems from a localized semantic error. We propose Execution-Grounded Credit Assignment (EGCA), which localizes GRPO updates using execution traces. For programs that satisfy algorithmic constraints but fail tests, EGCA executes the candidate and a canonical reference solution (curated once offline; used for analysis, not supervision) under identical instrumentation, identifies the earliest semantic divergence, and assigns advantage only to the corresponding token span while masking downstream tokens. EGCA is a drop-in modification requiring no critic, auxiliary loss, or learned verifier, yielding 82.1% pass@1 on HumanEval (+3.1 over GRPO) and 68.9% on MBPP (+1.5) with 18% wall-clock overhead.

[541] The Finetuner’s Fallacy: When to Pretrain with Your Finetuning Data

Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blakeney, Maximilian Böther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vineeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee, Katherine L. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason Telanoff, Jack Urbanek, Zhengping Wang, Josh Wills, Haoli Yin, Aditi Raghunathan, J. Zico Kolter, Bogdan Gaza, Ari Morcos, Matthew Leavitt, Pratyush Maini

Main category: cs.LG

TL;DR: Specialized Pretraining (SPT) improves domain performance and preserves general capabilities by repeating domain data during pretraining rather than just finetuning.

DetailsMotivation: Real-world deployments require strong performance on narrow domains with scarce data. Standard finetuning risks overfitting and forgetting general knowledge. The paper aims to find a better approach for domain adaptation.

Method: Specialized Pretraining (SPT) repeats a small domain dataset starting from pretraining as a fraction of total tokens. Evaluated across three specialized domains (ChemPile, MusicPile, ProofPile) with scaling laws derived for optimal domain-data repetition.

Result: SPT improves domain performance while preserving general capabilities, reduces pretraining tokens needed by up to 1.75x, and shows greater gains when target domain is underrepresented. A 1B SPT model outperforms 3B standard pretrained model on far-from-web domains.

Conclusion: Introducing specialized domain data during pretraining (SPT) yields better domain performance via reduced overfitting and better general performance via reduced forgetting, achieving stronger results with fewer parameters and less total compute. Domain data should be incorporated as early as possible.

Abstract: Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compared to standard pretraining. In our experiments, SPT reduces the pretraining tokens needed to reach a given domain performance by up to 1.75x. These gains grow when the target domain is underrepresented in the pretraining corpus: on domains far from web text, a 1B SPT model outperforms a 3B standard pretrained model. Beyond these empirical gains, we derive overfitting scaling laws to guide practitioners in selecting the optimal domain-data repetition for a given pretraining compute budget. Our observations reveal the finetuner’s fallacy: while finetuning may appear to be the cheapest path to domain adaptation, introducing specialized domain data during pretraining stretches its utility. SPT yields better specialized domain performance (via reduced overfitting across repeated exposures) and better general domain performance (via reduced forgetting during finetuning), ultimately achieving stronger results with fewer parameters and less total compute when amortized over inference. To get the most out of domain data, incorporate it as early in training as possible.

[542] Sample-Efficient Adaptation of Drug-Response Models to Patient Tumors under Strong Biological Domain Shift

Camille Jimenez Cortes, Philippe Lalanda, German Vega

Main category: cs.LG

TL;DR: A staged transfer-learning framework for drug-response prediction that separates representation learning from task supervision, enabling more sample-efficient adaptation from cell lines to patient tumors with limited labeled data.

DetailsMotivation: The biological gap between in vitro cell lines and patient tumors makes drug response prediction challenging. Rather than improving absolute in vitro accuracy, the goal is to enable more sample-efficient adaptation to patient data under strong domain shift.

Method: Proposes a staged transfer-learning framework: 1) Learn cellular and drug representations independently from unlabeled pharmacogenomic data using autoencoder-based representation learning, 2) Align representations with drug-response labels on cell-line data, 3) Adapt to patient tumors using few-shot supervision.

Result: Unsupervised pretraining provides limited benefit when source/target domains overlap substantially, but yields clear gains when adapting to patient tumors with very limited labeled data. The framework achieves faster performance improvements during few-shot patient-level adaptation while maintaining comparable accuracy on standard cell-line benchmarks.

Conclusion: Learning structured and transferable representations from unlabeled molecular profiles can substantially reduce the amount of clinical supervision required for effective drug-response prediction, offering a practical pathway toward data-efficient preclinical-to-clinical translation.

Abstract: Predicting drug response in patients from preclinical data remains a major challenge in precision oncology due to the substantial biological gap between in vitro cell lines and patient tumors. Rather than aiming to improve absolute in vitro prediction accuracy, this work examines whether explicitly separating representation learning from task supervision enables more sample-efficient adaptation of drug-response models to patient data under strong biological domain shift. We propose a staged transfer-learning framework in which cellular and drug representations are first learned independently from large collections of unlabeled pharmacogenomic data using autoencoder-based representation learning. These representations are then aligned with drug-response labels on cell-line data and subsequently adapted to patient tumors using few-shot supervision. Through a systematic evaluation spanning in-domain, cross-dataset, and patient-level settings, we show that unsupervised pretraining provides limited benefit when source and target domains overlap substantially, but yields clear gains when adapting to patient tumors with very limited labeled data. In particular, the proposed framework achieves faster performance improvements during few-shot patient-level adaptation while maintaining comparable accuracy to single-phase baselines on standard cell-line benchmarks. Overall, these results demonstrate that learning structured and transferable representations from unlabeled molecular profiles can substantially reduce the amount of clinical supervision required for effective drug-response prediction, offering a practical pathway toward data-efficient preclinical-to-clinical translation.

[543] Online Semi-infinite Linear Programming: Efficient Algorithms via Function Approximation

Yiming Zong, Jiashuo Jiang

Main category: cs.LG

TL;DR: Novel online semi-infinite linear programming approach using function approximation to handle large/infinite constraints with constraint-independent regret bounds.

DetailsMotivation: Traditional online linear programming algorithms suffer from regret bounds that depend on the number of constraints, making them impractical for problems with large or infinite constraint sets revealed via streaming data.

Method: Proposes an Online Semi-infinite Linear Programming (OSILP) formulation using function approximation to reduce constraints to constant q, with dual-based algorithm using potential functions and two-stage improvement for better regret.

Result: Achieves O(q√T) regret under stochastic input and O((q+q log T)√T) under random permutation, independent of number of constraints. Two-stage algorithm achieves O(q log T + q/ε) regret under stricter assumptions.

Conclusion: The approach effectively handles large/infinite constraint sets in online optimization with constraint-independent regret bounds, validated by experiments showing superiority over existing methods.

Abstract: We consider the dynamic resource allocation problem where the decision space is finite-dimensional, yet the solution must satisfy a large or even infinite number of constraints revealed via streaming data or oracle feedback. We model this challenge as an Online Semi-infinite Linear Programming (OSILP) problem and develop a novel LP formulation to solve it approximately. Specifically, we employ function approximation to reduce the number of constraints to a constant $q$. This addresses a key limitation of traditional online LP algorithms, whose regret bounds typically depend on the number of constraints, leading to poor performance in this setting. We propose a dual-based algorithm to solve our new formulation, which offers broad applicability through the selection of appropriate potential functions. We analyze this algorithm under two classical input models-stochastic input and random permutation-establishing regret bounds of $O(q\sqrt{T})$ and $O\left(\left(q+q\log{T})\sqrt{T}\right)\right)$ respectively. Note that both regret bounds are independent of the number of constraints, which demonstrates the potential of our approach to handle a large or infinite number of constraints. Furthermore, we investigate the potential to improve upon the $O(q\sqrt{T})$ regret and propose a two-stage algorithm, achieving $O(q\log{T} + q/ε)$ regret under more stringent assumptions. We also extend our algorithms to the general function setting. A series of experiments validates that our algorithms outperform existing methods when confronted with a large number of constraints.

[544] Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

Yongyu Mu, Jiali Zeng, Fandong Meng, JingBo Zhu, Tong Xiao

Main category: cs.LG

TL;DR: OXA fine-tuning improves mathematical reasoning in LLMs by optimizing exploration-aware objectives during supervised fine-tuning, enhancing subsequent RLVR training performance.

DetailsMotivation: Current RLVR approaches focus on exploration during RL training, but neglect exploration-aware SFT initialization, which shapes the subsequent exploration landscape and affects long-term performance.

Method: OXA fine-tuning optimizes two objectives: 1) promoting low-confidence verified teacher-distillation data to internalize new reasoning patterns, and 2) suppressing high-confidence incorrect self-distillation data to redistribute probability mass toward correct candidates.

Result: Experimental results across 6 benchmarks show consistent improvements, with average gains of +6 Pass@1 and +5 Pass@k points on Qwen2.5-1.5B-Math. OXA elevates initial policy entropy and gains persist throughout extensive RLVR training.

Conclusion: OXA fine-tuning provides long-term value by improving exploration-aware SFT initialization, which enhances both initial performance and subsequent RLVR training outcomes for mathematical reasoning tasks.

Abstract: Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward potentially correct candidates. Experimental results across 6 benchmarks show that OXA consistently improves mathematical reasoning performance, especially achieving an average gain of $+6$ Pass@1 and $+5$ Pass@$k$ points compared to conventional SFT on the Qwen2.5-1.5B-Math. Crucially, OXA elevates initial policy entropy, and performance gains persist throughout extensive RLVR training, demonstrating the long-term value of OXA.

[545] Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

Kaixuan Du, Meng Cao, Hang Zhang, Yukun Wang, Xiangzhou Huang, Ni Li

Main category: cs.LG

TL;DR: DCRL is a self-supervised training method for LLMs that uses a two-stage consensus mechanism to generate reliable learning signals without external supervision, improving reasoning performance while avoiding convergence on spurious popular answers.

DetailsMotivation: Current label-free RLVR approaches for LLMs (like TTRL and Self-reward) have limitations: they rely heavily on accurate pseudo-label estimation and tend to converge on spurious yet popular answers, trapping the model in dominant modes and limiting further improvements.

Method: Dual Consensus Reinforcement Learning (DCRL) uses a two-stage consensus mechanism: 1) the model acts as an anchor producing dominant responses, 2) it serves as an explorer generating diverse auxiliary signals via temporary unlearning. The final training target is derived from the harmonic mean of these two signal sets, operating entirely without external models or supervision.

Result: Across eight benchmarks and diverse domains, DCRL consistently improves Pass@1 over majority vote while yielding more stable training dynamics.

Conclusion: DCRL establishes a scalable path toward stronger reasoning without labels by generating more reliable learning signals through its dual consensus mechanism, avoiding the limitations of previous label-free approaches.

Abstract: Current label-free RLVR approaches for large language models (LLMs), such as TTRL and Self-reward, have demonstrated effectiveness in improving the performance of LLMs on complex reasoning tasks. However, these methods rely heavily on accurate pseudo-label estimation and converge on spurious yet popular answers, thereby trapping in a dominant mode and limiting further improvements. Building on this, we propose Dual Consensus Reinforcement Learning (DCRL), a novel self-supervised training method which is capable of generating more reliable learning signals through a two-stage consensus mechanism. The model initially acts as an anchor, producing dominant responses; then it serves as an explorer, generating diverse auxiliary signals via a temporary unlearning process. The final training target is derived from the harmonic mean of these two signal sets. Notably, the process operates entirely without external models or supervision. Across eight benchmarks and diverse domains, DCRL consistently improves Pass@1 over majority vote while yielding more stable training dynamics. These results demonstrate that DCRL establishes a scalable path toward stronger reasoning without labels.

[546] Physics-integrated neural differentiable modeling for immersed boundary systems

Chenglin Li, Hang Xu, Jianting Chen, Yanfei Zhang

Main category: cs.LG

TL;DR: A physics-integrated differentiable framework for long-horizon prediction of immersed-boundary flows using neural PDE solvers with learned pressure correction and stable coarse-grid rollouts.

DetailsMotivation: To address challenges in computing complex fluid flows near solid boundaries over long horizons, where conventional solvers are computationally expensive and purely data-driven models lack robustness under extrapolative conditions.

Method: Extends neural PDE solvers with physics-integrated differentiable framework incorporating PDE-based intermediate velocity module and multi-direct forcing immersed boundary module. Replaces expensive pressure projection with learned implicit correction using ConvResNet blocks, and uses sub-iteration strategy for stable coarse-grid autoregressive rollouts.

Result: Outperforms purely data-driven, physics-loss-constrained, and coarse-grid numerical baselines in flow-field fidelity and long-horizon stability, with ~200x inference speedup over high-resolution solver, trained in under one hour on single GPU.

Conclusion: The framework successfully integrates physical principles into differentiable architecture for efficient and stable long-horizon prediction of immersed-boundary flows, bridging gap between numerical solvers and data-driven models.

Abstract: Accurately, efficiently, and stably computing complex fluid flows and their evolution near solid boundaries over long horizons remains challenging. Conventional numerical solvers require fine grids and small time steps to resolve near-wall dynamics, resulting in high computational costs, while purely data-driven surrogate models accumulate rollout errors and lack robustness under extrapolative conditions. To address these issues, this study extends existing neural PDE solvers by developing a physics-integrated differentiable framework for long-horizon prediction of immersed-boundary flows. A key design aspect of the framework includes an important improvement, namely the structural integration of physical principles into an end-to-end differentiable architecture incorporating a PDE-based intermediate velocity module and a multi-direct forcing immersed boundary module, both adhering to the pressure-projection procedure for incompressible flow computation. The computationally expensive pressure projection step is substituted with a learned implicit correction using ConvResNet blocks to reduce cost, and a sub-iteration strategy is introduced to separate the embedded physics module’s stability requirement from the surrogate model’s time step, enabling stable coarse-grid autoregressive rollouts with large effective time increments. The framework uses only single-step supervision for training, eliminating long-horizon backpropagation and reducing training time to under one hour on a single GPU. Evaluations on benchmark cases of flow past a stationary cylinder and a rotationally oscillating cylinder at Re=100 show the proposed model consistently outperforms purely data-driven, physics-loss-constrained, and coarse-grid numerical baselines in flow-field fidelity and long-horizon stability, while achieving an approximately 200-fold inference speedup over the high-resolution solver.

[547] Laya: A LeJEPA Approach to EEG via Latent Prediction over Reconstruction

Saarang Panchavati, Uddhav Panchavati, Corey Arnold, William Speier

Main category: cs.LG

TL;DR: Laya is the first EEG foundation model using LeJEPA (Latent Joint Embedding Predictive Architecture) that learns by predicting latent representations instead of reconstructing raw EEG signals, showing improved performance over reconstruction-based methods.

DetailsMotivation: Current EEG foundation models rely on signal reconstruction as SSL objective, which biases representations toward high-variance artifacts rather than task-relevant neural structure, leading to modest improvements and sensitivity to downstream adaptation.

Method: Introduces Laya, the first EEG foundation model based on LeJEPA (Latent Joint Embedding Predictive Architecture), which learns by predicting latent representations instead of reconstructing raw EEG signals, providing a more principled and stable formulation.

Result: Across a range of EEG benchmarks, Laya demonstrates improved performance under linear probing compared to reconstruction-based baselines, suggesting latent predictive objectives offer better transferable, high-level EEG representations.

Conclusion: Latent predictive objectives (JEPA-style methods) offer a promising direction for learning transferable, high-level EEG representations compared to traditional reconstruction-based SSL approaches.

Abstract: Electroencephalography (EEG) is a widely used tool for studying brain function, with applications in clinical neuroscience, diagnosis, and brain-computer interfaces (BCIs). Recent EEG foundation models trained on large unlabeled corpora aim to learn transferable representations, but their effectiveness remains unclear; reported improvements over smaller task-specific models are often modest, sensitive to downstream adaptation and fine-tuning strategies, and limited under linear probing. We hypothesize that one contributing factor is the reliance on signal reconstruction as the primary self-supervised learning (SSL) objective, which biases representations toward high-variance artifacts rather than task-relevant neural structure. To address this limitation, we explore an SSL paradigm based on Joint Embedding Predictive Architectures (JEPA), which learn by predicting latent representations instead of reconstructing raw signals. While earlier JEPA-style methods often rely on additional heuristics to ensure training stability, recent advances such as LeJEPA provide a more principled and stable formulation. We introduce Laya, the first EEG foundation model based on LeJEPA. Across a range of EEG benchmarks, Laya demonstrates improved performance under linear probing compared to reconstruction-based baselines, suggesting that latent predictive objectives offer a promising direction for learning transferable, high-level EEG representations.

[548] Decoding the Critique Mechanism in Large Reasoning Models

Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan

Main category: cs.LG

TL;DR: LRMs exhibit hidden critique abilities that enable error detection and self-correction even when errors propagate through reasoning chains, with identifiable critique vectors that can be used to enhance error detection without additional training.

DetailsMotivation: To systematically investigate how Large Reasoning Models recover from errors and understand their internal critique mechanisms, particularly when they can detect and correct mistakes despite error propagation through reasoning chains.

Method: Insert arithmetic mistakes in intermediate reasoning steps, analyze error recovery patterns, conduct feature space analysis to identify interpretable critique vectors, and perform extensive experiments across multiple model scales and families.

Result: Discovered that LRMs can reach correct final answers despite error propagation, identified highly interpretable critique vectors representing error detection behavior, and demonstrated that steering latent representations with these vectors improves error detection and enhances test-time scaling performance.

Conclusion: LRMs possess hidden critique abilities that enable self-correction, with identifiable critique vectors that provide valuable understanding of their self-verification mechanisms and offer promising directions for controlling and improving these capabilities.

Abstract: Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong “critique” ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating through the chain-of-thought (CoT), resulting in an incorrect intermediate conclusion, the model still reaches the correct final answer. This recovery implies that the model must possess an internal mechanism to detect errors and trigger self-correction, which we refer to as the hidden critique ability. Building on feature space analysis, we identify a highly interpretable critique vector representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model’s error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs’ critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at https://github.com/mail-research/lrm-critique-vectors.

[549] Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Jia Qing Yap

Main category: cs.LG

TL;DR: Researchers use sparse autoencoders on a 35B parameter MoE model to identify and steer agentic behavioral traits through linear probes and activation space projection, enabling fine-grained behavioral intervention without retraining.

DetailsMotivation: To develop methods for identifying and steering specific behavioral traits in large language models, particularly agentic behaviors like autonomy and tool use, without requiring model retraining or fine-tuning.

Method: Train nine sparse autoencoders on Qwen 3.5-35B-A3B’s residual stream, use linear probes on SAE latent activations to identify behavioral traits, project probe weights back through SAE decoder to obtain continuous steering vectors, and apply these vectors during inference for behavioral intervention.

Result: Autonomy steering achieved Cohen’s d = 1.01, shifting model from asking for help 78% of time to proactive execution; all five steering vectors primarily modulate a single dominant agency axis; tool-use vector steers behavior (d = 0.39); risk-calibration only suppresses behavior; steering during autoregressive decoding has zero effect.

Conclusion: The method enables fine-grained behavioral steering without retraining, reveals that behavioral traits are dominated by a single agency axis, and provides causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

Abstract: We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model’s native activation space. This bypasses the SAE’s top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen’s d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

[550] DynamicGate MLP Conditional Computation via Learned Structural Dropout and Input Dependent Gating for Functional Plasticity

Yong Il Choi

Main category: cs.LG

TL;DR: DynamicGate-MLP unifies regularization and conditional computation by learning input-dependent gates that stochastically deactivate units during training and selectively execute only necessary computations at inference time.

DetailsMotivation: Standard dropout provides regularization but uses dense computation at inference, while conditional computation adapts computation to inputs but lacks regularization benefits. The paper aims to unify both approaches in a single framework.

Method: Proposes DynamicGate-MLP that learns continuous gate probabilities to decide unit/block usage. Uses Straight-Through Estimator to optimize discrete execution masks, controls compute budget via gate usage penalty, and generates sample-dependent execution paths.

Result: Evaluated on MNIST, CIFAR-10, Tiny-ImageNet, Speech Commands, and PBMC3k datasets. Compared with MLP baselines and MoE variants using gate activation ratios and layer-weighted relative MAC metrics for compute efficiency.

Conclusion: DynamicGate-MLP successfully bridges regularization and conditional computation, enabling sample-dependent execution that suppresses unnecessary computation while maintaining model performance.

Abstract: Dropout is a representative regularization technique that stochastically deactivates hidden units during training to mitigate overfitting. In contrast, standard inference executes the full network with dense computation, so its goal and mechanism differ from conditional computation, where the executed operations depend on the input. This paper organizes DynamicGate-MLP into a single framework that simultaneously satisfies both the regularization view and the conditional-computation view. Instead of a random mask, the proposed model learns gates that decide whether to use each unit (or block), suppressing unnecessary computation while implementing sample-dependent execution that concentrates computation on the parts needed for each input. To this end, we define continuous gate probabilities and, at inference time, generate a discrete execution mask from them to select an execution path. Training controls the compute budget via a penalty on expected gate usage and uses a Straight-Through Estimator (STE) to optimize the discrete mask. We evaluate DynamicGate-MLP on MNIST, CIFAR-10, Tiny-ImageNet, Speech Commands, and PBMC3k, and compare it with various MLP baselines and MoE-style variants. Compute efficiency is compared under a consistent criterion using gate activation ratios and a layerweighted relative MAC metric, rather than wall-clock latency that depends on hardware and backend kernels.

[551] FederatedFactory: Generative One-Shot Learning for Extremely Non-IID Distributed Scenarios

Andrea Moleri, Christian Internò, Ali Raza, Markus Olhofer, David Klindt, Fabio Stella, Barbara Hammer

Main category: cs.LG

TL;DR: FederatedFactory is a novel FL framework that exchanges generative modules instead of discriminative parameters to handle mutually exclusive label distributions, enabling class-balanced dataset synthesis without gradient conflicts.

DetailsMotivation: Standard federated learning fails when local label distributions are mutually exclusive due to conflicting optimization trajectories. Existing FL methods often rely on unrealistic assumptions like pretrained foundation models.

Method: Inverts the unit of federation from discriminative parameters to generative priors. Exchanges generative modules in a single communication round to synthesize universally class-balanced datasets, eliminating gradient conflict and external prior bias.

Result: Achieves centralized upper-bound performance on medical imagery benchmarks (MedMNIST, ISIC2019). Under pathological heterogeneity, lifts baseline accuracy from 11.36% to 90.57% on CIFAR-10 and restores ISIC2019 AUROC to 90.57%.

Conclusion: FederatedFactory provides an effective solution for FL with mutually exclusive label distributions, enables exact modular unlearning through deterministic deletion of specific generative modules, and eliminates dependency on external priors.

Abstract: Federated Learning (FL) enables distributed optimization without compromising data sovereignty. Yet, where local label distributions are mutually exclusive, standard weight aggregation fails due to conflicting optimization trajectories. Often, FL methods rely on pretrained foundation models, introducing unrealistic assumptions. We introduce FederatedFactory, a zero-dependency framework that inverts the unit of federation from discriminative parameters to generative priors. By exchanging generative modules in a single communication round, our architecture supports ex nihilo synthesis of universally class balanced datasets, eliminating gradient conflict and external prior bias entirely. Evaluations across diverse medical imagery benchmarks, including MedMNIST and ISIC2019, demonstrate that our approach recovers centralized upper-bound performance. Under pathological heterogeneity, it lifts baseline accuracy from a collapsed 11.36% to 90.57% on CIFAR-10 and restores ISIC2019 AUROC to 90.57%. Additionally, this framework facilitates exact modular unlearning through the deterministic deletion of specific generative modules.

[552] Prior-Informed Neural Network Initialization: A Spectral Approach for Function Parameterizing Architectures

David Orlando Salazar Torres, Diyar Altinses, Andreas Schwung

Main category: cs.LG

TL;DR: A prior-informed neural network design strategy that uses spectral and temporal data structure to guide initialization and architecture, improving convergence and efficiency without changing training procedures.

DetailsMotivation: Neural architectures for function parameterization (like Bag-of-Functions) are sensitive to initialization because traditional data-agnostic schemes don't capture target signal structure, leading to suboptimal convergence.

Method: Uses Fast Fourier Transform to extract dominant seasonal priors to inform model depth and initial states, plus residual-based regression for trend parameterization. This structural alignment reduces encoder dimensionality while maintaining reconstruction fidelity.

Result: Extensive experiments show embedding data-driven priors significantly accelerates convergence, reduces performance variability, improves computational efficiency, and enables more compact architectures while outperforming standard initialization baselines.

Conclusion: The framework enables more compact and interpretable architectures with better performance through data-driven structural alignment, without altering core training procedures.

Abstract: Neural network architectures designed for function parameterization, such as the Bag-of-Functions (BoF) framework, bridge the gap between the expressivity of deep learning and the interpretability of classical signal processing. However, these models are inherently sensitive to parameter initialization, as traditional data-agnostic schemes fail to capture the structural properties of the target signals, often leading to suboptimal convergence. In this work, we propose a prior-informed design strategy that leverages the intrinsic spectral and temporal structure of the data to guide both network initialization and architectural configuration. A principled methodology is introduced that uses the Fast Fourier Transform to extract dominant seasonal priors, informing model depth and initial states, and a residual-based regression approach to parameterize trend components. Crucially, this structural alignment enables a substantial reduction in encoder dimensionality without compromising reconstruction fidelity. A supporting theoretical analysis provides guidance on trend estimation under finite-sample regimes. Extensive experiments on synthetic and real-world benchmarks demonstrate that embedding data-driven priors significantly accelerates convergence, reduces performance variability across trials, and improves computational efficiency. Overall, the proposed framework enables more compact and interpretable architectures while outperforming standard initialization baselines, without altering the core training procedure.

[553] Age Predictors Through the Lens of Generalization, Bias Mitigation, and Interpretability: Reflections on Causal Implications

Debdas Paul, Elisa Ferrari, Irene Gravili, Alessandro Cellerino

Main category: cs.LG

TL;DR: A theoretical framework for invariant representation learning to improve out-of-distribution generalization in age prediction by mitigating biases from attributes like race, gender, and tissue, with application to mouse transcriptomic data.

DetailsMotivation: Chronological age predictors often fail to generalize out-of-distribution due to exogenous attributes (race, gender, tissue) that create bias. Learning invariant representations is essential for better OOD generalization, bias mitigation, causal analysis, and fairness.

Method: Theoretical exploration of invariant representation learning concepts with an interpretable neural network model based on adversarial representation learning. Applied to publicly available mouse transcriptomic datasets and compared with conventional ML models.

Result: The model’s outcomes are consistent with published predictive results on Elamipretide effects on mouse skeletal and cardiac muscle. The approach demonstrates improved handling of exogenous attributes compared to conventional models.

Conclusion: While the model shows promise for bias mitigation and OOD generalization, there are limitations in deriving causal interpretations from purely predictive models. The theoretical framework provides coherent understanding across predictive, causal, and fairness perspectives.

Abstract: Chronological age predictors often fail to achieve out-of-distribution (OOD) gen- eralization due to exogenous attributes such as race, gender, or tissue. Learning an invariant representation with respect to those attributes is therefore essential to improve OOD generalization and prevent overly optimistic results. In predic- tive settings, these attributes motivate bias mitigation; in causal analyses, they appear as confounders; and when protected, their suppression leads to fairness. We coherently explore these concepts with theoretical rigor and discuss the scope of an interpretable neural network model based on adversarial representation learning. Using publicly available mouse transcriptomic datasets, we illustrate the behavior of this model relative to conventional machine learning models. We observe that the outcome of this model is consistent with the predictive results of a published study demonstrating the effects of Elamipretide on mouse skeletal and cardiac muscle. We conclude by discussing the limitations of deriving causal interpretation from such purely predictive models.

[554] Trained Persistent Memory for Frozen Encoder–Decoder LLMs: Six Architectural Methods

Hong Jeong

Main category: cs.LG

TL;DR: Frozen LLMs can be enhanced with persistent differentiable memory in continuous latent space using small trainable adapters, enabling conversational learning without altering the backbone.

DetailsMotivation: Current frozen encoder-decoder language models are stateless and discard latent representations after each forward pass, preventing information persistence across sessions. The authors aim to demonstrate that persistent memory in the continuous latent space of frozen LLMs is feasible even under severe resource constraints.

Method: Implemented six architectural methods spanning three injection points and four write mechanisms for differentiable memory operations on dense vectors. Used a single frozen Flan-T5-XL backbone with small trainable adapters and a single dataset. Memory bank accumulates at inference time without gradients after adapter training.

Result: At 10× capacity scale, all six trained adapters produced positive memory-recall curves, while stateless baseline scored zero. At 1× capacity, three methods collapsed, revealing capacity as critical design parameter. Memory bank can scale arbitrarily without altering backbone.

Conclusion: Persistent differentiable memory in frozen LLMs is feasible and establishes a baseline for future work with larger models, data, and memory capacity. The approach enables conversational learning through continuous latent space memory.

Abstract: Frozen encoder–decoder language models are stateless: the latent representation is discarded after every forward pass, so no information persists across sessions. This paper presents a \textbf{proof-of-concept pilot study} showing that persistent memory in the \emph{continuous latent space} of a frozen LLM is feasible – even under severe resource constraints (a single frozen Flan-T5-XL backbone, small trainable adapters, a single dataset). We implement six architectural methods spanning three injection points and four write mechanisms; unlike text-level memory systems, every write and read is a differentiable operation on dense vectors. After training only the adapter, the memory bank continues to accumulate at inference time without gradients, enabling \emph{conversational learning}. Under a forgetting-curve evaluation on LoCoMo at two capacity scales (1$\times$ and 10$\times$), the stateless baseline scores exactly zero; at 10$\times$ all six trained adapters produce positive memory-recall curves; at 1$\times$ three methods collapse, revealing capacity as a critical design parameter. Because the memory bank is a compact numerical array, it can be scaled to arbitrarily large capacity without altering the backbone. We argue that full end-to-end training with larger models, larger data, and orders-of-magnitude larger memory will yield substantially stronger results; this pilot study establishes the feasibility baseline and design-space taxonomy that such efforts require.

[555] DISCOVER: A Solver for Distributional Counterfactual Explanations

Yikai Gu, Lele Cao, Bo Zhao, Lei Lei, Lei You

Main category: cs.LG

TL;DR: DISCOVER is a model-agnostic solver for distributional counterfactual explanations that replaces gradient-based optimization with sparse propose-and-select search, enabling distributional counterfactual reasoning for non-differentiable black-box models.

DetailsMotivation: Existing Distributional Counterfactual Explanations (DCE) methods rely on gradient-based optimization, but many real-world tabular pipelines use non-differentiable models, limiting their applicability. There's a need for model-agnostic approaches that can handle black-box models while preserving the statistical certification benefits of DCE.

Method: DISCOVER replaces gradient descent with a sparse propose-and-select search paradigm. It uses a sample-wise decomposition of the transport objective to compute per-row impact scores and enforces a top-k intervention budget. For candidate generation without predictor gradients, it introduces an OT-guided cone sampling primitive driven by input-side transport geometry.

Result: Experiments on multiple tabular datasets demonstrate strong joint alignment of input and output distributions, successfully extending distributional counterfactual reasoning to modern black-box learning pipelines.

Conclusion: DISCOVER enables distributional counterfactual explanations for non-differentiable black-box models while preserving the original DCE objective and certification, making distributional counterfactual reasoning applicable to a broader range of real-world machine learning pipelines.

Abstract: Counterfactual explanations (CE) explain model decisions by identifying input modifications that lead to different predictions. Most existing methods operate at the instance level. Distributional Counterfactual Explanations (DCE) extend this setting by optimizing an optimal transport objective that balances proximity to a factual input distribution and alignment to a target output distribution, with statistical certification via chance constrained bounds. However, DCE relies on gradient based optimization, while many real-world tabular pipelines are dominated by non-differentiable models. We propose DISCOVER, a model-agnostic solver for distributional counterfactual explanations. DISCOVER preserves the original DCE objective and certification while replacing gradient descent with a sparse propose-and-select search paradigm. It exploits a sample-wise decomposition of the transport objective to compute per-row impact scores and enforce a top-$k$ intervention budget, focusing edits on the most influential samples. To guide candidate generation without predictor gradients, DISCOVER introduces an OT-guided cone sampling primitive driven by input-side transport geometry. Experiments on multiple tabular datasets demonstrate strong joint alignment of input and output distributions, extending distributional counterfactual reasoning to modern black box learning pipelines. A code repository is available at https://github.com/understanding-ml/DCE.

[556] Capability-Guided Compression: Toward Interpretability-Aware Budget Allocation for Large Language Models

Rishaank Gupta

Main category: cs.LG

TL;DR: CGC introduces capability density maps from Sparse Autoencoders to guide compression budgets, addressing the capability-blind compression problem where current methods ignore what model components functionally encode.

DetailsMotivation: Current LLM compression methods allocate budgets without understanding what individual model components functionally encode, leading to reasoning capability loss and abrupt performance phase transitions that perplexity-based evaluation fails to detect.

Method: Uses Sparse Autoencoder-derived capability density maps combining feature breadth, activation entropy, and cross-input consistency to allocate differential compression budgets across transformer components. Proves theoretical relationship between capability density and structural redundancy.

Result: Capability density is statistically independent of existing importance metrics (Wanda scores), establishing it as novel compression signal. Negative result on PPL-based compression comparison suggests GPT-2 Medium insufficient test bed for full CGC hypothesis.

Conclusion: Provides theoretical framework for capability-aware compression research, addressing root cause of reasoning capability loss in current compression methods through capability density formalism.

Abstract: Large language model compression has made substantial progress through pruning, quantization, and low-rank decomposition, yet a fundamental limitation persists across all existing methods: compression budgets are allocated without any representation of what individual model components functionally encode. We term this the capability-blind compression problem and argue it is a root cause of two well-documented failures – the insensitivity of perplexity-based evaluation to reasoning capability loss, and the abrupt phase transitions in model performance recently characterized by Ma et al. (2026). We propose Capability-Guided Compression (CGC), a framework that addresses this by using Sparse Autoencoder (SAE)-derived capability density maps to allocate differential compression budgets across transformer components. Capability density is a formally defined scalar measure combining the feature breadth, activation entropy, and cross-input consistency of a component’s SAE feature activation distribution. We prove theoretically that components with higher capability density exhibit lower structural redundancy and reach their individual phase transition points at lower compression ratios, providing the first pre-compression mechanism for component-level phase transition prediction. Experiments on GPT-2 Medium confirm that capability density is statistically independent of Wanda importance scores (Spearman rho = -0.054, n = 384 heads), establishing it as a genuinely novel compression signal orthogonal to all existing importance metrics. We report a negative result on PPL-based compression comparison and provide a principled diagnosis identifying GPT-2 Medium as an insufficient test bed for the full CGC hypothesis. The theoretical framework, density formalism, and orthogonality finding constitute a foundation for capability-aware compression research.

[557] Optimal uncertainty bounds for multivariate kernel regression under bounded noise: A Gaussian process-based dual function

Amon Lahr, Anna Scampicchio, Johannes Köhler, Melanie N. Zeilinger

Main category: cs.LG

TL;DR: Distribution-free, tight uncertainty bounds for multi-output kernel methods that generalize existing results and enable safe learning-based control

DetailsMotivation: Existing uncertainty bounds for kernel methods have limitations: strong noise distribution assumptions, conservatism, poor multi-output scaling, or difficulty integrating into downstream tasks. Reliable uncertainty quantification is crucial for safe learning-based control.

Method: Proposes a tight, distribution-free bound for multi-output kernel-based estimates using an unconstrained, duality-based formulation that shares the same structure as classic Gaussian process confidence bounds.

Result: The bound generalizes many existing results and can be straightforwardly integrated into downstream optimization pipelines, demonstrated with a quadrotor dynamics learning example.

Conclusion: The paper presents a practical uncertainty quantification method for kernel-based learning that addresses key limitations of existing approaches, enabling more reliable and safe learning-based control applications.

Abstract: Non-conservative uncertainty bounds are essential for making reliable predictions about latent functions from noisy data–and thus, a key enabler for safe learning-based control. In this domain, kernel methods such as Gaussian process regression are established techniques, thanks to their inherent uncertainty quantification mechanism. Still, existing bounds either pose strong assumptions on the underlying noise distribution, are conservative, do not scale well in the multi-output case, or are difficult to integrate into downstream tasks. This paper addresses these limitations by presenting a tight, distribution-free bound for multi-output kernel-based estimates. It is obtained through an unconstrained, duality-based formulation, which shares the same structure of classic Gaussian process confidence bounds and can thus be straightforwardly integrated into downstream optimization pipelines. We show that the proposed bound generalizes many existing results and illustrate its application using an example inspired by quadrotor dynamics learning.

[558] Bridging the High-Frequency Data Gap: A Millisecond-Resolution Network Dataset for Advancing Time Series Foundation Models

Subina Khanal, Seshu Tirupathi, Merim Dzaferagic, Marco Ruffini, Torben Bach Pedersen

Main category: cs.LG

TL;DR: A new high-frequency (millisecond-resolution) wireless network dataset for time series foundation models, revealing poor performance of current TSFMs on high-frequency data and highlighting the need for diverse temporal resolutions in pre-training.

DetailsMotivation: Current time series foundation models (TSFMs) are trained on low-frequency data (seconds to years), limiting their ability to handle high-frequency time series. There's a need for diverse temporal resolutions and new domains to improve TSFM generalization.

Method: Introduces a novel dataset capturing millisecond-resolution wireless and traffic conditions from operational 5G deployments. Benchmarks traditional ML models and TSFMs on short-term forecasting tasks with prediction horizons from 100ms to 9.6 seconds.

Result: Most TSFM configurations perform poorly on this high-frequency data in both zero-shot and fine-tuned settings, demonstrating current models’ limitations with high-frequency distributions.

Conclusion: High-frequency datasets are crucial for TSFM pre-training to enhance architectures, fine-tuning strategies, generalization, and robustness in real-world applications across diverse temporal resolutions.

Abstract: Time series foundation models (TSFMs) require diverse, real-world datasets to adapt across varying domains and temporal frequencies. However, current large-scale datasets predominantly focus on low-frequency time series with sampling intervals, i.e., time resolution, in the range of seconds to years, hindering their ability to capture the nuances of high-frequency time series data. To address this limitation, we introduce a novel dataset that captures millisecond-resolution wireless and traffic conditions from an operational 5G wireless deployment, expanding the scope of TSFMs to incorporate high-frequency data for pre-training. Further, the dataset introduces a new domain, wireless networks, thus complementing existing more general domains like energy and finance. The dataset also provides use cases for short-term forecasting, with prediction horizons spanning from 100 milliseconds (1 step) to 9.6 seconds (96 steps). By benchmarking traditional machine learning models and TSFMs on predictive tasks using this dataset, we demonstrate that most TSFM model configurations perform poorly on this new data distribution in both zero-shot and fine-tuned settings. Our work underscores the importance of incorporating high-frequency datasets during pre-training and forecasting to enhance architectures, fine-tuning strategies, generalization, and robustness of TSFMs in real-world applications.

[559] From the Inside Out: Progressive Distribution Refinement for Confidence Calibration

Xizhong Yang, Yinan Xia, Huiming Wang, Mofei Song

Main category: cs.LG

TL;DR: DistriTTRL: A method that uses distribution priors of model confidence during RL to optimize self-reward signals and mitigate reward hacking in test-time training, achieving performance improvements across models and benchmarks.

DetailsMotivation: Addressing two key issues in self-reward RL: 1) the discrepancy in internal information between test and training phases, and 2) reward hacking problems in voting-based test-time scaling strategies.

Method: Proposes DistriTTRL which leverages distribution priors of model confidence during RL to progressively optimize reward signals (instead of single-query rollouts), and introduces diversity-targeted penalties to mitigate consistent reward hacking from voting-based TTS strategies.

Result: Achieves significant performance improvements across multiple models and benchmarks, benefiting from the complementary training mechanism between model capability and self-reward signals, and effective mitigation of reward hacking.

Conclusion: DistriTTRL effectively addresses test-training discrepancy and reward hacking in self-reward RL through distribution-based optimization and diversity penalties, leading to robust performance gains.

Abstract: Leveraging the model’s internal information as the self-reward signal in Reinforcement Learning (RL) has received extensive attention due to its label-free nature. While prior works have made significant progress in applying the Test-Time Scaling (TTS) strategies to RL, the discrepancy in internal information between test and training remains inadequately addressed. Moreover, Test-Time Training based on voting-based TTS strategies often suffers from reward hacking problems. To address these issues, we propose DistriTTRL, which leverages the distribution prior of the model’s confidence during RL to progressively optimize the reward signal, rather than relying solely on single-query rollouts. Additionally, we mitigate the phenomenon of consistent reward hacking caused by the voting-based TTS strategies through diversity-targeted penalties. Benefiting from this training mechanism where model capability and self-reward signals complement each other, and the mitigation of reward hacking, DistriTTRL has achieved significant performance improvements across multiple models and benchmarks.

[560] FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

Zhenghang Song, Tang Qian, Lu Chen, Yushuai Li, Zhengke Hu, Bingbing Fang, Yumeng Song, Junbo Zhao, Sheng Zhang, Tianyi Li

Main category: cs.LG

TL;DR: FEAT is a linear-complexity foundation model for structured data that replaces quadratic attention with hybrid linear encoding for efficient cross-sample modeling.

DetailsMotivation: Existing large structured-data models face limitations: O(N²) complexity from sample-wise self-attention, representation degradation from linear sequence models, and distribution mismatch from synthetic-only pre-training.

Method: Multi-layer dual-axis architecture with adaptive-fusion bi-Mamba-2 for local sample dependencies and convolutional gated linear attention for global memory, plus hybrid structural causal model pipeline and stable reconstruction objective.

Result: Outperforms baselines on 11 real-world datasets in zero-shot performance, scales linearly, and achieves up to 40x faster inference.

Conclusion: FEAT provides an efficient foundation model for structured data that overcomes quadratic complexity limitations while maintaining strong performance.

Abstract: Structured data is foundational to healthcare, finance, e-commerce, and scientific data management. Large structured-data models (LDMs) extend the foundation model paradigm to unify heterogeneous datasets for tasks such as classification, regression, and decision support. However, existing LDMs face major limitations. First, most rely on sample-wise self-attention, whose O(N^2) complexity limits the sample count. Second, linear sequence models often degrade representations due to hidden-state compression and artificial causal bias. Third, synthetic-only pre-training often fails to match real-world distributions. We propose FEAT, a linear-complexity foundation model for extremely large structured data. FEAT introduces a multi-layer dual-axis architecture that replaces quadratic attention with hybrid linear encoding. The architecture combines adaptive-fusion bi-Mamba-2 (AFBM) for local sample dependencies and convolutional gated linear attention (Conv-GLA) for global memory. This design enables linear-complexity cross-sample modeling while preserving expressive representations. To improve robustness, FEAT adopts a hybrid structural causal model pipeline and a stable reconstruction objective. Experiments on 11 real-world datasets show that FEAT consistently outperforms baselines in zero-shot performance, while scaling linearly and achieving up to 40x faster inference.

[561] When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective

Zelin Zhang, Fei Cheng, Chenhui Chu

Main category: cs.LG

TL;DR: The paper proposes unsupervised RL with intrinsic rewards for mathematical reasoning in LLMs, addressing scalability issues of supervised RL, and introduces geometric diagnostics to understand training stability.

DetailsMotivation: Supervised RL for mathematical reasoning in LLMs suffers from scalability bottlenecks due to expensive ground-truth annotations. Unsupervised RL with intrinsic rewards offers a scalable alternative but faces issues like opaque training dynamics, policy collapse, and reward hacking.

Method: 1) Design and evaluate intrinsic rewards that enforce concise and certain generation; 2) Test base models across a spectrum of intrinsic reasoning capabilities; 3) Introduce geometric diagnostic lens to analyze training stability and identify successful configurations as being enveloped by manifolds.

Result: The approach successfully boosts mathematical reasoning through concise and certain responses, but also reveals when this unsupervised approach breaks down and provides geometric explanations for why certain configurations stabilize while others collapse.

Conclusion: The work goes beyond demonstrating that enforcing concise and certain responses improves mathematical reasoning - it reveals failure modes of unsupervised RL and provides geometric diagnostics to understand training stability, offering insights into when and why this approach works or fails.

Abstract: Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite of intrinsic rewards that explicitly enforce concise and certain generation. Second, to discover the boundaries of this approach, we test base models across a spectrum of intrinsic reasoning capabilities, revealing how a model’s foundational logical prior dictates its success or failure. Finally, to demystify why certain configurations stabilize while others collapse, we introduce a novel geometric diagnostic lens, showing that successful cases are enveloped by manifolds. Ultimately, our work goes beyond merely demonstrating that enforcing concise and certain responses successfully boosts mathematical reasoning; we reveal when this unsupervised approach breaks down and geometrically diagnose why.

[562] SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

Viktor Stein, Wuchen Li, Gabriele Steidl

Main category: cs.LG

TL;DR: The paper introduces accelerated attention blocks for transformers using inertial Nesterov-type dynamics on probability density spaces, showing faster convergence than classical attention while preserving oracle calls.

DetailsMotivation: Transformers have achieved great success in NLP, with recent perspectives interpreting attention blocks as interacting particle systems. The authors aim to extend this viewpoint by introducing accelerated attention blocks derived from inertial dynamics to improve convergence rates while maintaining computational efficiency.

Method: Propose accelerated attention blocks based on inertial Nesterov-type dynamics on density spaces. Tokens carry both spatial (feature) and velocity variables. Time discretization and approximation of accelerated density dynamics yield Hamiltonian momentum attention blocks. For linear self-attention, show these blocks approximate a Stein variational gradient flow with bilinear kernel.

Result: Prove that elliptically contoured probability distributions are preserved by accelerated attention blocks. Demonstrate through particle-based algorithms that proposed accelerated attention blocks converge faster than classical attention blocks while preserving the number of oracle calls.

Conclusion: The paper successfully introduces a novel accelerated attention mechanism based on inertial dynamics that provides faster convergence while maintaining computational efficiency, offering a new theoretical perspective on transformer architectures.

Abstract: Transformers owe much of their empirical success in natural language processing to the self-attention blocks. Recent perspectives interpret attention blocks as interacting particle systems, whose mean-field limits correspond to gradient flows of interaction energy functionals on probability density spaces equipped with Wasserstein-$2$-type metrics. We extend this viewpoint by introducing accelerated attention blocks derived from inertial Nesterov-type dynamics on density spaces. In our proposed architecture, tokens carry both spatial (feature) and velocity variables. The time discretization and the approximation of accelerated density dynamics yield Hamiltonian momentum attention blocks, which constitute the proposed accelerated attention architectures. In particular, for linear self-attention, we show that the attention blocks approximate a Stein variational gradient flow, using a bilinear kernel, of a potential energy. In this setting, we prove that elliptically contoured probability distributions are preserved by the accelerated attention blocks. We present implementable particle-based algorithms and demonstrate that the proposed accelerated attention blocks converge faster than the classical attention blocks while preserving the number of oracle calls.

[563] Manifold-Matching Autoencoders

Laurent Cheret, Vincent Létourneau, Isar Nejadgholi, Chris Drummond, Hussein Al Osman, Maia Fraser

Main category: cs.LG

TL;DR: Manifold-Matching Autoencoder (MMAE) aligns pairwise distances between input and latent spaces for unsupervised regularization, outperforming similar methods on nearest-neighbor and topology preservation metrics.

DetailsMotivation: The paper aims to develop an unsupervised regularization method for autoencoders that better preserves the geometric structure of data in the latent space, addressing limitations of existing approaches that may not adequately maintain pairwise distance relationships.

Method: MMAE aligns pairwise distances in the latent space to those in the input data space by minimizing mean squared error. The method works on distances rather than coordinates, allowing extension to lower-dimensional representations and providing a scalable approximation of Multi-Dimensional Scaling.

Result: MMAE outperforms similar methods on metrics based on preservation of nearest-neighbor distances and persistent homology-based measures. It effectively preserves the geometric structure of data while providing a scalable alternative to traditional MDS.

Conclusion: Manifold-Matching regularization is an effective approach for autoencoders that better preserves data geometry in latent representations, offering advantages over existing methods and providing a scalable approximation to MDS.

Abstract: We study a simple unsupervised regularization scheme for autoencoders called Manifold-Matching (MMAE): we align the pairwise distances in the latent space to those of the input data space by minimizing mean squared error. Because alignment occurs on pairwise distances rather than coordinates, it can also be extended to a lower-dimensional representation of the data, adding flexibility to the method. We find that this regularization outperforms similar methods on metrics based on preservation of nearest-neighbor distances and persistent homology-based measures. We also observe that MMAE provides a scalable approximation of Multi-Dimensional Scaling (MDS).

[564] SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit

Yibo Li, Qiongxiu Li

Main category: cs.LG

TL;DR: SOMP is a scalable gradient inversion attack framework that recovers private training text from aggregated gradients in LLMs by treating it as a sparse signal recovery problem, leveraging transformer gradient structure and sample-level sparsity.

DetailsMotivation: Prior gradient inversion attacks work well for small batches but struggle with larger batch sizes and longer sequences due to signal mixing, high computational costs, and degraded fidelity. There's a need for scalable methods that can handle aggregated gradients in practical training scenarios.

Method: SOMP (Subspace-Guided Orthogonal Matching Pursuit) frames text recovery from aggregated gradients as a sparse signal recovery problem. It exploits two key properties: 1) aggregated transformer gradients retain head-wise geometric structure, and 2) they exhibit sample-level sparsity. The method progressively narrows search space and disentangles mixed signals without exhaustive search using orthogonal matching pursuit guided by subspace information.

Result: SOMP consistently outperforms prior methods across multiple LLM families, model scales, and five languages in aggregated-gradient settings. For long sequences at batch size B=16, it achieves substantially higher reconstruction fidelity than baselines while remaining computationally competitive. Even under extreme aggregation (up to B=128), SOMP still recovers meaningful text where prior attacks become ineffective.

Conclusion: SOMP demonstrates that privacy leakage through gradient inversion can persist even in highly aggregated training regimes where previous methods fail, highlighting ongoing privacy risks in LLM training despite gradient aggregation.

Abstract: Gradient inversion attacks reveal that private training text can be reconstructed from shared gradients, posing a privacy risk to large language models (LLMs). While prior methods perform well in small-batch settings, scaling to larger batch sizes and longer sequences remains challenging due to severe signal mixing, high computational cost, and degraded fidelity. We present SOMP (Subspace-Guided Orthogonal Matching Pursuit), a scalable gradient inversion framework that casts text recovery from aggregated gradients as a sparse signal recovery problem. Our key insight is that aggregated transformer gradients retain exploitable head-wise geometric structure together with sample-level sparsity. SOMP leverages these properties to progressively narrow the search space and disentangle mixed signals without exhaustive search. Experiments across multiple LLM families, model scales, and five languages show that SOMP consistently outperforms prior methods in the aggregated-gradient regime.For long sequences at batch size B=16, SOMP achieves substantially higher reconstruction fidelity than strong baselines, while remaining computationally competitive. Even under extreme aggregation (up to B=128), SOMP still recovers meaningful text, suggesting that privacy leakage can persist in regimes where prior attacks become much less effective.

[565] Deep Tabular Representation Corrector

Hangting Ye, Peng Wang, Wei Fan, Xiaozhuang Song, He Zhao, Dandan Gun, Yi Chang

Main category: cs.LG

TL;DR: TRC is a model-agnostic representation corrector for deep tabular models that enhances representations without modifying original model parameters by addressing representation shift and redundancy through two tasks: representation re-estimation and space mapping.

DetailsMotivation: Existing deep tabular learning methods have limitations: in-learning methods require training from scratch or imposing constraints that make learning difficult, while pre-learning methods need extensive pre-training with prior knowledge. There's a need for an efficient way to enhance representations of already-trained tabular models without altering their parameters.

Method: TRC uses two tasks: (1) Tabular Representation Re-estimation - trains a shift estimator to calculate and mitigate inherent representation shift, and (2) Tabular Space Mapping - transforms re-estimated representations into a light-embedding vector space via a coordinate estimator while preserving predictive information to minimize redundancy. Both tasks work in a model-agnostic manner without touching original model parameters.

Result: Extensive experiments on state-of-the-art deep tabular models with TRC on various tabular benchmarks show consistent superiority and improved performance.

Conclusion: TRC provides an efficient, model-agnostic approach to enhance representations of trained deep tabular models without parameter modification, addressing representation shift and redundancy issues that hinder prediction performance.

Abstract: Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc. The recent success of deep learning has fostered many deep networks (e.g., Transformer, ResNet) based tabular learning methods. Generally, existing deep tabular machine learning methods are along with the two paradigms, i.e., in-learning and pre-learning. In-learning methods need to train networks from scratch or impose extra constraints to regulate the representations which nonetheless train multiple tasks simultaneously and make learning more difficult, while pre-learning methods design several pretext tasks for pre-training and then conduct task-specific fine-tuning, which however need much extra training effort with prior knowledge. In this paper, we introduce a novel deep Tabular Representation Corrector, TRC, to enhance any trained deep tabular model’s representations without altering its parameters in a model-agnostic manner. Specifically, targeting the representation shift and representation redundancy that hinder prediction, we propose two tasks, i.e., (i) Tabular Representation Re-estimation, that involves training a shift estimator to calculate the inherent shift of tabular representations to subsequently mitigate it, thereby re-estimating the representations and (ii) Tabular Space Mapping, that transforms the above re-estimated representations into a light-embedding vector space via a coordinate estimator while preserves crucial predictive information to minimize redundancy. The two tasks jointly enhance the representations of deep tabular models without touching on the original models thus enjoying high efficiency. Finally, we conduct extensive experiments on state-of-the-art deep tabular machine learning models coupled with TRC on various tabular benchmarks which have shown consistent superiority.

[566] Trajectory-Optimized Time Reparameterization for Learning-Compatible Reduced-Order Modeling of Stiff Dynamical Systems

Joe Standridge, Daniel Livescu, Paul Cizmas

Main category: cs.LG

TL;DR: Time reparameterization optimization improves neural ODE reduced-order models for stiff dynamical systems by smoothing trajectories for better learnability.

DetailsMotivation: Stiff dynamical systems challenge ML-ROMs due to unstable explicit integration and expensive/implicit integration. Time reparameterization can help but existing methods' effect on learnability is poorly understood.

Method: Proposes trajectory-optimized time reparameterization (TOTR) as an optimization problem in arc-length coordinates, selecting traversal-speed profile to penalize acceleration in stretched time for smoother training dynamics.

Result: TOTR yields smoother reparameterizations and improved predictions across three stiff problems (linear system, van der Pol oscillator, HIRES model), with 1-2 orders of magnitude loss reduction vs benchmarks.

Conclusion: Effective stiffness mitigation in ML-ROMs depends on regularity/learnability of time map itself; optimization-based TR provides robust framework for explicit reduced-order modeling of multiscale dynamical systems.

Abstract: Stiff dynamical systems present a challenge for machine-learning reduced-order models (ML-ROMs), as explicit time integration becomes unstable in stiff regimes while implicit integration within learning loops is computationally expensive and often degrades training efficiency. Time reparameterization (TR) offers an alternative by transforming the independent variable so that rapid physical-time transients are spread over a stretched-time coordinate, enabling stable explicit integration on uniformly sampled grids. Although several TR strategies have been proposed, their effect on learnability in ML-ROMs remains incompletely understood. This work investigates time reparameterization as a stiffness-mitigation mechanism for neural ODE reduced-order modeling and introduces a trajectory-optimized TR (TOTR) formulation. The proposed approach casts time reparameterization as an optimization problem in arc-length coordinates, in which a traversal-speed profile is selected to penalize acceleration in stretched time. By targeting the smoothness of the training dynamics, this formulation produces reparameterized trajectories that are better conditioned and easier to learn than existing TR methods. TOTR is evaluated on three stiff problems: a parameterized stiff linear system, the van der Pol oscillator, and the HIRES chemical kinetics model. Across all cases, the proposed approach yields smoother reparameterizations and improved physical-time predictions under identical training regimens than other TR approaches. Quantitative results demonstrate loss reductions of one to two orders of magnitude compared to benchmark algorithms. These results highlight that effective stiffness mitigation in ML-ROMs depends critically on the regularity and learnability of the time map itself, and that optimization-based TR provides a robust framework for explicit reduced-order modeling of multiscale dynamical systems.

[567] Efficient Reasoning on the Edge

Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N Whatmough, Arash Behboodi, Babak Ehteshami Bejnordi

Main category: cs.LG

TL;DR: A method to enable efficient reasoning in small LLMs for mobile deployment using LoRA adapters, budget forcing via RL, parallel test-time scaling, and dynamic adapter switching with KV-cache sharing.

DetailsMotivation: Large LLMs with chain-of-thought reasoning are impractical for edge deployment due to verbose reasoning traces, large context requirements, high token generation costs, and large KV-cache footprints. Existing distillation approaches produce redundant reasoning traces unsuitable for mobile devices.

Method: 1) Lightweight reasoning using LoRA adapters with supervised fine-tuning; 2) Budget forcing via reinforcement learning to reduce response length; 3) Parallel test-time scaling to improve accuracy with minor latency increase; 4) Dynamic adapter-switching mechanism that activates reasoning only when needed; 5) KV-cache sharing strategy during prompt encoding to reduce time-to-first-token.

Result: Experiments on Qwen2.5-7B demonstrate efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. The method achieves significant response length reduction with minimal accuracy loss.

Conclusion: The proposed approach enables practical LLM reasoning for mobile deployment by addressing key challenges including memory constraints, latency, and computational efficiency through adapter-based techniques and optimization strategies.

Abstract: Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.

[568] Simplex-to-Euclidean Bijection for Conjugate and Calibrated Multiclass Gaussian Process

Bernardo Williams, Harsha Vardhan Tetali, Arto Klami, Marcelo Hartmann

Main category: cs.LG

TL;DR: A conjugate Gaussian process model for multi-class classification using Aitchison geometry to map probability simplex to Euclidean space, enabling exact inference without approximations.

DetailsMotivation: Standard multi-class GP classifiers require approximations and have computational challenges. The authors aim to develop a conjugate GP model that provides exact inference and well-calibrated predictive probabilities for classification.

Method: Uses Aitchison geometry to map class probabilities from the probability simplex to an unconstrained Euclidean representation. This transforms classification into a GP regression problem with fewer latent dimensions than standard approaches, enabling conjugate inference without distributional approximations.

Result: Empirical results show well-calibrated and competitive performance across synthetic and real-world datasets. The method is compatible with standard sparse GP regression techniques, enabling scalable inference on larger datasets.

Conclusion: The proposed approach provides a conjugate and calibrated GP model for multi-class classification that avoids approximations, yields reliable predictive probabilities, and scales to larger datasets through sparse GP techniques.

Abstract: We propose a conjugate and calibrated Gaussian process (GP) model for multi-class classification by exploiting the geometry of the probability simplex. Our approach uses Aitchison geometry to map simplex-valued class probabilities to an unconstrained Euclidean representation, turning classification into a GP regression problem with fewer latent dimensions than standard multi-class GP classifiers. This yields conjugate inference and reliable predictive probabilities without relying on distributional approximations in the model construction. The method is compatible with standard sparse GP regression techniques, enabling scalable inference on larger datasets. Empirical results show well-calibrated and competitive performance across synthetic and real-world datasets.

[569] Self-Aware Markov Models for Discrete Reasoning

Gregor Kornhardt, Jannis Chemseddine, Christian Wald, Gabriele Steidl

Main category: cs.LG

TL;DR: A novel diffusion method with learnable Markov transition kernel and adaptive stopping for reasoning tasks, enabling error correction and computation adjustment based on problem complexity.

DetailsMotivation: Standard masked discrete diffusion models have limitations in reasoning tasks: they cannot correct mistakes made during the masking path and use fixed denoising steps that don't adapt to problem complexity.

Method: Introduces a method based on learning a Markov transition kernel trained on its own outputs, allowing tokens to be remasked for error correction. Uses trained stopping criterion instead of fixed time schedule, enabling adaptive number of function evaluations. Adds two lightweight prediction heads to reuse/fine-tune existing pretrained models.

Result: On Sudoku-Extreme: 95% validity, outperforming other flow-based methods. On Countdown-4: average 10 steps to solve 96% correctly, with many problems solvable in just 2 steps.

Conclusion: The proposed method successfully addresses limitations of standard diffusion models for reasoning tasks by enabling error correction and adaptive computation, achieving strong performance on challenging reasoning benchmarks.

Abstract: Standard masked discrete diffusion models face limitations in reasoning tasks due to their inability to correct their own mistakes on the masking path. Since they rely on a fixed number of denoising steps, they are unable to adjust their computation to the complexity of a given problem. To address these limitations, we introduce a method based on learning a Markov transition kernel that is trained on its own outputs. This design enables tokens to be remasked, allowing the model to correct its previous mistakes. Furthermore, we do not need a fixed time schedule but use a trained stopping criterion. This allows for adaptation of the number of function evaluations to the difficulty of the reasoning problem. Our adaptation adds two lightweight prediction heads, enabling reuse and fine-tuning of existing pretrained models. On the Sudoku-Extreme dataset we clearly outperform other flow based methods with a validity of 95%. For the Countdown-4 we only need in average of 10 steps to solve almost 96% of them correctly, while many problems can be solved already in 2 steps.

[570] Grid-World Representations in Transformers Reflect Predictive Geometry

Sasha Brenner, Thomas R. Knösche, Nico Scherf

Main category: cs.LG

TL;DR: Transformers trained on constrained random walks learn internal representations that align with analytically derived optimal predictive vectors, showing how world-model-like representations emerge from the predictive geometry of data.

DetailsMotivation: To understand the connection between next-token predictors' internal representations of the latent world and the geometry of probability distributions, using a minimal stochastic process as a controlled setting.

Method: Use constrained random walks on a 2D lattice with fixed endpoints as a toy system, train decoder-only transformers on walk prefixes, and compare hidden activations to analytically derived sufficient vectors for optimal prediction.

Result: Learned representations strongly align with ground-truth predictive vectors, are often low-dimensional, and demonstrate that world-model-like representations can be traced back to the predictive geometry of the data.

Conclusion: Geometric representations supporting optimal prediction provide a useful lens for studying how neural networks internalize structural constraints, even in simplified systems.

Abstract: Next-token predictors often appear to develop internal representations of the latent world and its rules. The probabilistic nature of these models suggests a deep connection between the structure of the world and the geometry of probability distributions. In order to understand this link more precisely, we use a minimal stochastic process as a controlled setting: constrained random walks on a two-dimensional lattice that must reach a fixed endpoint after a predetermined number of steps. Optimal prediction of this process solely depends on a sufficient vector determined by the walker’s position relative to the target and the remaining time horizon; in other words, the probability distributions are parametrized by the world’s geometry. We train decoder-only transformers on prefixes sampled from the exact distribution of these walks and compare their hidden activations to the analytically derived sufficient vectors. Across models and layers, the learned representations align strongly with the ground-truth predictive vectors and are often low-dimensional. This provides a concrete example in which world-model-like representations can be directly traced back to the predictive geometry of the data itself. Although demonstrated in a simplified toy system, the analysis suggests that geometric representations supporting optimal prediction may provide a useful lens for studying how neural networks internalize grammatical and other structural constraints.

[571] Cost Trade-offs in Matrix Inversion Updates for Streaming Outlier Detection

Florian Grivet, Louise Travé-Massuyès

Main category: cs.LG

TL;DR: Technical note comparing three matrix inversion update methods for online outlier detection using Christoffel function: Direct Inversion (DI), Iterative Sherman-Morrison (ISM), and Woodbury Matrix Identity (WMI).

DetailsMotivation: Online outlier detection using Christoffel function requires efficient matrix inversion updates, but there's no consensus on optimal method for rank-k updates given initial inverse.

Method: Compare three updating methods: DI (direct inversion), ISM (iterative Sherman-Morrison), and WMI (Woodbury Matrix Identity). Derive theoretical computational costs and validate with Python simulations on CPU.

Result: ISM optimal for rank-1 updates, WMI excels for small updates relative to matrix size, and DI preferable otherwise. Provides quantitative rule for method selection.

Conclusion: General result applicable to any matrix inversion update problem, contributing to efficient online outlier detection techniques.

Abstract: Outlier detection identifies data points that deviate significantly from expected patterns, revealing anomalies that may require special attention. Incorporating online learning further improves accuracy by continuously updating the model to reflect the most recent data. When employing the Christoffel function as an outlier score, online learning requires updating the inverse of a matrix following a rank-k update, given the initial inverse. Surprisingly, there is no consensus on the optimal method for this task. This technical note aims to compare three different updating methods: Direct Inversion (DI), Iterative Sherman-Morrison (ISM), and Woodbury Matrix Identity (WMI), to identify the most suitable approach for different scenarios. We first derive the theoretical computational costs of each method and then validate these findings through comprehensive Python simulations run on a CPU. These results allow us to propose a simple, quantitative, and easy-to-remember rule that can be stated qualitatively as follows: ISM is optimal for rank-1 updates, WMI excels for small updates relative to matrix size, and DI is preferable otherwise. This technical note produces a general result for any problem involving a matrix inversion update. In particular, it contributes to the ongoing development of efficient online outlier detection techniques.

[572] One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers

Lennon J. Shikhman

Main category: cs.LG

TL;DR: Neural PDE solvers don’t learn true solution operators but rather boundary-condition-dependent mappings, limiting generalization across boundary conditions.

DetailsMotivation: To clarify that neural operator learning for PDEs doesn't actually learn boundary-agnostic solution operators as commonly claimed, but rather learns mappings conditioned on the specific boundary conditions seen during training.

Method: Theoretical analysis framing operator learning as conditional risk minimization over boundary conditions, leading to non-identifiability results, supported by controlled experiments on Poisson equation with various boundary-condition shifts.

Result: Standard neural operators show sharp performance degradation under boundary-condition shifts, fail to generalize between distinct boundary ensembles, and converge to conditional expectations when boundary information is removed.

Conclusion: Current neural PDE solvers have fundamental limitations in learning true solution operators, highlighting the need for explicit boundary-aware modeling for developing foundation models for PDEs.

Abstract: Neural PDE solvers are often described as learning solution operators that map problem data to PDE solutions. In this work, we argue that this interpretation is generally incorrect when boundary conditions vary. We show that standard neural operator training implicitly learns a boundary-indexed family of operators, rather than a single boundary-agnostic operator, with the learned mapping fundamentally conditioned on the boundary-condition distribution seen during training. We formalize this perspective by framing operator learning as conditional risk minimization over boundary conditions, which leads to a non-identifiability result outside the support of the training boundary distribution. As a consequence, generalization in forcing terms or resolution does not imply generalization across boundary conditions. We support our theoretical analysis with controlled experiments on the Poisson equation, demonstrating sharp degradation under boundary-condition shifts, cross-distribution failures between distinct boundary ensembles, and convergence to conditional expectations when boundary information is removed. Our results clarify a core limitation of current neural PDE solvers and highlight the need for explicit boundary-aware modeling in the pursuit of foundation models for PDEs.

[573] Learning Lineage-guided Geodesics with Finsler Geometry

Aaron Zweig, Mingxuan Zhang, David A. Knowles, Elham Azizi

Main category: cs.LG

TL;DR: Finsler metric combining geometry with classification for trajectory inference using both continuous geometric and discrete directed priors

DetailsMotivation: Trajectory inference needs to incorporate both continuous geometric priors (spatial features) and discrete directed prior knowledge (like lineage trees in biology) for better interpolation of dynamical systems

Method: Introduces a Finsler metric that combines geometry with classification to incorporate both continuous spatial features and discrete directed transition knowledge in trajectory inference

Result: Improved performance on interpolation tasks in both synthetic and real-world data compared to previous methods

Conclusion: The proposed Finsler metric approach successfully integrates both geometric and discrete directed priors for enhanced trajectory inference in dynamical systems

Abstract: Trajectory inference investigates how to interpolate paths between observed timepoints of dynamical systems, such as temporally resolved population distributions, with the goal of inferring trajectories at unseen times and better understanding system dynamics. Previous work has focused on continuous geometric priors, utilizing data-dependent spatial features to define a Riemannian metric. In many applications, there exists discrete, directed prior knowledge over admissible transitions (e.g. lineage trees in developmental biology). We introduce a Finsler metric that combines geometry with classification and incorporate both types of priors in trajectory inference, yielding improved performance on interpolation tasks in synthetic and real-world data.

[574] Novelty-Driven Target-Space Discovery in Automated Electron and Scanning Probe Microscopy

Utkarsh Pratiush, Kamyar Barakati, Boris N. Slautin, Catherine C. Bodinger, Christopher D. Lowe, Brandi M. Cossairt, Sergei V. Kalinin

Main category: cs.LG

TL;DR: BEACON framework uses deep kernel learning to guide discovery in microscopy by learning structure-property relationships during experiments and seeking diverse response regimes rather than optimizing known objectives.

DetailsMotivation: Modern microscopy faces a discovery challenge where important scientific information resides in sequentially acquired spectra or functional responses rather than immediately visible image features, requiring strategies that actively search for new behaviors rather than optimizing known objectives.

Method: Developed a deep-kernel-learning BEACON framework that learns structure-property relationships during experiments and uses the evolving model to seek diverse response regimes. Established method through demonstration workflows on pre-acquired datasets, benchmarked against classical acquisition strategies, and defined monitoring functions for comparing exploration quality and target-space coverage.

Result: Successfully operationalized and deployed the workflow on STEM (Scanning Transmission Electron Microscopy), transitioning from offline validation to real experimental implementation. Created a benchmarking framework for evaluating discovery-driven algorithms and made associated notebooks available for community adoption.

Conclusion: BEACON provides a practical framework for discovery-driven microscopy that goes beyond optimization, enabling active search for new behaviors in target spaces of spectra or functional responses, with tools available for broader community adoption.

Abstract: Modern automated microscopy faces a fundamental discovery challenge: in many systems, the most important scientific information does not reside in the immediately visible image features, but in the target space of sequentially acquired spectra or functional responses, making it essential to develop strategies that can actively search for new behaviors rather than simply optimize known objectives. Here, we developed a deep-kernel-learning BEACON framework that is explicitly designed to guide discovery in the target space by learning structure-property relationships during the experiment and using that evolving model to seek diverse response regimes. We first established the method through demonstration workflows built on pre-acquired ground-truth datasets, which enabled direct benchmarking against classical acquisition strategies and allowed us to define a set of monitoring functions for comparing exploration quality, target-space coverage, and surrogate-model behavior in a transparent and reproducible manner. This benchmarking framework provides a practical basis for evaluating discovery-driven algorithms, not just optimization performance. We then operationalized and deployed the workflow on STEM, showing that the approach can transition from offline validation to real experimental implementation. To support adoption and extension by the broader community, the associated notebooks are available, allowing users to reproduce the workflows, test the benchmarks, and adapt the method to their own instruments and datasets.

[575] Federated Learning with Multi-Partner OneFlorida+ Consortium Data for Predicting Major Postoperative Complications

Yuanfang Ren, Varun Sai Vemuri, Zhenhong Hu, Benjamin Shickel, Ziyuan Guan, Tyler J. Loftus, Parisa Rashidi, Tezcan Ozrazgat-Baslanti, Azra Bihorac

Main category: cs.LG

TL;DR: Federated learning models for predicting postoperative complications using multicenter surgical data show comparable or superior performance to local/central models while preserving data privacy.

DetailsMotivation: To develop privacy-preserving predictive models for postoperative complications using multicenter data while addressing data sharing limitations and privacy concerns in healthcare.

Method: Retrospective multicenter cohort study of 358,644 patients across 5 institutions using federated learning to predict ICU admission, mechanical ventilation, acute kidney injury, and mortality, comparing with local and central models.

Result: Federated learning models achieved strong predictive performance (AUROC/AUPRC) comparable or superior to local/central models across all outcomes and demonstrated strong generalizability across sites.

Conclusion: Federated learning enables robust, generalizable, privacy-preserving predictive models for postoperative complications, supporting feasibility for clinical decision support systems.

Abstract: Background: This study aims to develop and validate federated learning models for predicting major postoperative complications and mortality using a large multicenter dataset from the OneFlorida Data Trust. We hypothesize that federated learning models will offer robust generalizability while preserving data privacy and security. Methods: This retrospective, longitudinal, multicenter cohort study included 358,644 adult patients admitted to five healthcare institutions, who underwent 494,163 inpatient major surgical procedures from 2012-2023. We developed and internally and externally validated federated learning models to predict the postoperative risk of intensive care unit (ICU) admission, mechanical ventilation (MV) therapy, acute kidney injury (AKI), and in-hospital mortality. These models were compared with local models trained on data from a single center and central models trained on a pooled dataset from all centers. Performance was primarily evaluated using area under the receiver operating characteristics curve (AUROC) and the area under the precision-recall curve (AUPRC) values. Results: Our federated learning models demonstrated strong predictive performance, with AUROC scores consistently comparable or superior performance in terms of AUROC and AUPRC across all outcomes and sites. Our federated learning models also demonstrated strong generalizability, with comparable or superior performance in terms of both AUROC and AUPRC compared to the best local learning model at each site. Conclusions: By leveraging multicenter data, we developed robust, generalizable, and privacy-preserving predictive models for major postoperative complications and mortality. These findings support the feasibility of federated learning in clinical decision support systems.

[576] The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models

Robert Welch, Emir Konuk, Kevin Smith

Main category: cs.LG

TL;DR: Reasoning in vision-language models degrades most uncertainty estimates despite improving accuracy, due to implicit answer conditioning where token probabilities reflect consistency with the model’s own reasoning rather than uncertainty about correctness.

DetailsMotivation: Vision-language models are increasingly used in high-stakes applications where reliable uncertainty quantification is crucial. While reasoning (chain-of-thought prompting) has become common in VLM pipelines, its impact on uncertainty estimation reliability remains poorly understood and potentially problematic.

Method: The paper analyzes how reasoning affects uncertainty estimates in VLMs, identifying implicit answer conditioning as the key mechanism. As reasoning traces converge on conclusions before final answer generation, token probabilities increasingly reflect consistency with the model’s own reasoning trace rather than true uncertainty about correctness.

Result: Reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. The model becomes overconfident in its answers. However, agreement-based consistency measures remain robust and often improve under reasoning.

Conclusion: Agreement-based consistency is recommended as a practical choice for uncertainty estimation in reasoning-enabled VLMs, as it remains robust to the overconfidence effects caused by implicit answer conditioning during reasoning processes.

Abstract: Vision-language models (VLMs) are increasingly deployed in high-stakes settings where reliable uncertainty quantification (UQ) is as important as predictive accuracy. Extended reasoning via chain-of-thought (CoT) prompting or reasoning-trained models has become ubiquitous in modern VLM pipelines, yet its effect on UQ reliability remains poorly understood. We show that reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. We identify implicit answer conditioning as the primary mechanism: as reasoning traces converge on a conclusion before the final answer is generated, token probabilities increasingly reflect consistency with the model’s own reasoning trace rather than uncertainty about correctness. In effect, the model becomes overconfident in its answer. In contrast, agreement-based consistency remains robust and often improves under reasoning, making it a practical choice for uncertainty estimation in reasoning-enabled VLMs.

[577] GeMA: Learning Latent Manifold Frontiers for Benchmarking Complex Systems

Jia Ming Li, Anupriya, Daniel J. Graham

Main category: cs.LG

TL;DR: GeMA is a novel frontier analysis method using a productivity-manifold VAE to represent production sets as low-dimensional manifold boundaries, enabling efficiency benchmarking with better handling of heterogeneity, non-convexity, and scale effects.

DetailsMotivation: Classical frontier methods (DEA, SFA) rely on restrictive assumptions about production sets and struggle with heterogeneity and scale effects. There's a need for more flexible methods that can handle complex real-world systems with non-convex frontiers and diverse technologies.

Method: Geometric Manifold Analysis (GeMA) uses a productivity-manifold variational autoencoder (ProMan-VAE) to represent production sets as boundaries of low-dimensional manifolds. A split-head encoder learns latent variables for technological structure and inefficiency, with efficiency evaluated relative to the learned manifold. The approach includes clustering for peer groups, scale-invariant benchmarking, and geometric robustness certification.

Result: GeMA performs comparably to established methods when classical assumptions hold, and provides superior insights in settings with pronounced heterogeneity, non-convexity, or size-related bias across synthetic data and four real-world case studies (rail systems, operators, national economies, wind farms).

Conclusion: GeMA offers a flexible, geometry-based alternative to classical frontier methods that better handles complex real-world production systems with heterogeneous technologies, non-convex frontiers, and scale effects.

Abstract: Benchmarking the performance of complex systems such as rail networks, renewable generation assets and national economies is central to transport planning, regulation and macroeconomic analysis. Classical frontier methods, notably Data Envelopment Analysis (DEA) and Stochastic Frontier Analysis (SFA), estimate an efficient frontier in the observed input-output space and define efficiency as distance to this frontier, but rely on restrictive assumptions on the production set and only indirectly address heterogeneity and scale effects. We propose Geometric Manifold Analysis (GeMA), a latent manifold frontier framework implemented via a productivity-manifold variational autoencoder (ProMan-VAE). Instead of specifying a frontier function in the observed space, GeMA represents the production set as the boundary of a low-dimensional manifold embedded in the joint input-output space. A split-head encoder learns latent variables that capture technological structure and operational inefficiency. Efficiency is evaluated with respect to the learned manifold, endogenous peer groups arise as clusters in latent technology space, a quotient construction supports scale-invariant benchmarking, and a local certification radius, derived from the decoder Jacobian and a Lipschitz bound, quantifies the geometric robustness of efficiency scores. We validate GeMA on synthetic data with non-convex frontiers, heterogeneous technologies and scale bias, and on four real-world case studies: global urban rail systems (COMET), British rail operators (ORR), national economies (Penn World Table) and a high-frequency wind-farm dataset. Across these domains GeMA behaves comparably to established methods when classical assumptions hold, and provides additional insight in settings with pronounced heterogeneity, non-convexity or size-related bias.

[578] Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets

Kristi Topollai, Anna Choromanska

Main category: cs.LG

TL;DR: Quantizing optimizer states causes staleness due to rounding, slowing adaptation; predictive stalling model explains why resets help; theory-guided reset schedules recover performance while reducing memory.

DetailsMotivation: Quantizing optimizer states is crucial for memory-efficient large-scale pre-training, but the effects on optimizer dynamics are not well understood. The paper aims to understand how low-precision storage affects exponential moving average (EMA) optimizer states and develop solutions to mitigate performance degradation.

Method: The authors study low-precision EMA optimizer states, develop a predictive model of stalling that estimates one-step stalling probabilities, and characterize how stalling builds up over time. They derive a theory-guided method for choosing optimal reset periods for quantized optimizer states.

Result: The analysis shows that quantization causes many nominal updates to round back to the same stored value, making the state effectively stale. The predictive stalling model provides a mechanistic explanation for why optimizer-state resets help. Experiments in controlled simulations and LLM pre-training demonstrate that suitable reset schedules recover performance lost to low-precision state storage while substantially reducing optimizer-state memory.

Conclusion: Quantization of optimizer states introduces staleness that slows adaptation beyond what nominal decay would suggest. The key insight is that in low precision, the important question is not just whether resets help, but when they should be applied. Theory-guided reset schedules can effectively mitigate performance degradation while maintaining memory efficiency.

Abstract: Quantizing optimizer states is becoming an important ingredient of memory-efficient large-scale pre-training, but the resulting optimizer dynamics remain only partially understood. We study low-precision exponential moving average (EMA) optimizer states and show how quantization can cause many nominal updates to round back to the same stored value, making the state effectively stale and slowing adaptation beyond what the nominal decay would suggest. We then develop a simple predictive model of stalling that estimates one-step stalling probabilities and characterizes how stalling builds up over time after the initialization. This perspective provides a mechanistic explanation for why optimizer-state resets help in low precision: once a quantized EMA becomes effectively stale, resetting it can temporarily restore responsiveness. Motivated by this picture, we derive a simple theory-guided method for choosing useful reset periods, showing that in low precision the key question is not only whether resets help, but when they should be applied. Experiments in controlled simulations and LLM pre-training show that suitable reset schedules recover the performance lost to low-precision state storage while substantially reducing optimizer-state memory.

[579] SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

D. Darankoum, C. Habermacher, J. Volle, S. Grudinin

Main category: cs.LG

TL;DR: A novel EEG foundation model using Gaussian-smoothed masking on STFT maps with SpecHi-Net architecture and SpecMoE mixture of experts achieves SOTA performance across diverse EEG decoding tasks with strong cross-species generalization.

DetailsMotivation: Existing EEG foundation models using separate temporal/spectral masking bias learning toward high-frequency oscillations, as low-frequency patterns can be easily inferred from unmasked signals. Need more challenging reconstruction to force learning of intricate neural patterns across all frequency domains.

Method: Proposes Gaussian-smoothed masking on STFT maps with joint time, frequency, and time-frequency masks. Uses SpecHi-Net (U-shaped hierarchical architecture) for reconstruction. Employs SpecMoE - mixture of experts with spectral gating mechanism trained on partitioned data subsets.

Result: Achieves state-of-the-art performance across diverse EEG tasks: sleep staging, emotion recognition, motor imagery classification, abnormal signal detection, and drug effect prediction. Demonstrates strong cross-species (human/murine) and cross-subject generalization.

Conclusion: The proposed Gaussian-smoothed masking strategy with SpecMoE framework effectively learns comprehensive neural representations across frequency domains, enabling robust EEG decoding with excellent generalization capabilities.

Abstract: Decoding the orchestration of neural activity in electroencephalography (EEG) signals is a central challenge in bridging neuroscience with artificial intelligence. Foundation models have made strides in generalized EEG decoding, yet many existing frameworks primarily relying on separate temporal and spectral masking of raw signals during self-supervised pretraining. Such strategies often tend to bias learning toward high-frequency oscillations, as low-frequency rhythmic patterns can be easily inferred from the unmasked signal. We introduce a foundation model that utilizes a novel Gaussian-smoothed masking scheme applied to short-time Fourier transform (STFT) maps. By jointly applying time, frequency, and time-frequency Gaussian masks, we make the reconstruction task much more challenging, forcing the model to learn intricate neural patterns across both high- and low-frequency domains. To effectively recover signals under this aggressive masking strategy, we design SpecHi-Net, a U-shaped hierarchical architecture with multiple encoding and decoding stages. To accelerate large-scale pretraining, we partition the data into three subsets, each used to train an independent expert model. We then combine these models through SpecMoE, a mixture of experts framework guided by a learned spectral gating mechanism. SpecMoE achieves state-of-the-art performance across a diverse set of EEG decoding tasks, including sleep staging, emotion recognition, motor imagery classification, abnormal signal detection, and drug effect prediction. Importantly, the model demonstrates strong cross-species and cross-subject generalization, maintaining high accuracy on both human and murine EEG datasets.

[580] Bayesian Inference of Psychometric Variables From Brain and Behavior in Implicit Association Tests

Christian A. Kothe, Sean Mullen, Michael V. Bronstein, Grant Hanada, Marcelo Cicconet, Aaron N. McInnes, Tim Mullen, Marc Aafjes, Scott R. Sponheim, Alik S. Widge

Main category: cs.LG

TL;DR: A sparse hierarchical Bayesian model using multi-modal data improves prediction of mental health symptoms from Implicit Association Test data compared to traditional D-score methods.

DetailsMotivation: To overcome the limited predictive performance (typically under 0.7 AUC) of the gold-standard D-score method for inferring mental health variables from IAT data, which relies solely on reaction times.

Method: Proposed a sparse hierarchical Bayesian model that leverages multi-modal data as a multivariate generalization of the D-score with trainable parameters, engineered for parameter efficiency in small-cohort IAT studies. Analyzed data from two IAT variants: suicidality-related E-IAT (n=39) and psychosis-related PSY-IAT (n=34).

Result: Achieved AUCs of 0.73 (E-IAT) and 0.76 (PSY-IAT) in best modality configurations, with E-IAT restricted to MDD participants improving to 0.79 AUC. Performance was on par with best reference methods (shrinkage LDA and EEGNet) and substantially above near-chance D-scores (0.50-0.53 AUC).

Conclusion: The framework shows promise for enhancing IAT-based assessment of mental health conditions, though further validation on larger independent cohorts is needed to establish clinical utility.

Abstract: Objective. We establish a principled method for inferring mental health related psychometric variables from neural and behavioral data using the Implicit Association Test (IAT) as the data generation engine, aiming to overcome the limited predictive performance (typically under 0.7 AUC) of the gold-standard D-score method, which relies solely on reaction times. Approach. We propose a sparse hierarchical Bayesian model that leverages multi-modal data to predict experiences related to mental illness symptoms in new participants. The model is a multivariate generalization of the D-score with trainable parameters, engineered for parameter efficiency in the small-cohort regime typical of IAT studies. Data from two IAT variants were analyzed: a suicidality-related E-IAT ($n=39$) and a psychosis-related PSY-IAT ($n=34$). Main Results. Our approach overcomes a high inter-individual variability and low within-session effect size in the dataset, reaching AUCs of 0.73 (E-IAT) and 0.76 (PSY-IAT) in the best modality configurations, though corrected 95% confidence intervals are wide ($\pm 0.18$) and results are marginally significant after FDR correction ($q=0.10$). Restricting the E-IAT to MDD participants improves AUC to 0.79 $[0.62, 0.97]$ (significant at $q=0.05$). Performance is on par with the best reference methods (shrinkage LDA and EEGNet) for each task, even when the latter were adapted to the task, while the proposed method was not. Accuracy was substantially above near-chance D-scores (0.50-0.53 AUC) in both tasks, with more consistent cross-task performance than any single reference method. Significance. Our framework shows promise for enhancing IAT-based assessment of experiences related to entrapment and psychosis, and potentially other mental health conditions, though further validation on larger and independent cohorts will be needed to establish clinical utility.

[581] A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems

Wei Min Loh, Sajib Kumer Sinha, Ankur Agarwal, Pascal Poupart

Main category: cs.LG

TL;DR: C3 Thompson sampling algorithm for contextual bandits with dense arm features, non-linear rewards, and time-varying correlated rewards, showing improved performance on recommendation tasks.

DetailsMotivation: To address limitations in existing contextual bandit approaches by combining three challenging aspects: dense arm features, non-linear reward functions, and time-varying correlated rewards, which better reflects real-world recommendation system scenarios.

Method: Proposes Conditionally Coupled Contextual (C3) Thompson Sampling that combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling for online learning without retraining.

Result: C3 outperforms next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets and achieves 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.

Conclusion: The C3 algorithm effectively addresses the combined challenges of dense features, non-linear rewards, and time-varying correlations, demonstrating practical value for recommendation systems.

Abstract: Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce conditionally coupled contextual C3 Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that C3 outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.

[582] pADAM: A Plug-and-Play All-in-One Diffusion Architecture for Multi-Physics Learning

Amirhossein Mollaali, Bongseok Kim, Christian Moya, Guang Lin

Main category: cs.LG

TL;DR: pADAM is a unified generative framework that learns shared probabilistic priors across different PDE families, enabling forward prediction, inverse inference, and model selection without retraining.

DetailsMotivation: Existing deep-learning solvers are confined to single-equation settings, limiting transfer across physical regimes and inference tasks. There's a need for AI systems that can generalize across disparate physical laws.

Method: pADAM learns a joint distribution of system states and physical parameters across heterogeneous PDE families, creating a shared probabilistic prior. It supports forward prediction and inverse inference within a single architecture without retraining, and uses conformal prediction for uncertainty quantification.

Result: Achieves accurate inference under sparse observations across benchmarks from scalar diffusion to nonlinear Navier-Stokes equations. Provides reliable uncertainty quantification with coverage guarantees. Performs probabilistic model selection from only two sparse snapshots to identify governing laws.

Conclusion: pADAM demonstrates the potential of generative multi-physics modeling for unified and uncertainty-aware scientific inference, enabling generalization across different physical laws and regimes.

Abstract: Generalizing across disparate physical laws remains a fundamental challenge for artificial intelligence in science. Existing deep-learning solvers are largely confined to single-equation settings, limiting transfer across physical regimes and inference tasks. Here we introduce pADAM, a unified generative framework that learns a shared probabilistic prior across heterogeneous partial differential equation families. Through a learned joint distribution of system states and, where applicable, physical parameters, pADAM supports forward prediction and inverse inference within a single architecture without retraining. Across benchmarks ranging from scalar diffusion to nonlinear Navier–Stokes equations, pADAM achieves accurate inference even under sparse observations. Combined with conformal prediction, it also provides reliable uncertainty quantification with coverage guarantees. In addition, pADAM performs probabilistic model selection from only two sparse snapshots, identifying governing laws through its learned generative representation. These results highlight the potential of generative multi-physics modeling for unified and uncertainty-aware scientific inference.

[583] Conservative Continuous-Time Treatment Optimization

Nora Schneider, Georg Manten, Niki Kilbertus

Main category: cs.LG

TL;DR: Conservative stochastic control framework for treatment optimization from irregular patient data using SDE modeling with MMD regularization to prevent out-of-support extrapolation

DetailsMotivation: Optimizing treatments from irregularly sampled patient trajectories requires handling model errors and preventing exploitation of inaccurate dynamics that could lead to unsafe out-of-support treatment recommendations

Method: Model patient dynamics as controlled stochastic differential equations, then add signature-based MMD regularization on path space to penalize treatment plans that deviate from observed trajectory distributions, creating a conservative objective with computable upper bound on true cost

Result: Experiments on benchmark datasets demonstrate improved robustness and performance compared to non-conservative baselines

Conclusion: The proposed conservative framework with path-space regularization effectively addresses model errors and prevents unsafe extrapolation in treatment optimization from irregular patient data

Abstract: We develop a conservative continuous-time stochastic control framework for treatment optimization from irregularly sampled patient trajectories. The unknown patient dynamics are modeled as a controlled stochastic differential equation with treatment as a continuous-time control. Naive model-based optimization can exploit model errors and propose out-of-support controls, so optimizing the estimated dynamics may not optimize the true dynamics. To limit extrapolation, we add a consistent signature-based MMD regularizer on path space that penalizes treatment plans whose induced trajectory distribution deviates from observed trajectories. The resulting objective minimizes a computable upper bound on the true cost. Experiments on benchmark datasets show improved robustness and performance compared to non-conservative baselines.

[584] Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling

Christian Belardi, Justin Lovelace, Kilian Q. Weinberger, Carla P. Gomes

Main category: cs.LG

TL;DR: Proposes using adaptive moment estimation to stabilize noisy likelihood scores in guided diffusion sampling, achieving SOTA results on image restoration and class-conditional generation with improved computational efficiency.

DetailsMotivation: Guided diffusion sampling suffers from noisy likelihood score approximations that introduce instability into sampling dynamics, affecting alignment and performance.

Method: Uses adaptive moment estimation to stabilize noisy likelihood scores during diffusion sampling, providing a simple yet effective approach to mitigate gradient noise.

Result: Achieves state-of-the-art results on image restoration and class-conditional generation tasks, outperforming more complex methods while being computationally more efficient.

Conclusion: Mitigating gradient noise through adaptive moments offers an effective way to improve alignment in guided diffusion sampling, providing a simple and efficient solution.

Abstract: Guided diffusion sampling relies on approximating often intractable likelihood scores, which introduces significant noise into the sampling dynamics. We propose using adaptive moment estimation to stabilize these noisy likelihood scores during sampling. Despite its simplicity, our approach achieves state-of-the-art results on image restoration and class-conditional generation tasks, outperforming more complicated methods, which are often computationally more expensive. We provide empirical analysis of our method on both synthetic and real data, demonstrating that mitigating gradient noise through adaptive moments offers an effective way to improve alignment.

[585] High-Dimensional Gaussian Mean Estimation under Realizable Contamination

Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas

Main category: cs.LG

TL;DR: The paper establishes an information-computation gap for Gaussian mean estimation under realizable ε-contamination missing data, showing that computationally efficient algorithms require substantially more samples than information-theoretically necessary.

DetailsMotivation: To understand the computational complexity of mean estimation under realizable ε-contamination, an intermediate missing data model between MCAR and MNAR, where previous work left open whether efficient algorithms exist in high dimensions.

Method: Establishes lower bounds in the Statistical Query (SQ) model (and corollaries for Low-Degree Polynomials and PTF tests), showing exponential runtime is needed for sample-efficient algorithms, and provides an algorithm whose sample-time tradeoff nearly matches the lower bound.

Result: Demonstrates an information-computation gap: algorithms must either use substantially more samples than information-theoretically necessary or incur exponential runtime, characterizing the complexity of Gaussian mean estimation under this missing data model.

Conclusion: The realizable ε-contamination model exhibits a fundamental information-computation gap, meaning computationally efficient mean estimation requires significantly more samples than information-theoretic limits allow.

Abstract: We study mean estimation for a Gaussian distribution with identity covariance in $\mathbb{R}^d$ under a missing data scheme termed realizable $ε$-contamination model. In this model an adversary can choose a function $r(x)$ between 0 and $ε$ and each sample $x$ goes missing with probability $r(x)$. Recent work Ma et al., 2024 proposed this model as an intermediate-strength setting between Missing Completely At Random (MCAR) – where missingness is independent of the data – and Missing Not At Random (MNAR) – where missingness may depend arbitrarily on the sample values and can lead to non-identifiability issues. That work established information-theoretic upper and lower bounds for mean estimation in the realizable contamination model. Their proposed estimators incur runtime exponential in the dimension, leaving open the possibility of computationally efficient algorithms in high dimensions. In this work, we establish an information-computation gap in the Statistical Query model (and, as a corollary, for Low-Degree Polynomials and PTF tests), showing that algorithms must either use substantially more samples than information-theoretically necessary or incur exponential runtime. We complement our SQ lower bound with an algorithm whose sample-time tradeoff nearly matches our lower bound. Together, these results qualitatively characterize the complexity of Gaussian mean estimation under $ε$-realizable contamination.

[586] RaDAR: Relation-aware Diffusion-Asymmetric Graph Contrastive Learning for Recommendation

Yixuan Huang, Jiawei Chen, Shengfan Zhang, Zongsheng Cao

Main category: cs.LG

TL;DR: RaDAR is a novel graph contrastive learning framework for recommendation systems that combines diffusion-based augmentation with relation-aware edge refinement to handle noisy and sparse data.

DetailsMotivation: Existing graph-based collaborative filtering methods suffer from two main issues: (1) random edge perturbations distort critical structural signals and degrade semantic consistency across augmented views, and (2) data sparsity hampers the propagation of collaborative signals, limiting generalization capabilities.

Method: RaDAR combines two complementary view generation mechanisms: a graph generative model to capture global structure and a relation-aware denoising model to refine noisy edges. It introduces three key innovations: (1) asymmetric contrastive learning with global negative sampling, (2) diffusion-guided augmentation using progressive noise injection and denoising, and (3) relation-aware edge refinement that dynamically adjusts edge weights based on latent node semantics.

Result: Extensive experiments on three public benchmarks demonstrate that RaDAR consistently outperforms state-of-the-art methods, particularly under noisy and sparse conditions.

Conclusion: RaDAR effectively addresses the challenges of noise and sparsity in graph-based recommendation systems through its novel combination of diffusion-based augmentation and relation-aware refinement, showing superior performance over existing methods.

Abstract: Collaborative filtering (CF) recommendation has been significantly advanced by integrating Graph Neural Networks (GNNs) and Graph Contrastive Learning (GCL). However, (i) random edge perturbations often distort critical structural signals and degrade semantic consistency across augmented views, and (ii) data sparsity hampers the propagation of collaborative signals, limiting generalization. To tackle these challenges, we propose RaDAR (Relation-aware Diffusion-Asymmetric Graph Contrastive Learning Framework for Recommendation Systems), a novel framework that combines two complementary view generation mechanisms: a graph generative model to capture global structure and a relation-aware denoising model to refine noisy edges. RaDAR introduces three key innovations: (1) asymmetric contrastive learning with global negative sampling to maintain semantic alignment while suppressing noise; (2) diffusion-guided augmentation, which employs progressive noise injection and denoising for enhanced robustness; and (3) relation-aware edge refinement, dynamically adjusting edge weights based on latent node semantics. Extensive experiments on three public benchmarks demonstrate that RaDAR consistently outperforms state-of-the-art methods, particularly under noisy and sparse conditions.

[587] Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

Jello Zhou, Vudtiwat Ngampruetikorn, David J. Schwab

Main category: cs.LG

TL;DR: Stochastic resetting accelerates reinforcement learning by truncating uninformative trajectories and enhancing value propagation, offering a novel optimization principle for learning systems.

DetailsMotivation: Existing stochastic resetting theory focuses on static, non-learning processes, but the authors want to explore how resetting interacts with reinforcement learning where dynamics adapt through experience.

Method: Studied resetting in tabular grid environments and continuous control tasks with neural-network-based value approximation, comparing resetting effects on policy convergence and learning efficiency.

Result: Resetting accelerates policy convergence even when it doesn’t reduce search time for diffusive agents, and improves deep reinforcement learning in sparse reward, difficult exploration scenarios by truncating long uninformative trajectories.

Conclusion: Stochastic resetting serves as a simple, tunable mechanism for accelerating learning, translating statistical mechanics phenomena into an optimization principle for reinforcement learning.

Abstract: Stochastic resetting, where a dynamical process is intermittently returned to a fixed reference state, has emerged as a powerful mechanism for optimizing first-passage properties. Existing theory largely treats static, non-learning processes. Here we ask how stochastic resetting interacts with reinforcement learning, where the underlying dynamics adapt through experience. In tabular grid environments, we find that resetting accelerates policy convergence even when it does not reduce the search time of a purely diffusive agent, indicating a novel mechanism beyond classical first-passage optimization. In a continuous control task with neural-network-based value approximation, we show that random resetting improves deep reinforcement learning when exploration is difficult and rewards are sparse. Unlike temporal discounting, resetting preserves the optimal policy while accelerating convergence by truncating long, uninformative trajectories to enhance value propagation. Our results establish stochastic resetting as a simple, tunable mechanism for accelerating learning, translating a canonical phenomenon of statistical mechanics into an optimization principle for reinforcement learning.

[588] Dynamic Meta-Layer Aggregation for Byzantine-Robust Federated Learning

Reek Das, Biplab Kanti Sen

Main category: cs.LG

TL;DR: FedAOT: A metalearning-inspired adaptive aggregation framework for federated learning that defends against multi-label flipping and untargeted poisoning attacks by dynamically weighting client updates based on reliability.

DetailsMotivation: Federated Learning (FL) is vulnerable to Byzantine adversaries injecting malicious updates, and existing defenses fail against sophisticated untargeted attacks like multi-label flipping or combined noise/backdoor patterns.

Method: FedAOT uses a metalearning-inspired adaptive aggregation framework that dynamically weights client updates based on their reliability, suppressing adversarial influence without predefined thresholds or restrictive attack assumptions.

Result: FedAOT substantially improves model accuracy and resilience across diverse datasets and attack types, maintaining robust performance even in previously unseen scenarios while being computationally efficient.

Conclusion: FedAOT offers a scalable and practical solution for secure federated learning that generalizes effectively against diverse Byzantine attacks without requiring specific attack knowledge.

Abstract: Federated Learning (FL) is increasingly applied in sectors like healthcare, finance, and IoT, enabling collaborative model training while safeguarding user privacy. However, FL systems are susceptible to Byzantine adversaries that inject malicious updates, which can severely compromise global model performance. Existing defenses tend to focus on specific attack types and fail against untargeted strategies, such as multi-label flipping or combinations of noise and backdoor patterns. To overcome these limitations, we propose FedAOT-a novel defense mechanism that counters multi-label flipping and untargeted poisoning attacks using a metalearning-inspired adaptive aggregation framework. FedAOT dynamically weights client updates based on their reliability, suppressing adversarial influence without relying on predefined thresholds or restrictive attack assumptions. Notably, FedAOT generalizes effectively across diverse datasets and a wide range of attack types, maintaining robust performance even in previously unseen scenarios. Experimental results demonstrate that FedAOT substantially improves model accuracy and resilience while maintaining computational efficiency, offering a scalable and practical solution for secure federated learning.

[589] GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

Mattia Rigotti, Nicholas Thumiger, Thomas Frick

Main category: cs.LG

TL;DR: GIST is a gauge-invariant graph transformer that achieves linear complexity via random projections while preserving gauge invariance, enabling discretization-invariant learning across mesh resolutions.

DetailsMotivation: Existing transformer positional encoding methods for meshes and graphs face computational challenges: exact spectral methods require cubic complexity and can break gauge invariance, while approximate methods sacrifice gauge symmetry, both causing catastrophic generalization failures in inductive learning.

Method: GIST uses random projections to achieve O(N) complexity while algorithmically preserving gauge invariance through inner-product-based attention on projected embeddings, enabling discretization-invariant learning with bounded mismatch error.

Result: GIST matches state-of-the-art on standard graph benchmarks (99.50% micro-F1 on PPI) and scales to mesh-based Neural Operator benchmarks with up to 750K nodes, achieving SOTA aerodynamic prediction on DrivAerNet and DrivAerNet++ datasets.

Conclusion: GIST resolves fundamental challenges in graph/mesh transformers by combining computational efficiency with gauge invariance, enabling robust parameter transfer across arbitrary mesh resolutions for neural operator applications.

Abstract: Adapting transformer positional encoding to meshes and graph-structured data presents significant computational challenges: exact spectral methods require cubic-complexity eigendecomposition and can inadvertently break gauge invariance through numerical solver artifacts, while efficient approximate methods sacrifice gauge symmetry by design. Both failure modes cause catastrophic generalization in inductive learning, where models trained with one set of numerical choices fail when encountering different spectral decompositions of similar graphs or discretizations of the same mesh. We propose GIST (Gauge-Invariant Spectral Transformers), a new graph transformer architecture that resolves this challenge by achieving end-to-end $\mathcal{O}(N)$ complexity through random projections while algorithmically preserving gauge invariance via inner-product-based attention on the projected embeddings. We prove GIST achieves discretization-invariant learning with bounded mismatch error, enabling parameter transfer across arbitrary mesh resolutions for neural operator applications. Empirically, GIST matches state-of-the-art on standard graph benchmarks (e.g., achieving 99.50% micro-F1 on PPI) while uniquely scaling to mesh-based Neural Operator benchmarks with up to 750K nodes, achieving state-of-the-art aerodynamic prediction on the challenging DrivAerNet and DrivAerNet++ datasets.

[590] Long-Horizon Traffic Forecasting via Incident-Aware Conformal Spatio-Temporal Transformers

Mayur Patil, Qadeer Ahmed, Shawn Midlam-Mohler, Stephanie Marik, Allen Sheldon, Rajeev Chhajer, Nithin Santhanam

Main category: cs.LG

TL;DR: STT model with ACP for multi-horizon traffic forecasting using dynamic graph construction based on hour-to-hour variability and incident data, validated via SUMO simulations.

DetailsMotivation: Traffic forecasting is challenging due to stochastic network conditions, intermittent incident disruptions, and varying spatial dependencies across time-of-day patterns. Existing methods often use fixed assumptions that don't capture dynamic changes in traffic conditions.

Method: Uses Spatio-Temporal Transformer (STT) with Adaptive Conformal Prediction (ACP) for uncertainty calibration. Proposes piecewise Coefficient of Variation (CV) strategy modeling hour-to-hour traveltime variability with log-normal distribution to create per-hour dynamic adjacency matrix. Perturbs edge weights using incident severity signals from crash data (clearance time, weather, speed violations, work zones, roadway class). Validated via multi-hour SUMO simulations and Monte Carlo simulation for travel-time distributions.

Result: Experiments show improved long-horizon accuracy and well-calibrated prediction intervals compared to baseline methods, demonstrating effectiveness of dynamic graph construction approach.

Conclusion: The proposed dynamic graph construction method using piecewise CV and incident data better represents changing traffic conditions, leading to more reliable multi-horizon forecasts with calibrated uncertainty.

Abstract: Reliable multi-horizon traffic forecasting is challenging because network conditions are stochastic, incident disruptions are intermittent, and effective spatial dependencies vary across time-of-day patterns. This study is conducted on the Ohio Department of Transportation (ODOT) traffic count data and corresponding ODOT crash records. This work utilizes a Spatio-Temporal Transformer (STT) model with Adaptive Conformal Prediction (ACP) to produce multi-horizon forecasts with calibrated uncertainty. We propose a piecewise Coefficient of Variation (CV) strategy that models hour-to-hour traveltime variability using a log-normal distribution, enabling the construction of a per-hour dynamic adjacency matrix. We further perturb edge weights using incident-related severity signals derived from the ODOT crash dataset that comprises incident clearance time, weather conditions, speed violations, work zones, and roadway functional class, to capture localized disruptions and peak/off-peak transitions. This dynamic graph construction replaces a fixed-CV assumption and better represents changing traffic conditions within the forecast window. For validation, we generate extended trips via multi-hour loop runs on the Columbus, Ohio, network in SUMO simulations and apply a Monte Carlo simulation to obtain travel-time distributions for a Vehicle Under Test (VUT). Experiments demonstrate improved long-horizon accuracy and well-calibrated prediction intervals compared to other baseline methods.

[591] Generalizable End-to-End Tool-Use RL with Synthetic CodeGym

Weihua Du, Hailei Gong, Zhan Ling, Kang Liu, Lingfeng Shen, Xuesong Yao, Yufei Xu, Dingyuan Shi, Yiming Yang, Jiecao Chen

Main category: cs.LG

TL;DR: CodeGym: A framework that converts coding problems into interactive RL environments for training LLM agents in tool-use behaviors, enabling better generalization to unseen workflows.

DetailsMotivation: Current LLM agent training methods (SFT or RL on narrow tasks) generalize poorly to new tools and workflows. Need scalable environments that reflect real-world tool-use patterns.

Method: CodeGym synthesizes diverse, verifiable tool-use environments from coding problems by extracting atomic functions into callable tools, creating interactive multi-turn RL training environments.

Result: Models trained with CodeGym show consistent OOD generalization - Qwen2.5-32B-Instruct achieves 8.7 point accuracy gain on τ-Bench benchmark.

Conclusion: CodeGym represents progress toward scalable RL environments for training general-purpose tool-use behaviors that align with real-world agent workflows.

Abstract: Tool-augmented large language models (LLMs), hereafter LLM agents, leverage external tools to solve diverse tasks and interface with the real world. However, current training practices largely rely on supervised fine-tuning (SFT) over static trajectories or reinforcement learning (RL) on narrow tasks, which generalize poorly beyond development settings and lead to brittleness with new tools and unseen workflows. Because code execution reflects many structural patterns of real-world workflows, we use coding problems as a structured substrate to build tool-use agent training environments with diverse task configurations. To this end, we introduce CodeGym, a scalable framework that synthesizes diverse, verifiable, and controllable multi-turn tool-use environments for agent RL, enabling LLM agents to explore and master various workflows actively. CodeGym converts static coding problems into interactive environments by extracting atomic functions or logic into callable tools, yielding verifiable tasks that span various tool-execution workflows. Models of varying sizes and chain-of-thought configurations trained in CodeGym exhibit consistent out-of-distribution generalizability; for example, Qwen2.5-32B-Instruct achieves an absolute accuracy gain of 8.7 points on the OOD benchmark $τ$-Bench. These results highlight CodeGym as a step toward scalable general-purpose RL environments for training tool-use behaviors that align with real-world agent workflows.

[592] Language as a Wave Phenomenon: Semantic Phase Locking and Interference in Neural Networks

Alper Yıldırım, İbrahim Yücedağ

Main category: cs.LG

TL;DR: PRISM introduces a complex-valued encoder with unit-norm constraint to study phase’s role in neural sequence models, finding semantic relationships correlate with phase structure and phase representations are robust to attenuation.

DetailsMotivation: To isolate and understand the role of phase in neural sequence models, since phase's function remains poorly understood despite its potential for encoding semantic information.

Method: Introduces PRISM, a complex-valued encoder with unit-norm constraint (|z|=1) that replaces attention with gated spectral filtering, forcing reliance on phase angles rather than magnitude.

Result: Semantic relationships correlate with measurable phase structure (synonym pairs show higher phase coherence), lexical ambiguity resolved via layer-specific phase rotations, phase representations robust to scalar attenuation (97% translation quality retained), and identification of spectral density threshold for coherent generation.

Conclusion: Phase angles can encode semantic information in complex-valued networks under specific conditions, providing controlled evidence for phase’s role in neural sequence modeling.

Abstract: The role of phase in neural sequence models remains poorly understood. To isolate this question, we introduce PRISM, a complex-valued encoder that enforces a unit-norm constraint ($|z| = 1$) and replaces attention with gated spectral filtering. Under this constraint, the model cannot use activation magnitude to distinguish signal from noise, and must instead rely on phase angles. We find that semantic relationships correlate with measurable phase structure: synonym pairs exhibit significantly higher phase coherence than random pairs ($R = 0.198$ vs.\ $0.072$, $p < 0.001$), and the model resolves lexical ambiguity via layer-specific phase rotations while maintaining near-unit gain. These phase representations are robust to scalar attenuation, retaining $97%$ of translation quality when signal magnitude is uniformly reduced. We also identify a spectral density threshold: the model fails to generate coherent output from isolated tokens, requiring minimum sequence length to produce the interference patterns that support its computation. Finally, we show that a hybrid architecture (Wave-Particle Transformer) combining a phase-based encoder with standard attention matches Transformer baselines at $33$M parameters with fewer non-embedding parameters, though we do not claim this generalizes to larger scales. Our findings provide controlled evidence that phase angles can encode semantic information in complex-valued networks, and characterize the conditions under which this encoding succeeds and fails.

[593] Geometric Imbalance in Semi-Supervised Node Classification

Liang Yan, Shengzhong Zhang, Bisheng Li, Menglin Yang, Chen Yang, Min Zhou, Weiyang Ding, Yutong Xie, Zengfeng Huang

Main category: cs.LG

TL;DR: A framework addressing geometric imbalance in graph neural networks for semi-supervised node classification, particularly under class imbalance conditions.

DetailsMotivation: Class imbalance in graph data poses significant challenges for node classification, especially in semi-supervised settings. The paper introduces the concept of "geometric imbalance" - how message passing on class-imbalanced graphs creates geometric ambiguity for minority-class nodes in Riemannian manifold embedding spaces.

Method: Proposes a unified framework that explicitly mitigates geometric imbalance through three components: 1) pseudo-label alignment, 2) node reordering, and 3) ambiguity filtering. The approach is theoretically grounded in Riemannian manifold analysis.

Result: Extensive experiments on diverse benchmarks show the approach consistently outperforms existing methods, especially under severe class imbalance conditions.

Conclusion: The work provides new theoretical insights into geometric imbalance and offers practical tools for robust semi-supervised node classification on imbalanced graph data.

Abstract: Class imbalance in graph data presents a significant challenge for effective node classification, particularly in semi-supervised scenarios. In this work, we formally introduce the concept of geometric imbalance, which captures how message passing on class-imbalanced graphs leads to geometric ambiguity among minority-class nodes in the riemannian manifold embedding space. We provide a rigorous theoretical analysis of geometric imbalance on the riemannian manifold and propose a unified framework that explicitly mitigates it through pseudo-label alignment, node reordering, and ambiguity filtering. Extensive experiments on diverse benchmarks show that our approach consistently outperforms existing methods, especially under severe class imbalance. Our findings offer new theoretical insights and practical tools for robust semi-supervised node classification.

[594] Leveraging Imperfect Sources to Detect Fairwashing in Black-Box Auditing

Jade Garcia Bourrée, Erwan Le Merrer, Gilles Tredan, Benoît Rottembourg

Main category: cs.LG

TL;DR: A framework called Two-Source Audit Model (2SAM) detects platform manipulation in algorithmic audits by cross-referencing Audit API outputs with an independent trusted stream using probabilistic consistency proxies.

DetailsMotivation: Current algorithmic auditing relies on Audit APIs controlled by the platforms being audited, creating a vulnerability where platforms can serve compliant surrogate models while running discriminatory production systems (fairwashing), which is undetectable with single-source audits.

Method: Introduces 2SAM that cross-references Audit API outputs with an independent trusted stream using a consistency proxy - a probabilistic mapping that reconciles discrepancies between sources, enabling detection of manipulation.

Result: Quantifies manipulation detection thresholds, shows how proxy quality governs detection power, provides closed-form budget conditions for guaranteed detection, and validates on UCI Adult dataset achieving 70% detection power with only 127 cross-verification queries out of 750 total budget.

Conclusion: 2SAM addresses the fundamental vulnerability in current algorithmic auditing by enabling detection of platform manipulation through cross-referencing with independent sources, providing a practical solution to the fairwashing problem in platform accountability frameworks.

Abstract: Algorithmic auditing has become central to platform accountability under frameworks such as the AI Act and the Digital Services Act. In practice, this obligation is discharged through dedicated Audit APIs. This architecture creates a paradox: the entity under scrutiny controls the evaluation interface. A platform facing legal sanctions can serve a compliant surrogate model on its Audit API, while running a discriminatory production system. This deceptive practice is known as fairwashing. Manipulation is undetectable if the auditor relies on only one source. To address this limitation, we introduce the Two-Source Audit Model (2SAM). This model cross-references the Audit API with an independent trusted stream. The key insight is that the trusted stream does not need to be perfectly aligned with the Audit API. We introduce a consistency proxy, a probabilistic mapping that can reconcile discrepancies between sources. This approach yields three results. First, we quantify the rate of manipulation above which a single-source auditor is blind. Second, we show how proxy quality governs detection power. Third, we provide a closed-form budget condition guaranteeing detection at any target confidence level, closing the blind spot mentioned above. We validate 2SAM on the UCI Adult dataset, achieving $70%$ detection power with as few as $127$ cross-verification queries out of a total budget of $750$, using a name-based gender proxy with $94.2%$ accuracy.

[595] Exact and general decoupled solutions of the LMC Multitask Gaussian Process model

Olivier Truffinet, Karim Ammar, Jean-Philippe Argaud, Bertrand Bouriquet

Main category: cs.LG

TL;DR: The paper presents a projected Linear Model of Co-regionalization (LMC) for multitask Gaussian processes that achieves efficient exact computation through noise model restrictions, reducing complexity from cubic to linear in latent processes.

DetailsMotivation: The LMC is a powerful multitask Gaussian process model, but naive implementations have cubic complexity (number of datapoints × number of tasks), making approximations necessary for practical applications. The authors aim to develop an efficient exact computation method.

Method: The authors extend previous work showing that latent processes can be decoupled under certain conditions. They prove that with a mild noise model hypothesis, efficient exact computation of LMC is possible. They introduce a full parametrization of the projected LMC model, enabling efficient optimization through noise model restrictions.

Result: The projected LMC achieves competitive performance as a simpler alternative to state-of-the-art multitask Gaussian process models. It enables efficient training data updates, leave-one-out cross-validation, and provides better interpretability through access to low-dimensional quantities and their explicit relation to full-dimensional data.

Conclusion: The projected LMC is a competitive, simpler alternative to existing multitask Gaussian process models that facilitates adoption of methodologies like multitask Bayesian optimization in various industries through improved computational efficiency and interpretability.

Abstract: The Linear Model of Co-regionalization (LMC) is a very general multitask gaussian process model for regression or classification. While its expressiveness and conceptual simplicity are appealing, naive implementations have cubic complexity in the product (number of datapoints $\times$ number of tasks), making approximations mandatory for most applications. However, recent work has shown that in some settings the latent processes of the model can be decoupled, leading to a complexity that is only linear in the number of said processes. We here extend these results, showing from the most general assumptions that the only condition necessary to an efficient exact computation of the LMC is a mild hypothesis on the noise model. We introduce a full parametrization of the resulting \emph{projected LMC} model, enabling its efficient optimization. The effectiveness of this approach is assessed through synthetic and real-data experiments, testing in particular the behavior of its underlying noise model restriction.\ Overall, the projected LMC appears as a competitive and simpler alternative to state-of-the art multitask gaussian process models. It greatly facilitates some computations such as training data updates or leave-one-out cross-validation, and is more interpretable, for it gives access to its low-dimensional quantities and to their explicit relation with the full-dimensional data. These qualities could facilitate the adoption by various industries of entire classes of methodologies, notably multitask bayesian optimization.

[596] Interpretable factorization of clinical questionnaires to identify latent factors of psychopathology

Ka Chun Lam, Bridget W Mahony, Armin Raznahan, Francisco Pereira

Main category: cs.LG

TL;DR: ICQF is an interpretability-constrained non-negative matrix factorization method for psychiatric questionnaire data that improves factor interpretability while preserving diagnostic information.

DetailsMotivation: Traditional factor analysis in psychiatry research often produces factors that are not interpretable, may be confounded, and requires explicit imputation for missing data. There's a need for methods that promote factor interpretability and solution stability in questionnaire data analysis.

Method: ICQF (Interpretability Constrained Questionnaire Factorization) is a non-negative matrix factorization method with tailored regularization for questionnaire data. It includes an optimization procedure with theoretical convergence guarantees and an automated procedure to detect latent dimensionality accurately.

Result: ICQF improves interpretability as defined by domain experts while preserving diagnostic information across a range of disorders. It outperforms competing methods for smaller dataset sizes and shows that its regularization matches domain characteristics.

Conclusion: ICQF provides an effective alternative to traditional factor analysis for psychiatric questionnaire data, offering improved interpretability and stability while handling missing data and confounding variables better.

Abstract: Psychiatry research seeks to understand the manifestations of psychopathology in behavior, as measured in questionnaire data, by identifying a small number of latent factors that explain them. While factor analysis is the traditional tool for this purpose, the resulting factors may not be interpretable, and may also be subject to confounding variables. Moreover, missing data are common, and explicit imputation is often required. To overcome these limitations, we introduce interpretability constrained questionnaire factorization (ICQF), a non-negative matrix factorization method with regularization tailored for questionnaire data. Our method aims to promote factor interpretability and solution stability. We provide an optimization procedure with theoretical convergence guarantees, and an automated procedure to detect latent dimensionality accurately. We validate these procedures using realistic synthetic data. We demonstrate the effectiveness of our method in a widely used general-purpose questionnaire, in two independent datasets (the Healthy Brain Network and Adolescent Brain Cognitive Development studies). Specifically, we show that ICQF improves interpretability, as defined by domain experts, while preserving diagnostic information across a range of disorders, and outperforms competing methods for smaller dataset sizes. This suggests that the regularization in our method matches domain characteristics. The python implementation for ICQF is available at https://github.com/jefferykclam/ICQF.

[597] drGT: Attention-Guided Gene Assessment of Drug Response Utilizing a Drug-Cell-Gene Heterogeneous Network

Yoshitaka Inoue, Hunmin Lee, Tianfan Fu, Rui Kuang, Augustin Luna

Main category: cs.LG

TL;DR: drGT is a heterogeneous graph deep learning model for drug response prediction that combines accurate prediction with biological interpretability using attention mechanisms.

DetailsMotivation: The paper addresses the need for both accurate drug response prediction and biological plausibility of predictive features for translational impact. Current models often lack interpretability or fail to generalize well to unseen drugs and cell lines.

Method: Developed drGT, a heterogeneous graph deep learning model over drugs, genes, and cell lines that uses attention coefficients for interpretability. The model was evaluated on GDSC, NCI60, and CTRP datasets with various splits (random, unseen-drug, unseen-cell, zero-shot).

Result: drGT achieved top regression performance (R² up to 0.690) and competitive classification accuracy (AUROC up to 0.945). It performed well on unseen data and zero-shot prediction. Attention coefficients recovered known drug-target interactions (36.9% matched established DTIs, 63.7% supported by literature or structure-based models).

Conclusion: drGT advances predictive generalization and mechanism-centered interpretability, offering state-of-the-art regression accuracy and literature-supported biological hypotheses from heterogeneous graph data.

Abstract: For translational impact, both accurate drug response prediction and biological plausibility of predictive features are needed. We present drGT, a heterogeneous graph deep learning model over drugs, genes, and cell lines that couples prediction with mechanism-oriented interpretability via attention coefficients (ACs). We assess both predictive generalization (random, unseen-drug, unseen-cell, and zero-shot splits) and biological plausibility (use of text-mined PubMed gene-drug co-mentions and comparison to a structure-based DTI predictor) on GDSC, NCI60, and CTRP datasets. Across benchmarks, drGT consistently delivers top regression performance while maintaining competitive classification accuracy for drug sensitivity. Under random 5-fold cross-validation, drGT attains an AUROC of up to 0.945 (3rd overall) and an $R^2$ up to 0.690, outperforming all baselines on regression. In leave-one-out tests for unseen cell lines and drugs, drGT achieves AUROCs of 0.706 and 0.844, and $R^2$ values of 0.692 and 0.022, the only model yielding positive $R^2$ for unseen drugs. In zero-shot prediction, drGT achieves an AUROC of 0.786 and a regression $R^2$ of 0.334, both representing the highest scores among all models. For interpretability, AC-derived drug-gene links recover known biology: among 976 drugs with known DTIs, 36.9% of predicted links match established DTIs, and 63.7% are supported by either PubMed abstracts or a structure-based predictive model. Enrichment analyses of AC-prioritized genes reveal drug-perturbed biological processes, providing pathway-level explanations. drGT advances predictive generalization and mechanism-centered interpretability, offering state-of-the-art regression accuracy and literature-supported biological hypotheses that demonstrate the use of graph learning from heterogeneous input data for biological discovery. Code: https://github.com/sciluna/drGT

[598] SG-DeepONet: Source-generalized deep operator learning for full waveform inversion

Zekai Guo, Lihui Chai, Ye Li

Main category: cs.LG

TL;DR: A new source-variable seismic dataset (SVFWI) and SG-DeepONet framework for improved full waveform inversion under varying source conditions

DetailsMotivation: Existing seismic datasets for deep learning-based full waveform inversion (FWI) have fixed or weakly varying source conditions, limiting their ability to represent realistic seismic scenarios and hindering source generalization. There's a need for more diverse training data and better models that can handle varying source parameters.

Method: 1) Constructed SVFWI dataset with systematic variations in source frequencies and horizontal locations, divided into three subsets for frequency variations, location variations, and combined effects. 2) Proposed SG-DeepONet, a DeepONet-based encoder-decoder framework with branch network for multi-scale time-frequency feature extraction from seismic observations, trunk network for explicit source parameter embedding, and interactive decoding network for nonlinear fusion and velocity reconstruction.

Result: Extensive experiments on SVFWI demonstrate that SG-DeepONet achieves superior inversion accuracy and robustness under varying source conditions compared with existing DL-based FWI methods.

Conclusion: The SVFWI dataset provides a challenging benchmark for data-driven FWI, and SG-DeepONet effectively handles varying source conditions through its specialized architecture, advancing the field of deep learning-based seismic inversion.

Abstract: Full waveform inversion (FWI) aims to reconstruct subsurface velocity models from observed seismic wavefields and has recently benefited from advances in deep learning (DL). The performance of DL-based FWI critically depends on the diversity of training data, yet existing datasets such as OpenFWI rely on fixed or weakly varying source conditions, limiting their ability to represent realistic seismic scenarios and hindering source generalization. To address this issue, we construct a new source-variable seismic dataset, termed SVFWI, by systematically varying the frequencies and horizontal locations of multiple surface sources. SVFWI is further divided into three subsets that respectively model frequency variations, location variations, and their combined effects, providing a challenging benchmark in data-driven FWI. We further propose SG-DeepONet, a novel DeepONet-based encoder-decoder framework tailored for FWI. The branch network extracts multi-scale time-frequency features from seismic observations, the trunk network explicitly embeds source physical parameters, and an interactive decoding network enables effective nonlinear fusion and high-fidelity velocity reconstruction. Extensive experiments on SVFWI demonstrate that SG-DeepONet achieves superior inversion accuracy and robustness under varying source conditions compared with existing DL-based FWI methods.

[599] Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

Wen Ye, Wei Yang, Defu Cao, Yizhou Zhang, Lumingyuan Tang, Jie Cai, Yan Liu

Main category: cs.LG

TL;DR: TS-Reasoner is a domain-specialized agent for multi-step time series inference that combines LLM reasoning with domain-specific computational tools and error feedback loops.

DetailsMotivation: Traditional time series methods focus on isolated tasks, and recent studies are limited to single-step inference or natural language answers, lacking multi-step reasoning capabilities for complex time series analysis.

Method: Integrates large language model reasoning with domain-specific computational tools and an error feedback loop to enable domain-informed, constraint-aware analytical workflows combining symbolic reasoning with precise numerical analysis.

Result: Outperforms standalone general-purpose LLMs in both basic time series concept understanding (TimeSeriesExam) and complex multi-step inference tasks on a newly proposed dataset.

Conclusion: Demonstrates the promise of domain-specialized agents for automating real-world time series reasoning and analysis through integrated symbolic and numerical approaches.

Abstract: Time series analysis is crucial in real-world applications, yet traditional methods focus on isolated tasks only, and recent studies on time series reasoning remain limited to either single-step inference or are constrained to natural language answers. In this work, we introduce TS-Reasoner, a domain-specialized agent designed for multi-step time series inference. By integrating large language model (LLM) reasoning with domain-specific computational tools and an error feedback loop, TS-Reasoner enables domain-informed, constraint-aware analytical workflows that combine symbolic reasoning with precise numerical analysis. We assess the system’s capabilities along two axes: (1) fundamental time series understanding assessed by TimeSeriesExam and (2) complex, multi-step inference evaluated by a newly proposed dataset designed to test both compositional reasoning and computational precision in time series analysis. Experiments show that our approach outperforms standalone general-purpose LLMs in both basic time series concept understanding as well as the multi-step time series inference task, highlighting the promise of domain-specialized agents for automating real-world time series reasoning and analysis.

[600] Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls

Shubhangi Upasani, Chen Wu, Jay Rainton, Bo Li, Urmish Thakker, Changran Hu, Qizheng Zhang

Main category: cs.LG

TL;DR: Empirical study of many-shot prompting for test-time adaptation in LLMs, analyzing performance across tasks, model backbones, and alternative strategies like Dynamic and Reinforced ICL.

DetailsMotivation: To understand the reliability and limits of many-shot prompting as a test-time adaptation mechanism for LLMs, particularly for open-source models, and to characterize when input-space updates are beneficial versus harmful.

Method: Conducted empirical study across tasks and model backbones, analyzing performance variation with update magnitude, example ordering, and selection policy. Studied Dynamic and Reinforced ICL as alternative test-time update strategies that control information injection and constrain model behavior.

Result: Many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and shows limited benefits for open-ended generation tasks.

Conclusion: Characterized practical limits of prompt-based test-time adaptation and outlined when input-space updates are beneficial versus harmful, providing guidance for effective use of many-shot prompting in LLMs.

Abstract: Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.

[601] Anomaly Resilient Temporal QoS Prediction using Hypergraph Convoluted Transformer Network

Suraj Kumar, Soumi Chattopadhyay, Chandranath Adak

Main category: cs.LG

TL;DR: HCTN: Hypergraph Convoluted Transformer Network for trust-aware temporal QoS prediction addressing data sparsity, cold-start, and reliability issues

DetailsMotivation: Current QoS prediction methods struggle with data sparsity, cold-start problems, and reliability issues (outliers, greysheep users/services). They also fail to leverage diverse features and complex higher-order patterns needed for accurate predictions.

Method: Proposes HCTN framework combining hypergraph structure with graph convolution over hyper-edges to capture complex high-order correlations, plus transformer network with multi-head attention and parallel 1D convolutional layers to capture dynamic patterns. Includes sparsity-resilient greysheep detection and robust loss function.

Result: Demonstrated state-of-the-art performance on large-scale WSDREAM-2 datasets for response time and throughput predictions.

Conclusion: HCTN effectively addresses QoS prediction challenges including sparsity, cold-start, and reliability issues through its hypergraph-transformer architecture and trust-aware mechanisms.

Abstract: Quality-of-Service (QoS) prediction is a critical task in the service lifecycle, enabling precise and adaptive service recommendations by anticipating performance variations over time in response to evolving network uncertainties and user preferences. However, contemporary QoS prediction methods frequently encounter data sparsity and cold-start issues, which hinder accurate QoS predictions and limit the ability to capture diverse user preferences. Additionally, these methods often assume QoS data reliability, neglecting potential credibility issues such as outliers and the presence of greysheep users and services with atypical invocation patterns. Furthermore, traditional approaches fail to leverage diverse features, including domain-specific knowledge and complex higher-order patterns, essential for accurate QoS predictions. In this paper, we introduce a real-time, trust-aware framework for temporal QoS prediction to address the aforementioned challenges, featuring an end-to-end deep architecture called the Hypergraph Convoluted Transformer Network (HCTN). HCTN combines a hypergraph structure with graph convolution over hyper-edges to effectively address high-sparsity issues by capturing complex, high-order correlations. Complementing this, the transformer network utilizes multi-head attention along with parallel 1D convolutional layers and fully connected dense blocks to capture both fine-grained and coarse-grained dynamic patterns. Additionally, our approach includes a sparsity-resilient solution for detecting greysheep users and services, incorporating their unique characteristics to improve prediction accuracy. Trained with a robust loss function resistant to outliers, HCTN demonstrated state-of-the-art performance on the large-scale WSDREAM-2 datasets for response time and throughput.

[602] Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

Xinyu Yuan, Yan Qiao, Meng Li, Zhenchun Wei, Cuiying Feng, Zonghui Wang, Wenzhi Chen

Main category: cs.LG

TL;DR: UCL-sketch: A learning-based frequency estimation method for data streams using online training without ground truth and compressive sensing for improved accuracy and speed.

DetailsMotivation: Traditional sketches provide coarse frequency estimates under memory constraints, while existing learning-augmented methods require offline training with ground truth data (often unavailable) and suffer from slow update speeds, limiting real-time processing capabilities.

Method: Proposes UCL-sketch with two innovations: (1) online training mechanism based on equivalent learning requiring no ground truth, and (2) scalable architecture using logically structured estimation buckets. Utilizes compressive sensing (CS) for frequency estimation.

Result: Outperforms previous approaches in per-key accuracy and distribution; achieves near-oracle quality under tight memory budgets; provides 500x average decoding speedup compared to existing equation-based sketches.

Conclusion: UCL-sketch offers a practical learning-based solution for frequency estimation in data streams with provable error bounds, high accuracy, and fast processing without requiring ground truth data.

Abstract: Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketches provide only coarse estimates under strict memory constraints. Although some learning-augmented methods have emerged recently, they typically rely on offline training with real frequencies or/and labels, which are often unavailable. Moreover, these methods suffer from slow update speeds, limiting their suitability for real-time processing despite offering only marginal accuracy improvements. To overcome these challenges, we propose UCL-sketch, a practical learning-based paradigm for per-key frequency estimation. Our design introduces two key innovations: (i) an online training mechanism based on equivalent learning that requires no ground truth (GT), and (ii) a highly scalable architecture leveraging logically structured estimation buckets to scale to real-world data stream. The UCL-sketch, which utilizes compressive sensing (CS), converges to an estimator that provably yields a error bound far lower than that of prior works, without sacrificing the speed of processing. Extensive experiments on both real-world and synthetic datasets demonstrate that our approach outperforms previously proposed approaches regarding per-key accuracy and distribution. Notably, under extremely tight memory budgets, its quality almost matches that of an (infeasible) omniscient oracle. Moreover, compared to the existing equation-based sketch, UCL-sketch achieves an average decoding speedup of nearly 500 times. To help further research and development, our code is publicly available at https://github.com/Y-debug-sys/UCL-sketch.

[603] When Machine Learning Gets Personal: Evaluating Prediction and Explanation

Louisa Cornelis, Guillermo Bernárdez, Haewon Jeong, Nina Miolane

Main category: cs.LG

TL;DR: Personalized ML models can have divergent effects on prediction accuracy and explainability, requiring joint evaluation and careful dataset design for proper assessment.

DetailsMotivation: In high-stakes domains like healthcare, users expect personal information sharing to yield benefits like better predictions and explanations, but this assumption is largely untested. The paper aims to quantify how personalization affects both prediction and explanation in ML models.

Method: Proposes a unified framework to quantify personalization effects on prediction and explanation. Studies a standard hypothesis test for detecting personalization effects on demographic groups, deriving finite-sample lower bounds on error probability based on group sizes, personal attributes, and desired benefits.

Result: Shows that personalization impacts on prediction and explanation can diverge - models may become more or less explainable even when prediction accuracy is unchanged. Applied framework to real-world tabular datasets using feature-attribution methods, uncovering scenarios where effects are fundamentally untestable due to dataset statistics.

Conclusion: Highlights the need for joint evaluation of prediction and explanation in personalized models and emphasizes the importance of designing models and datasets with sufficient information for such evaluation. Provides actionable insights for dataset characteristics needed to test personalization effects.

Abstract: In high-stakes domains like healthcare, users often expect that sharing personal information with machine learning systems will yield tangible benefits, such as more accurate diagnoses and clearer explanations of contributing factors. However, the validity of this assumption remains largely unexplored. We propose a unified framework to quantify how personalizing a model influences both prediction and explanation. We show that its impacts on prediction and explanation can diverge: a model may become more or less explainable even when prediction is unchanged. For practical settings, we study a standard hypothesis test for detecting personalization effects on demographic groups. We derive a finite-sample lower bound on its probability of error as a function of group sizes, number of personal attributes, and desired benefit from personalization. This provides actionable insights, such as which dataset characteristics are necessary to test an effect, or the maximum effect that can be tested given a dataset. We apply our framework to real-world tabular datasets using feature-attribution methods, uncovering scenarios where effects are fundamentally untestable due to the dataset statistics. Our results highlight the need for joint evaluation of prediction and explanation in personalized models and the importance of designing models and datasets with sufficient information for such evaluation.

[604] Reevaluating Policy Gradient Methods for Imperfect-Information Games

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, Samuel Sokota

Main category: cs.LG

TL;DR: PPO and generic policy gradient methods outperform specialized FP-, DO-, and CFR-based DRL approaches in imperfect-information games, as shown by large-scale exploitability comparisons across 7000+ training runs.

DetailsMotivation: To test the hypothesis that simpler generic policy gradient methods (like PPO) are competitive with or superior to specialized DRL approaches (FP-, DO-, and CFR-based) for adversarial imperfect-information games, given recent results from magnetic mirror descent algorithms.

Method: Implemented first broadly accessible exact exploitability computations for five large games, conducted largest-ever exploitability comparison of DRL algorithms for imperfect-information games with over 7000 training runs, comparing FP-, DO-, and CFR-based approaches against generic policy gradient methods.

Result: FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods in imperfect-information games across extensive experimental evaluation.

Conclusion: Simpler generic policy gradient methods like PPO are competitive with or superior to specialized DRL approaches for imperfect-information games, challenging previous assumptions about the necessity of FP-, DO-, and CFR-based methods.

Abstract: In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for five large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 7000 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at https://github.com/nathanlct/IIG-RL-Benchmark and https://github.com/gabrfarina/exp-a-spiel .

[605] Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity

Eliot Beyler, Francis Bach

Main category: cs.LG

TL;DR: Half-denoising vs full-denoising: Half-denoising performs better for regular densities, while full-denoising is superior for singular densities like mixtures of Dirac measures or low-dimensional manifold data, where it can mitigate the curse of dimensionality.

DetailsMotivation: Score-based generative models use Gaussian noise perturbation and denoising, but there are two approaches: full-denoising (optimal for quadratic loss) and half-denoising. The paper aims to compare these methods under different data distribution assumptions to understand which performs better in various scenarios.

Method: Theoretical analysis comparing full-denoising and half-denoising approaches. The study examines performance in terms of distribution distances under different data assumptions: regular densities vs singular densities (mixtures of Dirac measures, low-dimensional subspace data). Mathematical proofs establish conditions where each method excels.

Result: Half-denoising outperforms full-denoising for regular, well-behaved densities. Full-denoising is superior for singular densities like mixtures of Dirac measures or data on low-dimensional manifolds. For the latter case, full-denoising can alleviate the curse of dimensionality under linear manifold assumptions.

Conclusion: The choice between half-denoising and full-denoising depends on data characteristics: half-denoising is better for regular densities, while full-denoising excels for singular densities and can mitigate dimensionality issues for manifold data. This provides guidance for selecting denoising approaches in score-based generative models.

Abstract: Score-based generative models achieve state-of-the-art sampling performance by denoising a distribution perturbed by Gaussian noise. In this paper, we focus on a single deterministic denoising step, and compare the optimal denoiser for the quadratic loss, we name ‘‘full-denoising’’, to the alternative ‘‘half-denoising’’ introduced by Hyv{ä}rinen (2025). We show that looking at the performance in terms of distance between distributions tells a more nuanced story, with different assumptions on the data leading to very different conclusions. We prove that half-denoising is better than full-denoising for regular enough densities, while full-denoising is better for singular densities such as mixtures of Dirac measures or densities supported on a low-dimensional subspace. In the latter case, we prove that full-denoising can alleviate the curse of dimensionality under a linear manifold hypothesis.

[606] LogicXGNN: Grounded Logical Rules for Explaining Graph Neural Networks

Chuqin Geng, Ziyu Zhao, Zhaoyue Wang, Haolin Ye, Yuhe Jiang, Xujie Si

Main category: cs.LG

TL;DR: LogicXGNN is a post-hoc framework for Graph Neural Networks that constructs logical rules over reliable predicates capturing GNN message-passing structure, ensuring effective grounding and introducing data-grounded fidelity metrics.

DetailsMotivation: Existing rule-based explanations for GNNs optimize fidelity in intermediate concept spaces, overlooking grounding quality for end users in final subgraph explanations, leading to explanations that appear faithful but may be unreliable in practice.

Method: Proposes LogicXGNN, a post-hoc framework that constructs logical rules over reliable predicates explicitly designed to capture GNN’s message-passing structure. Introduces data-grounded fidelity (Fid_D) metric that evaluates explanations in their final-graph form, along with complementary utility metrics like coverage and validity.

Result: LogicXGNN improves data-grounded fidelity by over 20% on average relative to state-of-the-art methods while being 10-100× faster. Demonstrates strong scalability and utility performance, producing explanations faithful to model’s logic and reliably grounded in observable data.

Conclusion: LogicXGNN addresses the grounding gap in GNN explanations by constructing logical rules over reliable predicates that capture message-passing structure, ensuring explanations are both faithful to the model and reliably grounded in observable data.

Abstract: Existing rule-based explanations for Graph Neural Networks (GNNs) provide global interpretability but often optimize and assess fidelity in an intermediate, uninterpretable concept space, overlooking grounding quality for end users in the final subgraph explanations. This gap yields explanations that may appear faithful yet be unreliable in practice. To this end, we propose LogicXGNN, a post-hoc framework that constructs logical rules over reliable predicates explicitly designed to capture the GNN’s message-passing structure, thereby ensuring effective grounding. We further introduce data-grounded fidelity ($\textit{Fid}{\mathcal{D}}$), a realistic metric that evaluates explanations in their final-graph form, along with complementary utility metrics such as coverage and validity. Across extensive experiments, LogicXGNN improves $\textit{Fid}{\mathcal{D}}$ by over 20% on average relative to state-of-the-art methods while being 10-100 $\times$ faster. With strong scalability and utility performance, LogicXGNN produces explanations that are faithful to the model’s logic and reliably grounded in observable data. Our code is available at https://github.com/allengeng123/LogicXGNN/.

[607] MASS: MoErging through Adaptive Subspace Selection

Donato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, Iacopo Masi, Emanuele Rodolà

Main category: cs.LG

TL;DR: MASS is a model merging method that combines multiple fine-tuned models into a single parameter set using adaptive subspace selection and data-free routing, achieving near-individual model performance with minimal storage overhead.

DetailsMotivation: Existing model merging methods fail to match the full accuracy of separately fine-tuned models. The authors aim to create a lightweight alternative to ensembling that retains near state-of-the-art performance across multiple tasks without additional training overhead.

Method: MASS uses low-rank decomposition of per-task updates, storing only the most salient singular components for each task. At inference, a non-parametric, data-free router identifies which subspace best explains an input’s intermediate features and activates corresponding task-specific blocks.

Result: MASS achieves state-of-the-art performance on CLIP-based image classification benchmarks (8, 14, and 20 tasks), recovering up to ~98% of the average accuracy of individual fine-tuned models with only ~2x storage overhead compared to a single pretrained model.

Conclusion: MASS provides a practical alternative to ensembling that combines multiple fine-tuned models into a single parameter set while maintaining near-individual model performance, with minimal storage cost and no training overhead.

Abstract: Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input’s intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2 storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.

[608] Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

Ziyi Zhang, Li Shen, Deheng Ye, Yong Luo, Huangxuan Zhao, Meng Liu, Wei Yu, Lefei Zhang

Main category: cs.LG

TL;DR: MVC-ZigAL is a reinforcement learning framework for fine-tuning few-step text-to-multiview diffusion models to improve both per-view fidelity and cross-view consistency.

DetailsMotivation: Few-step text-to-multiview diffusion models enable real-time generation but compromise quality in per-view fidelity and cross-view consistency. Existing RL approaches for single-image diffusion don't work well for multiview settings as they neglect cross-view coordination and have weak learning signals in few-step regimes.

Method: Proposes MVC-ZigAL with three key innovations: (1) MDP formulation that jointly models all generated views with joint-view reward, (2) advantage learning strategy using self-refinement sampling for stronger learning signals, and (3) unified RL framework with Lagrangian dual formulation for multiview-constrained optimization with adaptive primal-dual updates and self-paced threshold curriculum.

Result: The framework enables robust RL fine-tuning for few-step T2MV diffusion models, yielding substantial gains in both per-view fidelity and cross-view consistency.

Conclusion: MVC-ZigAL provides an effective RL fine-tuning solution tailored for few-step text-to-multiview diffusion models, addressing the unique challenges of multiview generation and few-step regimes.

Abstract: Text-to-multiview (T2MV) diffusion models have shown great promise in generating multiple views of a scene from a single text prompt. While few-step backbones enable real-time T2MV generation, they often compromise key aspects of generation quality, such as per-view fidelity and cross-view consistency. Reinforcement learning (RL) finetuning offers a potential solution, yet existing approaches designed for single-image diffusion do not readily extend to the few-step T2MV setting, as they neglect cross-view coordination and suffer from weak learning signals in few-step regimes. To address this, we propose MVC-ZigAL, a tailored RL finetuning framework for few-step T2MV diffusion models. Specifically, its core insights are: (1) a new MDP formulation that jointly models all generated views and assesses their collective quality via a joint-view reward; (2) a novel advantage learning strategy that exploits the performance gains of a self-refinement sampling scheme over standard sampling, yielding stronger learning signals for effective RL finetuning; and (3) a unified RL framework that extends advantage learning with a Lagrangian dual formulation for multiview-constrained optimization, balancing single-view and joint-view objectives through adaptive primal-dual updates under a self-paced threshold curriculum that harmonizes exploration and constraint enforcement. Collectively, these designs enable robust and balanced RL finetuning for few-step T2MV diffusion models, yielding substantial gains in both per-view fidelity and cross-view consistency. Code is available at https://github.com/ZiyiZhang27/MVC-ZigAL.

[609] Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

Bao Pham, Gabriel Raya, Matteo Negri, Mohammed J. Zaki, Luca Ambrogioni, Dmitry Krotov

Main category: cs.LG

TL;DR: The paper connects diffusion models to dense associative memories, showing that as training data increases, diffusion models transition from memorization to generalization through an intermediate phase of “spurious states” which represent the emergence of generative capabilities.

DetailsMotivation: To understand diffusion models through the lens of dense associative memories (DenseAMs) and characterize the transition from memorization to generalization as training data size increases, particularly focusing on the emergence of "spurious states" that represent early generative capabilities.

Method: Theoretical analysis connecting diffusion models to DenseAMs, viewing the generative process as memory retrieval. Characterization of basins of attraction, energy landscape curvature, and computational properties of spurious states across various architectures and datasets.

Result: Demonstrates that diffusion models exhibit spurious states predicted by DenseAM theory when training data exceeds critical capacity. These states represent the transition from memorization to generalization and are observed across diverse architectures and datasets.

Conclusion: Spurious states in diffusion models are not negative artifacts but rather the first signs of generative capabilities, representing a critical intermediate phase in the transition from memorization to generalization as predicted by DenseAM theory.

Abstract: Dense Associative Memories (DenseAMs) are generalizations of Hopfield networks, which have superior information storage capacity and can store training data points (memories) at local minima of the energy landscape. When the amount of training data exceeds the critical memory storage capacity of these models, new local minima, which are different from the training data, emerge. In Associative Memory these emergent local minima are called $\textit{spurious}; \textit{states}$, which hinder memory retrieval. In this work, we examine diffusion models (DMs) through the DenseAM lens, viewing their generative process as an attempt of a memory retrieval. In the small data regimes, DMs create distinct attractors for each training sample, akin to DenseAMs below the critical memory storage. As the training data size increases, they transition from memorization to generalization. We identify a critical intermediate phase, predicted by DenseAM theory – the spurious states. In generative modeling, these states are no longer negative artifacts but rather are the first signs of generative capabilities. We characterize the basins of attraction, energy landscape curvature, and computational properties of these previously overlooked states. Their existence is demonstrated across a wide range of architectures and datasets.

[610] VERINA: Benchmarking Verifiable Code Generation

Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, Dawn Song

Main category: cs.LG

TL;DR: VERINA is a benchmark for evaluating verifiable code generation in Lean, assessing code correctness, specification quality, and proof generation capabilities of LLMs.

DetailsMotivation: LLMs are increasingly used in software development but ensuring correctness of generated code is challenging. Current benchmarks lack holistic evaluation of verifiable code generation (code, specifications, and proofs together).

Method: Created VERINA benchmark with 189 manually curated coding tasks in Lean, including problem descriptions, reference implementations, formal specifications, and test suites. Evaluated state-of-the-art LLMs on code correctness, specification soundness/completeness, and proof generation.

Result: Best model (OpenAI o3) achieved 72.6% code correctness, 52.3% for specification soundness/completeness, and only 4.9% proof success rate. Shows significant challenges in verifiable code generation, especially proof generation.

Conclusion: VERINA provides a comprehensive benchmark for verifiable code generation, revealing current limitations of LLMs in this domain, particularly in theorem proving for verification.

Abstract: Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation – jointly generating code, specifications, and proofs of code-specification alignment – offers a promising path to address this limitation and further unleash LLMs’ benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce VERINA (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. VERINA consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o3, achieves a 72.6% code correctness rate, 52.3% for specification soundness and completeness, and a mere 4.9% proof success rate (based on one trial per task). We hope VERINA will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.

[611] Coded Robust Aggregation for Distributed Learning under Byzantine Attacks

Chengxi Li, Ming Xiao, Mikael Skoglund

Main category: cs.LG

TL;DR: CRA-DL: A distributed learning method using coded gradients to enhance robustness against Byzantine attacks by making honest devices’ gradients more similar through redundant data allocation.

DetailsMotivation: Current distributed learning methods using robust bounded aggregation (RBA) rules suffer performance degradation when local gradients vary significantly between devices, especially under Byzantine attacks where malicious devices send disruptive information.

Method: Proposes CRA-DL (coded robust aggregation distributed learning) where training data is allocated redundantly to devices before training. During training, honest devices transmit coded gradients computed from allocated data, and the server aggregates using RBA rules. This makes honest devices’ gradients more similar, improving Byzantine resilience.

Result: Theoretical convergence analysis and numerical results show CRA-DL outperforms existing baselines, demonstrating enhanced learning performance under Byzantine attacks compared to standard RBA approaches.

Conclusion: CRA-DL effectively mitigates Byzantine attacks in distributed learning by using coded gradients that increase similarity among honest devices’ updates, making Byzantine detection easier and improving overall learning robustness.

Abstract: In this paper, we investigate the problem of distributed learning (DL) in the presence of Byzantine attacks. For this problem, various robust bounded aggregation (RBA) rules have been proposed at the central server to mitigate the impact of Byzantine attacks. However, current DL methods apply RBA rules for the local gradients from the honest devices and the disruptive information from Byzantine devices, and the learning performance degrades significantly when the local gradients of different devices vary considerably from each other. To overcome this limitation, we propose a new DL method to cope with Byzantine attacks based on coded robust aggregation (CRA-DL). Before training begins, the training data are allocated to the devices redundantly. During training, in each iteration, the honest devices transmit coded gradients to the server computed from the allocated training data, and the server then aggregates the information received from both honest and Byzantine devices using RBA rules. In this way, the global gradient can be approximately recovered at the server to update the global model. Compared with current DL methods applying RBA rules, the improvement of CRA-DL is attributed to the fact that the coded gradients sent by the honest devices are closer to each other. This closeness enhances the robustness of the aggregation against Byzantine attacks, since Byzantine messages tend to be significantly different from those of honest devices in this case. We theoretically analyze the convergence performance of CRA-DL. Finally, we present numerical results to verify the superiority of the proposed method over existing baselines, showing its enhanced learning performance under Byzantine attacks.

[612] Out-of-Distribution Graph Models Merging

Yidi Wang, Ziyue Qiao, Jiawei Gu, Xubin Zheng, Pengyang Wang, Xiaobing Pei, Xiao Luo

Main category: cs.LG

TL;DR: A framework for merging out-of-distribution graph neural network models from different domains without requiring source/target data, using graph generation, mixture-of-experts, and masking mechanisms.

DetailsMotivation: Addressing the challenge of merging multiple graph models pre-trained on different domains with distribution discrepancy to create a generalized model without access to source/target domain data.

Method: Proposes graph generation strategy to instantiate mixture distribution of multiple domains, then merges pre-trained models via MoE module and masking mechanism for generalized adaptation. Framework is architecture-agnostic.

Result: Theoretical analysis and experimental results demonstrate effectiveness in addressing model generalization problem for out-of-distribution graph models.

Conclusion: The proposed framework successfully merges graph models from different domains without requiring source/target data, enabling generalized adaptation through innovative graph generation and model merging techniques.

Abstract: This paper studies a novel problem of out-of-distribution graph models merging, which aims to construct a generalized model from multiple graph models pre-trained on different domains with distribution discrepancy. This problem is challenging because of the difficulty in learning domain-invariant knowledge implicitly in model parameters and consolidating expertise from potentially heterogeneous GNN backbones. In this work, we propose a graph generation strategy that instantiates the mixture distribution of multiple domains. Then, we merge and fine-tune the pre-trained graph models via a MoE module and a masking mechanism for generalized adaptation. Our framework is architecture-agnostic and can operate without any source/target domain data. Both theoretical analysis and experimental results demonstrate the effectiveness of our approach in addressing the model generalization problem.

[613] RETRO SYNFLOW: Discrete Flow Matching for Accurate and Diverse Single-Step Retrosynthesis

Robin Yadav, Qi Yan, Guy Wolf, Avishek Joey Bose, Renjie Liao

Main category: cs.LG

TL;DR: RETRO SYNFLOW (RSF) is a discrete flow-matching framework for single-step retrosynthesis prediction that uses Markov bridges between target products and reactants, with reaction center identification and Feynman-Kac steering for improved diversity and feasibility.

DetailsMotivation: Single-step retrosynthesis prediction is challenging due to the combinatorial chemical search space, and existing template-free generative approaches struggle to produce both accurate and diverse sets of feasible reactions.

Method: RSF builds Markov bridges between target product molecules and reactant molecules, employs reaction center identification to produce intermediate synthon structures as source distributions, and uses Feynman-Kac steering with Sequential Monte Carlo resampling guided by a forward-synthesis model reward oracle.

Result: Achieves 60.0% top-1 accuracy (20% improvement over previous SOTA) and FK-steering improves top-5 round-trip accuracy by 19% over prior template-free methods while maintaining competitive top-k accuracy.

Conclusion: RETRO SYNFLOW demonstrates superior performance in single-step retrosynthesis prediction through discrete flow-matching and inference-time steering, offering improved accuracy and diversity in reaction prediction.

Abstract: A fundamental problem in organic chemistry is identifying and predicting the series of reactions that synthesize a desired target product molecule. Due to the combinatorial nature of the chemical search space, single-step reactant prediction – i.e. single-step retrosynthesis – remains challenging even for existing state-of-the-art template-free generative approaches to produce an accurate yet diverse set of feasible reactions. In this paper, we model single-step retrosynthesis planning and introduce RETRO SYNFLOW (RSF) a discrete flow-matching framework that builds a Markov bridge between the prescribed target product molecule and the reactant molecule. In contrast to past approaches, RSF employs a reaction center identification step to produce intermediate structures known as synthons as a more informative source distribution for the discrete flow. To further enhance diversity and feasibility of generated samples, we employ Feynman-Kac steering with Sequential Monte Carlo based resampling to steer promising generations at inference using a new reward oracle that relies on a forward-synthesis model. Empirically, we demonstrate \nameshort achieves $60.0 %$ top-1 accuracy, which outperforms the previous SOTA by $20 %$. We also substantiate the benefits of steering at inference and demonstrate that FK-steering improves top-$5$ round-trip accuracy by $19 %$ over prior template-free SOTA methods, all while preserving competitive top-$k$ accuracy results.

[614] Scalable Feature Learning on Huge Knowledge Graphs for Downstream Machine Learning

Félix Lefebvre, Gaël Varoquaux

Main category: cs.LG

TL;DR: SEPAL is a scalable embedding propagation algorithm for large knowledge graphs that produces high-quality embeddings for downstream tasks by optimizing on a small core of entities and propagating via message passing.

DetailsMotivation: Current knowledge graph embedding methods have two limitations: they're primarily optimized for link prediction via local contrastive learning, and they require significant engineering effort to handle large graphs due to GPU memory constraints.

Method: SEPAL ensures global embedding consistency by optimizing embeddings only on a small core of entities, then propagating them to the rest of the graph using message passing techniques.

Result: SEPAL significantly outperforms previous methods on downstream tasks across 7 large-scale knowledge graphs and 46 downstream machine learning tasks, and scales to fit huge knowledge graphs on commodity hardware.

Conclusion: SEPAL addresses scalability limitations of knowledge graph embedding methods while improving performance on downstream tasks through its core optimization and propagation approach.

Abstract: Many machine learning tasks can benefit from external knowledge. Large knowledge graphs store such knowledge, and embedding methods can be used to distill it into ready-to-use vector representations for downstream applications. For this purpose, current models have however two limitations: they are primarily optimized for link prediction, via local contrastive learning, and their application to the largest graphs requires significant engineering effort due to GPU memory limits. To address these, we introduce SEPAL: a Scalable Embedding Propagation ALgorithm for large knowledge graphs designed to produce high-quality embeddings for downstream tasks at scale. The key idea of SEPAL is to ensure global embedding consistency by optimizing embeddings only on a small core of entities, and then propagating them to the rest of the graph with message passing. We evaluate SEPAL on 7 large-scale knowledge graphs and 46 downstream machine learning tasks. Our results show that SEPAL significantly outperforms previous methods on downstream tasks. In addition, SEPAL scales up its base embedding model, enabling fitting huge knowledge graphs on commodity hardware.

[615] On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou

Main category: cs.LG

TL;DR: CHORD is a framework that unifies supervised fine-tuning and reinforcement learning through dynamic weighting, treating SFT as a dynamically weighted auxiliary objective within on-policy RL to prevent disruption of learned patterns and overfitting.

DetailsMotivation: Existing approaches that integrate SFT and RL often disrupt established response patterns and cause overfitting to expert data. The authors aim to create a more stable and efficient learning process by harmonizing off-policy expert data with on-policy exploration.

Method: CHORD reframes SFT as a dynamically weighted auxiliary objective within on-policy RL. It uses a dual-control mechanism: (1) a global coefficient to guide transition from off-policy imitation to on-policy exploration, and (2) a token-wise weighting function for granular learning from expert data to promote on-policy exploration and mitigate disruption.

Result: Extensive experiments across various practical tasks show CHORD achieves stable and efficient learning, with significant improvements over baselines by effectively harmonizing off-policy expert data with on-policy exploration.

Conclusion: CHORD provides a unified framework for SFT and RL through dynamic weighting, enabling more stable and efficient post-training of LLMs while preventing disruption of learned patterns and overfitting to expert data.

Abstract: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data’s influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from the expert, which promotes on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments across various practical tasks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.

[616] LANCE: Low Rank Activation Compression for Efficient On-Device Continual Learning

Marco Paul E. Apolinario, Kaushik Roy

Main category: cs.LG

TL;DR: LANCE is a low-rank activation compression framework using one-shot higher-order SVD for efficient on-device fine-tuning and continual learning with reduced memory and computation costs.

DetailsMotivation: On-device learning needs efficient fine-tuning and continual learning without catastrophic forgetting, but current methods have high memory costs from storing activations during backpropagation. Existing compression methods use repeated low-rank decompositions with computational overhead and haven't been explored for continual learning.

Method: Proposes LANCE framework that performs one-shot higher-order Singular Value Decomposition (SVD) to obtain reusable low-rank subspaces for activation projection, eliminating repeated decompositions. Fixed low-rank subspaces enable continual learning by allocating tasks to orthogonal subspaces without storing large task-specific matrices.

Result: Reduces activation storage up to 250× while maintaining accuracy comparable to full backpropagation on CIFAR-10/100, Oxford-IIIT Pets, Flowers102, and CUB-200 datasets. On continual learning benchmarks (Split CIFAR-100, Split MiniImageNet, 5-Datasets), performs competitively with orthogonal gradient projection methods at a fraction of the memory cost.

Conclusion: LANCE is a practical and scalable solution for efficient fine-tuning and continual learning on edge devices, addressing both memory and computational constraints through reusable low-rank subspaces.

Abstract: On-device learning is essential for personalization, privacy, and long-term adaptation in resource-constrained environments. Achieving this requires efficient learning, both fine-tuning existing models and continually acquiring new tasks without catastrophic forgetting. Yet both settings are constrained by high memory cost of storing activations during backpropagation. Existing activation compression methods reduce this cost but rely on repeated low-rank decompositions, introducing computational overhead. Also, such methods have not been explored for continual learning. We propose LANCE (Low-rank Activation Compression), a framework that performs one-shot higher-order Singular Value Decomposition (SVD) to obtain a reusable low-rank subspace for activation projection. This eliminates repeated decompositions, reducing both memory and computation. Moreover, fixed low-rank subspaces further enable on-device continual learning by allocating tasks to orthogonal subspaces without storing large task-specific matrices. Experiments show that LANCE reduces activation storage up to 250$\times$ while maintaining accuracy comparable to full backpropagation on CIFAR-10/100, Oxford-IIIT Pets, Flowers102, and CUB-200 datasets. On continual learning benchmarks (Split CIFAR-100, Split MiniImageNet, 5-Datasets), it performs competitively with orthogonal gradient projection methods at a fraction of the memory cost. These results position LANCE as a practical and scalable solution for efficient fine-tuning and continual learning on edge devices.

[617] NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning

Raviteja Anantha, Soheil Hor, Teodor Nicola Antoniu, Layne C. Price

Main category: cs.LG

TL;DR: NanoFlux is an adversarial framework that generates small, targeted training datasets (<200 examples) to improve LLM reasoning, outperforming conventional fine-tuning with 3-14x computational savings.

DetailsMotivation: Current fine-tuning approaches for LLMs often require large datasets and extensive computational resources. The paper aims to develop a more efficient method for improving LLM reasoning capabilities through targeted, adversarial data generation.

Method: Uses an adversarial framework with models alternating as Attacker and Defender, supervised by a tool-augmented Judge. Generates multi-step questions with explanatory annotations targeting specific reasoning capabilities. Includes embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning.

Result: Fine-tuning a 4B-parameter model on NanoFlux-generated data yields significant gains: +5.9% on mathematical reasoning (GSMHard), +3.6% on scientific reasoning (GenomeBench), and +16.6% on medical reasoning (MultiMedQA), with 3-14x computational reduction.

Conclusion: Future model improvements may come from intelligent synthesis of small, precisely targeted training datasets rather than large-scale data collection. The framework demonstrates that quality and targeting matter more than quantity.

Abstract: We present NanoFlux, a novel adversarial framework for generating targeted training data to improve LLM reasoning, where adversarially-generated datasets containing fewer than 200 examples outperform conventional fine-tuning approaches. The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, synthesizing multi-step questions with explanatory annotations that target specific reasoning capabilities. Fine-tuning a 4B-parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning: +5.9% on mathematical reasoning (GSMHard), +3.6% on scientific reasoning (GenomeBench), and +16.6% on medical reasoning (MultiMedQA), while reducing computational requirements by 3-14x. Ablation studies reveal a non-monotonic relationship between dataset characteristics and model performance, uncovering domain-specific optimal points for question complexity and reasoning quality. NanoFlux automates training data generation through embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning, suggesting that future model improvements may lie in the intelligent synthesis of small, precisely targeted training datasets.

[618] Attribution-Guided Decoding

Piotr Komorowski, Elena Golimblevskaia, Reduan Achtibat, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek

Main category: cs.LG

TL;DR: AGD is an interpretability-based decoding method that selects output tokens based on their attribution to user-defined regions of interest, improving instruction following and factual accuracy in LLMs.

DetailsMotivation: Standard decoding methods often fail to robustly satisfy complex instruction following and factual accuracy requirements, while existing control techniques frequently degrade general output quality.

Method: Attribution-Guided Decoding (AGD) considers high-probability output token candidates and selects the one with highest attribution to user-defined Regions of Interest (ROI), which can be flexibly defined over different parts of the model’s input or internal components.

Result: AGD significantly improves instruction following (e.g., improving overall success rate on Llama 3.1 from 66.0% to 79.1%), reduces hallucinations, and improves factual accuracy in both closed-book and open-book settings.

Conclusion: AGD presents a versatile, interpretable, and effective method for enhancing the reliability of modern LLMs through attribution-based guidance during decoding.

Abstract: The capacity of Large Language Models (LLMs) to follow complex instructions and generate factually accurate text is critical for their real-world application. However, standard decoding methods often fail to robustly satisfy these requirements, while existing control techniques frequently degrade general output quality. In this work, we introduce Attribution-Guided Decoding (AGD), an interpretability-based decoding strategy. Instead of directly manipulating model activations, AGD considers a set of high-probability output token candidates and selects the one that exhibits the highest attribution to a user-defined Region of Interest (ROI). This ROI can be flexibly defined over different parts of the model’s input or internal components, allowing AGD to steer generation towards various desirable behaviors. We demonstrate AGD’s efficacy across three challenging domains. For instruction following, we show that AGD significantly boosts adherence (e.g., improving the overall success rate on Llama 3.1 from 66.0% to 79.1%). For knowledge-intensive tasks, we show that guiding generation towards usage of internal knowledge components or contextual sources can reduce hallucinations and improve factual accuracy in both closed-book and open-book settings. Furthermore, we propose an adaptive, entropy-based variant of AGD that mitigates quality degradation and reduces computational overhead by applying guidance only when the model is uncertain. Our work presents a versatile, more interpretable, and effective method for enhancing the reliability of modern LLMs.

[619] PolyGraph Discrepancy: a classifier-based metric for graph generation

Markus Krimmel, Philip Hartout, Karsten Borgwardt, Dexiong Chen

Main category: cs.LG

TL;DR: PolyGraph Discrepancy (PGD) is a new evaluation framework for graph generative models that uses binary classifiers to approximate Jensen-Shannon distance between real and generated graph distributions, providing more robust and comparable metrics than traditional MMD approaches.

DetailsMotivation: Current evaluation methods for graph generative models rely on Maximum Mean Discrepancy (MMD) metrics which have limitations: they don't provide absolute performance measures, are highly sensitive to kernel and descriptor parameters, and metrics from different descriptors are incomparable.

Method: PGD approximates Jensen-Shannon distance by training binary classifiers to distinguish between real and generated graphs featurized by graph descriptors. The classifiers’ data log-likelihood provides a variational lower bound on the JS distance, resulting in metrics constrained to [0,1] that are comparable across descriptors.

Result: PGD provides more robust and insightful evaluation compared to MMD metrics, with metrics that are comparable across different graph descriptors and constrained to a standardized unit interval.

Conclusion: PolyGraph Discrepancy offers a theoretically grounded, practical evaluation framework for graph generative models that addresses key limitations of existing MMD-based approaches, with publicly available benchmarking tools.

Abstract: Existing methods for evaluating graph generative models primarily rely on Maximum Mean Discrepancy (MMD) metrics based on graph descriptors. While these metrics can rank generative models, they do not provide an absolute measure of performance. Their values are also highly sensitive to extrinsic parameters, namely kernel and descriptor parametrization, making them incomparable across different graph descriptors. We introduce PolyGraph Discrepancy (PGD), a new evaluation framework that addresses these limitations. It approximates the Jensen-Shannon distance of graph distributions by fitting binary classifiers to distinguish between real and generated graphs, featurized by these descriptors. The data log-likelihood of these classifiers approximates a variational lower bound on the JS distance between the two distributions. Resulting metrics are constrained to the unit interval [0,1] and are comparable across different graph descriptors. We further derive a theoretically grounded summary metric that combines these individual metrics to provide a maximally tight lower bound on the distance for the given descriptors. Thorough experiments demonstrate that PGD provides a more robust and insightful evaluation compared to MMD metrics. The PolyGraph framework for benchmarking graph generative models is made publicly available at https://github.com/BorgwardtLab/polygraph-benchmark.

[620] Controllable Graph Generation with Diffusion Models via Inference-Time Tree Search Guidance

Jiachi Zhao, Zehong Wang, Yamei Liao, Chuxu Zhang, Yanfang Ye

Main category: cs.LG

TL;DR: TreeDiff: MCTS-guided dual-space diffusion framework for controllable graph generation with macro-step expansion, dual-space denoising, and dual-space verifier for improved performance and scalability.

DetailsMotivation: Current diffusion models for graph generation offer little control over desired properties, leading to unstable quality and difficulty incorporating new objectives. Inference-time guidance methods are local, heuristic, and limited in controllability.

Method: Proposes TreeDiff, a Monte Carlo Tree Search guided dual-space diffusion framework with three key designs: 1) macro-step expansion strategy grouping multiple denoising updates, 2) dual-space denoising coupling latent-space denoising with discrete graph-space correction, and 3) dual-space verifier for long-term reward prediction.

Result: Achieves state-of-the-art performance on 2D and 3D molecular generation benchmarks in both unconditional and conditional settings. Exhibits favorable inference-time scaling, continuing to improve with additional computation while other methods plateau early.

Conclusion: TreeDiff provides an effective plug-and-play inference-time method for controllable graph generation that expands search space while maintaining tractable computation, addressing limitations of existing diffusion and guidance approaches.

Abstract: Graph generation is a fundamental problem in graph learning with broad applications across Web-scale systems, knowledge graphs, and scientific domains such as drug and material discovery. Recent approaches leverage diffusion models for step-by-step generation, yet unconditional diffusion offers little control over desired properties, often leading to unstable quality and difficulty in incorporating new objectives. Inference-time guidance methods mitigate these issues by adjusting the sampling process without retraining, but they remain inherently local, heuristic, and limited in controllability. To overcome these limitations, we propose TreeDiff, a Monte Carlo Tree Search (MCTS) guided dual-space diffusion framework for controllable graph generation. TreeDiff is a plug-and-play inference-time method that expands the search space while keeping computation tractable. Specifically, TreeDiff introduces three key designs to make it practical and scalable: (1) a macro-step expansion strategy that groups multiple denoising updates into a single transition, reducing tree depth and enabling long-horizon exploration; (2) a dual-space denoising mechanism that couples efficient latent-space denoising with lightweight discrete correction in graph space, ensuring both scalability and structural fidelity; and (3) a dual-space verifier that predicts long-term rewards from partially denoised graphs, enabling early value estimation and removing the need for full rollouts. Extensive experiments on 2D and 3D molecular generation benchmarks, under both unconditional and conditional settings, demonstrate that TreeDiff achieves state-of-the-art performance. Notably, TreeDiff exhibits favorable inference-time scaling: it continues to improve with additional computation, while existing inference-time methods plateau early under limited resources.

[621] Distributional Consistency Loss: Beyond Pointwise Data Terms in Inverse Problems

George Webber, Andrew J. Reader

Main category: cs.LG

TL;DR: DC loss replaces pointwise data-fidelity with distribution-level calibration to avoid noise overfitting in inverse problems without needing early stopping or paired data.

DetailsMotivation: Current inverse problem methods using pointwise data-fidelity (like MSE) often overfit to noise, requiring early stopping or strong priors. The paper aims to develop a statistically grounded alternative that evaluates data-fidelity collectively through distribution-level consistency.

Method: Introduces Distributional Consistency (DC) loss that tests whether observed measurements are statistically consistent with noise distributions implied by current estimates, replacing pointwise matching with distribution-level calibration. It’s designed as a plug-in replacement compatible with unsupervised regularizers and optimized similarly to traditional losses.

Result: In image denoising with deep image prior, DC loss removes need for early stopping and achieves higher PSNR. In medical image reconstruction from Poisson-noisy data, DC loss reduces artifacts in highly-iterated reconstructions and enhances hand-crafted regularization efficacy.

Conclusion: DC loss provides a statistically grounded, performance-enhancing alternative to conventional fidelity losses for unsupervised noise-dominated inverse problems where measurement-noise distribution is known and datasets contain many independent noisy values.

Abstract: Recovering true signals from noisy measurements is a central challenge in inverse problems spanning medical imaging, geophysics, and signal processing. Current methods balance prior signal priors (regularization) with agreement with noisy data (data-fidelity). Conventional data-fidelity loss functions, such as mean-squared error (MSE) or negative log-likelihood, seek pointwise agreement with noisy measurements, often leading to overfitting to noise. In this work, we instead evaluate data-fidelity collectively by testing whether the observed measurements are statistically consistent with the noise distributions implied by the current estimate. We introduce distributional consistency (DC) loss, a data-fidelity objective that replaces pointwise matching with distribution-level calibration. DC loss acts as a direct and practical plug-in replacement for standard data consistency terms: i) it is compatible with modern unsupervised regularizers that operate without paired measurement-ground-truth data, ii) it is optimized in the same way as traditional losses, and iii) it avoids overfitting to measurement noise without early stopping or priors. Its scope naturally fits many practical inverse problems where the measurement-noise distribution is known and where the measured dataset consists of many independent noisy values. We demonstrate efficacy in two key example application areas: i) in image denoising with deep image prior, using DC instead of MSE loss removes the need for early stopping and achieves higher PSNR; ii) in medical image reconstruction from Poisson-noisy data, DC loss reduces artifacts in highly-iterated reconstructions and enhances the efficacy of hand-crafted regularization. These results position DC loss as a statistically grounded, performance-enhancing alternative to conventional fidelity losses for an important class of unsupervised noise-dominated inverse problems.

[622] Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning

Reuben Dorent, Polina Golland, William Wells

Main category: cs.LG

TL;DR: A theoretical paper deriving a tight, tractable lower bound on mutual information via Jensen-Shannon divergence, providing justification for discriminative learning approaches in representation learning.

DetailsMotivation: Mutual Information (MI) is crucial for representation learning but direct optimization is intractable. Many methods use surrogate objectives like Jensen-Shannon divergence (JSD) via discriminative losses, but the connection between these surrogates and MI remains poorly understood theoretically.

Method: Derives a new tight lower bound on Kullback-Leibler divergence as a function of JSD, then specializes to joint and marginal distributions. Shows that maximizing JSD-based information increases a guaranteed lower bound on MI. Revisits practical implementation showing binary classifier cross-entropy loss recovers variational lower bound on JSD.

Result: Extensive experiments show the lower bound is tight for MI estimation. Compared to state-of-the-art neural estimators across established reference scenarios, the lower bound estimator provides stable, low-variance estimates of tight lower bounds on MI. Also demonstrates practical usefulness in Information Bottleneck framework.

Conclusion: Provides new theoretical justifications and strong empirical evidence for using discriminative learning in MI-based representation learning, bridging the gap between practical JSD-based methods and theoretical MI optimization.

Abstract: Mutual Information (MI) is a fundamental measure of statistical dependence widely used in representation learning. While direct optimization of MI via its definition as a Kullback-Leibler divergence (KLD) is often intractable, many recent methods have instead maximized alternative dependence measures, most notably, the Jensen-Shannon divergence (JSD) between joint and product of marginal distributions via discriminative losses. However, the connection between these surrogate objectives and MI remains poorly understood. In this work, we bridge this gap by deriving a new, tight, and tractable lower bound on KLD as a function of JSD in the general case. By specializing this bound to joint and marginal distributions, we demonstrate that maximizing the JSD-based information increases a guaranteed lower bound on mutual information. Furthermore, we revisit the practical implementation of JSD-based objectives and observe that minimizing the cross-entropy loss of a binary classifier trained to distinguish joint from marginal pairs recovers a known variational lower bound on the JSD. Extensive experiments demonstrate that our lower bound is tight when applied to MI estimation. We compared our lower bound to state-of-the-art neural estimators of variational lower bound across a range of established reference scenarios. Our lower bound estimator consistently provides a stable, low-variance estimate of a tight lower bound on MI. We also demonstrate its practical usefulness in the context of the Information Bottleneck framework. Taken together, our results provide new theoretical justifications and strong empirical evidence for using discriminative learning in MI-based representation learning.

[623] Transformers can do Bayesian Clustering

Prajit Bhaskaran, Tom Viering

Main category: cs.LG

TL;DR: Cluster-PFN is a Transformer-based model for Bayesian clustering that learns from synthetic data to estimate posterior distributions over cluster counts and assignments, handling missing values without imputation.

DetailsMotivation: Bayesian clustering is computationally expensive at scale, and real-world datasets often have missing values where simple imputation ignores uncertainty, leading to suboptimal results.

Method: Extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering using Transformer architecture. Trained entirely on synthetic datasets generated from finite Gaussian Mixture Model priors. Learns to estimate posterior distribution over both number of clusters and cluster assignments, handling missing data without imputation.

Result: Estimates number of clusters more accurately than AIC, BIC and Variational Inference. Achieves clustering quality competitive with VI while being orders of magnitude faster. Outperforms imputation-based baselines on real-world genomic datasets with high missingness.

Conclusion: Cluster-PFN provides scalable and flexible Bayesian clustering that can handle missing data effectively without imputation, making Bayesian clustering more practical for real-world applications.

Abstract: Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include missing data, outperforming imputation-based baselines on real-world genomic datasets, at high missingness. These results show that the Cluster-PFN can provide scalable and flexible Bayesian clustering.

[624] AGRAG: Advanced Graph-based Retrieval-Augmented Generation for LLMs

Yubo Wang, Haoyang Li, Fei Teng, Lei Chen

Main category: cs.LG

TL;DR: AGRAG is an advanced graph-based RAG framework that addresses LLM hallucination in graph construction, poor reasoning ability, and inadequate answering by using statistics-based entity extraction and formulating retrieval as a Minimum Cost Maximum Influence subgraph generation problem.

DetailsMotivation: Existing graph-based RAG methods suffer from inaccurate graph construction due to LLM hallucination, poor reasoning ability from lack of explicit reasoning paths, and inadequate answering due to limited LLM reasoning, causing them to underperform NaiveRAG on some tasks.

Method: AGRAG uses statistics-based entity extraction instead of LLM-based extraction to avoid hallucination. It formulates graph reasoning as Minimum Cost Maximum Influence (MCMI) subgraph generation, proves it’s NP-hard, and uses a greedy algorithm to generate comprehensive reasoning paths that serve as explicit reasoning for LLMs.

Result: AGRAG improves reasoning ability by providing explicit reasoning paths, reduces noise impact, allows complex graph structures like cycles, and enhances the comprehensiveness of generated reasoning paths compared to simple tree-structured approaches.

Conclusion: AGRAG addresses key limitations in graph-based RAG through statistics-based graph construction and MCMI subgraph generation, providing explicit reasoning paths that improve LLM focus on relevant content and overall system performance.

Abstract: Graph-based retrieval-augmented generation (Graph-based RAG) has demonstrated significant potential in enhancing Large Language Models (LLMs) with structured knowledge. However, existing methods face three critical challenges: Inaccurate Graph Construction, caused by LLM hallucination; Poor Reasoning Ability, caused by failing to generate explicit reasons telling LLM why certain chunks were selected; and Inadequate Answering, which only partially answers the query due to the inadequate LLM reasoning, making their performance lag behind NaiveRAG on certain tasks. To address these issues, we propose AGRAG, an advanced graph-based retrieval-augmented generation framework. When constructing the graph, AGRAG substitutes the widely used LLM entity extraction method with a statistics-based method, avoiding hallucination and error propagation. During retrieval, AGRAG formulates the graph reasoning procedure as the Minimum Cost Maximum Influence (MCMI) subgraph generation problem, where we try to include more nodes with high influence score, but with less involving edge cost, to make the generated reasoning paths more comprehensive. We prove this problem to be NP-hard, and propose a greedy algorithm to solve it. The MCMI subgraph generated can serve as explicit reasoning paths to tell LLM why certain chunks were retrieved, thereby making the LLM better focus on the query-related part contents of the chunks, reducing the impact of noise, and improving AGRAG’s reasoning ability. Furthermore, compared with the simple tree-structured reasoning paths, our MCMI subgraph can allow more complex graph structures, such as cycles, and improve the comprehensiveness of the generated reasoning paths. The code and prompt of AGRAG are released at: https://github.com/Wyb0627/AGRAG.

[625] FedSDWC: Federated Synergistic Dual-Representation Weak Causal Learning for OOD

Zhenyuan Huang, Hui Zhang, Wenzhong Tang, Haijun Yang

Main category: cs.LG

TL;DR: FedSDWC is a federated learning method that uses causal inference to integrate invariant and variant features, improving generalization and OOD detection under data distribution shifts.

DetailsMotivation: Federated learning faces reliability issues due to data distribution differences (covariate and semantic shifts) across clients, which existing invariant learning methods struggle to address effectively.

Method: FedSDWC infers causal semantic representations by modeling weak causal influence between invariant and variant features, overcoming limitations of existing invariant learning methods that fail to accurately capture invariant features or directly construct causal representations.

Result: FedSDWC outperforms FedICON by 3.04% on CIFAR-10 and 8.11% on CIFAR-100, shows superior performance on multiple benchmark datasets, and provides theoretical generalization error bounds related to client prior distributions.

Conclusion: FedSDWC significantly enhances FL’s ability to generalize and detect OOD data by integrating causal inference with invariant and variant feature modeling, addressing key challenges in real-world federated learning deployments.

Abstract: Amid growing demands for data privacy and advances in computational infrastructure, federated learning (FL) has emerged as a prominent distributed learning paradigm. Nevertheless, differences in data distribution (such as covariate and semantic shifts) severely affect its reliability in real-world deployments. To address this issue, we propose FedSDWC, a causal inference method that integrates both invariant and variant features. FedSDWC infers causal semantic representations by modeling the weak causal influence between invariant and variant features, effectively overcoming the limitations of existing invariant learning methods in accurately capturing invariant features and directly constructing causal representations. This approach significantly enhances FL’s ability to generalize and detect OOD data. Theoretically, we derive FedSDWC’s generalization error bound under specific conditions and, for the first time, establish its relationship with client prior distributions. Moreover, extensive experiments conducted on multiple benchmark datasets validate the superior performance of FedSDWC in handling covariate and semantic shifts. For example, FedSDWC outperforms FedICON, the next best baseline, by an average of 3.04% on CIFAR-10 and 8.11% on CIFAR-100.

[626] Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning

Yuxuan Gu, Weimin Bai, Yifei Wang, Weijian Luo, He Sun

Main category: cs.LG

TL;DR: MARVAL accelerates masked auto-regressive diffusion models by distilling the diffusion chain into a single AR step, enabling fast inference and practical RL post-training.

DetailsMotivation: Vanilla masked auto-regressive diffusion models suffer from slow inference due to hierarchical structure (outer AR loop + inner diffusion chain), making them impractical for RL applications that require fast sampling.

Method: Proposes MARVAL: a distillation framework using a novel score-based variational objective to compress the diffusion chain into a single AR generation step while preserving flexible unmasking order. Also introduces MARVAL-RL for efficient RL post-training.

Result: On ImageNet 256×256, MARVAL-Huge achieves FID of 2.00 with >30× speedup vs MAR-diffusion. MARVAL-RL improves CLIP and image-reward scores on ImageNet datasets with entity names.

Conclusion: MARVAL provides the first practical path for distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignment.

Abstract: Masked auto-regressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR suffers from slow inference due to its hierarchical inference mechanism: an outer AR unmasking loop and an inner diffusion denoising chain. Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model post-training.To address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while preserving the flexible auto-regressive unmasking order. Such a distillation with MARVAL not only yields substantial inference acceleration but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models. Our contributions are twofold: (1) a novel score-based variational objective for distilling masked auto-regressive diffusion models into a single generation step without sacrificing sample quality; and (2) an efficient RL framework for masked auto-regressive models via MARVAL-RL. On ImageNet 256*256, MARVAL-Huge achieves an FID of 2.00 with more than 30 times speedup compared with MAR-diffusion, and MARVAL-RL yields consistent improvements in CLIP and image-reward scores on ImageNet datasets with entity names. In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments.

[627] Tail Distribution of Regret in Optimistic Reinforcement Learning

Sajad Khodadadian, Mehrdad Moharrami

Main category: cs.LG

TL;DR: The paper provides instance-dependent tail bounds for optimism-based RL algorithms in finite-horizon tabular MDPs, analyzing both model-based (UCBVI) and model-free (Q-learning) approaches with different exploration bonus schedules.

DetailsMotivation: Existing RL regret analyses typically focus on expected regret or single high-probability bounds, lacking comprehensive tail distribution characterization. The paper aims to provide more detailed instance-dependent tail bounds for optimism-based RL algorithms.

Method: Analyzes UCBVI (model-based) with two exploration bonus schedules: K-dependent and K-independent (anytime). Also analyzes optimistic Q-learning (model-free) with K-dependent bonus. Derives explicit bounds on tail distribution P(R_K ≥ x) using concentration inequalities and regret decomposition techniques.

Result: Obtains upper bounds on tail distribution with two-regime structure: sub-Gaussian tail up to a transition threshold, followed by sub-Weibull tail beyond that point. Derives corresponding instance-dependent expected regret bounds. Shows algorithms depend on tuning parameter α balancing expected regret and sub-Gaussian decay range.

Conclusion: Provides comprehensive tail distribution analysis for optimism-based RL algorithms, revealing distinctive two-regime structure and enabling better understanding of regret concentration properties beyond standard expected regret analyses.

Abstract: We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. We first study a UCBVI-type (model-based) algorithm and characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes via explicit bounds on $P(R_K \ge x)$, going beyond analyses limited to $E[R_K]$ or a single high-probability quantile. We analyze two natural exploration-bonus schedules for UCBVI: (i) a $K$-dependent scheme that explicitly incorporates the total number of episodes $K$, and (ii) a $K$-independent (anytime) scheme that depends only on the current episode index. We then complement the model-based results with an analysis of optimistic Q-learning (model-free) under a $K$-dependent bonus schedule. Across both the model-based and model-free settings, we obtain upper bounds on $P(R_K \ge x)$ with a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret $E[R_K]$. The proposed algorithms depend on a tuning parameter $α$, which balances the expected regret and the range over which the regret exhibits sub-Gaussian decay.

[628] Flow Matching for Tabular Data Synthesis

Bahrul Ilmi Nasution, Floor Eijkelboom, Mark Elliot, Richard Allmendinger, Christian A. Naesseth

Main category: cs.LG

TL;DR: Flow matching methods (FM and variational FM) outperform diffusion models for tabular data synthesis, offering better performance with fewer function evaluations and exploring privacy-utility tradeoffs.

DetailsMotivation: To develop effective synthetic data generation methods for privacy-preserving data sharing, comparing flow matching approaches against state-of-the-art diffusion models for tabular data synthesis.

Method: Comprehensive empirical study comparing flow matching (FM and variational FM) with diffusion methods (TabDDPM and TabSyn) for tabular data synthesis, evaluating different probability paths (OT and VP) and samplers (deterministic vs stochastic).

Result: FM (especially TabbyFlow) outperforms diffusion baselines, achieves better performance with ≤100 steps, OT path is robust default while VP reduces privacy risk, and stochastic flows preserve marginal distributions while enabling high utility with reduced disclosure risk.

Conclusion: Flow matching is a superior alternative to diffusion models for tabular data synthesis, offering computational efficiency, better performance, and flexible privacy-utility tradeoffs through different probability paths and stochastic sampling.

Abstract: Synthetic data generation is an important tool for privacy-preserving data sharing. Although diffusion models have set recent benchmarks, flow matching (FM) offers a promising alternative. This paper presents different ways to implement FM for tabular data synthesis. We provide a comprehensive empirical study that compares flow matching (FM and variational FM) with a state-of-the-art diffusion method (TabDDPM and TabSyn) in tabular data synthesis. We evaluate both the standard Optimal Transport (OT) and the Variance Preserving (VP) probability paths, and also compare deterministic and stochastic samplers – something possible when learning to generate using \textit{variational} FM – characterising the empirical relationship between data utility and privacy risk. Our key findings reveal that FM, particularly TabbyFlow, outperforms diffusion baselines. Flow matching methods also achieve better performance with remarkably low function evaluations ($\leq$ 100 steps), offering a substantial computational advantage. The choice of probability path is also crucial, as using the OT is a strong default and more robust to early stopping on average, while VP has potential to produce synthetic data with lower privacy risk. Lastly, our results show that making flows stochastic not only preserves marginal distributions but, in some instances, enables the generation of high utility synthetic data with reduced disclosure risk. The implementation code associated with this paper is publicly available at https://github.com/rulnasution/tabular-flow-matching.

[629] A Dynamic Time Warping-Transfer Learning Approach to Transferring Knowledge in Stress-strain Behaviors from Polymers to Metals: An Affordable and Generalizable Additive Manufacturing Part Qualification Framework

Chenglong Duan, Dazhong Wu

Main category: cs.LG

TL;DR: A DTW-transfer learning framework for additive manufacturing part qualification that transfers knowledge from polymer stress-strain behaviors to predict metal behaviors, reducing testing costs.

DetailsMotivation: Conventional part qualification techniques for additive manufacturing (especially metal AM) are costly and time-consuming. The paper aims to develop a more efficient method by leveraging knowledge transfer from low-cost polymers to expensive metals.

Method: Develops a dynamic time warping (DTW)-transfer learning framework that selects the most similar polymer dataset (from Nylon, PLA, CF-ABS, Resin) to target metal datasets (AlSi10Mg, Ti6Al4V, carbon steel) using DTW similarity. Then trains an LSTM model on the optimal polymer dataset and tests on metal datasets.

Result: Resin dataset selected as optimal for AlSi10Mg and Ti6Al4V, Nylon for carbon steel. DTWTL model achieves best performance with average MAPE of 12.41%, RMSE of 63.75, and R² of 0.96, outperforming vanilla LSTM and TL models trained on all polymer datasets.

Conclusion: The DTW-TL framework successfully enables knowledge transfer from polymers to metals for AM part qualification, reducing costs while maintaining predictive accuracy for stress-strain behavior prediction.

Abstract: Part qualification in additive manufacturing (AM) ensures that additively manufactured parts can be consistently produced and reliably used in critical applications. One crucial aspect of part qualification is to determine the complex stress-strain behavior of additively manufactured parts. However, conventional part qualification techniques such as the destructive testing and non-destructive testing are costly and time consuming, especially for metal AM. To address this challenge, we develop a dynamic time warping (DTW)-transfer learning (TL) framework for AM part qualification by transferring knowledge gained from the stress-strain behaviors of additively manufactured low-cost polymers to high-performance, expensive metals. Specifically, the framework selects one single optimal polymer dataset that is the most similar to the metal dataset in the target domain using DTW among multiple polymer datasets, including Nylon, PLA, CF-ABS, and Resin. A long short-term memory (LSTM) model is then trained on one single optimal polymer dataset and tested on one of three target metal datasets, including AlSi10Mg, Ti6Al4V, and carbon steel datasets. Experimental results show that the Resin dataset is selected as the optimal polymer dataset in the source domain for the AlSi10Mg and Ti6Al4V datasets, while the Nylon dataset is selected as the optimal polymer dataset in the source domain for the carbon steel dataset. The DTWTL model trained on one single optimal polymer dataset as the source domain achieves the best predictive performance, including an average mean absolute percentage error of 12.41%, an average root mean squared error of 63.75, and an average coefficient of determination of 0.96 when three metals are used as the target domain, outperforming the vanilla LSTM model without TL as well as the TL model trained on all four polymer datasets as the source domain.

Manos Plitsis, Giorgos Bouritsas, Vassilis Katsouros, Yannis Panagakis

Main category: cs.LG

TL;DR: BGPS automatically generates prompts to expose social biases in text-to-image diffusion models by using LLMs guided by attribute classifiers to find prompts that maximize bias in generated images.

DetailsMotivation: Current bias mitigation approaches for text-to-image models rely on curated prompt datasets, which are costly and may miss subtle, unanticipated biases. There's a need for automated methods to discover prompts that trigger biased generation.

Method: Bias-Guided Prompt Search (BGPS) uses an LLM to generate attribute-neutral prompts, then employs attribute classifiers on the TTI model’s internal representations to guide the LLM’s decoding toward prompts that amplify specific image attributes of interest.

Result: BGPS discovered subtle, previously undocumented biases in Stable Diffusion 1.5 and state-of-the-art debiased models that severely deteriorate fairness metrics. The discovered prompts are interpretable and quantitatively improve perplexity compared to hard prompt optimization methods.

Conclusion: BGPS expands the bias search space for text-to-image models, uncovers vulnerabilities, and can serve as a new evaluation tool for bias mitigation, addressing limitations of curated prompt datasets.

Abstract: Text-to-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on the TTI’s internal representations that steer the decoding process of the LLM toward regions of the prompt space that amplify the image attributes of interest. We conduct extensive experiments on Stable Diffusion 1.5 and a state-of-the-art debiased model and discover an array of subtle and previously undocumented biases that severely deteriorate fairness metrics. Crucially, the discovered prompts are interpretable, i.e they may be entered by a typical user, quantitatively improving the perplexity metric compared to a prominent hard prompt optimization counterpart. Our findings uncover TTI vulnerabilities, while BGPS expands the bias search space and can act as a new evaluation tool for bias mitigation.

[631] SolarGPT-QA: A Domain-Adaptive Large Language Model for Educational Question Answering in Space Weather and Heliophysics

Santosh Chapagain, MohammadReza EskandariNasab, Onur Vural, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi

Main category: cs.LG

TL;DR: SolarGPT-QA is a domain-adapted LLM built on LLaMA-3 for educational question answering about solar activity and space weather, trained on scientific literature and GPT-4-generated Q&A data, with evaluation using LLM-as-judge framework.

DetailsMotivation: Solar activity impacts critical infrastructure but general LLMs lack domain-specific knowledge and pedagogical capability to explain complex space science concepts clearly. There's a need for specialized educational systems in space weather.

Method: Built on LLaMA-3 base model with domain-adaptive pretraining on scientific literature and fine-tuning on large-scale Q&A data generated by GPT-4 and refined by Grok-3 in student-friendly storytelling style. Uses LLM-as-judge evaluation framework with structured criteria.

Result: SolarGPT-QA outperforms general-purpose models in zero-shot settings and achieves competitive performance compared to instruction-tuned models for educational explanations in space weather and heliophysics. Ablation studies show domain-adaptive pretraining plus fine-tuning balances scientific accuracy and educational effectiveness.

Conclusion: Domain-adapted LLMs like SolarGPT-QA can effectively provide educational explanations for complex scientific domains like space weather, combining scientific accuracy with pedagogical effectiveness through specialized training approaches.

Abstract: Solar activity, including solar flares, coronal mass ejections (CMEs), and geomagnetic storms can significantly impact satellites, aviation, power grids, data centers, and space missions. Extreme solar events can cause substantial economic damage with limited advance warning, underscoring the importance of early warning systems, accurate forecasting, and effective education in space science. Although large language models (LLMs) perform well on general tasks, they often lack domain specific knowledge and pedagogical capability to clearly explain complex space science concepts. We introduce SolarGPT-QA, a question answering system based on a domain adapted large language model built on the LLaMA-3 base model. The model is trained using scientific literature and large scale question and answer data generated with GPT-4 and refined using Grok-3 in a student friendly storytelling style. To evaluate response quality, we employ an LLM-as-judge evaluation framework, where a strong reference model assesses generated answers using structured criteria including scientific accuracy, clarity, completeness, and pedagogical effectiveness. Results show that SolarGPT-QA performs strongly relative to general purpose models in zero shot settings and achieves competitive performance compared to instruction tuned models for educational explanations in space weather and heliophysics. Ablation studies indicate that combining domain adaptive pretraining with fine tuning is important for balancing scientific accuracy and educational effectiveness.

[632] Scalable Sample-Level Causal Discovery in Event Sequences via Autoregressive Density Estimation

Hugo Math, Rainer Lienhart

Main category: cs.LG

TL;DR: TRACE: A scalable framework for causal discovery from single event sequences using autoregressive models as pretrained density estimators for conditional mutual information estimation.

DetailsMotivation: Causal discovery from single observed sequences of discrete events (like vehicle logs, manufacturing systems, or patient trajectories) is challenging due to lack of repeated samples, high dimensionality, and long-range temporal dependencies. Existing methods struggle with scalability and handling delayed causal effects.

Method: TRACE repurposes autoregressive models as pretrained density estimators for conditional mutual information estimation. It infers summary causal graphs between event types in sequences, scales linearly with event vocabulary, supports delayed causal effects, and is fully parallel on GPUs.

Result: Theoretical identifiability established under imperfect autoregressive models. Experiments show robust performance across baselines and varying vocabulary sizes, including successful application to root-cause analysis in vehicle diagnostics with over 29,100 event types.

Conclusion: TRACE provides a scalable solution for causal discovery from single event sequences, overcoming challenges of high dimensionality and long-range dependencies while maintaining theoretical guarantees and practical applicability to large-scale real-world problems.

Abstract: We study causal discovery from a single observed sequence of discrete events generated by a stochastic process, as encountered in vehicle logs, manufacturing systems, or patient trajectories. This regime is particularly challenging due to the absence of repeated samples, high dimensionality, and long-range temporal dependencies of the single observation during inference. We introduce TRACE, a scalable framework that repurposes autoregressive models as pretrained density estimators for conditional mutual information estimation. TRACE infers the summary causal graph between event types in a sequence, scaling linearly with the event vocabulary and supporting delayed causal effects, while being fully parallel on GPUs. We establish its theoretical identifiability under imperfect autoregressive models. Experiments demonstrate robust performance across different baselines and varying vocabulary sizes including an application to root-cause analysis in vehicle diagnostics with over 29,100 event types.

[633] A Scalable Approach to Solving Simulation-Based Network Security Games

Michael Lanier, Yevgeniy Vorobeychik

Main category: cs.LG

TL;DR: MetaDOAR: A lightweight meta-controller for scalable multi-agent reinforcement learning on large cyber-network environments using partition-aware filtering and Q-value caching

DetailsMotivation: To enable scalable multi-agent reinforcement learning on very large cyber-network environments where conventional approaches face significant scaling issues in terms of memory usage and training time

Method: Uses a learned meta-controller with partition-aware filtering layer and Q-value caching; learns compact state projection from per-node structural embeddings to select small subset of devices; performs focused beam search with critic agent; implements LRU cache with quantized state projection and conservative k-hop cache invalidation

Result: MetaDOAR attains higher player payoffs than state-of-the-art baselines on large network topologies without significant scaling issues in memory usage or training time

Conclusion: Provides a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems

Abstract: We introduce MetaDOAR, a lightweight meta-controller that augments the Double Oracle / PSRO paradigm with a learned, partition-aware filtering layer and Q-value caching to enable scalable multi-agent reinforcement learning on very large cyber-network environments. MetaDOAR learns a compact state projection from per node structural embeddings to rapidly score and select a small subset of devices (a top-k partition) on which a conventional low-level actor performs focused beam search utilizing a critic agent. Selected candidate actions are evaluated with batched critic forwards and stored in an LRU cache keyed by a quantized state projection and local action identifiers, dramatically reducing redundant critic computation while preserving decision quality via conservative k-hop cache invalidation. Empirically, MetaDOAR attains higher player payoffs than SOTA baselines on large network topologies, without significant scaling issues in terms of memory usage or training time. This contribution provide a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems.

[634] Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

Afshin Khadangi

Main category: cs.LG

TL;DR: TRC² is a novel decoder-only architecture designed for continual learning in language models, featuring cortical columns with thalamic routing and hippocampal pathways to prevent catastrophic forgetting while adapting to evolving data.

DetailsMotivation: Current LLMs struggle with continual adaptation to evolving data, user behavior, and task mixtures without catastrophic forgetting. Existing stabilization methods are costly, brittle, or difficult to scale, requiring a backbone architecture that inherently supports continual learning.

Method: TRC² combines stacked cortical columns with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event-selective retrieval, delayed surprise-based writing, and replay-driven consolidation. Includes causal memory-update scheme and online replay controller.

Result: TRC² consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting relative to Transformer, Mamba, MoE, and DeepSeek baselines across task-sequential language-modeling streams over C4, WikiText-103, and GSM8K.

Conclusion: TRC² demonstrates that architectural innovations can make continual adaptation an inherent property of language model backbones, with thalamic and hippocampal components being central to retention gains while maintaining competitive throughput and training costs.

Abstract: Large language models deployed in the wild must adapt to evolving data, user behavior, and task mixtures without erasing previously acquired capabilities. In practice, this remains difficult: sequential updates induce catastrophic forgetting, while many stabilization methods rely on external procedures that are costly, brittle, or difficult to scale. We present TRC$^{2}$ (Thalamically Routed Cortical Columns), a decoder-only architecture that makes continual adaptation a property of the backbone itself. TRC$^{2}$ combines stacked cortical columns with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event-selective retrieval, delayed surprise-based writing, and replay-driven consolidation. This design localizes fast plasticity while preserving a slower stable computation pathway. We further introduce a causal memory-update scheme and an online replay controller that adjusts consolidation strength from measured forgetting. Across a task-sequential language-modeling stream over C4, WikiText-103, and GSM8K, TRC$^{2}$ consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting relative to Transformer, Mamba, MoE, and DeepSeek baselines trained under the same pipeline. Ablations show that the thalamic and hippocampal components are central to the retention gains, while the full model remains competitive in throughput and training cost.

[635] Transit Network Design with Two-Level Demand Uncertainties: A Machine Learning and Contextual Stochastic Optimization Framework

Hongzhao Guan, Beste Basciftci, Pascal Van Hentenryck

Main category: cs.LG

TL;DR: A two-level transit network design framework that uses machine learning and contextual stochastic optimization to incorporate demand uncertainties, distinguishing between core transit users and latent demand based on service quality.

DetailsMotivation: Traditional transit network design uses fixed demand assumptions, which are unrealistic. The paper aims to incorporate two layers of demand uncertainties: core transit users and latent demand that depends on service quality.

Method: Proposes 2LRC-TND framework using machine learning models for travel mode choice prediction, integrated with contextual stochastic optimization solved via constraint programming (CP-SAT solver).

Result: Evaluated on Atlanta metropolitan area with 6,600+ travel arcs and 38,000+ trips. Framework effectively designs transit networks accounting for demand uncertainties and contextual information.

Conclusion: 2LRC-TND provides a more realistic alternative to fixed-demand transit network design models by incorporating demand uncertainties through machine learning and optimization.

Abstract: Transit Network Design is a well-studied problem in the field of transportation, typically addressed by solving optimization models under fixed demand assumptions. Considering the limitations of these assumptions, this paper proposes a new framework, namely the Two-Level Rider Choice Transit Network Design (2LRC-TND), that leverages machine learning and contextual stochastic optimization (CSO) through constraint programming (CP) to incorporate two layers of demand uncertainties into the network design process. The first level identifies travelers who rely on public transit (core demand), while the second level captures the conditional adoption behavior of those who do not (latent demand), based on the availability and quality of transit services. To capture these two types of uncertainties, 2LRC-TND relies on two travel mode choice models, that use multiple machine learning models. To design a network, 2LRC-TND integrates the resulting choice models into a CSO that is solved using a CP-SAT solver. 2LRC-TND is evaluated through a case study involving over 6,600 travel arcs and more than 38,000 trips in the Atlanta metropolitan area. The computational results demonstrate the effectiveness of the 2LRC-TND in designing transit networks that account for demand uncertainties and contextual information, offering a more realistic alternative to fixed-demand models.

[636] AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

Pei Yang, Wanyi Chen, Asuka Yuxi Zheng, Xueqian Li, Xiang Li, Haoqin Tu, Jie Xiao, Yifan Pang, Dongdong Zhang, Fuqiang Li, Alfred Long, Lynn Ai, Eric Yang, Bill Shi

Main category: cs.LG

TL;DR: AOI is a trainable multi-agent framework for automating Site Reliability Engineering using LLM agents with security constraints, featuring diagnostic training, safe execution architecture, and failure learning.

DetailsMotivation: LLM agents show promise for automating SRE tasks but face deployment challenges: restricted access to proprietary data, unsafe action execution in permission-governed environments, and inability to learn from failures in closed systems.

Method: Three-component framework: 1) Trainable diagnostic system using Group Relative Policy Optimization to distill expert knowledge into open-source models; 2) Read-write separated execution architecture with observation, reasoning, and action phases; 3) Failure Trajectory Closed-Loop Evolver that mines unsuccessful trajectories for corrective supervision.

Result: AOI achieves 66.3% best@5 success on all 86 tasks (24.4 points over SOTA), 42.9% avg@1 on 63 held-out tasks with unseen faults (surpassing Claude Sonnet 4.5), and improves avg@5 by 4.8 points with 35% variance reduction through failure learning.

Conclusion: AOI demonstrates effective automation of SRE operations through structured trajectory learning under security constraints, enabling safe deployment and continuous improvement from failures.

Abstract: Large language model (LLM) agents offer a promising data-driven approach to automating Site Reliability Engineering (SRE), yet their enterprise deployment is constrained by three challenges: restricted access to proprietary data, unsafe action execution under permission-governed environments, and the inability of closed systems to improve from failures. We present AOI (Autonomous Operations Intelligence), a trainable multi-agent framework formulating automated operations as a structured trajectory learning problem under security constraints. Our approach integrates three key components. First, a trainable diagnostic system applies Group Relative Policy Optimization (GRPO) to distill expert-level knowledge into locally deployed open-source models, enabling preference-based learning without exposing sensitive data. Second, a read-write separated execution architecture decomposes operational trajectories into observation, reasoning, and action phases, allowing safe learning while preventing unauthorized state mutation. Third, a Failure Trajectory Closed-Loop Evolver mines unsuccessful trajectories and converts them into corrective supervision signals, enabling continual data augmentation. Evaluated on the AIOpsLab benchmark, our contributions yield cumulative gains. (1) The AOI runtime alone achieves 66.3% best@5 success on all 86 tasks, outperforming the prior state-of-the-art (41.9%) by 24.4 points. (2) Adding Observer GRPO training, a locally deployed 14B model reaches 42.9% avg@1 on 63 held-out tasks with unseen fault types, surpassing Claude Sonnet 4.5. (3) The Evolver converts 37 failed trajectories into diagnostic guidance, improving end-to-end avg@5 by 4.8 points while reducing variance by 35%.

[637] Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang

Main category: cs.LG

TL;DR: Factored keys reduce KV cache size by exploiting asymmetry between selection (queries/keys) and value transfer, using low-dimensional keys via SVD factorization without requiring architecture redesign.

DetailsMotivation: Standard transformer attention uses identical dimensionality for queries, keys, and values, but selection (queries and keys) requires far fewer dimensions than value transfer. Current KV cache compression methods like GQA and MLA must be designed into the architecture before pretraining, limiting flexibility.

Method: Factorize each key projection W_K ≈ A_{d×r}B_{r×d} via truncated SVD, where r = d_select. Use W_K’ = A as new key projection for compact r-dimensional keys in cache, and absorb B^⊤ into query projection W_Q’ = W_QB^⊤ at zero cost since queries are never cached. Can be applied to existing models via SVD + QK fine-tuning.

Result: At 7B scale, training from scratch with r = d_model/4 matches full-attention perplexity (9.2 vs 9.3 PPL) while using 12% fewer parameters and training 8% faster. For existing models, achieves 75% key cache savings at ~2% quality cost on GPT-2 and Mistral-7B. Enables 25 GB KV cache savings per user for 7B model with 128K context, allowing ~60% more concurrent users.

Conclusion: Factored keys provide an effective post-hoc KV cache compression method that exploits the asymmetry between selection and value transfer, working with existing models without requiring architecture redesign, and composing with other compression techniques for substantial memory savings.

Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O(\log N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) – far fewer than value transfer needs. We introduce factored keys, which exploit this asymmetry to physically shrink the KV cache of any pretrained model without retraining from scratch – unlike GQA and MLA, which must be designed into the architecture before pretraining. We factorize each key projection $W_K \approx A_{d \times r} B_{r \times d}$ via truncated SVD (where $r = d_{\text{select}}$), set $W_K’ = A$ as the new key projection producing compact $r$-dimensional keys for the cache, and absorb $B^\top$ into the query projection ($W_Q’ = W_Q B^\top$) at zero cost – since queries are never cached. At 7B scale, training from scratch with $r = d_{\text{model}}/4$ matches full-attention perplexity (9.2 vs 9.3 PPL after 20B tokens) while using 12% fewer parameters and training 8% faster. For existing models, SVD + QK fine-tuning (3 epochs, less than 1% of pretraining data) achieves 75% key cache savings at approximately 2% quality cost on both GPT-2 and Mistral-7B. The approach composes with GQA and quantization for up to $16\times$ combined key cache compression. For a 7B model serving 128K context, factored keys save 25 GB of KV cache per user, enabling approximately 60% more concurrent users on identical hardware.

[638] On-Policy Self-Distillation for Reasoning Compression

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun

Main category: cs.LG

TL;DR: OPSDC is a self-distillation method that teaches reasoning models to be more concise by conditioning them on “be concise” instructions and minimizing reverse KL divergence on their own rollouts, achieving significant token reduction while improving accuracy.

DetailsMotivation: Reasoning models produce verbose outputs with much noise and redundancy, where unnecessary tokens can actually compound errors rather than help reasoning.

Method: On-Policy Self-Distillation for Reasoning Compression (OPSDC): condition the same model on a “be concise” instruction to get teacher logits, then minimize per-token reverse KL divergence on the student’s own rollouts. No ground-truth answers, token budgets, or difficulty estimators needed.

Result: Achieved 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute on Qwen3-8B and Qwen3-14B. On AIME 2024, the 14B model gained 10 points with 41% compression.

Conclusion: Self-distillation effectively teaches models to be more concise, automatically compressing easy problems aggressively while preserving deliberation for hard ones, revealing that much reasoning model output is not just redundant but actively harmful.

Abstract: Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a “be concise” instruction to obtain teacher logits, and minimize per-token reverse KL on the student’s own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token. Code is available at https://github.com/HJSang/OPSD_Reasoning_Compression.

[639] Relaxed Efficient Acquisition of Context and Temporal Features

Yunni Qu, Dzung Dinh, Grant King, Whitney Ringwald, Bing Cai Kok, Kathleen Gates, Aidan Wright, Junier Oliva

Main category: cs.LG

TL;DR: REACT is an end-to-end differentiable framework that jointly optimizes selection of onboarding contextual descriptors and adaptive feature-time acquisition plans for longitudinal measurements under cost constraints in biomedical applications.

DetailsMotivation: Biomedical measurements incur costs and risks, requiring adaptive selection over time. Existing approaches don't jointly optimize onboarding context selection with temporally coupled acquisition decisions, despite real-world clinical workflows starting with an initial onboarding phase.

Method: REACT uses Gumbel-Sigmoid relaxation with straight-through estimation to enable gradient-based optimization over discrete acquisition masks, allowing direct backpropagation from prediction loss and acquisition cost in a unified framework.

Result: Across real-world longitudinal health and behavioral datasets, REACT achieves improved predictive performance at lower acquisition costs compared to existing longitudinal acquisition baselines.

Conclusion: Jointly modeling onboarding context selection and temporally coupled acquisition within a unified optimization framework provides practical benefits for biomedical applications with measurement constraints.

Abstract: In many biomedical applications, measurements are not freely available at inference time: each laboratory test, imaging modality, or assessment incurs financial cost, time burden, or patient risk. Longitudinal active feature acquisition (LAFA) seeks to optimize predictive performance under such constraints by adaptively selecting measurements over time, yet the problem remains inherently challenging due to temporally coupled decisions (missed early measurements cannot be revisited, and acquisition choices influence all downstream predictions). Moreover, real-world clinical workflows typically begin with an initial onboarding phase, during which relatively stable contextual descriptors (e.g., demographics or baseline characteristics) are collected once and subsequently condition longitudinal decision-making. Despite its practical importance, the efficient selection of onboarding context has not been studied jointly with temporally adaptive acquisition. We therefore propose REACT (Relaxed Efficient Acquisition of Context and Temporal features), an end-to-end differentiable framework that simultaneously optimizes (i) selection of onboarding contextual descriptors and (ii) adaptive feature–time acquisition plans for longitudinal measurements under cost constraints. REACT employs a Gumbel–Sigmoid relaxation with straight-through estimation to enable gradient-based optimization over discrete acquisition masks, allowing direct backpropagation from prediction loss and acquisition cost. Across real-world longitudinal health and behavioral datasets, REACT achieves improved predictive performance at lower acquisition costs compared to existing longitudinal acquisition baselines, demonstrating the benefit of modeling onboarding and temporally coupled acquisition within a unified optimization framework.

[640] Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Yuval Ran-Milo

Main category: cs.LG

TL;DR: Theoretical analysis shows attention sinks in softmax Transformers are functionally necessary for certain tasks, not just optimization artifacts, while ReLU attention can solve same tasks without sinks.

DetailsMotivation: To understand whether attention sinks in Transformers are merely optimization artifacts or functionally necessary components, and to analyze the role of normalization constraints in causing sink behavior.

Method: Theoretical analysis proving that computing trigger-conditional behavior necessarily induces sinks in softmax self-attention models, with concrete task instantiation. Comparison with non-normalized ReLU attention showing it can solve same tasks without sinks. Experimental validation extending beyond theoretical analysis.

Result: Proved that attention sinks are functionally necessary in softmax Transformers for certain tasks, with ReLU attention eliminating sinks while solving same tasks. Experiments confirmed softmax models develop strong sinks while ReLU attention eliminates them in both single-head and multi-head variants.

Conclusion: Attention sinks in softmax Transformers are not just optimization artifacts but functionally necessary for certain tasks, driven fundamentally by normalization constraints. ReLU attention provides an alternative without sink behavior.

Abstract: Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. Are sinks a byproduct of the optimization/training regime? Or are they sometimes functionally necessary in softmax Transformers? Are sinks a byproduct of the optimization/training regime? Or are they sometimes functionally necessary in softmax Transformers? We prove that, in some settings, it is the latter: computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when a designated trigger token appears, the model must return the average of all preceding token representations, and otherwise output zero, a task which mirrors the functionality of attention heads in the wild (Barbero et al., 2025; Guo et al., 2024). We also prove that non-normalized ReLU attention can solve the same task without any sink, confirming that the normalization constraint is the fundamental driver of sink behavior. Experiments validate our predictions and demonstrate they extend beyond the theoretically analyzed setting: softmax models develop strong sinks while ReLU attention eliminates them in both single-head and multi-head variants.

[641] Overcoming the Modality Gap in Context-Aided Forecasting

Vincent Zhihao Zheng, Étienne Marcotte, Arjun Ashok, Andrew Robert Williams, Lijun Sun, Alexandre Drouin, Valentina Zantedeschi

Main category: cs.LG

TL;DR: Semi-synthetic data augmentation method creates CAF-7M dataset with 7M context-augmented time series windows to improve context-aided forecasting by addressing poor context quality in existing datasets.

DetailsMotivation: Context-aided forecasting (CAF) aims to integrate domain knowledge and forward-looking information for better predictions, but multimodal models often underperform unimodal ones due to poor context quality in existing datasets where verification is challenging.

Method: Introduces a semi-synthetic data augmentation method that generates contexts both descriptive of temporal dynamics and verifiably complementary to numerical histories, enabling creation of CAF-7M dataset with 7 million context-augmented time series windows including a rigorously verified test set.

Result: Semi-synthetic pre-training transfers effectively to real-world evaluation, shows clear evidence of context utilization, and suggests dataset quality rather than architectural limitations has been the primary bottleneck in context-aided forecasting.

Conclusion: The semi-synthetic data augmentation approach successfully addresses the context quality problem in CAF, enabling better multimodal forecasting performance through improved dataset construction rather than architectural changes.

Abstract: Context-aided forecasting (CAF) holds promise for integrating domain knowledge and forward-looking information, enabling AI systems to surpass traditional statistical methods. However, recent empirical studies reveal a puzzling gap: multimodal models often fail to outperform their unimodal counterparts. We hypothesize that this underperformance stems from poor context quality in existing datasets, as verification is challenging. To address these limitations, we introduce a semi-synthetic data augmentation method that generates contexts both descriptive of temporal dynamics and verifiably complementary to numerical histories. This approach enables massive-scale dataset creation, resulting in CAF-7M, a corpus of 7 million context-augmented time series windows, including a rigorously verified test set. We demonstrate that semi-synthetic pre-training transfers effectively to real-world evaluation, and show clear evidence of context utilization. Our results suggest that dataset quality, rather than architectural limitations, has been the primary bottleneck in context-aided forecasting.

[642] Enhanced Atrial Fibrillation Prediction in ESUS Patients with Hypergraph-based Pre-training

Yuzhang Xie, Yuhua Wu, Ruiyu Wang, Fadi Nahab, Xiao Hu, Carl Yang

Main category: cs.LG

TL;DR: Hypergraph-based pre-training strategies improve atrial fibrillation prediction in stroke patients by transferring knowledge from large stroke cohorts to smaller ESUS cohorts, addressing data scarcity and high-dimensional feature challenges.

DetailsMotivation: Atrial fibrillation (AF) is a major complication after embolic stroke of undetermined source (ESUS), increasing stroke recurrence and mortality risk. Current prediction tools have accuracy, scalability, and cost limitations. Machine learning approaches are hindered by small ESUS cohorts and high-dimensional medical features.

Method: Introduces supervised and unsupervised hypergraph-based pre-training strategies. First pre-trains hypergraph-based patient embedding models on a large stroke cohort (7,780 patients) to capture salient features and higher-order interactions. Transfers resulting embeddings to a smaller ESUS cohort (510 patients), reducing feature dimensionality while preserving clinically meaningful information for lightweight model prediction.

Result: Both pre-training approaches outperform traditional models trained on raw data, improving accuracy and robustness. The framework provides a scalable and efficient solution for AF risk prediction after stroke.

Conclusion: Hypergraph-based pre-training offers an effective approach to address data scarcity in medical ML applications, enabling accurate AF prediction in ESUS patients through knowledge transfer from larger cohorts.

Abstract: Atrial fibrillation (AF) is a major complication following embolic stroke of undetermined source (ESUS), elevating the risk of recurrent stroke and mortality. Early identification is clinically important, yet existing tools face limitations in accuracy, scalability, and cost. Machine learning (ML) offers promise but is hindered by small ESUS cohorts and high-dimensional medical features. To address these challenges, we introduce supervised and unsupervised hypergraph-based pre-training strategies to improve AF prediction in ESUS patients. We first pre-train hypergraph-based patient embedding models on a large stroke cohort (7,780 patients) to capture salient features and higher-order interactions. The resulting embeddings are transferred to a smaller ESUS cohort (510 patients), reducing feature dimensionality while preserving clinically meaningful information, enabling effective prediction with lightweight models. Experiments show that both pre-training approaches outperform traditional models trained on raw data, improving accuracy and robustness. This framework offers a scalable and efficient solution for AF risk prediction after stroke.

[643] Lipschitz-Based Robustness Certification Under Floating-Point Execution

Toby Murray

Main category: cs.LG

TL;DR: A formal framework for certifying neural network robustness under floating-point arithmetic, addressing the semantic gap between real arithmetic guarantees and actual floating-point execution.

DetailsMotivation: Existing neural network robustness certification methods assume exact real arithmetic, but deployed networks use floating-point arithmetic, creating a semantic gap where real arithmetic guarantees can fail under floating-point execution, especially at lower precision formats like float16.

Method: Develops a formal compositional theory relating real arithmetic Lipschitz-based sensitivity bounds to floating-point execution sensitivity under standard rounding-error models, specialized for feed-forward neural networks with ReLU activations. Derives sound conditions for robustness under floating-point execution, including certificate degradation bounds and overflow prevention conditions.

Result: Provides concrete counterexamples showing real arithmetic robustness guarantees can fail under floating-point execution, even for previously verified certifiers. Formalizes the theory and implements an executable certifier based on these principles, demonstrating practicality through empirical evaluation.

Conclusion: Addresses the critical semantic gap between theoretical robustness certification and practical floating-point execution, providing a formal framework for sound robustness guarantees that account for floating-point arithmetic effects in neural networks.

Abstract: Sensitivity-based robustness certification has emerged as a practical approach for certifying neural network robustness, including in settings that require verifiable guarantees. A key advantage of these methods is that certification is performed by concrete numerical computation (rather than symbolic reasoning) and scales efficiently with network size. However, as with the vast majority of prior work on robustness certification and verification, the soundness of these methods is typically proved with respect to a semantic model that assumes exact real arithmetic. In reality deployed neural network implementations execute using floating-point arithmetic. This mismatch creates a semantic gap between certified robustness properties and the behaviour of the executed system. As motivating evidence, we exhibit concrete counterexamples showing that real arithmetic robustness guarantees can fail under floating-point execution, even for previously verified certifiers, with discrepancies becoming pronounced at lower-precision formats such as float16. We then develop a formal, compositional theory relating real arithmetic Lipschitz-based sensitivity bounds to the sensitivity of floating-point execution under standard rounding-error models, specialised to feed-forward neural networks with ReLU activations. We derive sound conditions for robustness under floating-point execution, including bounds on certificate degradation and sufficient conditions for the absence of overflow. We formalize the theory and its main soundness results, and implement an executable certifier based on these principles, which we empirically evaluate to demonstrate its practicality.

[644] OrigamiBench: An Interactive Environment to Synthesize Flat-Foldable Origamis

Naaisha Agarwal, Yihan Wu, Yichang Jian, Yikuan Hu, Nishad Mansoor, Mohan Li, Yifei Peng, Wang-Zhou Dai, Yao-Xiang Ding, Emanuele Sansone

Main category: cs.LG

TL;DR: OrigamiBench: A benchmark for evaluating AI systems’ ability to integrate visual perception, causal reasoning, and sequential planning through physical origami folding tasks.

DetailsMotivation: Current AI systems lack understanding of causal mechanisms and constraints governing physical processes needed for planning and acting in the physical world. Existing benchmarks separate visual perception from programmatic reasoning, while origami provides a natural testbed integrating both modalities.

Method: Introduces OrigamiBench, an interactive benchmark where models iteratively propose folds and receive feedback on physical validity and similarity to target configurations. Tests modern vision-language models on multi-step folding strategies requiring integration of visual and language representations.

Result: Scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, showing visual and language representations remain weakly integrated.

Conclusion: OrigamiBench reveals limitations in current vision-language models’ ability to integrate perception and reasoning for physical world tasks, highlighting the need for better multimodal integration beyond simple scaling.

Abstract: Building AI systems that can plan, act, and create in the physical world requires more than pattern recognition. Such systems must understand the causal mechanisms and constraints governing physical processes in order to guide sequential decisions. This capability relies on internal representations, analogous to an internal language model, that relate observations, actions, and resulting environmental changes. However, many existing benchmarks treat visual perception and programmatic reasoning as separate problems, focusing either on visual recognition or on symbolic tasks. The domain of origami provides a natural testbed that integrates these modalities. Constructing shapes through folding operations requires visual perception, reasoning about geometric and physical constraints, and sequential planning, while remaining sufficiently structured for systematic evaluation. We introduce OrigamiBench, an interactive benchmark in which models iteratively propose folds and receive feedback on physical validity and similarity to a target configuration. Experiments with modern vision-language models show that scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, suggesting that visual and language representations remain weakly integrated.

[645] Artificial intelligence-enabled single-lead ECG for non-invasive hyperkalemia detection: development, multicenter validation, and proof-of-concept deployment

Gongzheng Tang, Qinghao Zhao, Guangkun Nie, Yujie Xiao, Shijia Geng, Donglin Xie, Shun Huang, Deyun Zhang, Xingchen Yao, Jinwei Wang, Kangyin Chen, Luxia Zhang, Shenda Hong

Main category: cs.LG

TL;DR: Pocket-K: AI-ECG system for non-invasive hyperkalemia screening using single-lead ECG data, achieving high diagnostic accuracy across validation sets.

DetailsMotivation: Hyperkalemia is a dangerous electrolyte disorder common in chronic kidney disease and heart failure patients, but frequent monitoring is difficult outside hospitals. There's a need for non-invasive, accessible screening tools.

Method: Developed Pocket-K, a single-lead AI-ECG system initialized from ECGFounder foundation model. Used 62,290 ECG-potassium pairs from 34,439 patients across two hospitals. Fine-tuned model using lead I data with internal development, temporal validation, and external validation sets.

Result: Achieved AUROCs of 0.936 (internal), 0.858 (temporal), and 0.808 (external) for hyperkalemia detection. For moderate-to-severe hyperkalemia, AUROCs increased to 0.940 and 0.861. External negative predictive value exceeded 99.3%. Handheld prototype enabled near-real-time inference.

Conclusion: Pocket-K demonstrates effective non-invasive hyperkalemia screening with high accuracy, supporting future deployment in handheld and wearable devices for accessible monitoring outside hospital settings.

Abstract: Hyperkalemia is a life-threatening electrolyte disorder that is common in patients with chronic kidney disease and heart failure, yet frequent monitoring remains difficult outside hospital settings. We developed and validated Pocket-K, a single-lead AI-ECG system initialized from the ECGFounder foundation model for non-invasive hyperkalemia screening and handheld deployment. In this multicentre observational study using routinely collected clinical ECG and laboratory data, 34,439 patients contributed 62,290 ECG–potassium pairs. Lead I data were used to fine-tune the model. Data from Peking University People’s Hospital were divided into development and temporal validation sets, and data from The Second Hospital of Tianjin Medical University served as an independent external validation set. Hyperkalemia was defined as venous serum potassium > 5.5 mmol/L. Pocket-K achieved AUROCs of 0.936 in internal testing, 0.858 in temporal validation, and 0.808 in external validation. For KDIGO-defined moderate-to-severe hyperkalemia (serum potassium >= 6.0 mmol/L), AUROCs increased to 0.940 and 0.861 in the temporal and external sets, respectively. External negative predictive value exceeded 99.3%. Model-predicted high risk below the hyperkalemia threshold was more common in patients with chronic kidney disease and heart failure. A handheld prototype enabled near-real-time inference, supporting future prospective evaluation in native handheld and wearable settings.

[646] Efficient Federated Conformal Prediction with Group-Conditional Guarantees

Haifeng Wen, Osvaldo Simeone, Hong Xing

Main category: cs.LG

TL;DR: Group-conditional federated conformal prediction (GC-FCP) provides uncertainty quantification with group-specific coverage guarantees in federated learning settings where data is distributed across clients with potentially overlapping groups.

DetailsMotivation: Deploying trustworthy AI systems requires principled uncertainty quantification. In federated settings where calibration data is distributed across multiple clients with local data distributions, and where data can be partitioned into overlapping groups (client-specific strata or demographic/semantic categories), there's a need for methods that provide group-conditional coverage guarantees while respecting data privacy and communication constraints.

Method: GC-FCP constructs mergeable, group-stratified coresets from local calibration scores. Clients communicate compact weighted summaries (coresets) that support efficient aggregation and calibration at the server. This enables group-conditional conformal prediction in federated settings while maintaining communication efficiency.

Result: Experiments on synthetic and real-world datasets validate that GC-FCP performs comparably to centralized calibration baselines while providing group-conditional coverage guarantees in federated settings.

Conclusion: GC-FCP enables principled uncertainty quantification with group-specific coverage guarantees in federated learning environments, addressing practical needs in healthcare, finance, and mobile sensing applications where data privacy and group fairness are important considerations.

Abstract: Deploying trustworthy AI systems requires principled uncertainty quantification. Conformal prediction (CP) is a widely used framework for constructing prediction sets with distribution-free coverage guarantees. In many practical settings, including healthcare, finance, and mobile sensing, the calibration data required for CP are distributed across multiple clients, each with its own local data distribution. In this federated setting, data can often be partitioned into, potentially overlapping, groups, which may reflect client-specific strata or cross-cutting attributes such as demographic or semantic categories. We propose group-conditional federated conformal prediction (GC-FCP), a novel protocol that provides group-conditional coverage guarantees. GC-FCP constructs mergeable, group-stratified coresets from local calibration scores, enabling clients to communicate compact weighted summaries that support efficient aggregation and calibration at the server. Experiments on synthetic and real-world datasets validate the performance of GC-FCP compared to centralized calibration baselines.

[647] High-Fidelity Compression of Seismic Velocity Models via SIREN Auto-Decoders

Caiyun Liu, Xiaoxue Luo, Jie Xiong

Main category: cs.LG

TL;DR: SIREN-based auto-decoder for neural compression of seismic velocity models achieving 19:1 compression with high reconstruction quality and enabling smooth interpolation and zero-shot super-resolution.

DetailsMotivation: Implicit Neural Representations (INRs) offer resolution-independent continuous signal representation. The paper aims to apply this to compress multi-structural seismic velocity models from OpenFWI benchmark for efficient storage and geophysical applications.

Method: Uses SIREN (Sinusoidal Representation Networks) auto-decoder framework to compress 70x70 velocity maps into 256-dimensional latent vectors. Evaluates on 1,000 samples across five geological families: FlatVel, CurveVel, FlatFault, CurveFault, and Style.

Result: Achieves 19:1 compression ratio with average PSNR of 32.47 dB and SSIM of 0.956. Demonstrates smooth latent space interpolation for generating intermediate velocity structures and zero-shot super-resolution up to 280x280 without additional training.

Conclusion: INR-based auto-decoders show strong potential for efficient storage, multi-scale analysis, and downstream geophysical applications like full waveform inversion through their resolution-independent representation capabilities.

Abstract: Implicit Neural Representations (INRs) have emerged as a powerful paradigm for representing continuous signals independently of grid resolution. In this paper, we propose a high-fidelity neural compression framework based on a SIREN (Sinusoidal Representation Networks) auto-decoder to represent multi-structural seismic velocity models from the OpenFWI benchmark. Our method compresses each 70x70 velocity map (4,900 points) into a compact 256-dimensional latent vector, achieving a compression ratio of 19:1. We evaluate the framework on 1,000 samples across five diverse geological families: FlatVel, CurveVel, FlatFault, CurveFault, and Style. Experimental results demonstrate an average PSNR of 32.47 dB and SSIM of 0.956, indicating high-quality reconstruction. Furthermore, we showcase two key advantages of our implicit representation: (1) smooth latent space interpolation that generates plausible intermediate velocity structures, and (2) zero-shot super-resolution capability that reconstructs velocity fields at arbitrary resolutions up to 280x280 without additional training. The results highlight the potential of INR-based auto-decoders for efficient storage, multi-scale analysis, and downstream geophysical applications such as full waveform inversion.

[648] GNNVerifier: Graph-based Verifier for LLM Task Planning

Yu Hao, Qiuyu Wang, Cheng Yang, Yawen Li, Zhiqiang Zhang, Chuan Shi

Main category: cs.LG

TL;DR: GNNVerifier: A graph neural network-based verifier for LLM task planning that detects structural flaws in plans and guides LLMs to correct them through local edits.

DetailsMotivation: LLM-generated task plans often suffer from hallucinations and are sensitive to long-context prompts. Existing LLM-based verifiers struggle with structural flaws like type mismatches, missing intermediates, or broken dependencies across plan steps.

Method: 1) Represent plans as directed graphs with enriched attributes (nodes=sub-tasks, edges=execution order/dependencies). 2) Use GNN for structural evaluation producing graph-level plausibility scores and node/edge-level risk scores. 3) Generate training data via controllable perturbations from ground truth plan graphs. 4) Guide LLMs to perform local edits (tool replacement/insertion) based on GNN feedback.

Result: Extensive experiments across diverse datasets, backbone LLMs, and planners show GNNVerifier achieves significant gains in improving plan quality compared to existing approaches.

Conclusion: Graph-based verification effectively addresses structural flaws in LLM-generated plans that LLM-based verifiers miss, enabling more reliable autonomous agents through improved plan correction.

Abstract: Large language models (LLMs) facilitate the development of autonomous agents. As a core component of such agents, task planning aims to decompose complex natural language requests into concrete, solvable sub-tasks. Since LLM-generated plans are frequently prone to hallucinations and sensitive to long-context prom-pts, recent research has introduced plan verifiers to identify and correct potential flaws. However, most existing approaches still rely on an LLM as the verifier via additional prompting for plan review or self-reflection. LLM-based verifiers can be misled by plausible narration and struggle to detect failures caused by structural relations across steps, such as type mismatches, missing intermediates, or broken dependencies. To address these limitations, we propose a graph-based verifier for LLM task planning. Specifically, the proposed method has four major components: Firstly, we represent a plan as a directed graph with enriched attributes, where nodes denote sub-tasks and edges encode execution order and dependency constraints. Secondly, a graph neural network (GNN) then performs structural evaluation and diagnosis, producing a graph-level plausibility score for plan acceptance as well as node/edge-level risk scores to localize erroneous regions. Thirdly, we construct controllable perturbations from ground truth plan graphs, and automatically generate training data with fine-grained annotations. Finally, guided by the feedback from our GNN verifier, we enable an LLM to conduct local edits (e.g., tool replacement or insertion) to correct the plan when the graph-level score is insufficient. Extensive experiments across diverse datasets, backbone LLMs, and planners demonstrate that our GNNVerifier achieves significant gains in improving plan quality. Our data and code is available at https://github.com/BUPT-GAMMA/GNNVerifier.

[649] Integrating Weather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting

Ziqing Ma, Kai Ying, Xinyue Gu, Tian Zhou, Tianyu Zhu, Haifan Zhang, Peisong Niu, Wang Zheng, Cong Bai, Liang Sun

Main category: cs.LG

TL;DR: Baguan-solar is a two-stage multimodal framework that fuses global weather foundation model forecasts with high-resolution satellite imagery for 24-hour solar irradiance forecasting at kilometer scale.

DetailsMotivation: Accurate day-ahead solar irradiance forecasting is crucial for solar energy grid integration, but current methods lack fine-scale resolution or degrade at longer lead times due to complex cloud dynamics and diurnal cycles.

Method: Two-stage multimodal framework: first forecasts day-night continuous intermediates (cloud cover) using Baguan weather foundation model, then infers irradiance by fusing Baguan forecasts with high-resolution geostationary satellite imagery to preserve both fine-scale cloud structures and large-scale constraints.

Result: Outperforms strong baselines (ECMWF IFS, vanilla Baguan, SolarSeer) over East Asia using CLDAS ground truth, reducing RMSE by 16.08% and better resolving cloud-induced transients. Operational deployment since July 2025 in an eastern Chinese province.

Conclusion: Baguan-solar effectively combines global weather foundation models with satellite imagery for accurate kilometer-scale solar irradiance forecasting, demonstrating practical utility for solar power grid integration.

Abstract: Accurate day-ahead solar irradiance forecasting is essential for integrating solar energy into the power grid. However, it remains challenging due to the pronounced diurnal cycle and inherently complex cloud dynamics. Current methods either lack fine-scale resolution (e.g., numerical weather prediction, weather foundation models) or degrade at longer lead times (e.g., satellite extrapolation). We propose Baguan-solar, a two-stage multimodal framework that fuses forecasts from Baguan, a global weather foundation model, with high-resolution geostationary satellite imagery to produce 24- hour irradiance forecasts at kilometer scale. Its decoupled two-stage design first forecasts day-night continuous intermediates (e.g., cloud cover) and then infers irradiance, while its modality fusion jointly preserves fine-scale cloud structures from satellite and large-scale constraints from Baguan forecasts. Evaluated over East Asia using CLDAS as ground truth, Baguan-solar outperforms strong baselines (including ECMWF IFS, vanilla Baguan, and SolarSeer), reducing RMSE by 16.08% and better resolving cloud-induced transients. An operational deployment of Baguan-solar has supported solar power forecasting in an eastern province in China, since July 2025. Our code is accessible at https://github.com/DAMO-DI-ML/Baguansolar. git.

[650] Informative Perturbation Selection for Uncertainty-Aware Post-hoc Explanations

Sumedha Chugh, Ranjitha Prasad, Nazreen Shah

Main category: cs.LG

TL;DR: EAGLE is a post-hoc model-agnostic explanation framework that uses information-theoretic active learning to efficiently select perturbations for learning local linear surrogate models, providing feature importance scores with uncertainty estimates.

DetailsMotivation: The widespread deployment of opaque ML models creates trust and ethical concerns, necessitating reliable explanations. Post-hoc model-agnostic methods address this by approximating black-box model behavior locally, but need efficient perturbation selection.

Method: EAGLE formulates perturbation selection as an information-theoretic active learning problem, adaptively sampling perturbations that maximize expected information gain to efficiently learn linear surrogate explainable models with feature importance scores and uncertainty estimates.

Result: Theoretical analysis shows cumulative information gain scales as O(d log t) and sample complexity grows linearly with feature dimension d and logarithmically with confidence parameter. Empirical results on tabular and image datasets show improved explanation reproducibility, higher neighborhood stability, and better perturbation sample quality compared to state-of-the-art baselines.

Conclusion: EAGLE provides an efficient, theoretically-grounded approach to post-hoc model explanation with uncertainty quantification, addressing key challenges in explanation reliability and reproducibility for black-box ML models.

Abstract: Trust and ethical concerns due to the widespread deployment of opaque machine learning (ML) models motivating the need for reliable model explanations. Post-hoc model-agnostic explanation methods addresses this challenge by learning a surrogate model that approximates the behavior of the deployed black-box ML model in the locality of a sample of interest. In post-hoc scenarios, neither the underlying model parameters nor the training are available, and hence, this local neighborhood must be constructed by generating perturbed inputs in the neighborhood of the sample of interest, and its corresponding model predictions. We propose \emph{Expected Active Gain for Local Explanations} (\texttt{EAGLE}), a post-hoc model-agnostic explanation framework that formulates perturbation selection as an information-theoretic active learning problem. By adaptively sampling perturbations that maximize the expected information gain, \texttt{EAGLE} efficiently learns a linear surrogate explainable model while producing feature importance scores along with the uncertainty/confidence estimates. Theoretically, we establish that cumulative information gain scales as $\mathcal{O}(d \log t)$, where $d$ is the feature dimension and $t$ represents the number of samples, and that the sample complexity grows linearly with $d$ and logarithmically with the confidence parameter $1/δ$. Empirical results on tabular and image datasets corroborate our theoretical findings and demonstrate that \texttt{EAGLE} improves explanation reproducibility across runs, achieves higher neighborhood stability, and improves perturbation sample quality as compared to state-of-the-art baselines such as Tilia, US-LIME, GLIME and BayesLIME.

[651] Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization

Hideaki Iiduka

Main category: cs.LG

TL;DR: Muon optimizer with orthogonal updates converges to stationary points under heavy-tailed noise conditions and outperforms mini-batch SGD.

DetailsMotivation: Practical machine learning often involves heavy-tailed stochastic noise that violates standard bounded-variance assumptions, requiring optimization methods that can handle such challenging noise conditions while maintaining convergence guarantees.

Method: Muon optimizer enforces orthogonality in parameter updates by projecting gradients onto the Stiefel manifold, providing stable training even with heavy-tailed stochastic noise in nonconvex Hölder-smooth empirical risk minimization.

Result: Muon converges to a stationary point of the empirical risk under boundedness conditions accounting for heavy-tailed noise, and demonstrates faster convergence than mini-batch SGD.

Conclusion: Muon provides a robust optimization approach that handles heavy-tailed stochastic noise effectively while maintaining convergence guarantees and outperforming standard SGD methods.

Abstract: Muon is a recently proposed optimizer that enforces orthogonality in parameter updates by projecting gradients onto the Stiefel manifold, leading to stable and efficient training in large-scale deep neural networks. Meanwhile, the previously reported results indicated that stochastic noise in practical machine learning may exhibit heavy-tailed behavior, violating the bounded-variance assumption. In this paper, we consider the problem of minimizing a nonconvex Hölder-smooth empirical risk that works well with the heavy-tailed stochastic noise. We then show that Muon converges to a stationary point of the empirical risk under the boundedness condition accounting for heavy-tailed stochastic noise. In addition, we show that Muon converges faster than mini-batch SGD.

Gal Dalal, Assaf Hallak, Gal Chechik, Yftah Ziser

Main category: cs.LG

TL;DR: Analysis shows beam search can hurt output quality beyond a critical width due to scorer noise; maximum useful beam width depends exponentially on scorer’s signal-to-noise ratio.

DetailsMotivation: Prior work focused on inference efficiency of beam width selection without analyzing whether wider search can degrade output quality. The paper aims to determine when beam widening becomes harmful.

Method: Uses Extreme Value Theory to analyze beam selection over noisy scorer outputs, deriving a maximum useful beam width that depends on signal-to-noise ratio. Validates theory by comparing perplexity-guided and PRM-guided beam search across three 7B-parameter models and ten domains on MR-BEN dataset.

Result: Perplexity scoring (high noise) yields maximum useful beam width of 1 (no benefit from search), while PRM scoring (lower noise) yields maximum useful width ≥4 with gains up to 8.9 percentage points. Same model and algorithm but different scorers place critical width at opposite ends of the range.

Conclusion: Scorer’s signal-to-noise ratio is the key quantity governing beam width selection. Provides diagnostic indicators for choosing beam width in practice based on this analysis.

Abstract: Wider beam search should improve LLM reasoning, but when should you stop widening? Prior work on beam width selection has focused on inference efficiency \citep{qin2025dsbd, freitag2017beam}, without analyzing whether wider search can \emph{hurt} output quality. We present an analysis, grounded in Extreme Value Theory, that answers this question. Beam selection over noisy scorer outputs introduces a systematic overestimation bias that grows with the candidate pool size, and we derive a maximum useful beam width $\hat{k}$ beyond which search degrades performance. This critical width depends on the signal-to-noise ratio of the scorer: $\hat{k}$ grows exponentially with $(Δ/σ)^2$, where $Δ> 0$ is the quality advantage of correct paths over incorrect ones and $σ$ is the scorer noise. We validate this theory by comparing perplexity-guided and PRM-guided beam search across three 7B-parameter models and ten domains on MR-BEN (5,975 questions). Perplexity scoring, with its high noise, yields $\hat{k} = 1$: search provides no benefit at any width tested. PRM scoring, with lower noise, yields $\hat{k} \geq 4$, with gains of up to 8.9 percentage points. The same model, the same algorithm, but different scorers place $\hat{k}$ at opposite ends of the beam width range. Our analysis identifies the scorer’s signal-to-noise ratio as the key quantity governing beam width selection, and we propose diagnostic indicators for choosing the beam width in practice.

[653] The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

Seth Karten, Jake Grigsby, Tersoo Upaa, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin, Arian Urdu, Matin Urdu, Qiao Wang, Ling Wu, Wenli Zhang, Kunsheng Zhou, Stephanie Milani, Kiran Vodrahalli, Amy Zhang, Fei Fang, Yuke Zhu, Chi Jin

Main category: cs.LG

TL;DR: PokeAgent Challenge is a large-scale benchmark for decision-making research using Pokemon’s multi-agent battle system and RPG environment, featuring two tracks: Battling (strategic reasoning under partial observability) and Speedrunning (long-horizon planning).

DetailsMotivation: Addresses limitations in existing benchmarks that simultaneously stress partial observability, game-theoretic reasoning, and long-horizon planning under realistic conditions. Few benchmarks combine all three challenges at scale.

Method: Creates two complementary tracks: 1) Battling Track with 20M+ battle trajectories and baselines (heuristic, RL, LLM-based), 2) Speedrunning Track with standardized RPG speedrunning evaluation framework and multi-agent orchestration system for LLM approaches.

Result: NeurIPS 2025 competition attracted 100+ teams; analysis shows Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites. Reveals gaps between LLM, RL, and elite human performance.

Conclusion: Pokemon presents an unsolved benchmark that can drive RL and LLM research forward. Transitioned to living benchmark with live leaderboard for Battling and self-contained evaluation for Speedrunning.

Abstract: We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon’s multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community’s interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

[654] Physics-Informed Neural Systems for the Simulation of EUV Electromagnetic Wave Diffraction from a Lithography Mask

Vasiliy A. Es’kin, Egor V. Ivanov

Main category: cs.LG

TL;DR: PINNs and neural operators applied to EUV lithography mask diffraction problems, with novel Waveguide Neural Operator achieving competitive accuracy and faster inference than traditional solvers.

DetailsMotivation: Accelerate design and optimization workflows for next-generation lithography masks by applying physics-informed neural networks and neural operators to solve electromagnetic wave diffraction problems, reducing computational time compared to traditional numerical solvers.

Method: Introduces hybrid Waveguide Neural Operator (WGNO) combining waveguide method with neural networks, replacing computationally expensive components. Compares PINNs and NOs against modern numerical solvers on problems with known exact solutions, focusing on 13.5nm and 11.2nm wavelengths.

Result: PINNs and neural operators achieve competitive accuracy with significantly reduced prediction times. WGNO reaches state-of-the-art performance with good generalization to unseen problem parameters, providing highly efficient solution for lithography mask design.

Conclusion: Neural operators and PINNs offer promising approach for accelerating EUV lithography mask design, with WGNO demonstrating strong performance and generalization capabilities for electromagnetic wave diffraction problems.

Abstract: Physics-informed neural networks (PINNs) and neural operators (NOs) for solving the problem of diffraction of Extreme Ultraviolet (EUV) electromagnetic waves from contemporary lithography masks are presented. A novel hybrid Waveguide Neural Operator (WGNO) is introduced, based on a waveguide method with its most computationally expensive components replaced by a neural network. To evaluate performance, the accuracy and inference time of PINNs and NOs are compared against modern numerical solvers for a series of problems with known exact solutions. The emphasis is placed on investigation of solution accuracy by considered artificial neural systems for 13.5 nm and 11.2 nm wavelengths. Numerical experiments on realistic 2D and 3D masks demonstrate that PINNs and neural operators achieve competitive accuracy and significantly reduced prediction times, with the proposed WGNO architecture reaching state-of-the-art performance. The presented neural operator has pronounced generalizing properties, meaning that for unseen problem parameters it delivers a solution accuracy close to that for parameters seen in the training dataset. These results provide a highly efficient solution for accelerating the design and optimization workflows of next-generation lithography masks.

[655] Predicting Biomedical Interactions with Probabilistic Model Selection for Graph Neural Networks

Kishan KC, Rui Li, Paribesh Regmi, Anne R. Haake

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2211.13231: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2211.13231&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[656] Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Cells

Bailey Andrew, Erica L. Harris, James A. Poulter, David R. Westhead, Luisa Cutillo

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: No method information available - paper content inaccessible due to HTTP 429 error

Result: No results available - unable to fetch paper summary from arXiv API

Conclusion: Cannot analyze paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2407.19892: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.19892&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[657] S2Act: Simple Spiking Actor

Ugur Akcal, Seung Hyun Kim, Mikihisa Yuasa, Hamid Osooli, Jiarui Sun, Ribhav Sahu, Mattia Gazzola, Huy T. Tran, Girish Chowdhary

Main category: cs.MA

TL;DR: S2Act: A lightweight framework for deploying RL policies using spiking neural networks that addresses gradient vanishing and hyperparameter sensitivity issues through rate-based approximation and parameter transfer to physical LIF neurons.

DetailsMotivation: SNNs are attractive for mobile robotics due to power efficiency, but face challenges in complex stochastic environments due to sensitivity to hyperparameters and inconsistent gradient signals. Existing approaches struggle with these issues.

Method: Three-step approach: (1) architect actor-critic model using approximated rate-based spiking neurons, (2) train with gradients using compatible activation functions, (3) transfer trained weights to physical LIF neuron parameters for inference. Globally shapes LIF parameters to approximate ReLU activations.

Result: Outperforms relevant baselines in task performance and real-time inference in multi-agent stochastic environments (capture-the-flag and parking). Successfully deployed on physical TurtleBot platforms using Intel’s Loihi neuromorphic hardware.

Conclusion: S2Act effectively mitigates vanishing gradient problem and reduces reliance on complex SNN hyperparameter tuning, showing potential for rapid prototyping and efficient real-world deployment of SNN-based RL policies.

Abstract: Spiking neural networks (SNNs) and biologically-inspired learning mechanisms are attractive in mobile robotics, where the size and performance of onboard neural network policies are constrained by power and computational budgets. Existing SNN approaches, such as population coding, reward modulation, and hybrid artificial neural network (ANN)-SNN architectures, have shown promising results; however, they face challenges in complex, highly stochastic environments due to SNN sensitivity to hyperparameters and inconsistent gradient signals. To address these challenges, we propose simple spiking actor (S2Act), a computationally lightweight framework that deploys an RL policy using an SNN in three steps: (1) architect an actor-critic model based on an approximated network of rate-based spiking neurons, (2) train the network with gradients using compatible activation functions, and (3) transfer the trained weights into physical parameters of rate-based leaky integrate-and-fire (LIF) neurons for inference and deployment. By globally shaping LIF neuron parameters such that their rate-based responses approximate ReLU activations, S2Act effectively mitigates the vanishing gradient problem, while pre-constraining LIF response curves reduces reliance on complex SNN-specific hyperparameter tuning. We demonstrate our method in two multi-agent stochastic environments (capture-the-flag and parking) that capture the complexity of multi-robot interactions, and deploy our trained policies on physical TurtleBot platforms using Intel’s Loihi neuromorphic hardware. Our experimental results show that S2Act outperforms relevant baselines in task performance and real-time inference in nearly all considered scenarios, highlighting its potential for rapid prototyping and efficient real-world deployment of SNN-based RL policies.

[658] Don’t Trust Stubborn Neighbors: A Security Framework for Agentic Networks

Samira Abedini, Sina Mavali, Lea Schönherr, Martin Pawelczyk, Rebekka Burkholz

Main category: cs.MA

TL;DR: LLM-based multi-agent systems are vulnerable to manipulation through opinion formation dynamics, where malicious agents can trigger persuasion cascades; a trust-adaptive defense mechanism is proposed to mitigate these security risks.

DetailsMotivation: LLM-based multi-agent systems are increasingly deployed for agentic tasks but their interactive nature introduces security risks where malicious agents can exploit communication channels to propagate misinformation and manipulate collective outcomes.

Method: Borrows the Friedkin-Johnsen opinion formation model from social sciences to create a theoretical framework for studying LLM-MAS. Proposes a trust-adaptive defense mechanism that dynamically adjusts inter-agent trust to limit adversarial influence while maintaining cooperative performance.

Result: Theoretical and empirical findings show that a single highly stubborn and persuasive agent can take over MAS dynamics, demonstrating high susceptibility to attacks. The trust-adaptive defense mechanism effectively defends against manipulation in extensive experiments across different network topologies and attack/defense scenarios.

Conclusion: LLM-based multi-agent systems are vulnerable to manipulation through opinion formation dynamics, but security can be enhanced through mechanisms like increasing benign agents, increasing agent stubbornness, reducing trust in adversaries, or implementing trust-adaptive defenses that dynamically adjust inter-agent trust.

Abstract: Large Language Model (LLM)-based Multi-Agent Systems (MASs) are increasingly deployed for agentic tasks, such as web automation, itinerary planning, and collaborative problem solving. Yet, their interactive nature introduces new security risks: malicious or compromised agents can exploit communication channels to propagate misinformation and manipulate collective outcomes. In this paper, we study how such manipulation can arise and spread by borrowing the Friedkin-Johnsen opinion formation model from social sciences to propose a general theoretical framework to study LLM-MAS. Remarkably, this model closely captures LLM-MAS behavior, as we verify in extensive experiments across different network topologies and attack and defense scenarios. Theoretically and empirically, we find that a single highly stubborn and persuasive agent can take over MAS dynamics, underscoring the systems’ high susceptibility to attacks by triggering a persuasion cascade that reshapes collective opinion. Our theoretical analysis reveals three mechanisms to increase system security: a) increasing the number of benign agents, b) increasing the innate stubbornness or peer-resistance of agents, or c) reducing trust in potential adversaries. Because scaling is computationally expensive and high stubbornness degrades the network’s ability to reach consensus, we propose a new mechanism to mitigate threats by a trust-adaptive defense that dynamically adjusts inter-agent trust to limit adversarial influence while maintaining cooperative performance. Extensive experiments confirm that this mechanism effectively defends against manipulation.

[659] Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective

Noppanat Wadlom, Junyi Shen, Yao Lu

Main category: cs.MA

TL;DR: Helium is a workflow-aware serving framework for agentic LLM workflows that treats LLM calls as query operators and uses proactive caching and cache-aware scheduling to optimize cross-call dependencies and reduce redundancy.

DetailsMotivation: Agentic workflows composed of interdependent LLM calls have become dominant in AI systems, but existing serving systems like vLLM optimize individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies from redundant prompts and intermediate results.

Method: Helium models agentic workloads as query plans and treats LLM invocations as first-class operators. It integrates proactive caching and cache-aware scheduling to maximize reuse across prompts, KV states, and workflows, bridging classic query optimization principles with LLM serving.

Result: Helium achieves up to 1.56x speedup over state-of-the-art agent serving systems on various workloads, demonstrating that end-to-end optimization across workflows is essential for scalable and efficient LLM-based agents.

Conclusion: Workflow-aware serving with cross-call optimization is crucial for efficient LLM-based agents, and Helium’s approach of treating LLM workflows as query plans with caching and scheduling optimizations provides significant performance improvements over existing systems.

Abstract: Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies. This paper rethinks LLM and agent serving from a data systems perspective and introduces Helium, a workflow-aware serving framework that models agentic workloads as query plans and treats LLM invocations as first-class operators. Helium integrates proactive caching and cache-aware scheduling to maximize reuse across prompts, KV states, and workflows. Through these techniques, Helium bridges classic query optimization principles with LLM serving, achieving up to 1.56x speedup over state-of-the-art agent serving systems on various workloads. Our results demonstrate that end-to-end optimization across workflows is essential for scalable and efficient LLM-based agents.

[660] Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment

Enguang Fan, Yifan Chen, Zihan Shan, Matthew Caesar, Jae Kim

Main category: cs.MA

TL;DR: Multi-agent RL framework for UAV swarms using graph attention and CTDE for cooperative relay and adversarial tasks under partial observability and limited communication.

DetailsMotivation: Practical UAV swarm deployments face challenges of partial observability and intermittent peer-to-peer communication links, requiring decentralized coordination methods that can operate under these constraints.

Method: Graph-based multi-agent reinforcement learning with centralized training and decentralized execution (CTDE). Uses agent-entity attention for local state encoding and neighbor self-attention over distance-limited communication graphs to aggregate inter-UAV messages.

Result: Achieves 74% coverage with 5 UAVs and 10 nodes in cooperative relay task, competitive with MILP optimization upper bound. Generalizes to unseen team sizes without fine-tuning. Transfers to adversarial setting without architectural changes, improving win rates over non-communicating baselines.

Conclusion: The proposed graph-based MARL framework enables effective UAV swarm coordination under realistic constraints of partial observability and limited communication, with demonstrated performance in both cooperative and adversarial scenarios.

Abstract: Autonomous Unmanned Aerial Vehicle (UAV) swarms are increasingly used as rapidly deployable aerial relays and sensing platforms, yet practical deployments must operate under partial observability and intermittent peer-to-peer links. We present a graph-based multi-agent reinforcement learning framework trained under centralized training with decentralized execution (CTDE): a centralized critic and global state are available only during training, while each UAV executes a shared policy using local observations and messages from nearby neighbors. Our architecture encodes local agent state and nearby entities with an agent-entity attention module, and aggregates inter-UAV messages with neighbor self-attention over a distance-limited communication graph. We evaluate primarily on a cooperative relay deployment task (DroneConnect) and secondarily on an adversarial engagement task (DroneCombat). In DroneConnect, the proposed method achieves high coverage under restricted communication and partial observation (e.g. 74% coverage with M = 5 UAVs and N = 10 nodes) while remaining competitive with a mixed-integer linear programming (MILP) optimization-based offline upper bound, and it generalizes to unseen team sizes without fine-tuning. In the adversarial setting, the same framework transfers without architectural changes and improves win rate over non-communicating baselines.

[661] CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation

Gengxin Sun, Ruihao Yu, Liangyi Yin, Yunqi Yang, Bin Zhang, Zhiwei Xu

Main category: cs.MA

TL;DR: CoMAI: A multi-agent interview framework using modular task decomposition with specialized agents for question generation, security, scoring, and summarization to improve robustness and fairness in AI-driven assessment.

DetailsMotivation: Addressing challenges in AI-driven interview assessment, particularly ensuring robustness and fairness, as monolithic single-agent LLM systems have limitations in security, bias reduction, and structured evaluation.

Method: Modular task-decomposition architecture with four specialized agents (question generation, security, scoring, summarization) coordinated through a centralized finite-state machine, featuring multi-layered security defenses, adaptive difficulty adjustment, and rubric-based structured scoring.

Result: Achieved 90.47% accuracy, 83.33% recall, and 84.41% candidate satisfaction, demonstrating robust performance in interview assessment scenarios.

Conclusion: CoMAI presents a robust, fair, and interpretable paradigm for AI-driven interview assessment that outperforms monolithic single-agent approaches through collaborative multi-agent specialization.

Abstract: Ensuring robust and fair interview assessment remains a key challenge in AI-driven evaluation. This paper presents CoMAI, a general-purpose multi-agent interview framework designed for diverse assessment scenarios. In contrast to monolithic single-agent systems based on large language models (LLMs), CoMAI employs a modular task-decomposition architecture coordinated through a centralized finite-state machine. The system comprises four agents specialized in question generation, security, scoring, and summarization. These agents work collaboratively to provide multi-layered security defenses against prompt injection, support multidimensional evaluation with adaptive difficulty adjustment, and enable rubric-based structured scoring that reduces subjective bias. Experimental results demonstrate that CoMAI achieved 90.47% accuracy, 83.33% recall, and 84.41% candidate satisfaction. These results highlight CoMAI as a robust, fair, and interpretable paradigm for AI-driven interview assessment.

[662] LOPT: Learning Optimal Pigovian Tax in Sequential Social Dilemmas

Yun Hua, Shang Gao, Wenhao Li, Haosheng Chen, Bo Jin, Xiangfeng Wang, Jun Luo, Hongyuan Zha

Main category: cs.MA

TL;DR: LOPT method applies Pigovian Tax from externality theory to multi-agent reinforcement learning to address social dilemmas by internalizing externalities through learned tax/subsidy policies.

DetailsMotivation: In multi-agent RL, individual accumulated rewards don't reflect how others perceive agents' actions, leading to selfish behaviors that harm global performance. The paper aims to apply externality theory to analyze and resolve social dilemmas in MARL.

Method: Proposes LOPT (Learning Optimal Pigovian Tax) method that introduces an additional agent to learn tax/allowance allocation policies approximating optimal Pigovian Tax. Uses reward shaping based on approximated optimal tax to reduce social cost and alleviate social dilemmas.

Result: LOPT achieves higher collective social welfare compared to state-of-the-art methods in both Escape Room and Cleanup environments, demonstrating superiority in solving social dilemmas.

Conclusion: Externality theory and Pigovian Tax can effectively address social dilemmas in MARL by internalizing externalities through learned tax policies, improving global performance.

Abstract: In multi-agent reinforcement learning, each agent acts to maximize its individual accumulated rewards. Nevertheless, individual accumulated rewards could not fully reflect how others perceive them, resulting in selfish behaviors that undermine global performance. The externality theory, defined as the activities of one economic actor affect the activities of another in ways that are not reflected in market transactions,'' is applicable to analyze the social dilemmas in MARL. One of its most profound non-market solutions, Pigovian Tax’’, which internalizes externalities by taxing those who create negative externalities and subsidizing those who create positive externalities, could aid in developing a mechanism to resolve MARL’s social dilemmas. The purpose of this paper is to apply externality theory to analyze social dilemmas in MARL. To internalize the externalities in MARL, the \textbf{L}earning \textbf{O}ptimal \textbf{P}igovian \textbf{T}ax method (LOPT), is proposed, where an additional agent is introduced to learn the tax/allowance allocation policy so as to approximate the optimal Pigovian Tax'' which accurately reflects the externalities for all agents. Furthermore, a reward shaping mechanism based on the approximated optimal Pigovian Tax’’ is applied to reduce the social cost of each agent and tries to alleviate the social dilemmas. Compared with existing state-of-the-art methods, the proposed LOPT leads to higher collective social welfare in both the Escape Room and the Cleanup environments, which shows the superiority of our method in solving social dilemmas.

[663] MetaCrit: A Critical Thinking Framework for Self-Regulated LLM Reasoning

Xinmeng Hou, Ziting Chang, Zhouquan Lu, Chen Wenli, Liang Wan, Wei Feng, Hai Hu, Qing Guo

Main category: cs.MA

TL;DR: MetaCrit is a multi-agent framework for improving LLM reasoning by implementing metacognitive regulation through monitoring, control, and synthesis agents.

DetailsMotivation: LLMs fail on over one-third of multi-hop questions with counterfactual premises and remain vulnerable to adversarial prompts that trigger biased or factually incorrect responses, exposing a fundamental deficit in self-regulated reasoning.

Method: MetaCrit decomposes reasoning regulation into four agents: object-level generation, a monitoring agent that assesses response validity, a control agent that critiques logical soundness, and a meta-level synthesizer that integrates all signals into a final response.

Result: Evaluation across eight benchmarks, four model backbones, and a college-level analytical writing study shows that MetaCrit significantly improves content truthfulness and logical soundness while eliminating toxic outputs.

Conclusion: MetaCrit’s modular design allows individual agents to be integrated into existing frameworks as drop-in components without architectural modifications, providing an effective approach to improving LLM reasoning through metacognitive regulation.

Abstract: Large language models (LLMs) fail on over one-third of multi-hop questions with counterfactual premises and remain vulnerable to adversarial prompts that trigger biased or factually incorrect responses, which exposes a fundamental deficit in self-regulated reasoning. We propose \textbf{MetaCrit}, a multi-agent framework grounded in Nelson and Narens’ metacognitive regulation theory. MetaCrit decomposes reasoning regulation into four agents: object-level generation, a \emph{monitoring} agent that assesses response validity, a \emph{control} agent that critiques logical soundness, and a meta-level synthesizer that integrates all signals into a final response. Evaluation across eight benchmarks, four model backbones, and a college-level analytical writing study shows that MetaCrit significantly improves content truthfulness and logical soundness while eliminating toxic outputs. Its modular design allows individual agents to be integrated into existing frameworks as drop-in components without architectural modifications.

[664] COCO: Cognitive Operating System with Continuous Oversight for Multi-Agent Workflow Reliability

Churong Liang, Jinling Gan, Kairan Hong, Qiushi Tian, Zongze Wu, Runnan Li

Main category: cs.MA

TL;DR: COCO is a framework for asynchronous self-monitoring and adaptive error correction in multi-agent systems that prevents error cascading through decoupled architecture and three algorithmic innovations.

DetailsMotivation: Large-scale multi-agent systems suffer from error cascading where downstream agents exacerbate upstream inaccuracies without intermediate verification, leading to significant quality degradation.

Method: COCO uses a decoupled architecture isolating error detection from execution path, with three key innovations: Contextual Rollback Mechanism for informed state recovery, Bidirectional Reflection Protocol to prevent oscillatory control loops, and Heterogeneous Cross-Validation Mechanism using ensemble disagreement to identify bias.

Result: COCO achieves 6.5% average performance improvement across diverse benchmarks and reaches 95.1% of large-model performance with 30× parameter reduction.

Conclusion: COCO establishes a practical, annotation-based solution for critical autonomous domains with efficient, high-reliability deployment potential.

Abstract: A critical limitation in large-scale multi-agent systems is the cascading of errors. And without intermediate verification, downstream agents exacerbate upstream inaccuracies, resulting in significant quality degradation. To bridge this gap, we introduce \textbf{COCO} (\textbf{C}ognitive \textbf{O}perating System with \textbf{C}ontinuous \textbf{O}versight), a theoretically grounded framework for asynchronous self-monitoring and adaptive error correction in multi-agent systems. COCO reconciles the fundamental tension between quality assurance and computational efficiency via a novel decoupled architecture. This design isolates error detection from the critical execution path and incorporates an automated configuration engine to minimize deployment complexity. The framework relies on three algorithmic innovations to mitigate both systematic and stochastic errors: (1) a Contextual Rollback Mechanism that leverages execution history for informed state recovery rather than naive retries; (2) a Bidirectional Reflection Protocol to ensure convergence and prevent oscillatory control loops; and (3) a Heterogeneous Cross-Validation Mechanism that utilizes ensemble disagreement to identify bias and hallucinations. Extensive experiments on diverse benchmarks demonstrate that COCO delivers a 6.5% average performance improvement. Notably, the framework achieves 95.1% of large-model performance with a 30$\times$ parameter reduction, confirming the potential for efficient, high-reliability deployment, and establishing COCO as a practical, annotation-based solution for critical autonomous domains.

[665] MACRO-LLM: LLM-Empowered Multi-Agent Collaborative Reasoning under Spatiotemporal Partial Observability

Handi Chen, Running Zhao, Xiuzhe Wu, Edith C. H. Ngai

Main category: cs.MA

TL;DR: MACRO-LLM: A multi-agent LLM framework for collaborative reasoning under spatiotemporal partial observability, using interleaved spatial-temporal reasoning modules for coordination in distributed environments.

DetailsMotivation: LLM agents in real-world distributed systems face spatiotemporal partial observability - spatial limitations (local perception) and temporal limitations (finite horizons) are fundamentally coupled, requiring integrated reasoning to enable effective coordination.

Method: Three interdependent modules: (1) CoProposer verifies candidate actions via predictive rollouts to mitigate temporal uncertainty; (2) Negotiator resolves spatial conflicts through mean-field statistical aggregation using rollout rewards; (3) Introspector analyzes environmental drift and attributes performance changes to refine strategies.

Result: Extensive evaluations on cooperative platoon planning and pandemic control tasks demonstrate robust coordination under spatiotemporal partial observability.

Conclusion: MACRO-LLM effectively addresses the coupled spatiotemporal partial observability challenge in distributed LLM agents through integrated reasoning modules, enabling practical multi-agent coordination in complex real-world scenarios.

Abstract: Large Language Model (LLM) agents deployed in complex real-world scenarios increasingly operate as spatially distributed entities. However, this physical dispersion constrains agents to limited local perception and finite temporal horizons. We characterize this bottleneck as spatiotemporal partial observability, where spatial and temporal limitations are fundamentally coupled: resolving spatial conflicts requires temporal reasoning about neighbors’ future actions, while temporal planning requires spatial context beyond local perception. To bridge this gap, we introduce MACRO-LLM, LLM-empowered multi-agent collaborative reasoning under spatiotemporal partial observability. The architecture interleaves spatial and temporal reasoning within each decision cycle via three interdependent modules: (1) the CoProposer mitigates temporal uncertainty by verifying candidate actions via predictive rollouts; (2) the Negotiator overcomes spatial myopia by resolving conflicts through mean-field statistical aggregation, grounded in the CoProposer’s rollout rewards; and (3) the Introspector closes the reasoning loop by analyzing environmental drift and attributing performance changes to refine strategies. Extensive evaluations on two complex long-horizon tasks, cooperative platoon planning and pandemic control, demonstrate that our framework enables robust coordination under spatiotemporal partial observability.

[666] Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

Shan Yang, Yang Liu

Main category: cs.MA

TL;DR: DG-PG uses differentiable analytical models to provide noise-free gradient signals in multi-agent RL, reducing variance from Θ(N) to O(1) and achieving scale-invariant sample complexity.

DetailsMotivation: Cross-agent noise in cooperative multi-agent RL grows with the number of agents, leading to poor scaling properties (Θ(N) gradient variance). Many real-world domains have differentiable analytical models that could provide efficient system state guidance.

Method: Descent-Guided Policy Gradient (DG-PG) framework that utilizes differentiable analytical models to provide each agent with a noise-free gradient signal, decoupling each agent’s gradient from the actions of all other agents.

Result: DG-PG reduces gradient variance from Θ(N) to O(1), achieves agent-independent sample complexity O(1/ε), and converges within 10 episodes on cloud scheduling tasks with up to 200 agents while MAPPO and IPPO fail.

Conclusion: DG-PG enables scalable multi-agent RL by leveraging domain-specific analytical models to provide noise-free gradients, overcoming fundamental scaling limitations of traditional cooperative MARL approaches.

Abstract: Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise. When agents share a common reward, the actions of all $N$ agents jointly determine each agent’s learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/ε)$. We observe that many domains, including cloud computing, transportation, and power systems, have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that utilizes these analytical models to provide each agent with a noise-free gradient signal, decoupling each agent’s gradient from the actions of all others. We prove that DG-PG reduces gradient variance from $Θ(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\mathcal{O}(1/ε)$. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale, from $N{=}5$ to $N{=}200$, directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.

cs.MM

[667] DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression

Bingzhou Li, Tao Huang

Main category: cs.MM

TL;DR: DASH is a training-free compression framework for OmniLLMs that uses audio-driven semantic chunking to dynamically align token compression with semantic structure, achieving higher compression ratios while maintaining accuracy.

DetailsMotivation: OmniLLMs process long audio-visual token sequences making inference expensive. Existing compression methods use fixed window partitioning and attention-based pruning, which ignore semantic structure and become fragile under aggressive token reduction.

Method: Uses audio embeddings as semantic anchor to detect boundary candidates via cosine-similarity discontinuities, creating dynamic variable-length segments. Projects boundaries onto video tokens for cross-modal segmentation. Within segments, token retention uses tri-signal importance estimator fusing structural boundary cues, representational distinctiveness, and attention-based salience.

Result: Extensive experiments on AVUT, VideoMME, and WorldSense show DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods.

Conclusion: DASH provides effective training-free compression for OmniLLMs by aligning token reduction with semantic structure through audio-driven chunking and multi-signal importance estimation.

Abstract: Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: https://github.com/laychou666/DASH.

[668] Visual Set Program Synthesizer

Zehua Cheng, Wei Dai, Wenhu Zhang, Thomas Lukasiewicz, Jiahao Sun

Main category: cs.MM

TL;DR: A visual program synthesis approach for set-based reasoning in visual question answering, outperforming standard MLLMs on complex reasoning tasks.

DetailsMotivation: Current visual AI assistants struggle with complex set-based reasoning queries (like filtering, comparison, aggregation) that require compositional logic beyond simple object recognition.

Method: Treat visual reasoning as Visual Program Synthesis: first generate symbolic programs, then execute them with a separate engine grounded in visual scenes. Also introduce Set-VQA benchmark for evaluation.

Result: Significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy.

Conclusion: Program-driven reasoning provides a principled alternative to black-box visual-language inference for complex set-based visual reasoning tasks.

Abstract: A user pointing their phone at a supermarket shelf and asking “Which soda has the least sugar?” poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.

[669] Hyperbolic Multimodal Generative Representation Learning for Generalized Zero-Shot Multimodal Information Extraction

Baohang Zhou, Kehui Song, Rize Jin, Yu Zhao, Xuhui Sui, Xinying Qian, Xingyue Guo, Ying Zhang

Main category: cs.MM

TL;DR: HMGRL: Hyperbolic multimodal generative representation learning framework for generalized zero-shot multimodal information extraction, addressing both seen and unseen categories by modeling hierarchical semantic correlations in hyperbolic space.

DetailsMotivation: Existing zero-shot multimodal information extraction (ZS-MIE) models only handle unseen categories but struggle with real-world scenarios containing both seen and unseen categories. Current methods use Euclidean space representations that fail to capture hierarchical semantic relationships between modalities and categories, and suffer from distribution gaps between seen/unseen category sets.

Method: Proposes HMGRL framework using hyperbolic space to model multi-level hierarchical semantic correlations. Reconstructs variational information bottleneck and autoencoder networks in hyperbolic space. Trains with generated unseen samples and introduces semantic similarity distribution alignment loss to enhance generalization.

Result: Experimental evaluations on two benchmark datasets show HMGRL outperforms existing baseline methods for generalized zero-shot multimodal information extraction.

Conclusion: HMGRL effectively addresses generalized zero-shot MIE by capturing hierarchical semantic relationships in hyperbolic space and aligning semantic similarity distributions between seen and unseen categories, demonstrating superior performance over existing methods.

Abstract: Multimodal information extraction (MIE) constitutes a set of essential tasks aimed at extracting structural information from Web texts with integrating images, to facilitate the structural construction of Web-based semantic knowledge. To address the expanding category set including newly emerging entity types or relations on websites, prior research proposed the zero-shot MIE (ZS-MIE) task which aims to extract unseen structural knowledge with textual and visual modalities. However, the ZS-MIE models are limited to recognizing the samples that fall within the unseen category set, and they struggle to deal with real-world scenarios that encompass both seen and unseen categories. The shortcomings of existing methods can be ascribed to two main aspects. On one hand, these methods construct representations of samples and categories within Euclidean space, failing to capture the hierarchical semantic relationships between the two modalities within a sample and their corresponding category prototypes. On the other hand, there is a notable gap in the distribution of semantic similarity between seen and unseen category sets, which impacts the generative capability of the ZS-MIE models. To overcome the disadvantages, we delve into the generalized zero-shot MIE (GZS-MIE) task and propose the hyperbolic multimodal generative representation learning framework (HMGRL). The variational information bottleneck and autoencoder networks are reconstructed with hyperbolic space for modeling the multi-level hierarchical semantic correlations among samples and prototypes. Furthermore, the proposed model is trained with the unseen samples generated by the decoder, and we introduce the semantic similarity distribution alignment loss to enhance the model’s generalization performance. Experimental evaluations on two benchmark datasets underscore the superiority of HMGRL compared to existing baseline methods.

[670] Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits

Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

Main category: cs.MM

TL;DR: A pipeline for joint audio-visual editing that first edits video, then generates aligned audio using a conditional video-to-audio generation model that adapts source audio influence based on edit complexity.

DetailsMotivation: Current video editing techniques often create visual changes that break audio-visual coherence, requiring audio editing to match the edited video content and maintain multimodal consistency.

Method: Two-stage pipeline: 1) Apply video editing to produce target video, 2) Use novel video-to-audio generation model conditioned on source audio, target video, and text prompt. Model incorporates conditional audio input, uses data augmentation for training efficiency, and dynamically adjusts source audio influence based on edit complexity.

Result: Method outperforms existing approaches in maintaining audio-visual alignment and content integrity, demonstrating effective joint audio-visual editing capabilities.

Conclusion: The proposed pipeline successfully addresses audio-visual coherence in editing tasks through a conditional generation approach that adapts to edit complexity while preserving original audio structure when possible.

Abstract: We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.

eess.AS

[671] Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

Jaesung Bae, Xiuwen Zheng, Minje Kim, Chang D. Yoo, Mark Hasegawa-Johnson

Main category: eess.AS

TL;DR: A three-stage framework for dysarthric speech quality assessment using pseudo-labeling and contrastive learning to overcome data scarcity, achieving state-of-the-art performance across multiple languages and etiologies.

DetailsMotivation: Dysarthric speech quality assessment (DSQA) is clinically important but suffers from costly subjective evaluation and limited labeled data, making robust objective modeling challenging.

Method: Three-stage framework: 1) Teacher model generates pseudo-labels for unlabeled dysarthric speech, 2) Weakly supervised pretraining with label-aware contrastive learning using diverse speakers and acoustic conditions, 3) Fine-tuning for downstream DSQA task.

Result: The approach demonstrates robustness across five unseen datasets spanning multiple etiologies and languages. Whisper-based baseline outperforms SOTA DSQA predictors like SpICE, with full framework achieving average SRCC of 0.761 across unseen test datasets.

Conclusion: The proposed framework effectively addresses data scarcity in DSQA by leveraging unlabeled speech and large-scale typical speech datasets, enabling scalable and robust objective assessment of dysarthric speech quality.

Abstract: Dysarthric speech quality assessment (DSQA) is critical for clinical diagnostics and inclusive speech technologies. However, subjective evaluation is costly and difficult to scale, and the scarcity of labeled data limits robust objective modeling. To address this, we propose a three-stage framework that leverages unlabeled dysarthric speech and large-scale typical speech datasets to scale training. A teacher model first generates pseudo-labels for unlabeled samples, followed by weakly supervised pretraining using a label-aware contrastive learning strategy that exposes the model to diverse speakers and acoustic conditions. The pretrained model is then fine-tuned for the downstream DSQA task. Experiments on five unseen datasets spanning multiple etiologies and languages demonstrate the robustness of our approach. Our Whisper-based baseline significantly outperforms SOTA DSQA predictors such as SpICE, and the full framework achieves an average SRCC of 0.761 across unseen test datasets.

[672] AILive Mixer: A Deep Learning based Zero Latency Automatic Music Mixer for Live Music Performances

Devansh Zurale, Iris Lorente, Michael Lester, Alex Mitchell

Main category: eess.AS

TL;DR: Deep learning system for automatic multitrack music mixing in live performances, handling acoustic bleeds and zero latency requirements

DetailsMotivation: Live music mixing faces unique challenges: acoustic bleeds from co-located instruments and strict latency requirements for audio-visual synchronization. Existing automatic mixing systems focus on offline production with isolated tracks, leaving a gap for live performance applications.

Method: End-to-end deep learning system designed specifically for live performances. The system predicts mono gains for multitrack input while handling corrupted channels with acoustic bleeds. Architecture emphasizes zero-latency processing to meet live performance requirements.

Result: First end-to-end deep learning system developed for live music performances. Successfully addresses the challenges of handling acoustic bleeds and achieving zero latency. System currently predicts mono gains but is designed for easy adaptation to predict other mixing parameters.

Conclusion: This work pioneers automatic music mixing for live performances using deep learning, solving critical challenges of acoustic bleeds and latency. The system provides a foundation for future work on predicting additional mixing parameters beyond mono gains.

Abstract: In this work, we present a deep learning-based automatic multitrack music mixing system catered towards live performances. In a live performance, channels are often corrupted with acoustic bleeds of co-located instruments. Moreover, audio-visual synchronization is of critical importance thus putting a tight constraint on the audio latency. In this work we primarily tackle these two challenges of handling bleeds in the input channels to produce the music mix with zero latency. Although there have been several developments in the field of automatic music mixing in recent times, most or all previous works focus on offline production for isolated instrument signals and to the best of our knowledge, this is the first end-to-end deep learning system developed for live music performances. Our proposed system currently predicts mono gains for a multitrack input, but its design along with the precedent set in past works, allows for easy adaptation to future work of predicting other relevant music mixing parameters.

[673] Robust Generative Audio Quality Assessment: Disentangling Quality from Spurious Correlations

Kuan-Tang Huang, Chien-Chun Wang, Cheng-Yeh Yang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

Main category: eess.AS

TL;DR: Domain adversarial training with aspect-specific domain definitions improves MOS prediction by disentangling true quality perception from dataset-specific acoustic biases.

DetailsMotivation: Automatic MOS prediction models for AIGC suffer from data scarcity and learn spurious correlations like dataset-specific acoustic signatures instead of generalized quality features, limiting their generalization to unseen generative scenarios.

Method: Leverage domain adversarial training (DAT) to disentangle true quality perception from nuisance factors. Systematically investigate domain definition strategies from explicit metadata-driven labels to implicit data-driven clusters, finding optimal aspect-specific domain strategies for different MOS evaluation aspects.

Result: The aspect-specific domain strategy effectively mitigates acoustic biases, significantly improves correlation with human ratings, and achieves superior generalization on unseen generative scenarios compared to static domain priors.

Conclusion: There is no “one-size-fits-all” domain definition for MOS prediction; optimal strategy depends on the specific MOS aspect being evaluated. Domain adversarial training with aspect-specific domain definitions successfully addresses data scarcity and spurious correlation issues in AIGC quality assessment.

Abstract: The rapid proliferation of AI-Generated Content (AIGC) has necessitated robust metrics for perceptual quality assessment. However, automatic Mean Opinion Score (MOS) prediction models are often compromised by data scarcity, predisposing them to learn spurious correlations– such as dataset-specific acoustic signatures– rather than generalized quality features. To address this, we leverage domain adversarial training (DAT) to disentangle true quality perception from these nuisance factors. Unlike prior works that rely on static domain priors, we systematically investigate domain definition strategies ranging from explicit metadata-driven labels to implicit data-driven clusters. Our findings reveal that there is no “one-size-fits-all” domain definition; instead, the optimal strategy is highly dependent on the specific MOS aspect being evaluated. Experimental results demonstrate that our aspect-specific domain strategy effectively mitigates acoustic biases, significantly improving correlation with human ratings and achieving superior generalization on unseen generative scenarios.

[674] Speakers Localization Using Batch EM In Unfolding Neural Network

Rina Veler, Sharon Gannot

Main category: eess.AS

TL;DR: Interpretable Batch-EM Unfolded Network for robust speaker localization using encoder-EM-decoder architecture to improve convergence and reduce initialization sensitivity

DetailsMotivation: Classical Batch-EM for speaker localization suffers from initialization sensitivity and convergence issues, especially in challenging acoustic environments like reverberant conditions

Method: Embed iterative EM procedure within an encoder-EM-decoder architecture, unfolding the EM algorithm into a neural network structure to learn better initialization and convergence properties

Result: Superior accuracy and robustness compared to classical Batch-EM, particularly in reverberant conditions

Conclusion: The unfolded network approach effectively addresses initialization sensitivity and convergence issues in speaker localization, providing a more robust solution for real-world acoustic environments

Abstract: We propose an interpretable Batch-EM Unfolded Network for robust speaker localization. By embedding the iterative EM procedure within an encoder-EM-decoder architecture, the method mitigates initialization sensitivity and improves convergence. Experiments show superior accuracy and robustness over the classical Batch-EM in reverberant conditions.

[675] HRTF-guided Binaural Target Speaker Extraction with Real-World Validation

Yoav Ellinson, Sharon Gannot

Main category: eess.AS

TL;DR: HRTF-guided binaural target speaker extraction framework that uses head-related transfer functions as spatial priors to preserve spatial cues while enhancing speech quality, with cross-listener generalization capability.

DetailsMotivation: Conventional target speaker extraction methods based on DOA estimation or enrollment signals often distort perceived spatial location. There's a need for methods that preserve binaural spatial cues while extracting target speakers from mixtures.

Method: Multi-channel deep blind source separation backbone adapted to binaural TSE, conditioned on HRTF-derived spatial information. Trained on measured HRTFs from diverse population for cross-listener generalization.

Result: Validated through simulations and real recordings from head and torso simulator (HATS). Preserves binaural cues while enhancing speech quality and intelligibility.

Conclusion: HRTF-guided framework effectively extracts target speakers while maintaining spatial fidelity, offering cross-listener generalization without subject-specific tuning.

Abstract: This paper presents a Head-Related Transfer Function (HRTF)-guided framework for binaural Target Speaker Extraction (TSE) from mixtures of concurrent sources. Unlike conventional TSE methods based on Direction of Arrival (DOA) estimation or enrollment signals, which often distort perceived spatial location, the proposed approach leverages the listener’s HRTF as an explicit spatial prior. The proposed framework is built upon a multi-channel deep blind source separation backbone, adapted to the binaural TSE setting. It is trained on measured HRTFs from a diverse population, enabling cross-listener generalization rather than subject-specific tuning. By conditioning the extraction on HRTF-derived spatial information, the method preserves binaural cues while enhancing speech quality and intelligibility. The performance of the proposed framework is validated through simulations and real recordings obtained from a head and torso simulator (HATS).

[676] Code-switching Speech Recognition Under the Lens: Model- and Data-Centric Perspectives

Hexin Liu, Haoyang Zhang, Qiquan Zhang, Xiangyu Zhang, Dongyuan Shi, Eng Siong Chng, Haizhou Li

Main category: eess.AS

TL;DR: Systematic analysis of code-switching ASR from model and data perspectives, comparing algorithmic methods and proposing SECT prompting for LLM-based code-switching text generation to improve ASR performance.

DetailsMotivation: Code-switching ASR faces challenges from language confusion, accent bias, and scarcity of annotated code-switching data, even when constituent languages are high-resource individually.

Method: 1) Compare state-of-the-art algorithmic methods including language-specific processing and auxiliary language-aware multi-task learning; 2) Investigate TTS as data augmentation; 3) Propose SECT (simplified equivalence constraint theory) prompting strategy to guide LLMs in generating linguistically valid code-switching text for speech-text pair generation.

Result: SECT outperforms existing methods in ASR performance and linguistic quality assessments, generating code-switching text that more closely resembles real-world code-switching. When used to generate speech-text pairs via TTS, SECT proves effective in improving CS-ASR performance.

Conclusion: Effective CS-ASR requires strategies to be carefully aligned with specific linguistic characteristics of code-switching data, with both model-centric and data-centric approaches playing important roles.

Abstract: Code-switching automatic speech recognition (CS-ASR) presents unique challenges due to language confusion introduced by spontaneous intra-sentence switching and accent bias that blurs the phonetic boundaries. Although the constituent languages may be individually high-resource, the scarcity of annotated code-switching data further compounds these challenges. In this paper, we systematically analyze CS-ASR from both model-centric and data-centric perspectives. By comparing state-of-the-art algorithmic methods, including language-specific processing and auxiliary language-aware multi-task learning, we discuss their varying effectiveness across datasets with different linguistic characteristics. On the data side, we first investigate TTS as a data augmentation method. By varying the textual characteristics and speaker accents, we analyze the impact of language confusion and accent bias on CS-ASR. To further mitigate data scarcity and enhance textual diversity, we propose a prompting strategy by simplifying the equivalence constraint theory (SECT) to guide large language models (LLMs) in generating linguistically valid code-switching text. The proposed SECT outperforms existing methods in ASR performance and linguistic quality assessments, generating code-switching text that more closely resembles real-world code-switching text. When used to generate speech-text pairs via TTS, SECT proves effective in improving CS-ASR performance. Our analysis of both model- and data-centric methods underscores that effective CS-ASR requires strategies to be carefully aligned with the specific linguistic characteristics of the code-switching data.

[677] Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS

Haoyu Li, Mingyang Han, Yu Xi, Dongxiao Wang, Hankun Wang, Haoxiang Shi, Boyu Li, Jun Song, Bo Zheng, Shuai Wang, Kai Yu

Main category: eess.AS

TL;DR: TLA-SA improves speaker similarity in flow-matching TTS systems by adaptively aligning speaker representations across time steps and network layers.

DetailsMotivation: Flow-matching TTS systems have high-quality synthesis but lack explicit speaker supervision, leading to underexplored speaker representation capabilities. The authors aim to improve speaker consistency in zero-shot TTS.

Method: Proposes Time-Layer Adaptive Speaker Alignment (TLA-SA) that analyzes speaker information distribution across time steps and network layers, then adaptively aligns speaker representations using both temporal and hierarchical variations.

Result: TLA-SA substantially improves speaker similarity over baseline systems on both research- and industrial-scale datasets, and generalizes well across diverse model architectures including decoder-only LM-based and free TTS systems.

Conclusion: Adaptive speaker alignment across time and layers effectively enhances speaker consistency in flow-matching TTS, addressing the lack of explicit speaker supervision in FM frameworks.

Abstract: Flow-Matching (FM)-based zero-shot text-to-speech (TTS) systems exhibit high-quality speech synthesis and robust generalization capabilities. However, the speaker representation ability of such systems remains underexplored, primarily due to the lack of explicit speaker-specific supervision in the FM framework. To this end, we conduct an empirical analysis of speaker information distribution and reveal its non-uniform allocation across time steps and network layers, underscoring the need for adaptive speaker alignment. Accordingly, we propose Time-Layer Adaptive Speaker Alignment (TLA-SA), a strategy that enhances speaker consistency by jointly leveraging temporal and hierarchical variations. Experimental results show that TLA-SA substantially improves speaker similarity over baseline systems on both research- and industrial-scale datasets and generalizes well across diverse model architectures, including decoder-only language model (LM)-based and free TTS systems. A demo is provided.

eess.IV

[678] Standardizing Medical Images at Scale for AI

Callen MacPhee, Yiming Zhou, Koichiro Kishima, Bahram Jalali

Main category: eess.IV

TL;DR: Physics-based preprocessing framework (PhyCV) uses optical physics transformations to standardize medical images, improving out-of-distribution generalization for breast cancer classification from 70.8% to 90.9% accuracy.

DetailsMotivation: Deep learning models for medical image analysis suffer from performance degradation due to domain shifts caused by differences in imaging hardware, staining protocols, and acquisition conditions across institutions.

Method: PhyCV framework models images as spatially varying optical fields that undergo virtual diffractive propagation followed by coherent phase detection, suppressing non-semantic variability while preserving diagnostically relevant features.

Result: On Camelyon17-WILDS benchmark, PhyCV preprocessing improved out-of-distribution breast-cancer classification accuracy from 70.8% (baseline) to 90.9%, matching or exceeding data-augmentation and domain-generalization approaches with negligible computational cost.

Conclusion: PhyCV establishes a generalizable data refinery for medical imaging that harmonizes heterogeneous datasets through first-principles physics, improving robustness, interpretability, and reproducibility in clinical AI systems.

Abstract: Deep learning has achieved remarkable success in medical image analysis, yet its performance remains highly sensitive to the heterogeneity of clinical data. Differences in imaging hardware, staining protocols, and acquisition conditions produce substantial domain shifts that degrade model generalization across institutions. Here we present a physics-based data preprocessing framework based on the PhyCV (Physics-Inspired Computer Vision) family of algorithms, which standardizes medical images through deterministic transformations derived from optical physics. The framework models images as spatially varying optical fields that undergo a virtual diffractive propagation followed by coherent phase detection. This process suppresses non-semantic variability such as color and illumination differences while preserving diagnostically relevant texture and structural features. When applied to histopathological images from the Camelyon17-WILDS benchmark, PhyCV preprocessing improves out-of-distribution breast-cancer classification accuracy from 70.8% (Empirical Risk Minimization baseline) to 90.9%, matching or exceeding data-augmentation and domain-generalization approaches at negligible computational cost. Because the transform is physically interpretable, parameterizable, and differentiable, it can be deployed as a fixed preprocessing stage or integrated into end-to-end learning. These results establish PhyCV as a generalizable data refinery for medical imaging-one that harmonizes heterogeneous datasets through first-principles physics, improving robustness, interpretability, and reproducibility in clinical AI systems.

[679] Preserving Vertical Structure in 3D-to-2D Projection for Permafrost Thaw Mapping

Justin McMillen, Robert Van Alphen, Taha Sadeghi Chorsi, Jason Shabaga, Mel Rodgers, Rocco Malservisi, Timothy Dixon, Yasin Yilmaz

Main category: eess.IV

TL;DR: A method for predicting permafrost thaw depth from aerial LiDAR using height-aware feature projection that preserves vertical forest structure information.

DetailsMotivation: Standard 2D projection methods for LiDAR point clouds destroy critical vertical structure information in forest environments, where ground, understory, and canopy layers carry distinct signals about subsurface permafrost conditions.

Method: Proposes a projection decoder with learned height embeddings for height-dependent feature transformations, combined with stratified sampling to ensure all forest strata remain represented. Paired with a Point Transformer V3 encoder to predict dense thaw depth maps from drone-collected LiDAR.

Result: The z-stratified projection approach outperforms standard averaging-based methods, particularly in areas with complex vertical vegetation structure, enabling more accurate permafrost thaw prediction.

Conclusion: The method enables scalable, high-resolution monitoring of permafrost degradation from UAV platforms by preserving critical vertical information in forest environments.

Abstract: Forecasting permafrost thaw from aerial lidar requires projecting 3D point cloud features onto 2D prediction grids, yet naive aggregation methods destroy the vertical structure critical in forest environments where ground, understory, and canopy carry distinct information about subsurface conditions. We propose a projection decoder with learned height embeddings that enable height-dependent feature transformations, allowing the network to differentiate ground-level signals from canopy returns. Combined with stratified sampling that ensures all forest strata remain represented, our approach preserves the vertical information critical for predicting subsurface conditions. Our approach pairs this decoder with a Point Transformer V3 encoder to predict dense thaw depth maps from drone-collected lidar over boreal forest in interior Alaska. Experiments demonstrate that z-stratified projection outperforms standard averaging-based methods, particularly in areas with complex vertical vegetation structure. Our method enables scalable, high-resolution monitoring of permafrost degradation from readily deployable UAV platforms.

[680] Towards Clinical Practice in CT-Based Pulmonary Disease Screening: An Efficient and Reliable Framework

Qian Shao, Bang Du, Yixuan Wu, Zepeng Li, Qiyuan Chen, Qianqian Tang, Jian Wu, Jintai Chen, Hongxia Xu

Main category: eess.IV

TL;DR: ERF framework improves CT scan analysis efficiency by selecting representative slices and quantifying uncertainty, achieving near-full-volume accuracy with 60% faster processing.

DetailsMotivation: Deep learning models for pulmonary disease screening from CT scans have high computational costs from processing entire 3D volumes, creating barriers to clinical adoption. Current sub-sampling methods often compromise diagnostic integrity by introducing artifacts or discarding critical information.

Method: Proposes ERF framework with two innovations: (1) Cluster-based Sub-Sampling (CSS) that selects compact yet comprehensive CT slice subsets by optimizing for representativeness and diversity using efficient k-nearest neighbor search with iterative refinement; (2) Ambiguity-aware Uncertainty Quantification (AUQ) that targets data ambiguity from subtle lesions/artifacts by leveraging predictive discrepancy between auxiliary classifiers to construct specialized ambiguity scores.

Result: Validated on two public datasets with 2,654 CT volumes across 3 pulmonary diseases, ERF achieves diagnostic performance comparable to full-volume analysis (over 90% accuracy and recall) while reducing processing time by more than 60%.

Conclusion: The work represents a significant step toward deploying fast, accurate, and trustworthy AI-powered screening tools in time-sensitive clinical settings by addressing computational efficiency and reliability concerns in CT analysis.

Abstract: Deep learning models for pulmonary disease screening from Computed Tomography (CT) scans promise to alleviate the immense workload on radiologists. Still, their high computational cost, stemming from processing entire 3D volumes, remains a major barrier to widespread clinical adoption. Current sub-sampling techniques often compromise diagnostic integrity by introducing artifacts or discarding critical information. To overcome these limitations, we propose an Efficient and Reliable Framework (ERF) that fundamentally improves the practicality of automated CT analysis. Our framework introduces two core innovations: (1) A Cluster-based Sub-Sampling (CSS) method that efficiently selects a compact yet comprehensive subset of CT slices by optimizing for both representativeness and diversity. By integrating an efficient k-nearest neighbor search with an iterative refinement process, CSS bypasses the computational bottlenecks of previous methods while preserving vital diagnostic features. (2) An Ambiguity-aware Uncertainty Quantification (AUQ) mechanism, which enhances reliability by specifically targeting data ambiguity arising from subtle lesions and artifacts. Unlike standard uncertainty measures, AUQ leverages the predictive discrepancy between auxiliary classifiers to construct a specialized ambiguity score. By maximizing this discrepancy during training, the system effectively flags ambiguous samples where the model lacks confidence due to visual noise or intricate pathologies. Validated on two public datasets with 2,654 CT volumes across diagnostic tasks for 3 pulmonary diseases, ERF achieves diagnostic performance comparable to the full-volume analysis (over 90% accuracy and recall) while reducing processing time by more than 60%. This work represents a significant step towards deploying fast, accurate, and trustworthy AI-powered screening tools in time-sensitive clinical settings.

[681] Three-Dimensional MRI Reconstruction with Gaussian Representations: Tackling the Undersampling Problem

Tengya Peng, Ruyi Zha, Zhen Li, Xiaofeng Liu, Qing Zou

Main category: eess.IV

TL;DR: 3D Gaussian Splatting adapted for MRI reconstruction from undersampled k-space data, achieving quality comparable to established methods without requiring training datasets.

DetailsMotivation: 3D Gaussian Splatting (3DGS) has shown promise in computer vision but remains unexplored in MRI. The study aims to explore its potential for reconstructing isotropic resolution 3D MRI from undersampled k-space data.

Method: Introduces 3D Gaussian MRI (3DGSMR) framework using 3D Gaussian distributions as explicit representation for MR volumes. Operates under self-supervised framework without need for training datasets or prior model training. Novel application of 3DGS methodology to decompose complex-valued MR signals.

Result: Method effectively reconstructs voxelized MR images with quality on par with well-established 3D MRI reconstruction techniques found in literature.

Conclusion: 3DGSMR introduces significant innovations by adapting 3DGS to MRI reconstruction and applying existing 3DGS methodology to decompose complex-valued MR signals, showing promise for medical imaging applications.

Abstract: Three-Dimensional Gaussian Splatting (3DGS) has shown substantial promise in the field of computer vision, but remains unexplored in the field of magnetic resonance imaging (MRI). This study explores its potential for the reconstruction of isotropic resolution 3D MRI from undersampled k-space data. We introduce a novel framework termed 3D Gaussian MRI (3DGSMR), which employs 3D Gaussian distributions as an explicit representation for MR volumes. Experimental evaluations indicate that this method can effectively reconstruct voxelized MR images, achieving a quality on par with that of well-established 3D MRI reconstruction techniques found in the literature. Notably, the 3DGSMR scheme operates under a self-supervised framework, obviating the need for extensive training datasets or prior model training. This approach introduces significant innovations to the domain, notably the adaptation of 3DGS to MRI reconstruction and the novel application of the existing 3DGS methodology to decompose MR signals, which are presented in a complex-valued format.

[682] CardioComposer: Leveraging Differentiable Geometry for Compositional Control of Anatomical Diffusion Models

Karim Kadry, Shoaib Goraya, Ajay Manicka, Abdalla Abdelwahed, Naravich Chutisilp, Farhad Nezami, Elazer Edelman

Main category: eess.IV

TL;DR: CardioComposer: A programmable framework for generating 3D cardiovascular anatomical label maps from interpretable ellipsoidal primitives using diffusion models with differentiable geometric moment losses for controllable and realistic generation.

DetailsMotivation: Generative models for 3D cardiovascular anatomy face a trade-off between geometric controllability and realism. Existing methods struggle to provide both interpretable control over anatomical substructures while maintaining realistic outputs for clinical research and medical device evaluation.

Method: Proposes CardioComposer: a programmable inference-time framework that generates multi-class anatomical label maps from interpretable ellipsoidal primitives representing geometric attributes (size, shape, position). Uses diffusion models with differentiable measurement functions based on voxel-wise geometric moments, enabling loss-based gradient guidance during sampling for disentangled control over individual geometric attributes.

Result: The method demonstrates that geometric losses can constrain individual attributes in a disentangled manner and provide compositional control over multiple substructures. It’s compatible with various anatomical systems containing non-convex substructures, spanning cardiac, vascular, and skeletal organs.

Conclusion: CardioComposer provides a novel approach to balance geometric controllability and realism in 3D anatomical generation, offering interpretable primitives and differentiable geometric constraints that enable programmable generation of cardiovascular anatomy for clinical applications.

Abstract: Generative models of 3D cardiovascular anatomy can synthesize informative structures for clinical research and medical device evaluation, but face a trade-off between geometric controllability and realism. We propose CardioComposer: a programmable, inference-time framework for generating multi-class anatomical label maps from interpretable ellipsoidal primitives. These primitives represent geometric attributes such as the size, shape, and position of discrete substructures. We specifically develop differentiable measurement functions based on voxel-wise geometric moments, enabling loss-based gradient guidance during diffusion model sampling. We demonstrate that these losses can constrain individual geometric attributes in a disentangled manner and provide compositional control over multiple substructures. Finally, we show that our method is compatible with a broad range of anatomical systems containing non-convex substructures, spanning cardiac, vascular, and skeletal organs. We release our code at https://github.com/kkadry/CardioComposer.

[683] Position-Blind Ptychography: Viability of image reconstruction via data-driven variational inference

Simon Welker, Lorenz Kuger, Tim Roith, Berthy Feng, Martin Burger, Timo Gerkmann, Henry Chapman

Main category: eess.IV

TL;DR: Presents position-blind ptychography - phase retrieval without known scan positions - using variational inference with score-based diffusion models as priors, showing successful reconstructions in most scenarios.

DetailsMotivation: Addresses the challenge in single-particle diffractive X-ray imaging where particles in random orientations are illuminated with focused beams, creating ptychographic measurements with unknown beam positions relative to particles.

Method: Uses variational inference with modern data-driven image priors in the form of score-based diffusion models to jointly recover both the image and unknown scan positions in a simulated 2-D variant of the problem.

Result: With appropriate illumination structure and strong priors, reliable and successful image reconstructions are achieved even under measurement noise, except in the most difficult imaging scenarios.

Conclusion: Position-blind ptychography is viable using variational inference with diffusion model priors, demonstrating potential for applications in single-particle X-ray imaging where scan positions are unknown.

Abstract: In this work, we present and investigate the novel blind inverse problem of position-blind ptychography, i.e., ptychographic phase retrieval without any knowledge of scan positions, which then must be recovered jointly with the image. The motivation for this problem comes from single-particle diffractive X-ray imaging, where particles in random orientations are illuminated and a set of diffraction patterns is collected. If one uses a highly focused X-ray beam, the measurements would also become sensitive to the beam positions relative to each particle and therefore ptychographic, but these positions are also unknown. We investigate the viability of image reconstruction in a simulated, simplified 2-D variant of this difficult problem, using variational inference with modern data-driven image priors in the form of score-based diffusion models. We find that, with the right illumination structure and a strong prior, one can achieve reliable and successful image reconstructions even under measurement noise, in all except the most difficult evaluated imaging scenario.

[684] LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol

Hongyi Pan, Gorkem Durak, Halil Ertugrul Aktas, Andrea M. Bejar, Baver Tutun, Emre Uysal, Ezgi Bulbul, Mehmet Fatih Dogan, Berrin Erok, Berna Akkus Yildirim, Sukru Mehmet Erturk, Ulas Bagci

Main category: eess.IV

TL;DR: LUMINA: A multi-vendor FFDM dataset with energy/vendor metadata and harmonization method for robust mammography AI

DetailsMotivation: Existing mammography datasets are limited in size, clinical annotations, and vendor diversity, hindering development of robust models that can handle real-world variations in acquisition systems and imaging styles.

Method: Created LUMINA dataset with 1824 images from 468 patients across 6 acquisition systems, including both high- and low-energy imaging. Proposed foreground-only pixel-space alignment method for energy harmonization that maps images to low-energy reference while preserving lesion morphology.

Result: Two-view models outperform single-view models. EfficientNet-B0 achieves 93.54% AUC for diagnosis, Swin-T achieves 89.43% macro-AUC for density prediction. Harmonization improves performance across architectures and produces more localized Grad-CAM responses.

Conclusion: LUMINA provides a vendor-diverse benchmark and model-agnostic harmonization framework for reliable and deployable mammography AI that addresses domain shifts from acquisition variations.

Abstract: Publicly available full-field digital mammography (FFDM) datasets remain limited in size, clinical annotations, and vendor diversity, hindering the development of robust models. We introduce LUMINA, a curated, multi-vendor FFDM dataset that explicitly encodes acquisition energy and vendor metadata to capture clinically relevant appearance variations often overlooked in existing benchmarks. This dataset contains 1824 images from 468 patients (960 benign, 864 malignant), with pathology-confirmed labels, BI-RADS assessments, and breast-density annotations. LUMINA spans six acquisition systems and includes both high- and low-energy imaging styles, enabling systematic analysis of vendor- and energy-induced domain shifts. To address these variations, we propose a foreground-only pixel-space alignment method (‘’energy harmonization’’) that maps images to a low-energy reference while preserving lesion morphology. We benchmark CNN and transformer models on three clinically relevant tasks: diagnosis (benign vs. malignant), BI-RADS classification, and density estimation. Two-view models consistently outperform single-view models. EfficientNet-B0 achieves an AUC of 93.54% for diagnosis, while Swin-T achieves the best macro-AUC of 89.43% for density prediction. Harmonization improves performance across architectures and produces more localized Grad-CAM responses. Overall, LUMINA provides (1) a vendor-diverse benchmark and (2) a model-agnostic harmonization framework for reliable and deployable mammography AI.

[685] Clinical Priors Guided Lung Disease Detection in 3D CT Scans

Kejin Lu, Jianfa Bai, Qingqiu Li, Runtian Yuan, Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng

Main category: eess.IV

TL;DR: A gender-aware two-stage framework for lung disease classification from CT scans that addresses class imbalance by incorporating gender-specific classifiers.

DetailsMotivation: Medical imaging datasets often suffer from severe class imbalance, which degrades deep learning model performance, especially for minority disease categories. The authors aim to address this issue by explicitly incorporating gender information into the disease recognition pipeline.

Method: A two-stage framework: 1) Train a gender classifier to predict patient’s gender from CT scans, 2) Route input CT images to corresponding gender-specific disease classifiers for final disease prediction. This design captures gender-related imaging characteristics and alleviates imbalanced data distribution effects.

Result: Experimental results show improved recognition performance for minority disease categories (particularly squamous cell carcinoma) while maintaining competitive performance on other classes.

Conclusion: Incorporating gender information through a two-stage classification framework effectively addresses class imbalance in medical imaging datasets and improves recognition of minority disease categories.

Abstract: Accurate classification of lung diseases from chest CT scans plays an important role in computer-aided diagnosis systems. However, medical imaging datasets often suffer from severe class imbalance, which may significantly degrade the performance of deep learning models, especially for minority disease categories. To address this issue, we propose a gender-aware two-stage lung disease classification framework. The proposed approach explicitly incorporates gender information into the disease recognition pipeline. In the first stage, a gender classifier is trained to predict the patient’s gender from CT scans. In the second stage, the input CT image is routed to a corresponding gender-specific disease classifier to perform final disease prediction. This design enables the model to better capture gender-related imaging characteristics and alleviate the influence of imbalanced data distribution. Experimental results demonstrate that the proposed method improves the recognition performance for minority disease categories, particularly squamous cell carcinoma, while maintaining competitive performance on other classes.

Last updated: 2026-03-27
Built with Hugo, theme modified on Stack