Daily arXiv Papers - 2026-02-20

AI-enhanced summaries of 0 research papers from arXiv

Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, Zeyu Jin

Main category: cs.SD

TL;DR: AudioChat is a framework for audio foundation models that can generate, edit, and understand complex multi-source audio scenes called “audio stories” through LLM-based toolcalling agents and a novel Audio Transfusion Forcing objective.

DetailsMotivation: Current audio foundation models struggle with complex multi-source acoustic scenes (audio stories) that have multiple speakers and background/foreground sound effects, introducing new layers of semantic, temporal, and physical complexity.

Method: Proposes AudioChat framework with LLM-based toolcalling agents that simulate user-system interactions to generate training data, and introduces Audio Transfusion Forcing objective for simultaneous decomposition of high-level instructions via structured chain-of-thought reasoning and interactive multi-turn audio understanding/generation.

Result: Develops three new metrics to directly measure task performance for generation and editing instead of relying on distribution-based scoring, with a demo available to showcase capabilities.

Conclusion: AudioChat addresses the challenge of processing complex audio stories through a novel framework that enables audio foundation models to handle multi-source acoustic scenes with semantic, temporal, and physical complexity.

Abstract: Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://wanchichen.github.io/audiochat/.

Relevance: 9/10

[2] Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Ivan Rinaldi, Matteo Mendula, Nicola Fanelli, Florence Levé, Matteo Testi, Giovanna Castellano, Gennaro Vessio

Main category: cs.CV

TL;DR: ArtToMus: A framework for direct artwork-to-music generation without text intermediaries, using visual embeddings to condition a latent diffusion model for music synthesis.

DetailsMotivation: Existing image-conditioned music generation systems have two limitations: 1) trained on natural photographs rather than artworks with richer semantic/stylistic content, and 2) rely on image-to-text conversion as a semantic shortcut, preventing direct visual-to-audio learning.

Method: Created ArtSound dataset (105,884 artwork-music pairs with dual-modality captions), then developed ArtToMus framework that projects visual embeddings directly into the conditioning space of a latent diffusion model for music synthesis without text translation or language supervision.

Result: ArtToMus generates musically coherent and stylistically consistent outputs reflecting visual cues from source artworks. While absolute alignment scores are lower than text-conditioned systems (expected due to increased difficulty), it achieves competitive perceptual quality and meaningful cross-modal correspondence.

Conclusion: Establishes direct visual-to-music generation as a distinct research direction, providing resources for multimedia art, cultural heritage, and AI-assisted creative practice applications.

Abstract: Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.

Relevance: 9/10

[3] MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues

Main category: cs.CL

TL;DR: MCIF is a crosslingual multimodal benchmark for evaluating instruction-following capabilities of MLLMs across speech, vision, and text modalities in multiple languages.

DetailsMotivation: Existing benchmarks are limited to English, focus on single modalities, use short-form inputs, or lack human annotations, making comprehensive evaluation of multimodal crosslingual capabilities difficult.

Method: Created MCIF benchmark based on scientific talks covering NLP and other topics, with human annotations across 4 languages (English, German, Italian, Chinese) and 3 modalities (speech, vision, text). Includes 4 macro-tasks: recognition, translation, QA, and summarization.

Result: Benchmarked 23 models, revealing universal challenges across modalities and tasks, showing substantial room for improvement in MLLM development.

Conclusion: MCIF enables systematic evaluation of MLLMs’ crosslingual multimodal instruction-following capabilities and highlights important research directions for future development.

Abstract: Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations–hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF evaluates instruction following in crosslingual, multimodal settings over different input lengths and spans four macro-tasks: recognition, translation, question answering, and summarization. It covers three core modalities (speech, vision, and text) and four diverse languages (English, German, Italian, and Chinese), fully aligned across all dimensions. This parallel design enables a systematic evaluation of MLLMs’ abilities to interpret instructions across languages and effectively integrate multimodal contextual information. Our benchmarking and analysis of 23 models highlight universal challenges across modalities and tasks, indicating substantial room for improvement in future MLLMs development. MCIF is released under CC-BY 4.0 license to promote open research.

Relevance: 9/10


Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] References Improve LLM Alignment in Non-Verifiable Domains

Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty, Arman Cohan

Main category: cs.CL

TL;DR: Reference-guided LLM evaluators can serve as soft verifiers for LLM alignment in non-verifiable domains, improving alignment tuning through self-improvement with reference outputs.

DetailsMotivation: RLVR works well for reasoning tasks with ground-truth verifiers, but cannot be applied to non-verifiable domains like LLM alignment. The paper investigates whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers".

Method: Design evaluation protocols that enhance LLM-based evaluators using reference outputs. Use reference-guided approach to improve accuracy of LLM-judges, then apply these improved judges for alignment tuning through self-improvement where LLMs guided with references are used as judges.

Result: Reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges. Achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, with average absolute gains of +20.2/+17.1 points over SFT distillation and +5.3/+3.6 points over reference-free self-improvement.

Conclusion: Reference-guided LLM-evaluators can enable effective LLM post-training in non-verifiable domains, achieving performance comparable to training with strong finetuned reward models like ArmoRM.

Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft “verifiers”. First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.

[2] Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark

Charalampos Mastrokostas, Nikolaos Giarelis, Nikos Karacapilidis

Main category: cs.CL

TL;DR: This paper introduces DemosQA, a Greek question-answering dataset from social media, and evaluates 11 monolingual/multilingual LLMs on Greek QA tasks using an efficient evaluation framework.

DetailsMotivation: LLMs have advanced QA but focus on high-resource languages like English, with multilingual models showing bias toward popular languages. Greek and other under-resourced languages are underrepresented, lacking culturally relevant datasets and proper evaluation.

Method: Created DemosQA dataset from Greek social media questions and community answers; developed memory-efficient LLM evaluation framework; evaluated 11 monolingual/multilingual LLMs on 6 Greek QA datasets using 3 prompting strategies.

Result: Comprehensive evaluation shows performance variations across models and datasets; provides insights into Greek language understanding capabilities of different LLM architectures; framework enables reproducible evaluation.

Conclusion: Addresses Greek QA research gap with culturally relevant dataset and evaluation framework; highlights need for language-specific LLM development beyond multilingual approaches; releases code/data for reproducibility.

Abstract: Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.

[3] One-step Language Modeling via Continuous Denoising

Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, Jinwoo Kim

Main category: cs.CL

TL;DR: Flow-based language models using continuous denoising outperform discrete diffusion models in both quality and speed for text generation.

DetailsMotivation: Discrete diffusion models promise faster generation than autoregressive models but suffer from quality degradation in few-step regimes. The authors question whether discrete diffusion processes are necessary for discrete modalities and explore flow-based continuous denoising as an alternative.

Method: Developed a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings using continuous flows. Introduced a time reparameterization for training stability. Created a distilled flow map language model (FMLM) for few-step generation by distilling FLM into its associated flow map.

Result: FLM matches state-of-the-art discrete diffusion models on LM1B and OWT datasets. FMLM outperforms recent few-step language models, with one-step generation exceeding their 8-step quality.

Conclusion: Flow-based continuous denoising can outperform discrete diffusion for language modeling, challenging the assumption that discrete processes are necessary for discrete modalities. This enables accelerated flow-based language modeling at scale.

Abstract: Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at https://github.com/david3684/flm.

[4] Claim Automation using Large Language Model

Zhengda Mo, Zhiyu Quan, Eli O’Donohue, Kaiwen Zhong

Main category: cs.CL

TL;DR: Fine-tuned LLMs for generating structured corrective-action recommendations from unstructured warranty claim narratives in insurance domain, using LoRA adaptation and achieving 80% near-identical matches to ground truth.

DetailsMotivation: LLMs have strong general-purpose performance but limited deployment in regulated domains like insurance due to governance and data sensitivity concerns. Need for domain-specific solutions that can generate structured recommendations from unstructured claim narratives.

Method: Fine-tuned pretrained LLMs using Low-Rank Adaptation (LoRA) on millions of historical warranty claims. Created a locally deployed governance-aware language modeling component as initial decision module in claim processing pipeline. Used multi-dimensional evaluation framework combining automated semantic similarity metrics with human evaluation.

Result: Domain-specific fine-tuning substantially outperformed commercial general-purpose and prompt-based LLMs. Approximately 80% of evaluated cases achieved near-identical matches to ground-truth corrective actions. Demonstrated that domain-adaptive fine-tuning aligns model output distributions more closely with real-world operational data.

Conclusion: Domain-adaptive fine-tuning of LLMs provides reliable and governable building blocks for insurance applications, enabling structured corrective-action recommendations from unstructured narratives while addressing regulatory and data sensitivity concerns.

Abstract: While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters’ decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.

[5] BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization

Ahmed Rafid, Rumman Adib, Fariya Ahmed, Ajwad Abrar, Mohammed Saidul Islam

Main category: cs.CL

TL;DR: BanglaSummEval: A reference-free QA-based framework for evaluating factual consistency in Bangla text summarization using a unified multilingual language model for question generation, answering, and importance weighting.

DetailsMotivation: Existing factual consistency evaluation metrics overlook Bangla (a widely spoken but under-resourced language) and often depend on reference summaries, creating a gap for reliable evaluation in low-resource settings.

Method: Reference-free QA-based framework that automatically generates questions from source documents, extracts answers from both source and summary, weights question importance, and uses BERTScore-Recall for semantic answer comparison.

Result: Strong correlation with expert human judgments (Pearson’s r=0.694, Spearman’s ρ=0.763) on 300 human-written summaries from educational and medical domains.

Conclusion: BanglaSummEval provides a practical, transparent, and interpretable solution for factual consistency evaluation in low-resource language settings with reduced system complexity.

Abstract: Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison. We validate BanglaSummEval on 300 human-written summaries from educational and medical domains, demonstrating strong correlation with expert human judgments (Pearson’s $r = 0.694$, Spearman’s $ρ= 0.763$). By providing interpretable, step-wise diagnostics alongside reliable evaluation scores, BanglaSummEval offers a practical and transparent solution for factual consistency evaluation in low-resource language settings.

[6] The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

Jayadev Billa

Main category: cs.CL

TL;DR: Speech LLMs largely function as expensive ASR-to-LLM cascades rather than true multimodal models, with most showing behavioral and mechanistic equivalence to simple Whisper→LLM pipelines.

DetailsMotivation: To investigate whether current speech LLMs are genuinely multimodal or simply function as cascaded ASR+LLM systems, and to understand their architectural dependencies and performance characteristics.

Method: Matched-backbone testing across four speech LLMs and six tasks, controlling for LLM backbone; used logit lens analysis to reveal text in hidden states; applied LEACE concept erasure to test causal necessity of text representations.

Result: Most speech LLMs (like Ultravox) are statistically indistinguishable from matched cascades (κ=0.93); text representations are causally necessary; Qwen2-Audio shows genuine divergence; under noise, speech LLMs perform worse than cascades with clean-condition advantages reversing by up to 7.6% at 0 dB.

Conclusion: Current speech LLMs are largely expensive cascades rather than true multimodal models, with performance degrading under noisy conditions; cascade equivalence is architecture-dependent, not universal.

Abstract: Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($κ{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.

[7] Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

Minh Duc Bui, Manuel Mager, Peter Herbert Kann, Katharina von der Wense

Main category: cs.CL

TL;DR: First NLP research on the endangered Meenzerisch dialect of Mainz, Germany, showing that current LLMs perform poorly (under 10% accuracy) at generating definitions for dialect words or generating dialect words from definitions.

DetailsMotivation: Meenzerisch, the dialect of Mainz, is endangered like many German dialects. NLP could help preserve and revive such dialects, but no prior NLP research has focused on Meenzerisch specifically.

Method: Created first NLP-ready dataset (2,351 dialect words with Standard German definitions). Tested state-of-the-art LLMs on two tasks: 1) generating definitions for dialect words, and 2) generating dialect words given definitions. Also experimented with few-shot learning and rule extraction approaches.

Result: LLMs performed very poorly: best definition generation accuracy was 6.27%, best word generation accuracy was 1.51%. Few-shot learning and rule extraction improved results but still below 10% accuracy.

Conclusion: Current LLMs are inadequate for Meenzerisch dialect tasks, highlighting urgent need for more resources and intensified research efforts on German dialects.

Abstract: Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model’s accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.

[8] ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, Craig Boutilier

Main category: cs.CL

TL;DR: ConvApparel dataset addresses LLM-based user simulator realism gap through dual-agent data collection with good/bad recommenders, enabling counterfactual validation and comprehensive evaluation framework

DetailsMotivation: LLM-based user simulators suffer from a "realism gap" where systems optimized for simulated interactions fail in real-world deployment, necessitating better datasets and validation methods

Method: Introduces ConvApparel dataset with dual-agent data collection protocol using both good and bad recommenders, enabling counterfactual validation; proposes validation framework combining statistical alignment, human-likeness score, and counterfactual validation

Result: Experiments reveal significant realism gap across all simulators, but data-driven simulators outperform prompted baselines, particularly in counterfactual validation where they adapt more realistically to unseen behaviors

Conclusion: Data-driven simulators embody more robust user models despite imperfections; the proposed framework and dataset help address the realism gap in conversational AI development

Abstract: The promise of LLM-based user simulators to improve conversational AI is hindered by a critical “realism gap,” leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol – using both “good” and “bad” recommenders – enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.

[9] When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English

Hasan Can Biyik, Libby Barak, Jing Peng, Anna Feldman

Main category: cs.CL

TL;DR: Cross-lingual euphemism detection study shows transfer asymmetry between Turkish and English, where semantic overlap doesn’t guarantee positive transfer, especially in low-resource Turkish-to-English direction.

DetailsMotivation: Euphemisms are culturally and pragmatically dependent, making cross-lingual modeling challenging. The study investigates how cross-lingual equivalence affects transfer in multilingual euphemism detection between Turkish and English.

Method: Categorized Potentially Euphemistic Terms (PETs) in Turkish and English into Overlapping (OPETs) and Non-Overlapping (NOPETs) subsets based on functional, pragmatic, and semantic alignment. Analyzed transfer performance in both directions with category-level analysis.

Result: Found transfer asymmetry: semantic overlap insufficient for positive transfer, especially in low-resource Turkish-to-English direction where performance can degrade for overlapping euphemisms. NOPET-based training sometimes improved performance. Label distribution differences explain counterintuitive results. Limited evidence suggests domain-specific alignment influences transfer.

Conclusion: Cross-lingual euphemism detection exhibits complex transfer patterns where semantic equivalence alone doesn’t guarantee successful transfer, highlighting the importance of pragmatic and cultural factors in multilingual NLP.

Abstract: Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, we investigate how cross-lingual equivalence influences transfer in multilingual euphemism detection. We categorize Potentially Euphemistic Terms (PETs) in Turkish and English into Overlapping (OPETs) and Non-Overlapping (NOPETs) subsets based on their functional, pragmatic, and semantic alignment. Our findings reveal a transfer asymmetry: semantic overlap is insufficient to guarantee positive transfer, particularly in low-resource Turkish-to-English direction, where performance can degrade even for overlapping euphemisms, and in some cases, improve under NOPET-based training. Differences in label distribution help explain these counterintuitive results. Category-level analysis suggests that transfer may be influenced by domain-specific alignment, though evidence is limited by sparsity.

[10] MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues

Main category: cs.CL

TL;DR: MCIF is a crosslingual multimodal benchmark for evaluating instruction-following capabilities of MLLMs across speech, vision, and text modalities in multiple languages.

DetailsMotivation: Existing benchmarks are limited to English, focus on single modalities, use short-form inputs, or lack human annotations, making comprehensive evaluation of multimodal crosslingual capabilities difficult.

Method: Created MCIF benchmark based on scientific talks covering NLP and other topics, with human annotations across 4 languages (English, German, Italian, Chinese) and 3 modalities (speech, vision, text). Includes 4 macro-tasks: recognition, translation, QA, and summarization.

Result: Benchmarked 23 models, revealing universal challenges across modalities and tasks, showing substantial room for improvement in MLLM development.

Conclusion: MCIF enables systematic evaluation of MLLMs’ crosslingual multimodal instruction-following capabilities and highlights important research directions for future development.

Abstract: Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations–hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF evaluates instruction following in crosslingual, multimodal settings over different input lengths and spans four macro-tasks: recognition, translation, question answering, and summarization. It covers three core modalities (speech, vision, and text) and four diverse languages (English, German, Italian, and Chinese), fully aligned across all dimensions. This parallel design enables a systematic evaluation of MLLMs’ abilities to interpret instructions across languages and effectively integrate multimodal contextual information. Our benchmarking and analysis of 23 models highlight universal challenges across modalities and tasks, indicating substantial room for improvement in future MLLMs development. MCIF is released under CC-BY 4.0 license to promote open research.

[11] Eigenmood Space: Uncertainty-Aware Spectral Graph Analysis of Psychological Patterns in Classical Persian Poetry

Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar

Main category: cs.CL

TL;DR: Uncertainty-aware computational framework for psychological analysis of classical Persian poetry using large-scale multi-label annotation and statistical divergence measures to quantify poetic individuality.

DetailsMotivation: Classical Persian poetry expresses affective life through complex literary devices, making close reading essential but limiting reproducible large-scale comparison. Need computational methods that preserve interpretive nuance while enabling scalable analysis.

Method: Large-scale automatic multi-label annotation of verses with psychological concepts, confidence scores, and abstention flags. Aggregates evidence into Poet × Concept matrix, uses Jensen-Shannon and KL divergence to measure individuality. Builds confidence-weighted co-occurrence graph and Eigenmood embeddings via Laplacian spectral decomposition.

Result: Applied to 61,573 verses across 10 poets, with 22.2% verses abstained. Framework supports sensitivity analysis, selection-bias diagnostics, and distant-to-close workflow for retrieving verse-level exemplars along Eigenmood axes.

Conclusion: Framework enables scalable, auditable digital humanities analysis while preserving interpretive caution by propagating uncertainty from verse-level evidence to poet-level inference.

Abstract: Classical Persian poetry is a historically sustained archive in which affective life is expressed through metaphor, intertextual convention, and rhetorical indirection. These properties make close reading indispensable while limiting reproducible comparison at scale. We present an uncertainty-aware computational framework for poet-level psychological analysis based on large-scale automatic multi-label annotation. Each verse is associated with a set of psychological concepts, per-label confidence scores, and an abstention flag that signals insufficient evidence. We aggregate confidence-weighted evidence into a Poet $\times$ Concept matrix, interpret each poet as a probability distribution over concepts, and quantify poetic individuality as divergence from a corpus baseline using Jensen–Shannon divergence and Kullback–Leibler divergence. To capture relational structure beyond marginals, we build a confidence-weighted co-occurrence graph over concepts and define an Eigenmood embedding through Laplacian spectral decomposition. On a corpus of 61{,}573 verses across 10 poets, 22.2% of verses are abstained, underscoring the analytical importance of uncertainty. We further report sensitivity analysis under confidence thresholding, selection-bias diagnostics that treat abstention as a category, and a distant-to-close workflow that retrieves verse-level exemplars along Eigenmood axes. The resulting framework supports scalable, auditable digital-humanities analysis while preserving interpretive caution by propagating uncertainty from verse-level evidence to poet-level inference.

[12] Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Serin Kim, Sangam Lee, Dongha Lee

Main category: cs.CL

TL;DR: Persona2Web is a benchmark for evaluating personalized web agents that can interpret ambiguous queries by inferring user preferences from long-term history, rather than requiring explicit instructions.

DetailsMotivation: Current web agents lack personalization capabilities and struggle with ambiguous queries where users don't specify every detail. Practical agents need to infer user preferences from context rather than relying on explicit instructions.

Method: Created Persona2Web benchmark with: (1) user histories revealing implicit preferences over long time spans, (2) ambiguous queries requiring inference of user preferences, and (3) reasoning-aware evaluation framework for fine-grained assessment of personalization.

Result: Extensive experiments across various agent architectures, backbone models, history access schemes, and ambiguity levels reveal key challenges in personalized web agent behavior.

Conclusion: Persona2Web addresses the personalization gap in web agents through the clarify-to-personalize principle, enabling evaluation of agents that can interpret ambiguous queries by inferring user preferences from history.

Abstract: Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://anonymous.4open.science/r/Persona2Web-73E8.

[13] ReIn: Conversational Error Recovery with Reasoning Inception

Takyoung Kim, Jinseok Nam, Chandrayee Basu, Xing Fan, Chengyuan Ma, Heng Ji, Gokhan Tur, Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: ReIn (Reasoning Inception) is a test-time intervention method that improves error recovery in conversational agents by planting external recovery plans into the agent’s reasoning process without modifying model parameters or prompts.

DetailsMotivation: Current LLM-powered conversational agents with tool integration perform well on fixed datasets but are vulnerable to unanticipated user-induced errors. Rather than focusing on error prevention, this work addresses error recovery, which requires accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans, all under realistic constraints that preclude model fine-tuning or prompt modification due to cost and time requirements.

Method: Proposes Reasoning Inception (ReIn), a test-time intervention method where an external inception module identifies predefined errors within dialogue contexts and generates recovery plans. These plans are then integrated into the agent’s internal reasoning process to guide corrective actions, without modifying the agent’s parameters or system prompts.

Result: ReIn substantially improves task success across diverse combinations of agent models and inception modules, generalizes to unseen error types, and consistently outperforms explicit prompt-modification approaches. It serves as an efficient, on-the-fly method for improving agent resilience.

Conclusion: Jointly defining recovery tools with ReIn can serve as a safe and effective strategy for improving the resilience of conversational agents without modifying backbone models or system prompts, with in-depth analysis showing its effectiveness particularly in relation to instruction hierarchy.

Abstract: Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans. Under realistic constraints precluding model fine-tuning or prompt modification due to significant cost and time requirements, we explore whether agents can recover from contextually flawed interactions and how their behavior can be adapted without altering model parameters and prompts. To this end, we propose Reasoning Inception (ReIn), a test-time intervention method that plants an initial reasoning into the agent’s decision-making process. Specifically, an external inception module identifies predefined errors within the dialogue context and generates recovery plans, which are subsequently integrated into the agent’s internal reasoning process to guide corrective actions, without modifying its parameters or system prompts. We evaluate ReIn by systematically simulating conversational failure scenarios that directly hinder successful completion of user goals: user’s ambiguous and unsupported requests. Across diverse combinations of agent models and inception modules, ReIn substantially improves task success and generalizes to unseen error types. Moreover, it consistently outperforms explicit prompt-modification approaches, underscoring its utility as an efficient, on-the-fly method. In-depth analysis of its operational mechanism, particularly in relation to instruction hierarchy, indicates that jointly defining recovery tools with ReIn can serve as a safe and effective strategy for improving the resilience of conversational agents without modifying the backbone models or system prompts.

[14] Large Language Models Persuade Without Planning Theory of Mind

Jared Moore, Rasmus Overmark, Ned Cooper, Beba Cibralic, Nick Haber, Cameron R. Jones

Main category: cs.CL

TL;DR: LLMs excel at persuasion through rhetorical strategies but struggle with explicit theory of mind reasoning requiring multi-step planning to infer mental states, unlike humans who can do both.

DetailsMotivation: Current ToM evaluations use static benchmarks, but theoretical work suggests first-person interaction is crucial for true ToM assessment. The paper addresses this gap by creating an interactive persuasion task that requires sensitivity to targets' knowledge and motivational states.

Method: Created a novel persuasion task where agents must persuade targets to choose one of three policy proposals by strategically revealing information. Varied whether targets’ knowledge/motivational states were Revealed or Hidden. Conducted three experiments: 1) persuading rational bot, 2) human role-playing bot, 3) measuring real belief changes in human targets.

Result: LLMs excelled when states were Revealed but performed below chance when states were Hidden, showing difficulty with multi-step planning for mental state inference. Humans performed moderately well in both conditions. In experiments with human targets, LLMs outperformed humans across all conditions, suggesting effective persuasion can occur without explicit ToM reasoning.

Conclusion: LLMs excel at persuasion through rhetorical strategies but lack human-like ToM reasoning requiring mental state inference. Results caution against attributing human-like ToM to LLMs while highlighting their potential to influence beliefs and behavior.

Abstract: A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks. However, theoretical work in the field suggests that first-personal interaction is a crucial part of ToM and that such predictive, spectatorial tasks may fail to evaluate it. We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information. Success depends on a persuader’s sensitivity to a given target’s knowledge states (what the target knows about the policies) and motivational states (how much the target values different outcomes). We varied whether these states were Revealed to persuaders or Hidden, in which case persuaders had to inquire about or infer them. In Experiment 1, participants persuaded a bot programmed to make only rational inferences. LLMs excelled in the Revealed condition but performed below chance in the Hidden condition, suggesting difficulty with the multi-step planning required to elicit and use mental state information. Humans performed moderately well in both conditions, indicating an ability to engage such planning. In Experiment 2, where a human target role-played the bot, and in Experiment 3, where we measured whether human targets’ real beliefs changed, LLMs outperformed human persuaders across all conditions. These results suggest that effective persuasion can occur without explicit ToM reasoning (e.g., through rhetorical strategies) and that LLMs excel at this form of persuasion. Overall, our results caution against attributing human-like ToM to LLMs while highlighting LLMs’ potential to influence people’s beliefs and behavior.

[15] Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data

Deepak Uniyal, Md Abul Bashar, Richi Nayak

Main category: cs.CL

TL;DR: Study compares four cross-lingual text classification approaches for filtering hydrogen-related tweets across English, Japanese, Hindi, and Korean, then performs topic modeling on filtered content.

DetailsMotivation: Multilingual social media discourse analysis is challenging, especially for global debates spanning diverse languages. Need reliable cross-lingual text classification methods to filter relevant content from noisy keyword-based collections.

Method: Four approaches tested on decade-long dataset of 9M+ tweets: 1) Translate English annotations to target languages for language-specific models, 2) Translate all data to English for single English model, 3) Apply English fine-tuned multilingual transformers directly to target languages, 4) Hybrid combining translated annotations with multilingual training.

Result: Results highlight trade-offs between translation and multilingual approaches, providing insights for optimizing cross-lingual pipelines for large-scale social media analysis.

Conclusion: Study offers actionable insights for cross-lingual text classification in social media analysis, with hydrogen energy as case study demonstrating practical approaches for multilingual discourse analysis.

Abstract: Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013–2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.

[16] ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning

Hussein S. Al-Olimat, Ahmad Alshareef

Main category: cs.CL

TL;DR: ALPS is a native Arabic diagnostic benchmark focusing on deep semantics and pragmatics, revealing that while commercial models achieve high fluency, they struggle with fundamental morpho-syntactic dependencies in Arabic.

DetailsMotivation: Existing Arabic NLP benchmarks prioritize scale and often use synthetic/translated data, lacking deep linguistic verification. There's a need for native, expert-curated diagnostic tools to probe deep semantic and pragmatic understanding in Arabic language models.

Method: Created ALPS (Arabic Linguistic & Pragmatic Suite) - a native, expert-curated diagnostic challenge set with 531 questions across 15 tasks and 47 subtasks. Evaluated 23 diverse models (commercial, open-source, Arabic-native) against human performance benchmarks.

Result: Models show dissociation: high fluency but poor morpho-syntactic understanding (36.5% error on diacritics-reliant tasks). Top commercial models (Gemini-3-flash: 94.2%) surpass average human (84.6%), but Arabic-native models (Jais-2-70B: 83.6%) lag behind commercial giants.

Conclusion: ALPS reveals critical gaps in Arabic language understanding, especially in morpho-syntactic dependencies. While commercial models excel in fluency, specialized Arabic models need improvement in fundamental linguistic understanding.

Abstract: While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but fail on fundamental morpho-syntactic dependencies, with elevated error rates on morpho-syntactic dependencies (36.5% across diacritics-reliant tasks) compared to compositional semantics. While top commercial models (Gemini-3-flash at 94.2%) surpass the average single human, a substantial gap persists between commercial giants and Arabic-native models, with the best Arabic-specific model (Jais-2-70B at 83.6%) approaching but not matching human performance.

[17] BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo

Main category: cs.CL

TL;DR: BankMathBench: A domain-specific dataset for evaluating and improving LLMs’ numerical reasoning in realistic banking scenarios, addressing systematic errors in financial computations.

DetailsMotivation: LLMs in banking chatbots struggle with core financial computations like payout estimation, product comparison, and interest calculations. Existing benchmarks don't capture these real-world banking errors, as math datasets focus on fundamental problems and financial benchmarks target documents rather than everyday banking scenarios.

Method: Created BankMathBench dataset with three difficulty levels: basic (single-product reasoning), intermediate (multi-product comparison), and advanced (multi-condition scenarios). Used the dataset for fine-tuning open-source LLMs with tool augmentation to improve numerical reasoning.

Result: Fine-tuning on BankMathBench significantly improved LLMs’ performance: 57.6%p increase for basic tasks, 75.1%p for intermediate, and 62.9%p for advanced tasks over zero-shot baselines. Models showed notable improvements in both formula generation and numerical reasoning accuracy.

Conclusion: BankMathBench is an effective benchmark for evaluating and advancing LLMs’ numerical reasoning in real-world banking scenarios, addressing a critical gap in existing evaluation frameworks.

Abstract: Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset’s effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs’ numerical reasoning in real-world banking scenarios.

[18] Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests

Anton Dzega, Aviad Elyashar, Ortal Slobodin, Odeya Cohen, Rami Puzis

Main category: cs.CL

TL;DR: LMM personality assessment using TAT projective test and SCORS-G framework shows models understand interpersonal dynamics well but fail at aggression perception/regulation, with newer/larger models performing better.

DetailsMotivation: To assess personality traits of Large Multimodal Models (LMMs) through non-language-based modalities using psychological assessment frameworks, specifically examining whether LMMs can be evaluated using projective tests like TAT and SCORS-G scoring.

Method: Used Thematic Apperception Test (TAT) images as stimuli, with LMMs in two roles: subject models generating stories in response to TAT images, and evaluator models assessing narratives using SCORS-G framework. Compared model performance across different model families and sizes.

Result: Evaluator models showed excellent ability to understand and analyze TAT responses with high consistency to human experts. All models understood interpersonal dynamics well and had good self-concept grasp, but consistently failed to perceive and regulate aggression. Larger and more recent models outperformed smaller/earlier ones across SCORS-G dimensions.

Conclusion: LMMs can be effectively assessed for personality-like functioning using psychological projective tests, revealing systematic patterns in their social cognition capabilities with specific deficits in aggression processing, and showing scaling benefits in newer architectures.

Abstract: Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of personality. This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities, using the Social Cognition and Object Relations Scale - Global (SCORS-G). LMMs are employed in two distinct roles: as subject models (SMs), which generate stories in response to TAT images, and as evaluator models (EMs), who assess these narratives using the SCORS-G framework. Evaluators demonstrated an excellent ability to understand and analyze TAT responses. Their interpretations are highly consistent with those of human experts. Assessment results highlight that all models understand interpersonal dynamics very well and have a good grasp of the concept of self. However, they consistently fail to perceive and regulate aggression. Performance varied systematically across model families, with larger and more recent models consistently outperforming smaller and earlier ones across SCORS-G dimensions.

[19] The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI

Dusan Bosnjakovic

Main category: cs.CL

TL;DR: A psychometric auditing framework using latent trait estimation to detect stable behavioral signatures in LLMs across providers, revealing persistent biases that create ideological echo chambers in multi-agent systems.

DetailsMotivation: As LLMs become reasoning layers in multi-agent systems and recursive evaluation loops, traditional benchmarks fail to capture stable, latent response policies that persist across model versions. There's a critical need to detect provider-level behavioral signatures for safety and governance.

Method: Uses psychometric measurement theory with latent trait estimation under ordinal uncertainty. Employs forced-choice ordinal vignettes masked by semantically orthogonal decoys with cryptographic permutation-invariance. Audits nine leading models across dimensions like Optimization Bias, Sycophancy, and Status-Quo Legitimization using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis.

Result: Identifies that while item-level framing drives high variance, a persistent “lab signal” accounts for significant behavioral clustering. Shows latent biases are not merely static errors but compounding variables in locked-in provider ecosystems.

Conclusion: In multi-layered AI architectures, latent biases risk creating recursive ideological echo chambers. The framework enables detection of durable behavioral signatures critical for safety and governance in LLM-based systems.

Abstract: As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies – the prevailing mindsets'' embedded during training and alignment that outlive individual model versions. This paper introduces a novel auditing framework that utilizes psychometric measurement theory -- specifically latent trait estimation under ordinal uncertainty -- to quantify these tendencies without relying on ground-truth labels. Utilizing forced-choice ordinal vignettes masked by semantically orthogonal decoys and governed by cryptographic permutation-invariance, the research audits nine leading models across dimensions including Optimization Bias, Sycophancy, and Status-Quo Legitimization. Using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis, the research identifies that while item-level framing drives high variance, a persistent lab signal’’ accounts for significant behavioral clustering. These findings demonstrate that in ``locked-in’’ provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi-layered AI architectures.

[20] What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform

Adrian Cosma, Cosmin Dumitrache, Emilian Radoi

Main category: cs.CL

TL;DR: Analysis of patient satisfaction in Romanian text-based telemedicine using 77K question-response pairs, finding that patient/clinician history features dominate predictions while text characteristics like politeness and hedging provide actionable signals.

DetailsMotivation: With text-based telemedicine becoming common, clinicians face pressure to maintain satisfaction scores that often reflect communication quality more than clinical accuracy. The study aims to understand what drives patient satisfaction in text-based medical consultations.

Method: Analyzed 77,334 anonymized patient-doctor text pairs from Romanian telemedicine. Modeled feedback as binary (thumbs-up vs. negative/absent). Extracted language-agnostic features (length, structure, readability), Romanian LIWC psycholinguistic features, and politeness/hedging markers. Trained classifier with time-based split and performed SHAP-based analyses.

Result: Patient and clinician history features dominated predictions as strong priors. Text characteristics provided smaller but actionable signals. Politeness and hedging consistently positively associated with feedback, while lexical diversity showed negative association.

Conclusion: While historical factors strongly influence patient satisfaction, clinicians can improve ratings through communication strategies like politeness and hedging. Text-based features offer actionable insights for improving telemedicine communication quality.

Abstract: Text-based telemedicine has become a common mode of care, requiring clinicians to deliver medical advice clearly and effectively in writing. As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy. We analyse patient satisfaction signals in Romanian text-based telemedicine. Using a sample of 77,334 anonymised patient question–doctor response pairs, we model feedback as a binary outcome, treating thumbs-up responses as positive and grouping negative or absent feedback into the other class. We extract interpretable, predominantly language-agnostic features (e.g., length, structural characteristics, readability proxies), along with Romanian LIWC psycholinguistic features and politeness/hedging markers where available. We train a classifier with a time-based split and perform SHAP-based analyses, which indicate that patient and clinician history features dominate prediction, functioning as strong priors, while characteristics of the response text provide a smaller but, crucially, actionable signal. In subgroup correlation analyses, politeness and hedging are consistently positively associated with patient feedback, whereas lexical diversity shows a negative association.

[21] Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Kensuke Okada, Yui Furukawa, Kyosuke Bunji

Main category: cs.CL

TL;DR: A psychometric framework to quantify and mitigate socially desirable responding (SDR) in questionnaire-based evaluation of LLMs, using IRT-based effect sizes and desirability-matched forced-choice formats.

DetailsMotivation: Human self-report questionnaires are widely used to evaluate LLMs for persona consistency, safety, and bias, but they assume honest responding. LLMs can exhibit socially desirable responding (SDR) - giving socially preferred answers - which biases questionnaire-derived scores and conclusions.

Method: Proposes a psychometric framework: 1) Quantify SDR by administering same inventory under HONEST vs FAKE-GOOD instructions, computing direction-corrected standardized effect size from IRT-estimated latent scores; 2) Mitigate SDR using graded forced-choice (GFC) Big Five inventory constructed by selecting 30 cross-domain pairs via constrained optimization to match desirability.

Result: Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles: Likert-style questionnaires show consistently large SDR, while desirability-matched GFC substantially attenuates SDR while largely preserving recovery of intended persona profiles.

Conclusion: Highlights model-dependent SDR-recovery trade-off and motivates SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.

Abstract: Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.

[22] Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective

Yukun Chen, Xinyu Zhang, Jialong Tang, Yu Wan, Baosong Yang, Yiming Li, Zhan Qin, Kui Ren

Main category: cs.CL

TL;DR: X-Value is a cross-lingual benchmark for evaluating LLMs’ ability to assess deep-level human values in content, revealing current models’ deficiencies in nuanced values assessment across languages.

DetailsMotivation: Current LLM evaluation for content safety focuses on explicit harms but neglects subtler value dimensions. There's a need to assess LLMs' ability to understand deep-level human values from a global perspective across different languages.

Method: Created X-Value benchmark with 5,000+ QA pairs across 18 languages, organized into 7 domains based on Schwartz’s Theory of Basic Human Values. Used two-stage annotation: first identifies global consensus vs. pluralism issues, then conducts multi-party evaluation of latent values.

Result: Current SOTA LLMs show deficiencies in cross-lingual values assessment (accuracy < 77%), with significant performance disparities across languages (ΔAcc > 20%). Models struggle with nuanced values-aware content assessment.

Conclusion: There’s an urgent need to improve LLMs’ nuanced, values-aware content assessment capabilities. The X-Value benchmark provides a tool for evaluating cross-lingual values assessment in LLMs.

Abstract: While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital content. To bridge this gap, we introduce X-Value, a novel Cross-lingual Values Assessment Benchmark designed to evaluate LLMs’ ability to assess deep-level values of content from a global perspective. X-Value consists of more than 5,000 QA pairs across 18 languages, systematically organized into 7 core domains grounded in Schwartz’s Theory of Basic Human Values and categorized into easy and hard levels for discriminative evaluation. We further propose a unique two-stage annotation framework that first identifies whether an issue falls under global consensus (e.g., human rights) or pluralism (e.g., religion), and subsequently conducts a multi-party evaluation of the latent values embedded within the content. Systematic evaluations on X-Value reveal that current SOTA LLMs exhibit deficiencies in cross-lingual values assessment ($Acc < 77%$), with significant performance disparities across different languages ($ΔAcc > 20%$). This work highlights the urgent need to improve the nuanced, values-aware content assessment capability of LLMs. Our X-Value is available at: https://huggingface.co/datasets/Whitolf/X-Value.

[23] Representation Collapse in Machine Translation Through the Lens of Angular Dispersion

Evgeniia Tokarchuk, Maya K. Nachesa, Sergey Troshin, Vlad Niculae

Main category: cs.CL

TL;DR: Analysis of representation collapse in Transformer-based NMT models, showing regularization via angular dispersion mitigates collapse and improves translation quality, with benefits preserved after quantization.

DetailsMotivation: Standard next-token prediction training in Transformer-based neural machine translation can lead to representation collapse, especially in deeper layers, where the model fails to efficiently utilize geometric space. This problem is particularly pronounced in continuous-output NMT where trivial solutions emerge.

Method: Analyze representation collapse dynamics in discrete and continuous NMT transformers throughout training. Incorporate angular dispersion-based regularization to mitigate collapse and evaluate its effects on translation quality and quantized models.

Result: Angular dispersion regularization effectively mitigates representation collapse and improves translation quality. Quantized models exhibit similar collapse behavior, and regularization benefits are preserved even after quantization.

Conclusion: Representation collapse is a significant issue in NMT transformers that can be effectively addressed with angular dispersion regularization, leading to improved model performance and robustness in both standard and quantized settings.

Abstract: Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets. A standard next-token prediction training strategy, while widely adopted in practice, may lead to overlooked artifacts such as representation collapse. Previous works have shown that this problem is especially pronounced in the representation of the deeper Transformer layers, where it often fails to efficiently utilize the geometric space. Representation collapse is even more evident in end-to-end training of continuous-output neural machine translation, where the trivial solution would be to set all vectors to the same value. In this work, we analyze the dynamics of representation collapse at different levels of discrete and continuous NMT transformers throughout training. We incorporate an existing regularization method based on angular dispersion and demonstrate empirically that it not only mitigates collapse but also improves translation quality. Furthermore, we show that quantized models exhibit similar collapse behavior and that the benefits of regularization are preserved even after quantization.

[24] Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Bogdan Kostić, Conor Fallon, Julian Risch, Alexander Löser

Main category: cs.CL

TL;DR: LLM evaluation benchmarks are unreliable due to sensitivity to meaning-preserving lexical/syntactic variations, causing performance degradation and ranking instability across models.

DetailsMotivation: Standardized evaluation benchmarks are widely used for LLM comparison but their reliability is questionable due to sensitivity to shallow input variations, raising concerns about whether they truly measure linguistic competence.

Method: Used two linguistically principled pipelines: 1) synonym substitution for lexical changes, 2) dependency parsing for syntactic transformations. Tested 23 contemporary LLMs across MMLU, SQuAD, and AMEGA benchmarks with truth-conditionally equivalent perturbations.

Result: Lexical perturbations consistently caused significant performance degradation across nearly all models/tasks. Syntactic perturbations had more heterogeneous effects (sometimes improving results). Both perturbation types destabilized model leaderboards, especially on complex tasks. Model robustness didn’t consistently scale with size.

Conclusion: LLMs rely more on surface-level lexical patterns than abstract linguistic competence, highlighting the need for robustness testing as a standard component of LLM evaluation to ensure reliable benchmarking.

Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.

[25] RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering

Yiming Zhang, Siyue Zhang, Junbo Zhao, Chen Zhao

Main category: cs.CL

TL;DR: RPDR is a data augmentation framework that improves dense retrievers for long-tail QA by selecting high-quality easy-to-learn training data through synthetic generation and round-trip prediction.

DetailsMotivation: LLMs struggle with long-tail knowledge due to limited recall of rare information, and dense retrieval models face similar generalization issues on niche knowledge, requiring better training data selection.

Method: Three components: 1) synthetic data generation, 2) data selection using Round-Trip prediction to identify easy-to-learn instances, 3) retriever training with selected instances, plus dynamic routing mechanism for specialized retrieval modules.

Result: Substantial improvements over BM25 and Contriver on PopQA and EntityQuestion benchmarks, especially on extremely long-tail categories, with human analysis validating strengths and limitations.

Conclusion: RPDR effectively enhances dense retrievers for long-tail QA through intelligent data selection, with dynamic routing offering further performance improvements.

Abstract: Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.

[26] The Role of the Availability Heuristic in Multiple-Choice Answering Behaviour

Leonidas Zotos, Hedderik van Rijn, Malvina Nissim

Main category: cs.CL

TL;DR: Using Wikipedia frequency as a proxy for cognitive availability, correct multiple-choice answers are significantly more available than incorrect options, with this strategy yielding 13.5-32.9% above random guessing.

DetailsMotivation: To investigate whether the availability heuristic (ease of recalling information) can serve as an effective strategy for answering multiple-choice questions when students are unsure, and to develop a computational method for assessing cognitive availability.

Method: Proposed a computational method using concept prevalence in large corpora (Wikipedia) to operationalize cognitive availability of MCQ options. Analyzed three large question sets comparing availability of correct vs. incorrect answers, and examined LLM-generated options.

Result: Correct answers are significantly more available than incorrect options across all question sets. Always selecting the most available option yields scores 13.5% to 32.9% above random guessing baseline. LLM-generated options show similar availability patterns to expert-created options.

Conclusion: Availability heuristic can be an effective strategy for MCQ answering, and availability should be considered when computationally modeling student behavior, including in contexts involving LLM-generated content.

Abstract: When students are unsure of the correct answer to a multiple-choice question (MCQ), guessing is common practice. The availability heuristic, proposed by A. Tversky and D. Kahneman in 1973, suggests that the ease with which relevant instances come to mind, typically operationalised by the mere frequency of exposure, can offer a mental shortcut for problems in which the test-taker does not know the exact answer. Is simply choosing the option that comes most readily to mind a good strategy for answering MCQs? We propose a computational method of assessing the cognitive availability of MCQ options operationalised by concepts’ prevalence in large corpora. The key finding, across three large question sets, is that correct answers, independently of the question stem, are significantly more available than incorrect MCQ options. Specifically, using Wikipedia as the retrieval corpus, we find that always selecting the most available option leads to scores 13.5% to 32.9% above the random-guess baseline. We further find that LLM-generated MCQ options show similar patterns of availability compared to expert-created options, despite the LLMs’ frequentist nature and their training on large collections of textual data. Our findings suggest that availability should be considered in current and future work when computationally modelling student behaviour.

[27] Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference

Anastasia Zhukova, Felix Hamborg, Karsten Donnay, Norman Meuschke, Bela Gipp

Main category: cs.CL

TL;DR: The paper proposes a revised cross-document coreference resolution annotation scheme for news datasets that treats coreference chains as discourse elements, accommodating both identity and near-identity relations to better capture lexical diversity and framing variation in polarized media coverage.

DetailsMotivation: Existing CDCR datasets focus too narrowly on event resolution and strict identity coreference, limiting their effectiveness for analyzing diverse and polarized news coverage where wording varies widely across different media perspectives.

Method: Proposes a revised CDCR annotation scheme treating coreference chains as discourse elements, accommodates both identity and near-identity relations, reannotates NewsWCL50 and a subset of ECB+ using a unified codebook, and evaluates through lexical diversity metrics and same-head-lemma baseline.

Result: The reannotated datasets align closely with each other, falling between the original ECB+ and NewsWCL50 datasets, supporting more balanced and discourse-aware CDCR research in the news domain.

Conclusion: The revised annotation scheme enables better capture of lexical diversity and framing variation in media discourse while maintaining fine-grained annotation of discourse elements, advancing CDCR research for analyzing polarized news coverage.

Abstract: Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking “the caravan” - “asylum seekers” - “those contemplating illegal entry”, allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.

[28] Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

Sanjeev Kumar, Preethi Jyothi, Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: Comparative analysis of BLEU vs ChrF++ for evaluating machine translation quality in extremely low-resource languages, focusing on translation artifacts and metric interpretability.

DetailsMotivation: Standard MT evaluation metrics like BLEU perform poorly in extremely low-resource language scenarios, creating a need for better evaluation methods that can handle translation artifacts and provide meaningful quality assessment.

Method: Comparative analysis of BLEU (n-gram-based) and ChrF++ (character-based) metrics across three extremely low-resource languages (Magahi, Bhojpuri, Chhattisgarhi), examining responses to translation artifacts like hallucinations, repetition, source-text copying, and diacritic variations in outputs from LLMs and NMT systems.

Result: BLEU provides complementary lexical-precision insights despite lower absolute scores, improving interpretability compared to relying solely on ChrF++ for evaluation in low-resource settings.

Conclusion: Both BLEU and ChrF++ offer valuable perspectives for MT evaluation in extremely low-resource languages, with BLEU’s lexical-precision insights complementing ChrF++’s character-based approach for better overall assessment.

Abstract: Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.

[29] Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study

Dylan Bouchard, Mohit Singh Chauhan, Viren Bajaj, David Skarbrevik

Main category: cs.CL

TL;DR: A framework for fine-grained uncertainty quantification in long-form LLM outputs using a three-stage taxonomy: response decomposition, unit-level scoring, and response-level aggregation, with consistency-based black-box scorers.

DetailsMotivation: Existing uncertainty quantification methods for LLM hallucination detection are designed for short-form outputs and don't generalize well to long-form generation, creating a need for specialized approaches.

Method: Proposes a taxonomy with three stages: 1) response decomposition (sentence or claim-level), 2) unit-level scoring using consistency-based black-box scorers (including claim-response entailment), 3) response-level aggregation. Formalizes several families of scorers and enables systematic comparisons.

Result: Experiments show: 1) claim-response entailment performs as well as more complex scorers, 2) claim-level scoring outperforms sentence-level, 3) uncertainty-aware decoding effectively improves long-form output factuality.

Conclusion: The framework clarifies relationships between prior methods, enables fair comparisons, and provides practical guidance for selecting uncertainty quantification components for long-form LLM outputs.

Abstract: Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.

[30] AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

Adib Sakhawat, Fardeen Sadab, Rakin Shahriar

Main category: cs.CL

TL;DR: AIDG framework evaluates LLMs’ strategic reasoning through adversarial information deduction games, revealing models are better at information containment than deduction with significant performance gaps.

DetailsMotivation: Current LLM evaluations lack dynamic, multi-turn interactions needed to assess strategic reasoning capabilities. The authors aim to move beyond static benchmarks to understand how LLMs handle information asymmetry in dialogue contexts.

Method: Introduces AIDG (Adversarial Information Deduction Game) with two tasks: AIDG-I for pragmatic strategy in social deduction, and AIDG-II for constraint satisfaction in structured “20 Questions” settings. Tests six frontier LLMs across 439 games.

Result: Clear capability asymmetry: models perform 350 ELO better at containment than deduction. Two bottlenecks identified: information dynamics (confirmation strategies 7.75x more effective) and constraint adherence degradation under conversational load (41.3% of deductive failures).

Conclusion: LLMs excel at local defensive coherence but struggle with global state tracking required for strategic inquiry, revealing fundamental limitations in their reasoning capabilities for multi-turn strategic interactions.

Abstract: Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured “20 Questions” setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen’s d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.

[31] ABCD: All Biases Come Disguised

Mateusz Nowak, Xavier Cadet, Peter Chin

Main category: cs.CL

TL;DR: A bias-reduced evaluation protocol for LLMs on multiple-choice questions that replaces answer labels with uniform, unordered labels and uses sentence similarity to match responses, reducing position/label bias while maintaining performance.

DetailsMotivation: Current MCQ benchmarks for LLMs suffer from label-position-few-shot-prompt bias, where models rely on answer positions, labels, or distribution patterns in prompts rather than actual reasoning, leading to unreliable evaluations.

Method: Proposes a bias-reduced evaluation protocol that: 1) replaces question labels with uniform, unordered labels, 2) prompts LLMs to use the whole answer text, 3) uses sentence similarity models to match LLM responses to answer options, reducing reliance on positional or label cues.

Result: The protocol reduces mean accuracy variance by 3x across multiple benchmarks and models with minimal performance drop, improves robustness to answer permutations, and shows better stability than standard evaluation methods through ablation studies.

Conclusion: Standard MCQ evaluations contain significant biases; the proposed protocol provides more reliable assessment of LLMs’ true reasoning capabilities by reducing evaluation artifacts while maintaining performance metrics.

Abstract: Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs’ ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM’s performance, exposing the LLM’s capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3\times$ with only a minimal decrease in the mean model’s performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.

[32] Entropy-Based Data Selection for Language Models

Hongming Li, Yang Liu, Chao Huang

Main category: cs.CL

TL;DR: EUDS is an entropy-based unsupervised data selection framework that reduces computational costs for fine-tuning language models by efficiently filtering training data based on uncertainty estimation.

DetailsMotivation: Modern LMs require significant computational and data resources. Data selection techniques can reduce training data needs but are computationally expensive. There's a need for efficient data selection methods that work in compute-constrained scenarios, especially since evaluating data usability remains challenging despite LLMs' capabilities.

Method: Proposes Entropy-Based Unsupervised Data Selection (EUDS) framework that establishes a computationally efficient data-filtering mechanism based on uncertainty estimation of selected data. The method systematically reveals the relationship between data selection and uncertainty estimation.

Result: Empirical experiments on sentiment analysis, topic classification, and question answering tasks validate EUDS effectiveness. It significantly reduces computational costs and improves training time efficiency with less data requirement.

Conclusion: EUDS provides an innovative solution for efficient fine-tuning of LMs in compute-constrained scenarios by establishing an efficient data-filtering mechanism that balances computational resources and data requirements.

Abstract: Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (Q&A) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.

[33] PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions

Greta Damo, Stéphane Petiot, Elena Cabrio, Serena Villata

Main category: cs.CL

TL;DR: PEACE 2.0 is a tool that detects hate speech, explains why content is hateful using evidence-based RAG, and generates evidence-grounded counter-speech responses.

DetailsMotivation: Address the challenge of counter-speech generation in online hate speech mitigation, moving beyond just detection to providing evidence-based explanations and constructive responses.

Method: Uses Retrieval-Augmented Generation (RAG) pipeline for three main functions: 1) grounding hate speech explanations in evidence/facts, 2) generating evidence-grounded counter-speech, and 3) analyzing counter-speech reply characteristics.

Result: PEACE 2.0 enables comprehensive analysis and response generation for both explicit and implicit hateful messages through its integrated capabilities.

Conclusion: The tool advances hate speech mitigation by combining detection, explanation, and evidence-based response generation in a unified system.

Abstract: The increasing volume of hate speech on online platforms poses significant societal challenges. While the Natural Language Processing community has developed effective methods to automatically detect the presence of hate speech, responses to it, called counter-speech, are still an open challenge. We present PEACE 2.0, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it. More specifically, PEACE 2.0 has three main new functionalities: leveraging a Retrieval-Augmented Generation (RAG) pipeline i) to ground HS explanations into evidence and facts, ii) to automatically generate evidence-grounded counter-speech, and iii) exploring the characteristics of counter-speech replies. By integrating these capabilities, PEACE 2.0 enables in-depth analysis and response generation for both explicit and implicit hateful messages.

[34] Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers

Nusrat Jahan Lia, Shubhashis Roy Dipta

Main category: cs.CL

TL;DR: Cross-lingual sentiment misalignment study reveals transformer models fail to preserve emotional fidelity across Bengali-English language barriers, showing high sentiment inversion rates and systematic biases in affective processing.

DetailsMotivation: Addresses the fracture in bidirectional AI-human alignment across language barriers, specifically focusing on cross-lingual sentiment misalignment between Bengali and English, revealing safety and representational failures in current alignment paradigms.

Method: Benchmarked four transformer architectures to analyze cross-lingual sentiment misalignment, measuring metrics like “Sentiment Inversion Rate,” “Asymmetric Empathy,” and “Modern Bias” across different Bengali variants (including formal Sadhu Bengali).

Result: Compressed model (mDistilBERT) showed 28.7% sentiment inversion rate, fundamentally misinterpreting positive intent as negative; identified systematic affective weight disparities; IndicBERT showed 57% increase in alignment error with formal Bengali.

Conclusion: Equitable human-AI co-evolution requires pluralistic, culturally grounded alignment respecting language diversity over universal compression; recommends incorporating “Affective Stability” metrics in benchmarks to penalize polarity inversions in low-resource contexts.

Abstract: The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior. However, this loop fractures significantly across language barriers. Our research addresses Cross-Lingual Sentiment Misalignment between Bengali and English by benchmarking four transformer architectures. We reveal severe safety and representational failures in current alignment paradigms. We demonstrate that compressed model (mDistilBERT) exhibits 28.7% “Sentiment Inversion Rate,” fundamentally misinterpreting positive user intent as negative (or vice versa). Furthermore, we identify systemic nuances affecting human-AI trust, including “Asymmetric Empathy” where some models systematically dampen and others amplify the affective weight of Bengali text relative to its English counterpart. Finally, we reveal a “Modern Bias” in the regional model (IndicBERT), which shows a 57% increase in alignment error when processing formal (Sadhu) Bengali. We argue that equitable human-AI co-evolution requires pluralistic, culturally grounded alignment that respects language and dialectal diversity over universal compression, which fails to preserve the emotional fidelity required for reciprocal human-AI trust. We recommend that alignment benchmarks incorporate “Affective Stability” metrics that explicitly penalize polarity inversions in low-resource and dialectal contexts.

[35] Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, Bernardo Magnini

Main category: cs.CL

TL;DR: Small LLMs (around 1B parameters) can effectively perform medical NLP tasks with fine-tuning, matching or surpassing larger models while being more deployable in healthcare settings.

DetailsMotivation: Large LLMs excel at medical NLP tasks but have high computational requirements that limit real-world healthcare deployment. The paper investigates whether smaller LLMs can maintain competitive accuracy while being more practical for healthcare settings.

Method: Evaluated small LLMs (1B parameters) from Llama-3, Gemma-3, and Qwen3 families across 20 clinical NLP tasks. Compared adaptation strategies including few-shot prompting, constraint decoding, supervised fine-tuning, and continual pretraining on Italian medical datasets.

Result: Fine-tuning was most effective, with Qwen3-1.7B achieving average score +9.2 points higher than Qwen3-32B. Few-shot prompting with constraint decoding offered strong lower-resource alternatives. Small LLMs matched or surpassed larger baselines.

Conclusion: Small LLMs can effectively perform medical NLP tasks with proper adaptation, making them viable for real-world healthcare deployment where computational resources are limited.

Abstract: Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether “small” LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.

[36] Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

Baris Karacan, Barbara Di Eugenio, Patrick Thornton

Main category: cs.CL

TL;DR: Clinical section segmentation using transformers vs. zero-shot LLMs, with new obstetrics dataset showing LLMs’ better out-of-domain robustness when hallucinations are corrected.

DetailsMotivation: Clinical free-text notes contain vital patient information structured into labelled sections. Recognizing these sections supports clinical decision-making and downstream NLP tasks. Most existing approaches are trained on limited public corpora like MIMIC-III, lacking domain diversity.

Method: Three key contributions: 1) Curated new de-identified, section-labeled obstetrics notes dataset to supplement existing corpora; 2) Systematic evaluation of transformer-based supervised models on MIMIC-III (in-domain) and obstetrics dataset (out-of-domain); 3) First head-to-head comparison of supervised models with zero-shot large language models for medical section segmentation.

Result: Supervised models perform strongly in-domain but performance drops substantially out-of-domain. Zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected.

Conclusion: Importance of developing domain-specific clinical resources and highlighting zero-shot segmentation as promising direction for healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

Abstract: Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

[37] Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems

Zhangqi Duan, Arnav Kankaria, Dhruv Kartik, Andrew Lan

Main category: cs.CL

TL;DR: LLM-based framework for automated KC-level correctness labeling in programming education using student code analysis

DetailsMotivation: Knowledge component (KC) level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions involve multiple KCs simultaneously. Problem-level correctness propagation obscures partial mastery and leads to poorly fitted learning curves.

Method: Uses large language models to label KC-level correctness directly from student-written code. Assesses whether each KC is correctly applied and introduces temporal context-aware Code-KC mapping to better align KCs with individual student code.

Result: The framework leads to learning curves more consistent with cognitive theory and improves predictive performance compared to baselines. Human evaluation shows substantial agreement between LLM and expert annotations.

Conclusion: LLMs can effectively automate KC-level correctness labeling for programming education, improving learning analytics and student modeling.

Abstract: Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code. Our method assesses whether each KC is correctly applied and further introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code. We evaluate the resulting KC-level correctness labels in terms of learning curve fit and predictive performance using the power law of practice and the Additive Factors Model. Experimental results show that our framework leads to learning curves that are more consistent with cognitive theory and improves predictive performance, compared to baselines. Human evaluation further demonstrates substantial agreement between LLM and expert annotations.

[38] Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Jyotin Goel, Souvik Maji, Pratik Mazumder

Main category: cs.CL

TL;DR: Adaptive regularization framework for maintaining safety in instruction-following LLMs during fine-tuning by estimating safety risk and constraining high-risk updates

DetailsMotivation: Safety behavior in language models deteriorates during fine-tuning, existing defenses offer limited protection or force safety-utility trade-offs

Method: Two approaches: 1) Judge-based Safety Critic assigning harm scores to batches, 2) Activation-based risk predictor using lightweight classifier on intermediate activations. Both provide risk signals to constrain high-risk updates close to safe reference policy

Result: Harmful intent signals predictable from pre-generation activations; judge scores provide effective safety guidance; adaptive regularization lowers attack success rate, preserves downstream performance, adds no inference cost

Conclusion: Principled mechanism for maintaining safety without sacrificing utility during fine-tuning across multiple model families and attack scenarios

Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.

[39] Modeling Distinct Human Interaction in Web Agents

Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu, Shuyan Zhou, Graham Neubig, Jeffrey P. Bigham

Main category: cs.CL

TL;DR: Paper introduces modeling human intervention in web agents, collects CowCorpus dataset, identifies four interaction patterns, trains LMs to predict interventions, and shows improved agent usefulness.

DetailsMotivation: Current autonomous web agents lack understanding of when and why humans intervene, often proceeding past critical decision points or requesting unnecessary confirmation, highlighting the need for principled modeling of human intervention.

Method: Collected CowCorpus dataset of 400 real-user web navigation trajectories with 4,200+ interleaved human/agent actions; identified four interaction patterns; trained language models to anticipate user interventions based on interaction styles.

Result: Achieved 61.4-63.4% improvement in intervention prediction accuracy over base LMs; deployed intervention-aware models in live web navigation agents and found 26.5% increase in user-rated agent usefulness.

Conclusion: Structured modeling of human intervention leads to more adaptive, collaborative web agents that better understand when and why users intervene during task execution.

Abstract: Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents – hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.

[40] Unmasking the Factual-Conceptual Gap in Persian Language Models

Alireza Sakhaeirad, Ali Ma’manpoosh, Arshia Hemmat

Main category: cs.CL

TL;DR: DivanBench is a Persian diagnostic benchmark focusing on superstitions and customs to evaluate LLMs’ ability to reason about implicit social norms beyond memorized cultural facts.

DetailsMotivation: Existing Persian NLP benchmarks don't distinguish between memorized cultural facts and reasoning about implicit social norms. The authors aim to evaluate whether LLMs can truly understand context-dependent cultural rules that resist simple logical deduction.

Method: Created DivanBench with 315 questions across three task types: factual retrieval, paired scenario verification, and situational reasoning. Evaluated seven Persian LLMs to assess their cultural reasoning capabilities.

Result: Three critical failures identified: 1) severe acquiescence bias (models identify appropriate behaviors but fail to reject violations), 2) continuous Persian pretraining amplifies bias rather than improving reasoning, and 3) 21% performance gap between factual retrieval and scenario application.

Conclusion: Cultural competence requires more than scaling monolingual data; current models learn to mimic cultural patterns without internalizing underlying schemas, showing limitations in reasoning about implicit social norms.

Abstract: While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model’s ability to discern contradictions; and all models show a 21% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.

[41] Differences in Typological Alignment in Language Models’ Treatment of Differential Argument Marking

Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld

Main category: cs.CL

TL;DR: GPT-2 models trained on synthetic corpora with differential argument marking (DAM) systems show human-like preferences for natural markedness direction but not for object preference patterns seen in human languages.

DetailsMotivation: To investigate whether language models can capture typological patterns in semantic licensing systems like differential argument marking (DAM), extending previous work on syntactic phenomena to semantic domains.

Method: Trained GPT-2 models on 18 synthetic corpora implementing distinct DAM systems, then evaluated generalization using minimal pairs to test typological preferences.

Result: Models reliably exhibited human-like preferences for natural markedness direction (overt marking targets semantically atypical arguments) but did not reproduce the strong object preference observed in human languages.

Conclusion: Different typological tendencies in human languages may arise from distinct underlying sources, with some patterns more learnable from distributional data than others.

Abstract: Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.

[42] What Language is This? Ask Your Tokenizer

Clara Meister, Ahmetcan Yavuz, Pietro Lesci, Tiago Pimentel

Main category: cs.CL

TL;DR: UniLID is a language identification method using UnigramLM tokenization that learns language-conditional unigram distributions over a shared vocabulary, treating segmentation as language-specific, achieving strong performance especially in low-resource settings.

DetailsMotivation: Existing language identification systems perform well on high-resource languages but are brittle for low-resource and closely related languages, creating a need for more robust methods that work efficiently with limited data.

Method: Uses UnigramLM tokenization algorithm to learn language-conditional unigram distributions over a shared tokenizer vocabulary, treating segmentation as language-specific. The approach is data/compute-efficient, supports incremental language addition without retraining, and integrates with existing tokenization pipelines.

Result: UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings (surpassing 70% accuracy with just 5 labeled samples per language), and delivers large gains on fine-grained dialect identification compared to baselines like fastText, GlotLID, and CLD3.

Conclusion: UniLID provides an efficient, scalable language identification method that excels in low-resource scenarios and dialect identification, with practical advantages for multilingual NLP pipelines.

Abstract: Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.

[43] Sink-Aware Pruning for Diffusion Language Models

Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen

Main category: cs.CL

TL;DR: Sink-Aware Pruning for Diffusion Language Models identifies and prunes unstable attention sink tokens that are transient in DLMs, unlike in autoregressive LLMs where sinks are stable anchors.

DetailsMotivation: Diffusion Language Models have high inference costs due to iterative denoising, and existing pruning methods from autoregressive LLMs preserve attention sink tokens assuming they serve as stable global anchors. However, this assumption doesn't hold for DLMs where sink positions show high variance across timesteps.

Method: Proposes Sink-Aware Pruning that automatically identifies and prunes unstable sinks in DLMs by analyzing how dominant sink locations shift across timesteps, unlike prior methods that preserve sinks for AR LLMs.

Result: Without retraining, the method achieves better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute.

Conclusion: Attention sink behavior differs fundamentally between DLMs and AR LLMs, enabling more effective pruning by targeting unstable sinks in diffusion models.

Abstract: Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.

[44] Improving Stance Detection by Leveraging Measurement Knowledge from Social Sciences: A Case Study of Dutch Political Tweets and Traditional Gender Role Division

Qixiang Fang, Anastasia Giachanou, Ayoub Bagheri

Main category: cs.CL

TL;DR: Using validated social science survey instruments improves stance detection performance for traditional gender role division in Dutch political tweets

DetailsMotivation: Stance detection for political tweets is important, especially for divisive issues like traditional gender role division in Dutch politics. The paper aims to improve stance detection by leveraging established, validated survey instruments from social sciences.

Method: Applied stance detection to tweets from official Dutch party accounts (2017-2021) focusing on traditional gender role division. Used validated social science survey instruments as a framework to measure attitudes and improve detection performance.

Result: Using validated survey instruments helps improve stance detection performance for traditional gender role division in political tweets.

Conclusion: Incorporating established social science measurement tools can enhance stance detection accuracy for specific political issues like gender role division.

Abstract: Stance detection concerns automatically determining the viewpoint (i.e., in favour of, against, or neutral) of a text’s author towards a target. Stance detection has been applied to many research topics, among which the detection of stances behind political tweets is an important one. In this paper, we apply stance detection to a dataset of tweets from official party accounts in the Netherlands between 2017 and 2021, with a focus on stances towards traditional gender role division, a dividing issue between (some) Dutch political parties. To implement and improve stance detection of traditional gender role division, we propose to leverage an established survey instrument from social sciences, which has been validated for the purpose of measuring attitudes towards traditional gender role division. Based on our experiments, we show that using such a validated survey instrument helps to improve stance detection performance.

[45] Efficient Context Propagating Perceiver Architectures for Auto-Regressive Language Modeling

Kaleel Mahmood, Shaoyi Huang

Main category: cs.CL

TL;DR: ECP (Efficient Context propagating Perceiver) improves upon PerceiverAR by better utilizing context and latent sequences in autoregressive training while maintaining efficient attention complexity comparable to LongLoRA.

DetailsMotivation: Address the quadratic complexity problem in Transformer attention mechanisms while maintaining high performance, building upon PerceiverAR to explore better trade-offs between context preservation and computational efficiency.

Method: Develops four new architectural paradigms based on PerceiverAR, with ECP as the best performer. ECP uses both context and latent sequences in autoregressive training and employs pairwise segment attention to extract better information while maintaining LongLoRA-level attention complexity.

Result: ECP significantly outperforms other state-of-the-art Transformer models on Wikitext-103, PG-19, and sCIFAR-10 benchmarks.

Conclusion: ECP successfully addresses the quadratic attention complexity problem while improving language modeling performance through better context utilization and efficient attention mechanisms.

Abstract: One of the key challenges in Transformer architectures is the quadratic complexity of the attention mechanism, which limits the efficient processing of long sequences. Many recent research works have attempted to provide a reduction from the $O(n^2)$ time complexity of attention to semi-linear complexity. However, it remains an unsolved problem in the sense of maintaining high performance when complexity is reduced. One of the important works in this respect is the Perceiver class of architectures that have demonstrated excellent performance, while reducing the computation complexity. In this paper, we use the PerceiverAR as a basis and explore the design space of different trade-offs between preserving context and reducing attention complexity. To this end, we develop four new architectural paradigms, the best performing of which we denote as the Efficient Context propagating Perceiver (ECP). ECP has two major advantages over the PerceiverAR. First, the ECP architecture overcomes the main drawback of PercieverAR by utilizing both the context and the latent sequences in autoregressive training. Second, the ECP architecture operates with the same attention complexity as LongLoRA, making it computationally efficient. More importantly, via pairwise segment attention, it extracts better information resulting in improved language modeling. Empirically, we demonstrate that the ECP architecture significantly outperforms other state-of-the-art Transformer models on Wikitext-103, PG-19 and sCIFAR-10.

[46] Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Bettina Messmer, Vinko Sabolčec, Martin Jaggi

Main category: cs.CL

TL;DR: A model-based filtering framework for multilingual datasets that identifies diverse, structured, knowledge-rich samples to improve LLM training efficiency and mitigate the curse of multilinguality.

DetailsMotivation: Address the disparity in model-based filtering techniques that primarily focus on English, while non-English languages lack similar research. Current rule-based filtering heuristics exist for multilingual datasets, but model-based approaches are needed for better data curation across diverse languages.

Method: Develop a transparent, simple, and efficient model-based filtering framework using Transformer- and FastText-based classifiers. Conduct comprehensive ablation studies on FineWeb-2 dataset across diverse language families, scripts, and resource availability. The approach emphasizes broad accessibility of both technique and data.

Result: Training a 1B-parameter Llama model for 70B and 119B tokens, the approach matches baseline MMLU score with only 15% of training tokens. Improves across other benchmarks and mitigates the curse of multilinguality. Framework extended to 20 languages with refined pretraining datasets released.

Conclusion: The model-based filtering framework effectively identifies high-quality multilingual data, significantly improving training efficiency and model performance while addressing language disparities. The approach demonstrates strong generalizability across diverse languages.

Abstract: Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we develop a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of multilinguality. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.

[47] ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization

Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko

Main category: cs.CL

TL;DR: ReplaceMe is a training-free depth pruning method for transformers that replaces blocks with linear approximations using calibration data, achieving up to 25% pruning with minimal performance loss.

DetailsMotivation: Existing pruning methods require extensive retraining/fine-tuning and architectural modifications, which is computationally expensive. The authors aim to develop a training-free approach that can prune transformer models efficiently without performance degradation.

Method: ReplaceMe uses a small calibration dataset to estimate linear transformations that approximate pruned transformer blocks. These linear mappings can be seamlessly merged with remaining blocks, requiring no additional parameters or training. The method works by replacing selected transformer layers with learned linear operations.

Result: The method achieves up to 25% pruning while retaining approximately 90% of original model performance on open benchmarks. It outperforms other training-free approaches and remains competitive with state-of-the-art pruning methods that require extensive retraining.

Conclusion: ReplaceMe provides an effective training-free depth pruning solution for transformers that maintains high performance with minimal computational overhead, making it practical for real-world deployment of large language models.

Abstract: We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model’s performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe

[48] FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang, Yi Han, Dongji Feng, Fengran Mo, Shengyuan Lin, Qinchuan Zhang, Kaiwen He, Chenri Luo, Jianxing Chen, Junwei Wu, Chen Xu, Ziyang Xu, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Qianqian Xie, Jian-Yun Nie

Main category: cs.CL

TL;DR: FinTagging benchmark for XBRL financial tagging with two subtasks: FinNI for entity extraction and FinCL for taxonomy mapping, revealing LLMs struggle with fine-grained concept linking.

DetailsMotivation: Current XBRL tagging benchmarks oversimplify the complex task of mapping financial figures to GAAP concepts, ignoring hierarchical taxonomy semantics and document structure, failing to evaluate LLMs under realistic reporting conditions.

Method: Introduces FinTagging benchmark with two-stage approach: FinNI extracts entities and types from heterogeneous contexts (text and tables), and FinCL maps extracted entities to full US GAAP taxonomy, enabling assessment of numerical reasoning and taxonomy alignment.

Result: Evaluation of diverse LLMs in zero-shot settings shows models generalize well in extraction (FinNI) but struggle significantly with fine-grained concept linking (FinCL), highlighting limitations in domain-specific structure-aware reasoning.

Conclusion: FinTagging provides comprehensive benchmark for structure-aware XBRL tagging, revealing critical gaps in LLMs’ ability to handle hierarchical taxonomy semantics and structured financial document reasoning.

Abstract: Accurate interpretation of numerical data in financial reports is critical for markets and regulators. Although XBRL (eXtensible Business Reporting Language) provides a standard for tagging financial figures, mapping thousands of facts to over 10k US GAAP concepts remains costly and error prone. Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents. Consequently, these benchmarks fail to evaluate Large Language Models (LLMs) under realistic reporting conditions. To bridge this gap, we introduce FinTagging, the first comprehensive benchmark for structure aware and full scope XBRL tagging. We decompose the complex tagging process into two subtasks: (1) FinNI (Financial Numeric Identification), which extracts entities and types from heterogeneous contexts including text and tables; and (2) FinCL (Financial Concept Linking), which maps extracted entities to the full US GAAP taxonomy. This two stage formulation enables a fair assessment of LLMs’ capabilities in numerical reasoning and taxonomy alignment. Evaluating diverse LLMs in zero shot settings reveals that while models generalize well in extraction, they struggle significantly with fine grained concept linking, highlighting critical limitations in domain specific structure aware reasoning.

[49] A dependently-typed calculus of event telicity and culminativity

Pavel Kovalev, Carlo Angiuli

Main category: cs.CL

TL;DR: A dependently-typed linguistic framework for analyzing event telicity and culminativity using Martin-Löf type theory, formalized in Agda.

DetailsMotivation: To provide a formal, cross-linguistic framework for analyzing event properties like telicity (bounded events) and culminativity (achieving inherent endpoints) using dependent type theory, addressing limitations in existing linguistic formalisms.

Method: Extends intensional Martin-Löf dependent type theory with two components: 1) nominal domain modeling of boundedness in noun phrases, subtyping, and modification; 2) verbal domain using dependent event calculus to define telic events (with bounded undergoers) and culminating events (achieving endpoints). Formalized in Agda proof assistant.

Result: Developed a formal framework that can model English sentences, capturing entailments related to telicity and culminativity, with implementation in Agda demonstrating practical applicability.

Conclusion: The framework provides a rigorous, type-theoretic foundation for analyzing event semantics across languages, with potential for computational linguistics applications and cross-linguistic comparison.

Abstract: We present a dependently-typed cross-linguistic framework for analyzing the telicity and culminativity of events, accompanied by examples of using our framework to model English sentences. Our framework consists of two parts. In the nominal domain, we model the boundedness of noun phrases and its relationship to subtyping, delimited quantities, and adjectival modification. In the verbal domain we define a dependent event calculus, modeling telic events as those whose undergoer is bounded, culminating events as telic events that achieve their inherent endpoint, and consider adverbial modification. In both domains we pay particular attention to associated entailments. Our framework is defined as an extension of intensional Martin-Löf dependent type theory, and the rules and examples in this paper have been formalized in the Agda proof assistant.

[50] Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

Maximilian Kreutner, Marlene Lutz, Markus Strohmaier

Main category: cs.CL

TL;DR: LLMs can simulate European Parliament voting behavior using persona prompts, achieving reasonable accuracy despite progressive biases.

DetailsMotivation: To investigate whether LLMs with persona prompting can accurately predict individual voting decisions and aggregate policy positions of European political groups, addressing known progressive biases in LLMs.

Method: Zero-shot persona prompting with limited information to predict voting behavior, evaluated for stability against counterfactual arguments, different persona prompts, and generation methods.

Result: Achieved weighted F1 score of approximately 0.793 for simulating voting behavior of Members of the European Parliament, with reasonable accuracy in predicting group policy positions.

Conclusion: LLMs with persona prompting can effectively simulate political voting behavior despite inherent biases, providing a useful tool for political analysis and prediction.

Abstract: Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse but have been found to consistently exhibit a progressive left-leaning bias. At the same time, so-called persona or identity prompts have been shown to produce LLM behavior that aligns with socioeconomic groups with which the base model is not aligned. In this work, we analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions and, by aggregation, accurately predict the positions of European groups on a diverse set of policies. We evaluate whether predictions are stable in response to counterfactual arguments, different persona prompts, and generation methods. Finally, we find that we can simulate the voting behavior of Members of the European Parliament reasonably well, achieving a weighted F1 score of approximately 0.793. Our persona dataset of politicians in the 2024 European Parliament and our code are available at the following url: https://github.com/dess-mannheim/european_parliament_simulation.

[51] DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries

Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto

Main category: cs.CL

TL;DR: DistillNote is an evaluation framework that assesses LLM-generated clinical summaries by measuring how well they preserve diagnostic information when used in downstream clinical prediction tasks, specifically heart failure diagnosis.

DetailsMotivation: Current LLM summarization of clinical notes lacks evaluation of essential diagnostic information preservation, posing risks to patient care. There's a need for functional utility assessment beyond traditional metrics.

Method: Created DistillNote framework that generates LLM summaries from MIMIC-IV clinical notes at varying compression rates, then fine-tunes LLMs on both original notes and summaries for heart failure diagnosis prediction, comparing performance using AUROC metrics.

Result: LLM summaries maintained strong diagnostic signal even with 20x compression - models on condensed summaries achieved AUROC 0.92 vs 0.94 baseline (97% retention). Functional evaluation provided new insights beyond traditional assessment methods.

Conclusion: DistillNote offers scalable, task-based functional utility assessment for clinical summaries, revealing compression-performance tradeoffs and supporting data-driven deployment decisions for LLM summarizers in healthcare.

Abstract: Large language models (LLMs) are increasingly used to generate summaries from clinical notes. However, their ability to preserve essential diagnostic information remains underexplored, which could lead to serious risks for patient care. This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much prediction signal is retained. We generated over 192,000 LLM summaries from MIMIC-IV clinical notes with increasing compression rates: standard, section-wise, and distilled section-wise. Heart failure diagnosis was chosen as the prediction task, as it requires integrating a wide range of clinical signals. LLMs were fine-tuned on both the original notes and their summaries, and their diagnostic performance was compared using the AUROC metric. We contrasted DistillNote’s results with evaluations from LLM-as-judge and clinicians, assessing consistency across different evaluation methods. Summaries generated by LLMs maintained a strong level of heart failure diagnostic signal despite substantial compression. Models trained on the most condensed summaries (about 20 times smaller) achieved an AUROC of 0.92, compared to 0.94 with the original note baseline (97 percent retention). Functional evaluation provided a new lens for medical summary assessment, emphasizing clinical utility as a key dimension of quality. DistillNote introduces a new scalable, task-based method for assessing the functional utility of LLM-generated clinical summaries. Our results detail compression-to-performance tradeoffs from LLM clinical summarization for the first time. The framework is designed to be adaptable to other prediction tasks and clinical domains, aiding data-driven decisions about deploying LLM summarizers in real-world healthcare settings.

[52] $π$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering

Chao Wan, Albert Gong, Mihir Mishra, Carl-Leander Henneking, Claas Beger, Kilian Q. Weinberger

Main category: cs.CL

TL;DR: π-CoT combines Prolog logic programming with Chain-of-Thought prompting to improve multi-hop question answering by decomposing complex questions into single-hop sub-queries resolved sequentially.

DetailsMotivation: Standard Chain-of-Thought prompting struggles with complex multi-hop questions in retrieval-augmented generation settings, often falling into circular reasoning or deviating from logical paths, highlighting the need for more structured reasoning approaches.

Method: π-CoT reformulates multi-hop questions into Prolog queries that are decomposed into single-hop sub-queries. These sub-queries are resolved sequentially to produce intermediate artifacts, which then initialize the subsequent Chain-of-Thought reasoning procedure.

Result: Extensive experiments show that π-CoT significantly outperforms standard RAG and in-context Chain-of-Thought on multi-hop question-answering benchmarks.

Conclusion: Combining logic programming’s structural rigor with language models’ flexibility through π-CoT provides an effective solution for improving multi-hop reasoning in question-answering systems.

Abstract: Chain-of-Thought (CoT) prompting significantly enhances large language models’ (LLMs) problem-solving capabilities, but still struggles with complex multi-hop questions, often falling into circular reasoning patterns or deviating from the logical path entirely. This limitation is particularly acute in retrieval-augmented generation (RAG) settings, where obtaining the right context is critical. We introduce Prolog-Initialized Chain-of-Thought ($π$-CoT), a novel prompting strategy that combines logic programming’s structural rigor with language models’ flexibility. $π$-CoT reformulates multi-hop questions into Prolog queries decomposed as single-hop sub-queries. These are resolved sequentially, producing intermediate artifacts, with which we initialize the subsequent CoT reasoning procedure. Extensive experiments demonstrate that $π$-CoT significantly outperforms standard RAG and in-context CoT on multi-hop question-answering benchmarks.

[53] Tokens with Meaning: A Hybrid Tokenization Approach for Turkish

M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik

Main category: cs.CL

TL;DR: A linguistically informed hybrid tokenizer for Turkish that combines morphological segmentation, phonological normalization, and subword fallback to improve tokenization quality for morphologically rich languages.

DetailsMotivation: Frequency-driven subword tokenizers like BPE and WordPiece often fragment morphologically rich and agglutinative languages like Turkish, obscuring morpheme boundaries and potentially harming model performance.

Method: Hybrid tokenizer combining: 1) dictionary-driven morphological segmentation (roots and affixes), 2) phonological normalization mapping allomorphic variants to shared identifiers, 3) controlled subword fallback for OOV coverage, and 4) orthographic case token for capitalization.

Result: Achieves 90.29% Turkish Token Percentage and 85.80% Pure Token Percentage on TR-MMLU, substantially exceeding general-purpose tokenizers. Outperforms baselines on Turkish STS Benchmark, MTEB-TR, and TurBLiMP under strict random initialization controls.

Conclusion: Linguistically informed tokenization significantly improves tokenization quality and downstream performance for morphologically rich languages like Turkish, demonstrating the importance of language-specific tokenizer design.

Abstract: Tokenization shapes how language models perceive morphology and meaning in NLP, yet widely used frequency-driven subword tokenizers (e.g., Byte Pair Encoding and WordPiece) can fragment morphologically rich and agglutinative languages in ways that obscure morpheme boundaries. We introduce a linguistically informed hybrid tokenizer for Turkish that combines (i) dictionary-driven morphological segmentation (roots and affixes), (ii) phonological normalization that maps allomorphic variants to shared identifiers, and (iii) a controlled subword fallback for out-of-vocabulary coverage. Concretely, our released Turkish vocabulary contains 22,231 root tokens mapped to 20,000 canonical root identifiers (with leading spaces to mark word boundaries), 72 affix identifiers that cover 177 allomorphic surface forms, and 12,696 subword units; an orthographic case token preserves capitalization without inflating the vocabulary. We evaluate tokenization quality on the TR-MMLU dataset using two linguistic alignment metrics: Turkish Token Percentage (TR~%), the proportion of produced tokens that correspond to Turkish lexical/morphemic units under our lexical resources, and Pure Token Percentage (Pure~%), the proportion of tokens aligning with unambiguous root/affix boundaries. The proposed tokenizer reaches 90.29% TR~% and 85.80% Pure~% on TR-MMLU, substantially exceeding several general-purpose tokenizers. We further validate practical utility with downstream sentence embedding benchmarks under a strict \emph{random initialization} control to isolate tokenizer inductive bias. Across four matched models (TurkishTokenizer, CosmosGPT2, Mursit, and Tabi), TurkishTokenizer outperforms all baselines on the Turkish STS Benchmark and achieves the strongest overall average on MTEB-TR. It also yields the strongest average accuracy on the TurBLiMP under a centroid-based proxy.

[54] CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis

Main category: cs.CL

TL;DR: CoSpaDi is a training-free compression framework for LLMs that replaces low-rank factorization with structured sparse decomposition using dictionary learning, improving accuracy at fixed compression ratios.

DetailsMotivation: Existing LLM compression methods using low-rank approximations are computationally efficient but overly rigid for heterogeneous projection weights, leading to avoidable accuracy loss. There's a need for more expressive compression that maintains accuracy at fixed parameter budgets.

Method: CoSpaDi uses structured sparse decomposition where weight matrices are represented as dense dictionaries multiplied by column-sparse coefficient matrices. It’s calibration-guided, optimizing factorization to minimize functional reconstruction error using a small calibration set. Activation-derived Gram orthonormalization transforms the problem into standard dictionary learning, supporting both per-layer compression and cross-layer dictionary sharing.

Result: Across Llama and Qwen model families, CoSpaDi consistently improves accuracy-compression and perplexity-compression trade-offs over SVD-based baselines and structured pruning at 20-40% compression ratios. The structured sparsity enables sparse-dense computation and integrates with post-training quantization.

Conclusion: CoSpaDi provides a more expressive alternative to low-rank compression for LLMs, achieving better accuracy at fixed compression ratios through structured sparse decomposition and calibration-guided optimization.

Abstract: Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy–compression and perplexity–compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40% compression ratios. The resulting structured sparsity enables sparse–dense computation and integrates with post-training quantization of the sparse coefficients.

[55] PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity

Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun, Chunping Li

Main category: cs.CL

TL;DR: PoLi-RL introduces a novel Point-to-List Reinforcement Learning framework for Conditional Semantic Textual Similarity (C-STS) that uses a two-stage curriculum with parallel slice ranking rewards to optimize LLMs for ranking-based conditional judgment tasks.

DetailsMotivation: Existing C-STS methods are limited to discriminative models and fail to leverage recent advances in LLMs and RL. RL is well-suited for C-STS as it can directly optimize non-differentiable ranking metrics and guide reasoning, but naive listwise RL struggles with complex reward signals.

Method: PoLi-RL uses a two-stage curriculum: 1) pointwise reward training for basic scoring, then 2) hybrid reward combining pointwise, pairwise, and listwise objectives. Key innovation is Parallel Slice Ranking Reward (PSRR) that computes ranking rewards in parallel slices for granular credit assignment.

Result: Achieves Spearman correlation coefficient of 48.18 on official C-STS benchmark, establishing new SOTA for cross-encoder architecture.

Conclusion: First successful application of RL to C-STS, introducing powerful paradigm for aligning LLMs for complex ranking-based conditional judgment tasks.

Abstract: Conditional Semantic Textual Similarity (C-STS) measures the semantic proximity between text segments under a specific condition, thereby overcoming the ambiguity inherent in traditional STS. However, existing methods are largely confined to discriminative models, failing to fully leverage recent breakthroughs in the NLP community involving Large Language Models (LLMs) and Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this task, as it can directly optimize the non-differentiable Spearman ranking metric and guide the reasoning process required by C-STS. Nevertheless, we find that naively applying listwise RL fails to produce meaningful improvements, as the model struggles with complex, coarse-grained reward signals, leading to optimization difficulties. To address this challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework. PoLi-RL employs a two-stage curriculum: it first trains the model with a simple pointwise reward to establish fundamental scoring capabilities, then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model’s ability to discern subtle semantic distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward (PSRR) mechanism that computes ranking rewards in parallel slices, where each slice consists of completions with the same index from different samples. This provides a precise, differentiated learning signal for each individual completion, enabling granular credit assignment and effective optimization. On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture. As the first work to successfully apply RL to C-STS, our study introduces a powerful paradigm for aligning LLMs for complex, ranking-based conditional judgment tasks.

[56] FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Yankai Chen, Víctor Gutiérrez-Basulto, Xiao-Yang Liu, Xue Liu, Jian-Yun Nie

Main category: cs.CL

TL;DR: FinAuditing benchmark evaluates LLMs on professional financial auditing tasks using real XBRL filings, revealing gaps in concept retrieval, taxonomy-aware relation modeling, and cross-document reasoning.

DetailsMotivation: Financial auditing requires detecting inconsistencies across structured XBRL disclosures, but LLMs' capability in professional-grade auditing remains unclear despite their promise on isolated financial tasks.

Method: Created FinAuditing benchmark from real XBRL filings with 1,102 annotated instances (~33k tokens each), defining three tasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluated 13 state-of-the-art LLMs.

Result: LLMs show substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning, highlighting limitations in professional financial auditing tasks.

Conclusion: There is a need for realistic, structure-aware benchmarks for financial auditing, and current LLMs have significant limitations in handling professional-grade auditing tasks requiring structured reasoning.

Abstract: Going beyond simple text processing, financial auditing requires detecting semantic, structural, and numerical inconsistencies across large-scale disclosures. As financial reports are filed in XBRL, a structured XML format governed by accounting standards, auditing becomes a structured information extraction and reasoning problem involving concept alignment, taxonomy-defined relations, and cross-document consistency. Although large language models (LLMs) show promise on isolated financial tasks, their capability in professional-grade auditing remains unclear. We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings. It contains 1,102 annotated instances averaging over 33k tokens and defines three tasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning. These findings highlight the need for realistic, structure-aware benchmarks. We release the evaluation code at https://github.com/The-FinAI/FinAuditing and the dataset at https://huggingface.co/collections/TheFinAI/finauditing. The task currently serves as the official benchmark of an ongoing public evaluation contest at https://open-finance-lab.github.io/SecureFinAI_Contest_2026/.

[57] Assessing Web Search Credibility and Response Groundedness in Chat Assistants

Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Marián Šimko

Main category: cs.CL

TL;DR: Systematic evaluation of AI chat assistants’ web search behavior for fact-checking, focusing on source credibility and response groundedness across misinformation-prone topics.

DetailsMotivation: As chat assistants integrate web search functionality, there's a growing risk of amplifying misinformation from low-credibility sources, necessitating systematic evaluation of their fact-checking behavior and source credibility.

Method: Evaluated GPT-4o, GPT-5, Perplexity, and Qwen Chat using 100 claims across five misinformation-prone topics, assessing source credibility and groundedness of responses with respect to cited sources.

Result: Perplexity achieved highest source credibility, while GPT-4o exhibited elevated citation of non-credible sources on sensitive topics, revealing differences between assistants.

Conclusion: Provides first systematic comparison of chat assistants for fact-checking behavior, offering foundation for evaluating AI systems in high-stakes information environments.

Abstract: Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants’ web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.

[58] Estonian Native Large Language Model Benchmark

Helena Grete Lillepalu, Tanel Alumäe

Main category: cs.CL

TL;DR: Introduces a new Estonian language benchmark for evaluating LLMs across 7 diverse native datasets, comparing base and instruction-tuned models with human and LLM-as-judge evaluation methods.

DetailsMotivation: Limited availability of LLM benchmarks for Estonian language and lack of comprehensive evaluation comparing different LLMs on Estonian tasks.

Method: Created benchmark with 7 diverse datasets from native Estonian sources (no machine translation). Evaluated 6 base models and 26 instruction-tuned models using both human evaluation and LLM-as-a-judge methods (Claude 3.7 Sonnet).

Result: Human evaluation scores showed moderate to high correlation with benchmark evaluations depending on dataset. Claude 3.7 Sonnet demonstrated strong alignment with human ratings, indicating top LLMs can effectively support Estonian model evaluation.

Conclusion: Provides comprehensive Estonian LLM benchmark; shows LLM-as-judge methods work well for low-resource languages when using high-quality LLMs like Claude 3.7 Sonnet.

Abstract: The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted. We introduce a new benchmark for evaluating LLMs in Estonian, based on seven diverse datasets. These datasets assess general and domain-specific knowledge, understanding of Estonian grammar and vocabulary, summarization abilities, contextual comprehension, and more. The datasets are all generated from native Estonian sources without using machine translation. We compare the performance of base models, instruction-tuned open-source models, and commercial models. Our evaluation includes 6 base models and 26 instruction-tuned models. To assess the results, we employ both human evaluation and LLM-as-a-judge methods. Human evaluation scores showed moderate to high correlation with benchmark evaluations, depending on the dataset. Claude 3.7 Sonnet, used as an LLM judge, demonstrated strong alignment with human ratings, indicating that top-performing LLMs can effectively support the evaluation of Estonian-language models.

[59] Probability Distributions Computed by Hard-Attention Transformers

Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski, Ryan Cotterell, David Chiang

Main category: cs.CL

TL;DR: Transformers as language models (autoregressive, probabilistic) have different expressivity than as language recognizers; autoregression can increase expressivity and probability breaks some equivalences.

DetailsMotivation: Most expressivity results for transformers treat them as language recognizers (accept/reject strings), not as they're actually used in practice - as language models that generate strings autoregressively and probabilistically. There's a gap between theoretical analysis and practical usage.

Method: Characterizes probability distributions that transformer language models can express. Analyzes how making transformer language recognizers autoregressive affects expressivity, and how making them probabilistic breaks equivalences that hold in non-probabilistic case.

Result: Autoregression can sometimes increase transformers’ expressivity. Probabilistic transformers break equivalences that hold in non-probabilistic case. Provides characterization of probability distributions expressible by transformer language models.

Conclusion: Transformers’ expressivity differs significantly between language recognizer and language model settings. Theoretical analysis should account for practical usage as autoregressive, probabilistic generators to understand their true capabilities.

Abstract: Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). We characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.

[60] State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić

Main category: cs.CL

TL;DR: LLMs show strong zero-shot text classification performance in South Slavic languages, matching fine-tuned BERT models, but have practical limitations like slower inference and higher costs.

DetailsMotivation: To evaluate how well large language models (LLMs) perform on text classification tasks in less-resourced South Slavic languages compared to fine-tuned BERT-like models, as the field shifts toward zero-shot prompting but performance in these languages remains under-explored.

Method: Comparative evaluation of openly available fine-tuned BERT-like models with open-source and closed-source LLMs across three text classification tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news/articles, and genre identification in web texts across several South Slavic languages.

Result: LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. LLMs perform comparably in South Slavic languages and English in zero-shot setups. However, LLMs have less predictable outputs, significantly slower inference, and higher computational costs.

Conclusion: While LLMs show impressive zero-shot text classification capabilities in less-resourced languages, fine-tuned BERT-like models remain more practical for large-scale automatic text annotation due to LLMs’ limitations in predictability, speed, and computational cost.

Abstract: Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.

[61] Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models

Wangjiaxuan Xin

Main category: cs.CL

TL;DR: ECN is a multi-stage prompting framework that enhances LLM empathy through four sequential stages: perspective adoption, emotional resonance, reflective understanding, and integrative synthesis, achieving top empathy scores while maintaining other conversational metrics.

DetailsMotivation: Current large language models often lack genuine empathy and inclusive understanding in conversations, limiting their effectiveness in applications requiring emotional intelligence and contextual awareness.

Method: Four-stage prompting framework: 1) Perspective Adoption - adopt user’s viewpoint, 2) Emotional Resonance - recognize and validate emotions, 3) Reflective Understanding - analyze emotional context, 4) Integrative Synthesis - generate empathetic responses.

Result: ECN achieves highest Empathy Quotient (EQ) scores across GPT-3.5-turbo and GPT-4 while maintaining competitive Regard and Perplexity metrics, demonstrating superior empathetic capabilities.

Conclusion: ECN effectively enhances LLM empathy and inclusivity through structured prompting, showing promise for conversational AI applications requiring emotional intelligence.

Abstract: This report presents the Empathetic Cascading Networks (ECN) framework, a multi-stage prompting method designed to enhance the empathetic and inclusive capabilities of large language models. ECN employs four stages: Perspective Adoption, Emotional Resonance, Reflective Understanding, and Integrative Synthesis, to guide models toward generating emotionally resonant and contextually aware responses. Experimental results demonstrate that ECN achieves the highest Empathy Quotient (EQ) scores across GPT-3.5-turbo and GPT-4, while maintaining competitive Regard and Perplexity metrics. These findings emphasize ECN’s potential for applications requiring empathy and inclusivity in conversational AI.

[62] Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li, Siran Yang, Yunlong Xu, Jiaheng Liu, Yongchi Zhao, Jiamang Wang, Yuchi Xu, Wenbo Su, Bo Zheng

Main category: cs.CL

TL;DR: FusedKV reduces KV cache memory by 50% through learnable fusion of bottom and middle layer KV caches for top layers, maintaining performance while improving efficiency.

DetailsMotivation: Transformer decoders face prohibitive memory requirements for KV cache at long sequence lengths. Existing cross-layer KV cache sharing methods underperform compared to within-layer methods like GQA, creating a need for more efficient KV cache architectures.

Method: Analyzes information flow of keys and values across layers, revealing values come predominantly from bottom layers while keys draw from both bottom and middle layers. Proposes FusedKV where top-layer KV caches are learnable fusions of most informative bottom and middle layer caches, operating on post-RoPE keys to preserve positional information. Also introduces FusedKV-Lite for cross-layer sharing with bottom-layer values and middle-layer keys.

Result: Achieves 50% reduction in cache memory while achieving lower validation perplexity than standard Transformer decoders across LLMs ranging from 332M to 4B parameters.

Conclusion: FusedKV provides a memory-efficient, high-performance architectural alternative for transformer decoders by intelligently fusing KV cache information across layers based on observed information flow patterns.

Abstract: Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.

[63] QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models

Maximilian Kreutner, Jens Rupprecht, Georg Ahnert, Ahmed Salem, Markus Strohmaier

Main category: cs.CL

TL;DR: QSTN is an open-source Python framework for generating questionnaire responses using LLMs, enabling systematic evaluation of survey presentation, prompt perturbations, and response generation methods.

DetailsMotivation: To support in-silico surveys and annotation tasks with LLMs by providing a systematic framework for evaluating questionnaire presentation, prompt perturbations, and response generation methods, addressing reproducibility and reliability concerns in LLM-based research.

Method: Developed an open-source Python framework with a no-code user interface that allows researchers to set up experiments without coding knowledge. The framework enables systematic generation of questionnaire responses from LLMs and evaluation of different presentation methods and prompt perturbations.

Result: Extensive evaluation (>40 million survey responses) shows that question structure and response generation methods significantly impact alignment of generated survey responses with human answers. Found that answers can be obtained for a fraction of the compute cost by changing presentation method.

Conclusion: QSTN supports reproducibility and reliability of LLM-based research by providing a systematic framework for questionnaire response generation and evaluation, with potential to reduce computational costs while maintaining response quality.

Abstract: We introduce QSTN, an open-source Python framework for systematically generating responses from questionnaire-style prompts to support in-silico surveys and annotation tasks with large language models (LLMs). QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods. Our extensive evaluation (>40 million survey responses) shows that question structure and response generation methods have a significant impact on the alignment of generated survey responses with human answers. We also find that answers can be obtained for a fraction of the compute cost, by changing the presentation method. In addition, we offer a no-code user interface that allows researchers to set up robust experiments with LLMs \emph{without coding knowledge}. We hope that QSTN will support the reproducibility and reliability of LLM-based research in the future.

[64] Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

Jonathan Kamp, Roos Bakker, Dominique Blok

Main category: cs.CL

TL;DR: Analysis of biases in feature attribution methods for explaining language models, proposing evaluation metrics to measure lexical and position biases across different methods.

DetailsMotivation: Feature attribution methods provide token-level explanations for language models, but explanations vary greatly due to underlying biases. Users may mistrust them or trust them inadequately, requiring systematic analysis of these biases.

Method: Proposes a model- and method-agnostic framework with three evaluation metrics to structure biases. Systematically assesses lexical and position biases for two transformers using controlled pseudo-random classification on artificial data and semi-controlled causal relation detection on natural data.

Result: Finds a trade-off between lexical and position biases - models scoring high on one type score low on the other. Also finds that anomalous explanations are more likely to be biased.

Conclusion: Provides a framework for understanding and evaluating biases in feature attribution methods, revealing systematic patterns in how different methods and models exhibit biases, which can help users better interpret explanations.

Abstract: Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find a trade-off between lexical and position biases in our model comparison, with models that score high on one type score low on the other. We also find signs that anomalous explanations are more likely to be biased.

[65] Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching

Stephen Gadd

Main category: cs.CL

TL;DR: Symphonym is a neural embedding system that maps names from any script into a unified phonetic space for cross-lingual and cross-script name matching, outperforming traditional string similarity methods.

DetailsMotivation: Existing approaches for linking names across historical sources, languages, and writing systems require language-specific phonetic algorithms or fail to capture phonetic relationships across different scripts, creating challenges in digital humanities and geographic information retrieval.

Method: Uses a Teacher-Student architecture where a Teacher network trained on articulatory phonetic features produces target embeddings, while a Student network learns to approximate these embeddings directly from characters. Combines Epitran (extended with 100 new language-script mappings), Phonikud for Hebrew, and CharsiuG2P for Chinese, Japanese, and Korean. Trained on 32.7 million triplet samples of toponyms spanning 20 writing systems.

Result: On the MEHDIE Hebrew-Arabic historical toponym benchmark, achieves Recall@10 of 97.6% and MRR of 90.3%, outperforming Levenshtein and Jaro-Winkler baselines. Evaluation on 12,947 real cross-script training pairs shows 82.6% achieve greater than 0.75 cosine similarity, with best performance on Arabic-Cyrillic (94-100%) and Cyrillic-Latin (94.3%) combinations.

Conclusion: Symphonym provides effective cross-script name matching with fixed-length embeddings that enable efficient retrieval in digital humanities workflows, demonstrating successful transfer from modern place names to historical orthographic variations.

Abstract: Linking names across historical sources, languages, and writing systems remains a fundamental challenge in digital humanities and geographic information retrieval. Existing approaches require language-specific phonetic algorithms or fail to capture phonetic relationships across different scripts. This paper presents Symphonym, a neural embedding system that maps names from any script into a unified 128-dimensional phonetic space, enabling direct similarity comparison without runtime phonetic conversion. Symphonym uses a Teacher-Student architecture where a Teacher network trained on articulatory phonetic features produces target embeddings, while a Student network learns to approximate these embeddings directly from characters. The Teacher combines Epitran (extended with 100 new language-script mappings), Phonikud for Hebrew, and CharsiuG2P for Chinese, Japanese, and Korean. Training used 32.7 million triplet samples of toponyms spanning 20 writing systems from GeoNames, Wikidata, and Getty Thesaurus of Geographic Names. On the MEHDIE Hebrew-Arabic historical toponym benchmark, Symphonym achieves Recall@10 of 97.6% and MRR of 90.3%, outperforming Levenshtein and Jaro-Winkler baselines (Recall@1: 86.7% vs 81.5% and 78.5%). Evaluation on 12,947 real cross-script training pairs shows 82.6% achieve greater than 0.75 cosine similarity, with best performance on Arabic-Cyrillic (94–100%) and Cyrillic-Latin (94.3%) combinations. The fixed-length embeddings enable efficient retrieval in digital humanities workflows, with a case study on medieval personal names demonstrating effective transfer from modern place names to historical orthographic variation.

Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu, Rong Fu, Zhiyuan Feng, Yuan Wang, Simon Fong, Kaiyue Zhou

Main category: cs.CL

TL;DR: JurisMMA is a novel framework for Legal Judgment Prediction that decomposes trial tasks into stages and uses multimodal data, validated on a new large Chinese judicial dataset.

DetailsMotivation: Traditional legal judgment prediction methods struggle with complex cases involving multiple allegations, diverse evidence, and lack adaptability. There's a need for more sophisticated approaches that can handle multimodal legal data and standardized legal processes.

Method: Proposes JurisMMA framework that decomposes trial tasks, standardizes processes into distinct stages. Also creates JurisMM dataset with over 100,000 recent Chinese judicial records containing both text and multimodal video-text data for comprehensive evaluation.

Result: Experiments on JurisMM and benchmark LawBench validate the framework’s effectiveness. The approach shows promise not just for legal judgment prediction but for broader legal applications.

Conclusion: JurisMMA offers new perspectives for developing future legal methods and datasets, demonstrating effectiveness in handling complex legal cases through task decomposition and multimodal data integration.

Abstract: Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework’s effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.

[67] Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

Haotong Yang, Zitong Wang, Shijia Kang, Siqi Yang, Wenkai Yu, Xu Niu, Yike Sun, Yi Hu, Zhouchen Lin, Muhan Zhang

Main category: cs.CL

TL;DR: Training a proof-checking reward model for LLMs using scalable data generation pipeline to enable reinforcement learning for mathematical proof verification

DetailsMotivation: Current LLMs use RL with verifiable rewards for math reasoning, but proof-based problems lack simple answer matching for verification. Need automatic proof verification through a reliable reward model.

Method: Created scalable data pipeline using LLMs to generate “question-proof-check” triplets with diverse problem sources, generation methods, and error types. Used hierarchical human review for quality. Trained proof-checking RM with “LLM-as-a-RM-for-RM” approach and balanced token weighting.

Result: Model shows scalability and strong performance in reward accuracy, generalization ability, and test-time guidance. Provides practical recipes and tools for enhancing LLM mathematical capabilities.

Conclusion: Successfully developed scalable approach to train proof-checking reward models, enabling better RL for mathematical proof verification in LLMs.

Abstract: While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with Verifiable Rewards (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a scalable data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality **question-proof-check**'' triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating an LLM-as-a-RM-for-RM’’ approach and balanced token weighting to stabilize the RL process. Our experiments validate the model’s scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.

[68] RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution

Isaac Picov, Ritesh Goru

Main category: cs.CL

TL;DR: RoPE-LIME is a method for explaining outputs of closed-source LLMs using a smaller open-source surrogate model with RoPE embedding-based similarity and efficient perturbation sampling.

DetailsMotivation: Closed-source LLMs pose challenges for explanation because gradient-based methods require API access, and perturbation methods are costly and noisy when they require regenerating text outputs.

Method: Uses a smaller open-source surrogate model to compute token-level attributions from probability-based objectives under input perturbations. Incorporates: (1) locality kernel based on Relaxed Word Mover’s Distance computed in RoPE embedding space for stable similarity under masking, and (2) Sparse-K sampling for efficient perturbation with better interaction coverage.

Result: Experiments on HotpotQA (sentence features) and hand-labeled MMLU subset (word features) show RoPE-LIME produces more informative attributions than leave-one-out sampling, improves over gSMILE, and substantially reduces closed-model API calls.

Conclusion: RoPE-LIME provides an effective approach for explaining closed-source LLM outputs by decoupling reasoning from explanation using surrogate models with efficient perturbation strategies.

Abstract: Explaining closed-source Large Language Model (LLM) outputs is challenging because API access prevents gradient-based attribution, while perturbation methods are costly and noisy when they depend on regenerated text. We introduce \textbf{Rotary Positional Embedding Linear Local Interpretable Model-agnostic Explanations (RoPE-LIME)}, an open-source extension of gSMILE that decouples reasoning from explanation: given a fixed output from a closed model, a smaller open-source surrogate computes token-level attributions from probability-based objectives (negative log-likelihood and divergence targets) under input perturbations. RoPE-LIME incorporates (i) a locality kernel based on Relaxed Word Mover’s Distance computed in \textbf{RoPE embedding space} for stable similarity under masking, and (ii) \textbf{Sparse-$K$} sampling, an efficient perturbation strategy that improves interaction coverage under limited budgets. Experiments on HotpotQA (sentence features) and a hand-labeled MMLU subset (word features) show that RoPE-LIME produces more informative attributions than leave-one-out sampling and improves over gSMILE while substantially reducing closed-model API calls.

[69] The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh, Georgios Chochlakis, Jose Alcocer, Kerby Bennett, Aarya Vijay Devnani, Parsa Hejabi, Harry G. Muttram, Akshay Kiran Padte, Mehrshad Saadatinia, Chenhao Wu, Alireza S. Ziabari, Michael Sierra-Arévalo, Nick Weller, Shrikanth Narayanan, Benjamin A. T. Graham, Morteza Dehghani

Main category: cs.CL

TL;DR: First large-scale traffic-stop dataset with respect ratings and rationales from multiple community perspectives (police-affiliated, justice-system-impacted, non-affiliated), enabling study of perceptual differences and perspective-aware modeling.

DetailsMotivation: Traffic stops are frequent police-civilian interactions where respect is central to public trust, but interpretation is subjective and shaped by lived experience, requiring community-specific perspectives.

Method: Created dataset from LAPD body-worn camera footage with annotations from three community groups; developed domain-specific evaluation rubric; introduced rubric-driven preference data construction; proposed perspective-aware modeling framework predicting personalized respect ratings and generating annotator-specific rationales from transcripts.

Result: Approach improved rating prediction performance and rationale alignment across all three annotator groups compared to baseline methods.

Conclusion: Perspective-aware framework enables law enforcement to better understand diverse community expectations, providing tool for building public trust and procedural legitimacy.

Abstract: Traffic stops are among the most frequent police-civilian interactions, and body-worn cameras (BWCs) provide a unique record of how these encounters unfold. Respect is a central dimension of these interactions, shaping public trust and perceived legitimacy, yet its interpretation is inherently subjective and shaped by lived experience, rendering community-specific perspectives a critical consideration. Leveraging unprecedented access to Los Angeles Police Department BWC footage, we introduce the first large-scale traffic-stop dataset annotated with respect ratings and free-text rationales from multiple perspectives. By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities. To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-consistent alignment; and (iii) propose a perspective-aware modeling framework that predicts personalized respect ratings and generates annotator-specific rationales for both officers and civilian drivers from traffic-stop transcripts. Across all three annotator groups, our approach improves both rating prediction performance and rationale alignment. Our perspective-aware framework enables law enforcement to better understand diverse community expectations, providing a vital tool for building public trust and procedural legitimacy.

[70] LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules

Ivan Vulić, Adam Grycner, Quentin de Laroussilhe, Jonas Pfeiffer

Main category: cs.CL

TL;DR: LoRA-Squeeze improves standard LoRA by learning higher-rank solutions first then compressing them, achieving better performance with lower-rank adapters through post-hoc compression or dynamic rank annealing during training.

DetailsMotivation: Standard LoRA faces challenges with optimal rank pre-selection, rank-specific hyperparameter tuning, and deployment complexity of heterogeneous-rank modules. The authors propose that learning expressive higher-rank solutions first and then compressing them is better than learning constrained low-rank solutions directly.

Method: LoRA-Squeeze involves: 1) Fine-tuning with deliberately high source rank, 2) Reconstructing or efficiently approximating the full weight update matrix, 3) Using Randomized SVD to create compressed LoRA modules at lower target rank. Two variants: post-hoc compression and gradual in-tuning rank annealing.

Result: Extensive experiments across 13 text and 10 vision-language tasks show post-hoc compression produces lower-rank adapters that outperform those trained directly at target rank, especially with small fine-tuning steps. The gradual rank annealing variant consistently achieves the best LoRA size-performance trade-off.

Conclusion: LoRA-Squeeze provides a simple and efficient methodology to improve standard LoRA learning by dynamically adjusting ranks during training or compressing post-hoc, addressing key limitations of traditional LoRA while maintaining or improving performance.

Abstract: Despite its huge number of variants, standard Low-Rank Adaptation (LoRA) is still a dominant technique for parameter-efficient fine-tuning (PEFT). Nonetheless, it faces persistent challenges, including the pre-selection of an optimal rank and rank-specific hyper-parameters, as well as the deployment complexity of heterogeneous-rank modules and more sophisticated LoRA derivatives. In this work, we introduce LoRA-Squeeze, a simple and efficient methodology that aims to improve standard LoRA learning by changing LoRA module ranks either post-hoc or dynamically during training}. Our approach posits that it is better to first learn an expressive, higher-rank solution and then compress it, rather than learning a constrained, low-rank solution directly. The method involves fine-tuning with a deliberately high(er) source rank, reconstructing or efficiently approximating the reconstruction of the full weight update matrix, and then using Randomized Singular Value Decomposition (RSVD) to create a new, compressed LoRA module at a lower target rank. Extensive experiments across 13 text and 10 vision-language tasks show that post-hoc compression often produces lower-rank adapters that outperform those trained directly at the target rank, especially if a small number of fine-tuning steps at the target rank is allowed. Moreover, a gradual, in-tuning rank annealing variant of LoRA-Squeeze consistently achieves the best LoRA size-performance trade-off.

[71] propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Maximilian Idahl, Benedikt Droste, Björn Plüster, Jan Philipp Harries

Main category: cs.CL

TL;DR: Propella-1 is a family of small multilingual LLMs that annotate text documents across 18 properties in 6 categories, providing structured JSON annotations for better data curation than single-score approaches.

DetailsMotivation: Current LLM pretraining data curation relies on single scalar quality scores from small classifiers, which conflate multiple quality dimensions, prevent flexible filtering, and lack interpretability.

Method: Developed propella-1 models (0.6B, 1.7B, 4B parameters) that support 57 languages and produce structured JSON annotations across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance.

Result: The 4B model achieves higher agreement with frontier commercial LLMs than much larger general-purpose models. Released propella-annotations dataset with over 3 billion document annotations covering major pretraining corpora. Multi-dimensional analysis reveals substantial dataset differences in quality, reasoning depth, and content composition.

Conclusion: Propella-1 enables more nuanced, interpretable, and flexible data curation for LLM pretraining through multi-dimensional structured annotations, overcoming limitations of single-score approaches.

Abstract: Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.

[72] SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Sher Badshah, Ali Emami, Hassan Sajjad

Main category: cs.CL

TL;DR: SCOPE framework for selective pairwise LLM judging with statistical guarantees using Bidirectional Preference Entropy for uncertainty estimation

DetailsMotivation: LLMs are increasingly used as judges for pairwise evaluation but suffer from miscalibration and systematic biases, requiring reliable methods with statistical guarantees

Method: Proposes SCOPE framework with Bidirectional Preference Entropy (BPE) that queries judges under both response positions, aggregates preference probabilities to enforce position invariance, and uses entropy-based uncertainty scores for selective judging

Result: SCOPE consistently meets target risk levels (α=0.10) across MT-Bench, RewardBench, and Chatbot Arena while retaining good coverage, accepting up to 2.4× more judgments than baselines under same risk constraints

Conclusion: SCOPE with BPE enables reliable and high-coverage LLM-based evaluation with finite-sample statistical guarantees, addressing miscalibration and bias issues in LLM judges

Abstract: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $α$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $α= 0.10$, SCOPE consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to naïve baselines, SCOPE accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

[73] Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

Ha Na Cho, Sairam Sutari, Alexander Lopez, Hansen Bow, Kai Zheng

Main category: cs.CL

TL;DR: A study on clinical NLP for discharge planning that addresses temporal leakage risks through an auditing pipeline to ensure safe deployment by prioritizing temporal validity and calibration over optimistic performance.

DetailsMotivation: Clinical NLP models for discharge planning are vulnerable to temporal and lexical leakage where documentation artifacts encode future decisions, inflating performance and posing deployment risks that could disrupt clinical workflows and compromise patient safety.

Method: Developed a lightweight auditing pipeline integrating interpretability into model development to identify and suppress leakage-prone signals before final training, using next-day discharge prediction after elective spine surgery as a case study.

Result: Audited models exhibited more conservative and better-calibrated probability estimates with reduced reliance on discharge-related lexical cues, showing improved temporal validity and behavioral robustness.

Conclusion: Deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance, with auditing pipelines being essential for safe clinical deployment.

Abstract: Clinical natural language processing (NLP) models have shown promise for supporting hospital discharge planning by leveraging narrative clinical documentation. However, note-based models are particularly vulnerable to temporal and lexical leakage, where documentation artifacts encode future clinical decisions and inflate apparent predictive performance. Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety. This study focuses on system-level design choices required to build safe and deployable clinical NLP under temporal leakage constraints. We present a lightweight auditing pipeline that integrates interpretability into the model development process to identify and suppress leakage-prone signals prior to final training. Using next-day discharge prediction after elective spine surgery as a case study, we evaluate how auditing affects predictive behavior, calibration, and safety-relevant trade-offs. Results show that audited models exhibit more conservative and better-calibrated probability estimates, with reduced reliance on discharge-related lexical cues. These findings emphasize that deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance.

[74] Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning

Magnus Boman

Main category: cs.CL

TL;DR: A formal Turing machine model for analyzing LLM failure modes across different pipeline stages like tokenization, vocabulary, and activations.

DetailsMotivation: LLMs exhibit failure modes on seemingly trivial tasks, and current explanations using geometric metaphors lack rigor. The paper aims to provide a formal, falsifiable framework for precisely localizing where failures occur in the LLM pipeline.

Method: Proposes a deterministic multi-tape Turing machine model where each tape represents a distinct component of LLM processing: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text.

Result: The model enables precise localization of failure modes to specific pipeline stages, revealing how tokenization obscures character-level structure needed for tasks like counting. It explains why techniques like chain-of-thought prompting help (by externalizing computation) while also revealing their fundamental limitations.

Conclusion: Provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis for understanding LLM failures.

Abstract: Large language models (LLMs) exhibit failure modes on seemingly trivial tasks. We propose a formalisation of LLM interaction using a deterministic multi-tape Turing machine, where each tape represents a distinct component: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text. The model enables precise localisation of failure modes to specific pipeline stages, revealing, e.g., how tokenisation obscures character-level structure needed for counting tasks. The model clarifies why techniques like chain-of-thought prompting help, by externalising computation on the output tape, while also revealing their fundamental limitations. This approach provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis.

[75] Are LLMs Ready to Replace Bangla Annotators?

Md. Najib Hasan, Touseef Hasan, Souvika Sarkar

Main category: cs.CL

TL;DR: LLMs show significant bias and instability as zero-shot annotators for Bangla hate speech, with larger models not necessarily performing better than smaller task-aligned models.

DetailsMotivation: To investigate the reliability of LLMs as automated annotators for sensitive tasks in low-resource languages, particularly for identity-sensitive settings like hate speech detection where human agreement is challenging and annotator bias has serious consequences.

Method: Systematic benchmark of 17 LLMs using a unified evaluation framework for zero-shot annotation of Bangla hate speech, analyzing annotator bias and model judgment instability across different model scales.

Result: LLMs exhibit substantial annotator bias and instability in judgments; surprisingly, increased model scale doesn’t guarantee better annotation quality - smaller, more task-aligned models often show more consistent behavior than larger counterparts.

Conclusion: Current LLMs have important limitations for sensitive annotation tasks in low-resource languages, highlighting the need for careful evaluation before deployment in such settings.

Abstract: Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators–especially for low-resource and identity-sensitive settings–remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality–smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.

[76] Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut

Main category: cs.CL

TL;DR: STING is an automated red-teaming framework for testing multi-turn misuse of LLM-based agents through sequential illicit plan execution with adaptive follow-ups and judge agents.

DetailsMotivation: Existing agent misuse benchmarks focus on single-prompt testing, leaving a gap in measuring how agents help with harmful/illegal tasks over multiple turns in realistic deployment scenarios.

Method: STING constructs step-by-step illicit plans grounded in benign personas, iteratively probes target agents with adaptive follow-ups, and uses judge agents to track phase completion. It models multi-turn red-teaming as time-to-first-jailbreak random variable.

Result: STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines across AgentHarm scenarios. In multilingual evaluations across six non-English settings, attack success doesn’t consistently increase in lower-resource languages.

Conclusion: STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings where interactions are inherently multi-turn and often multilingual.

Abstract: LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

[77] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Wenxuan Ding, Nicholas Tomlin, Greg Durrett

Main category: cs.CL

TL;DR: CTA framework helps LLMs explicitly reason about cost-uncertainty tradeoffs in sequential decision-making tasks like information retrieval and coding, improving exploration strategies.

DetailsMotivation: LLMs need to interact with environments to solve complex problems, requiring reasoning about when to stop exploring vs. committing to answers. Current LLMs don't explicitly consider cost-uncertainty tradeoffs in sequential decision-making scenarios.

Method: Introduces Calibrate-Then-Act (CTA) framework that provides LLMs with additional context about cost-uncertainty tradeoffs. Formalizes tasks as sequential decision-making under uncertainty with latent environment states. Uses prior information passed to LLM agents to enable more optimal exploration.

Result: CTA helps LLMs discover more optimal decision-making strategies on information-seeking QA and simplified coding tasks. Improvements persist even under RL training of both baseline and CTA approaches.

Conclusion: Making cost-benefit tradeoffs explicit through the CTA framework enables LLM agents to perform more optimal environment exploration in sequential decision-making tasks.

Abstract: LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.

cs.CV

[78] Three-dimensional Damage Visualization of Civil Structures via Gaussian Splatting-enabled Digital Twins

Shuo Wang, Shuo Wang, Xin Nie, Yasutaka Narazaki, Thomas Matiki, Billie F. Spencer

Main category: cs.CV

TL;DR: GS-based digital twin method for 3D damage visualization in civil infrastructure using Gaussian Splatting for efficient 3D reconstruction with multi-scale strategy and temporal updates.

DetailsMotivation: Traditional 2D image-based damage identification lacks 3D visualization capabilities needed for civil infrastructure digital twins. While NeRF and GS offer better scene representation than conventional photogrammetry, GS provides superior efficiency for practical applications.

Method: Uses Gaussian Splatting (GS) for 3D reconstruction to visualize 2D damage segmentation results, develops multi-scale reconstruction strategy to balance efficiency and detail, and enables digital twin updates as damage evolves over time.

Result: Demonstrated on open-source synthetic dataset for post-earthquake inspections, showing promising solution for comprehensive 3D damage visualization in civil infrastructure digital twins with reduced segmentation errors.

Conclusion: GS-enabled digital twin method offers effective 3D damage visualization for civil infrastructure, overcoming limitations of 2D approaches and providing efficient, detailed reconstruction with temporal update capabilities.

Abstract: Recent advancements in civil infrastructure inspections underscore the need for precise three-dimensional (3D) damage visualization on digital twins, transcending traditional 2D image-based damage identifications. Compared to conventional photogrammetric 3D reconstruction techniques, modern approaches such as Neural Radiance Field (NeRF) and Gaussian Splatting (GS) excel in scene representation, rendering quality, and handling featureless regions. Among them, GS stands out for its efficiency, leveraging discrete anisotropic 3D Gaussians to represent radiance fields, unlike NeRF’s continuous implicit model. This study introduces a GS-enabled digital twin method tailored for effective 3D damage visualization. The method’s key contributions include: 1) utilizing GS-based 3D reconstruction to visualize 2D damage segmentation results while reducing segmentation errors; 2) developing a multi-scale reconstruction strategy to balance efficiency and damage detail; 3) enabling digital twin updates as damage evolves over time. Demonstrated on an open-source synthetic dataset for post-earthquake inspections, the proposed approach offers a promising solution for comprehensive 3D damage visualization in civil infrastructure digital twins.

[79] Analytic Score Optimization for Multi Dimension Video Quality Assessment

Boda Lin, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan

Main category: cs.CV

TL;DR: UltraVQA dataset with multi-dimensional quality annotations and Analytic Score Optimization method for improved video quality assessment

DetailsMotivation: Video Quality Assessment needs to evolve beyond single-number scores to richer, multi-faceted evaluations that capture different quality dimensions like motion, aesthetics, content, and clarity.

Method: Created UltraVQA dataset with diverse UGC videos annotated across 5 quality dimensions by human raters, with GPT-generated explanatory rationales. Introduced Analytic Score Optimization (ASO), a theoretically grounded post-training objective for multi-dimensional VQA that reframes quality assessment as regularized decision-making.

Result: Method outperforms most baselines including closed-source APIs and open-source models, reduces mean absolute error in quality prediction, and better aligns with human ranking preferences.

Conclusion: Multi-dimensional, interpretable annotations and reinforcement-based alignment are important for advancing video quality assessment beyond traditional single-score approaches.

Abstract: Video Quality Assessment (VQA) is evolving beyond single-number mean opinion score toward richer, multi-faceted evaluations of video content. In this paper, we present a large-scale multi-dimensional VQA dataset UltraVQA that encompasses diverse User-Generated Content~(UGC) annotated across five key quality dimensions: Motion Quality, Motion Amplitude, Aesthetic Quality, Content Quality, and Clarity Quality. Each video in our dataset is scored by over 3 human raters on these dimensions, with fine-grained sub-attribute labels, and accompanied by an explanatory rationale generated by GPT based on the collective human judgments. To better leverage these rich annotations and improve discrete quality score assessment, we introduce Analytic Score Optimization (ASO), a theoretically grounded post-training objective derived for multi-dimensional VQA. By reframing quality assessment as a regularized decision-making process, we obtain a closed-form solution that naturally captures the ordinal nature of human ratings, ensuring alignment with human ranking preferences. In experiments, our method outperforms most baselines including closed-source APIs and open-source models, while also reducing mean absolute error (MAE) in quality prediction. Our work highlights the importance of multi-dimensional, interpretable annotations and reinforcement-based alignment in advancing video quality assessment.

[80] DODO: Discrete OCR Diffusion Models

Sean Man, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman

Main category: cs.CV

TL;DR: DODO is a vision-language model that uses block discrete diffusion for faster, parallel OCR decoding instead of slow autoregressive methods, achieving near state-of-the-art accuracy with 3x speedup.

DetailsMotivation: Current VLMs for OCR use slow autoregressive decoding that requires sequential forward passes for each token, making them computationally expensive for long documents. OCR is deterministic (visual input dictates unique output), so parallel decoding via diffusion should be possible, but existing masked diffusion models fail due to structural instabilities that are catastrophic for OCR's exact-match requirements.

Method: Introduces DODO, a VLM using block discrete diffusion for OCR. Instead of global diffusion that causes synchronization errors, it decomposes generation into blocks to mitigate these issues while enabling parallel decoding.

Result: Achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

Conclusion: Block discrete diffusion successfully unlocks the speedup potential of diffusion models for OCR tasks, overcoming the limitations of both autoregressive and existing masked diffusion approaches for this deterministic vision-language task.

Abstract: Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

[81] Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Ivan Rinaldi, Matteo Mendula, Nicola Fanelli, Florence Levé, Matteo Testi, Giovanna Castellano, Gennaro Vessio

Main category: cs.CV

TL;DR: ArtToMus: A framework for direct artwork-to-music generation without text intermediaries, using visual embeddings to condition a latent diffusion model for music synthesis.

DetailsMotivation: Existing image-conditioned music generation systems have two limitations: 1) trained on natural photographs rather than artworks with richer semantic/stylistic content, and 2) rely on image-to-text conversion as a semantic shortcut, preventing direct visual-to-audio learning.

Method: Created ArtSound dataset (105,884 artwork-music pairs with dual-modality captions), then developed ArtToMus framework that projects visual embeddings directly into the conditioning space of a latent diffusion model for music synthesis without text translation or language supervision.

Result: ArtToMus generates musically coherent and stylistically consistent outputs reflecting visual cues from source artworks. While absolute alignment scores are lower than text-conditioned systems (expected due to increased difficulty), it achieves competitive perceptual quality and meaningful cross-modal correspondence.

Conclusion: Establishes direct visual-to-music generation as a distinct research direction, providing resources for multimedia art, cultural heritage, and AI-assisted creative practice applications.

Abstract: Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.

[82] StereoAdapter-2: Globally Structure-Consistent Underwater Stereo Depth Estimation

Zeyu Ren, Xiang Li, Yiran Wang, Zeyu Zhang, Hao Tang

Main category: cs.CV

TL;DR: StereoAdapter-2 improves underwater stereo depth estimation using selective state space models for efficient long-range propagation and introduces a large synthetic dataset UW-StereoDepth-80K.

DetailsMotivation: Underwater stereo depth estimation suffers from domain shifts due to light attenuation, scattering, and refraction. Existing GRU-based methods require multiple iterations for long-range disparity propagation, limiting performance in large-disparity and textureless regions.

Method: Proposes StereoAdapter-2 with ConvSS2D operator based on selective state space models, using four-directional scanning aligned with epipolar geometry. Also creates UW-StereoDepth-80K synthetic dataset through semantic-aware style transfer and geometry-consistent novel view synthesis.

Result: Achieves state-of-the-art zero-shot performance with 17% improvement on TartanAir-UW and 7.2% improvement on SQUID benchmarks. Real-world validation on BlueROV2 platform demonstrates robustness.

Conclusion: The proposed ConvSS2D operator enables efficient long-range spatial propagation in single update step, and the synthetic dataset facilitates effective underwater adaptation, advancing underwater robotic perception.

Abstract: Stereo depth estimation is fundamental to underwater robotic perception, yet suffers from severe domain shifts caused by wavelength-dependent light attenuation, scattering, and refraction. Recent approaches leverage monocular foundation models with GRU-based iterative refinement for underwater adaptation; however, the sequential gating and local convolutional kernels in GRUs necessitate multiple iterations for long-range disparity propagation, limiting performance in large-disparity and textureless underwater regions. In this paper, we propose StereoAdapter-2, which replaces the conventional ConvGRU updater with a novel ConvSS2D operator based on selective state space models. The proposed operator employs a four-directional scanning strategy that naturally aligns with epipolar geometry while capturing vertical structural consistency, enabling efficient long-range spatial propagation within a single update step at linear computational complexity. Furthermore, we construct UW-StereoDepth-80K, a large-scale synthetic underwater stereo dataset featuring diverse baselines, attenuation coefficients, and scattering parameters through a two-stage generative pipeline combining semantic-aware style transfer and geometry-consistent novel view synthesis. Combined with dynamic LoRA adaptation inherited from StereoAdapter, our framework achieves state-of-the-art zero-shot performance on underwater benchmarks with 17% improvement on TartanAir-UW and 7.2% improvment on SQUID, with real-world validation on the BlueROV2 platform demonstrates the robustness of our approach. Code: https://github.com/AIGeeksGroup/StereoAdapter-2. Website: https://aigeeksgroup.github.io/StereoAdapter-2.

[83] PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation

Sihan Zhao, Zixuan Wang, Tianyu Luan, Jia Jia, Wentao Zhu, Jiebo Luo, Junsong Yuan, Nan Xi

Main category: cs.CV

TL;DR: PP-Motion: A novel data-driven metric for evaluating human motion generation fidelity that combines physical alignment and human perception through continuous physical labeling and correlation-based training.

DetailsMotivation: Existing motion generation evaluation methods have gaps between human-perceived fidelity and physical feasibility, with subjective binary human labeling being insufficient for robust data-driven metrics. There's a need for objective, continuous physical alignment annotations.

Method: Introduces a physical labeling method that calculates minimum modifications needed for motions to align with physical laws, producing continuous physical alignment annotations. Uses these annotations to train PP-Motion metric with Pearson’s correlation loss for physical priors and human-based perceptual fidelity loss.

Result: PP-Motion metric aligns better with both physical laws and human perception of motion fidelity than previous work, demonstrating effectiveness in evaluating both physical and perceptual aspects.

Conclusion: The proposed physical labeling method and PP-Motion metric provide a comprehensive solution for evaluating motion generation fidelity by bridging the gap between physical feasibility and human perception.

Abstract: Human motion generation has found widespread applications in AR/VR, film, sports, and medical rehabilitation, offering a cost-effective alternative to traditional motion capture systems. However, evaluating the fidelity of such generated motions is a crucial, multifaceted task. Although previous approaches have attempted at motion fidelity evaluation using human perception or physical constraints, there remains an inherent gap between human-perceived fidelity and physical feasibility. Moreover, the subjective and coarse binary labeling of human perception further undermines the development of a robust data-driven metric. We address these issues by introducing a physical labeling method. This method evaluates motion fidelity by calculating the minimum modifications needed for a motion to align with physical laws. With this approach, we are able to produce fine-grained, continuous physical alignment annotations that serve as objective ground truth. With these annotations, we propose PP-Motion, a novel data-driven metric to evaluate both physical and perceptual fidelity of human motion. To effectively capture underlying physical priors, we employ Pearson’s correlation loss for the training of our metric. Additionally, by incorporating a human-based perceptual fidelity loss, our metric can capture fidelity that simultaneously considers both human perception and physical alignment. Experimental results demonstrate that our metric, PP-Motion, not only aligns with physical laws but also aligns better with human perception of motion fidelity than previous work.

[84] A Multi-modal Detection System for Infrastructure-based Freight Signal Priority

Ziyan Zhang, Chuheng Wei, Xuanpeng Zhao, Siyan Li, Will Snyder, Mike Stas, Peng Hao, Kanok Boriboonsomsin, Guoyuan Wu

Main category: cs.CV

TL;DR: Infrastructure-based multimodal freight vehicle detection system using LiDAR and cameras for Freight Signal Priority applications

DetailsMotivation: Freight vehicles approaching signalized intersections need reliable detection and motion estimation for effective Freight Signal Priority (FSP) control strategies

Method: Hybrid sensing architecture with intersection-mounted and midblock subsystems using LiDAR and camera sensors, with clustering-based and deep learning detection methods plus Kalman filter tracking

Result: System reliably monitors freight vehicle movements at high spatio-temporal resolution with lane-level localization

Conclusion: The design provides practical insights for infrastructure-based sensing systems to support FSP applications

Abstract: Freight vehicles approaching signalized intersections require reliable detection and motion estimation to support infrastructure-based Freight Signal Priority (FSP). Accurate and timely perception of vehicle type, position, and speed is essential for enabling effective priority control strategies. This paper presents the design, deployment, and evaluation of an infrastructure-based multi-modal freight vehicle detection system integrating LiDAR and camera sensors. A hybrid sensing architecture is adopted, consisting of an intersection-mounted subsystem and a midblock subsystem, connected via wireless communication for synchronized data transmission. The perception pipeline incorporates both clustering-based and deep learning-based detection methods with Kalman filter tracking to achieve stable real-time performance. LiDAR measurements are registered into geodetic reference frames to support lane-level localization and consistent vehicle tracking. Field evaluations demonstrate that the system can reliably monitor freight vehicle movements at high spatio-temporal resolution. The design and deployment provide practical insights for developing infrastructure-based sensing systems to support FSP applications.

[85] SemCovNet: Towards Fair and Semantic Coverage-Aware Learning for Underrepresented Visual Concepts

Sakib Ahammed, Xia Cui, Xinqi Fan, Wenqi Lu, Moi Hoon Yap

Main category: cs.CV

TL;DR: SemCovNet addresses Semantic Coverage Imbalance (SCI) in vision models by learning to correct semantic representation disparities through descriptor maps, attention modulation, and alignment losses.

DetailsMotivation: Existing vision datasets suffer from Semantic Coverage Imbalance (SCI), a bias where semantic representations are long-tailed, affecting how models learn and reason about rare but meaningful semantics. Unlike class imbalance, SCI occurs at the semantic level and has been previously overlooked.

Method: Proposes Semantic Coverage-Aware Network (SemCovNet) with three key components: 1) Semantic Descriptor Map (SDM) for learning semantic representations, 2) Descriptor Attention Modulation (DAM) module that dynamically weights visual and concept features, and 3) Descriptor-Visual Alignment (DVA) loss that aligns visual features with descriptor semantics. Also introduces Coverage Disparity Index (CDI) to quantify semantic fairness.

Result: Extensive experiments across multiple datasets show SemCovNet enhances model reliability and substantially reduces CDI, achieving fairer and more equitable performance compared to baseline methods.

Conclusion: This work establishes SCI as a measurable and correctable bias, providing a foundation for advancing semantic fairness and interpretable vision learning. The proposed SemCovNet effectively mitigates semantic coverage disparities in vision models.

Abstract: Modern vision models increasingly rely on rich semantic representations that extend beyond class labels to include descriptive concepts and contextual attributes. However, existing datasets exhibit Semantic Coverage Imbalance (SCI), a previously overlooked bias arising from the long-tailed semantic representations. Unlike class imbalance, SCI occurs at the semantic level, affecting how models learn and reason about rare yet meaningful semantics. To mitigate SCI, we propose Semantic Coverage-Aware Network (SemCovNet), a novel model that explicitly learns to correct semantic coverage disparities. SemCovNet integrates a Semantic Descriptor Map (SDM) for learning semantic representations, a Descriptor Attention Modulation (DAM) module that dynamically weights visual and concept features, and a Descriptor-Visual Alignment (DVA) loss that aligns visual features with descriptor semantics. We quantify semantic fairness using a Coverage Disparity Index (CDI), which measures the alignment between coverage and error. Extensive experiments across multiple datasets demonstrate that SemCovNet enhances model reliability and substantially reduces CDI, achieving fairer and more equitable performance. This work establishes SCI as a measurable and correctable bias, providing a foundation for advancing semantic fairness and interpretable vision learning.

[86] Xray-Visual Models: Scaling Vision models on Industry Scale Data

Shlok Mishra, Tsung-Yu Lin, Linda Wang, Hongli Xu, Yimin Liu, Michael Hsu, Chaitanya Ahuja, Hao Yuan, Jianpeng Cheng, Hong-You Chen, Haoyuan Xu, Chao Li, Abhijeet Awasthi, Jihye Moon, Don Husa, Michael Ge, Sumedha Singla, Arkabandhu Chowdhury, Phong Dingh, Satya Narayan Shukla, Yonghuan Yang, David Jacobs, Qi Guo, Jun Xiao, Xiangjun Fan, Aashu Singh

Main category: cs.CV

TL;DR: Xray-Visual is a unified vision model for image and video understanding trained on massive social media data using a three-stage pipeline combining MAE, hashtag classification, and CLIP-style contrastive learning, achieving SOTA performance across multiple benchmarks.

DetailsMotivation: To develop a scalable, unified vision model that can handle both image and video understanding at industry scale, leveraging massive social media data while addressing challenges of data noise and computational efficiency.

Method: Three-stage training pipeline: 1) self-supervised MAE, 2) semi-supervised hashtag classification, 3) CLIP-style contrastive learning. Uses Vision Transformer backbone with EViT for efficiency, trained on 15B image-text pairs and 10B video-hashtag pairs with robust data curation.

Result: Achieves state-of-the-art performance on ImageNet, Kinetics, HMDB51, and MSCOCO benchmarks. Shows strong robustness to domain shift and adversarial perturbations. Integration with LLMs as text encoders (LLM2CLIP) further enhances retrieval and generalization.

Conclusion: Xray-Visual establishes new benchmarks for scalable multimodal vision models, demonstrating superior accuracy and computational efficiency while effectively leveraging massive social media data for unified image and video understanding.

Abstract: We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.

[87] HS-3D-NeRF: 3D Surface and Hyperspectral Reconstruction From Stationary Hyperspectral Images Using Multi-Channel NeRFs

Kibon Ku, Talukder Z. Jubery, Adarsh Krishnamurthy, Baskar Ganapathysubramanian

Main category: cs.CV

TL;DR: HSI-SC-NeRF: A stationary-camera multi-channel NeRF framework for high-throughput hyperspectral 3D reconstruction of agricultural produce using multi-view hyperspectral data captured while objects rotate in a controlled imaging chamber.

DetailsMotivation: Integrating hyperspectral imaging (HSI) and 3D reconstruction is crucial for agricultural produce quality assessment and plant phenotyping, but conventional approaches require complex hardware setups incompatible with automated systems. Current NeRF methods need moving cameras, limiting throughput in standard agricultural environments.

Method: Uses stationary camera while object rotates in custom Teflon imaging chamber with diffuse illumination. Object poses estimated via ArUco markers and transformed to camera frame through simulated pose transformations. Multi-channel NeRF formulation optimizes reconstruction across all hyperspectral bands using composite spectral loss with two-stage training (geometric initialization then radiometric refinement).

Result: Experiments on three agricultural produce samples demonstrate high spatial reconstruction accuracy and strong spectral fidelity across visible and near-infrared spectrum, confirming suitability for automated agricultural workflows.

Conclusion: HSI-SC-NeRF enables high-throughput hyperspectral 3D reconstruction suitable for integration into automated agricultural inspection systems, overcoming limitations of moving-camera setups while maintaining accuracy and spectral fidelity.

Abstract: Advances in hyperspectral imaging (HSI) and 3D reconstruction have enabled accurate, high-throughput characterization of agricultural produce quality and plant phenotypes, both essential for advancing agricultural sustainability and breeding programs. HSI captures detailed biochemical features of produce, while 3D geometric data substantially improves morphological analysis. However, integrating these two modalities at scale remains challenging, as conventional approaches involve complex hardware setups incompatible with automated phenotyping systems. Recent advances in neural radiance fields (NeRF) offer computationally efficient 3D reconstruction but typically require moving-camera setups, limiting throughput and reproducibility in standard indoor agricultural environments. To address these challenges, we introduce HSI-SC-NeRF, a stationary-camera multi-channel NeRF framework for high-throughput hyperspectral 3D reconstruction targeting postharvest inspection of agricultural produce. Multi-view hyperspectral data is captured using a stationary camera while the object rotates within a custom-built Teflon imaging chamber providing diffuse, uniform illumination. Object poses are estimated via ArUco calibration markers and transformed to the camera frame of reference through simulated pose transformations, enabling standard NeRF training on stationary-camera data. A multi-channel NeRF formulation optimizes reconstruction across all hyperspectral bands jointly using a composite spectral loss, supported by a two-stage training protocol that decouples geometric initialization from radiometric refinement. Experiments on three agricultural produce samples demonstrate high spatial reconstruction accuracy and strong spectral fidelity across the visible and near-infrared spectrum, confirming the suitability of HSI-SC-NeRF for integration into automated agricultural workflows.

[88] DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Dahye Kim, Deepti Ghadiyaram, Raghudeep Gadde

Main category: cs.CV

TL;DR: Dynamic tokenization for Diffusion Transformers that varies patch sizes based on content complexity and denoising timestep to achieve computational efficiency without quality loss.

DetailsMotivation: DiTs achieve SOTA in image/video generation but are computationally heavy due to fixed tokenization using constant-sized patches throughout denoising, regardless of content complexity.

Method: Proposes dynamic tokenization strategy that varies patch sizes based on content complexity and denoising timestep - early timesteps use coarser patches for global structure, later iterations use finer patches for local details.

Result: Achieves up to 3.52× speedup on FLUX-1.Dev and 3.2× speedup on Wan 2.1 without compromising generation quality or prompt adherence.

Conclusion: Dynamic tokenization substantially reduces computational cost while preserving perceptual generation quality for DiTs in image and video generation.

Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content’s complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.

[89] Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

Divyam Madaan, Sumit Chopra, Kyunghyun Cho

Main category: cs.CV

TL;DR: PRIMO is a supervised latent-variable imputation model that handles missing modalities in multimodal learning by modeling missing modalities through latent variables, enabling use of all training examples regardless of modality completeness.

DetailsMotivation: Existing MLLMs assume all modalities are available during training and inference, but real-world multimodal data is often incomplete due to missing modalities, asynchronous collection, or partial availability. This creates a need for models that can handle missing modalities while quantifying their predictive impact.

Method: PRIMO models missing modalities through latent variables that capture relationships with observed modalities in prediction contexts. It uses all available training examples (complete or partial) and during inference draws samples from learned distributions over missing modalities to obtain marginal predictive distributions and analyze modality impact.

Result: PRIMO achieves performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. It successfully quantifies predictive impact of modalities at instance level using variance-based metrics from predictions across latent completions.

Conclusion: PRIMO provides an effective framework for handling missing modalities in multimodal learning while quantifying modality importance, making it practical for real-world applications where multimodal data is often incomplete.

Abstract: Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latent-variable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.

[90] Patch-Based Spatial Authorship Attribution in Human-Robot Collaborative Paintings

Eric Chen, Patricia Alves-Oliveira

Main category: cs.CV

TL;DR: A patch-based framework for spatial authorship attribution in human-robot collaborative painting, achieving 88.8% patch-level accuracy using flatbed scanners and cross-validation, with entropy analysis quantifying stylistic overlap in hybrid regions.

DetailsMotivation: As AI becomes more involved in creative production, documenting authorship has become critical for artists, collectors, and legal contexts. There's a need for methods to attribute authorship in human-robot collaborative artworks.

Method: Patch-based framework using commodity flatbed scanners and leave-one-painting-out cross-validation. Uses conditional Shannon entropy to quantify stylistic overlap in collaborative artworks where ground truth is ambiguous. Compares against texture-based and pretrained-feature baselines.

Result: Achieves 88.8% patch-level accuracy (86.7% painting-level via majority vote), outperforming baselines (68.0%-84.7%). Hybrid regions show 64% higher uncertainty than pure paintings (p=0.003), indicating the model detects mixed authorship rather than classification failure.

Conclusion: Provides methodological grounding for sample-efficient authorship attribution in data-scarce human-AI creative workflows. The approach is specific to the human-robot pair studied but has potential to extend to any human-robot collaborative painting.

Abstract: As agentic AI becomes increasingly involved in creative production, documenting authorship has become critical for artists, collectors, and legal contexts. We present a patch-based framework for spatial authorship attribution within human-robot collaborative painting practice, demonstrated through a forensic case study of one human artist and one robotic system across 15 abstract paintings. Using commodity flatbed scanners and leave-one-painting-out cross-validation, the approach achieves 88.8% patch-level accuracy (86.7% painting-level via majority vote), outperforming texture-based and pretrained-feature baselines (68.0%-84.7%). For collaborative artworks, where ground truth is inherently ambiguous, we use conditional Shannon entropy to quantify stylistic overlap; manually annotated hybrid regions exhibit 64% higher uncertainty than pure paintings (p=0.003), suggesting the model detects mixed authorship rather than classification failure. The trained model is specific to this human-robot pair but provides a methodological grounding for sample-efficient attribution in data-scarce human-AI creative workflows that, in the future, has the potential to extend authorship attribution to any human-robot collaborative painting.

[91] PartRAG: Retrieval-Augmented Part-Level 3D Generation and Editing

Peize Li, Zeyu Zhang, Hao Tang

Main category: cs.CV

TL;DR: PartRAG: A retrieval-augmented framework for single-image 3D generation with editable part-level structure using external part database and diffusion transformer.

DetailsMotivation: Existing 3D generation methods struggle with long-tail part geometries, multi-view consistency, and lack precise localized editing capabilities for part-level structure.

Method: Combines external part database with diffusion transformer using Hierarchical Contrastive Retrieval to align image patches with 3D part latents, plus masked part-level editor in canonical space for localized edits.

Result: Achieves competitive results on Objaverse, ShapeNet, and ABO datasets - reduces Chamfer Distance from 0.1726 to 0.1528 and raises F-Score from 0.7472 to 0.844 on Objaverse with fast inference (38s) and interactive edits (5-8s).

Conclusion: PartRAG enables diverse, physically plausible 3D generation with precise part-level editing while maintaining multi-view consistency and producing sharper part boundaries.

Abstract: Single-image 3D generation with part-level structure remains challenging: learned priors struggle to cover the long tail of part geometries and maintain multi-view consistency, and existing systems provide limited support for precise, localized edits. We present PartRAG, a retrieval-augmented framework that integrates an external part database with a diffusion transformer to couple generation with an editable representation. To overcome the first challenge, we introduce a Hierarchical Contrastive Retrieval module that aligns dense image patches with 3D part latents at both part and object granularity, retrieving from a curated bank of 1,236 part-annotated assets to inject diverse, physically plausible exemplars into denoising. To overcome the second challenge, we add a masked, part-level editor that operates in a shared canonical space, enabling swaps, attribute refinements, and compositional updates without regenerating the whole object while preserving non-target parts and multi-view consistency. PartRAG achieves competitive results on Objaverse, ShapeNet, and ABO-reducing Chamfer Distance from 0.1726 to 0.1528 and raising F-Score from 0.7472 to 0.844 on Objaverse-with inference of 38s and interactive edits in 5-8s. Qualitatively, PartRAG produces sharper part boundaries, better thin-structure fidelity, and robust behavior on articulated objects. Code: https://github.com/AIGeeksGroup/PartRAG. Website: https://aigeeksgroup.github.io/PartRAG.

[92] Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

Chaojie Yang, Tian Li, Yue Zhang, Jun Gao

Main category: cs.CV

TL;DR: Efficient compression framework transforms 60-layer dual-stream MMDiT-based Qwen-Image into lightweight Amber-Image models (10B and 6B) with 70% parameter reduction, achieving high-fidelity text-to-image generation with minimal training cost.

DetailsMotivation: Diffusion Transformer architectures like DiT have advanced text-to-image generation but suffer from high computational costs and deployment barriers, creating need for efficient compression methods that maintain quality while reducing model size.

Method: Two-stage compression: 1) Amber-Image-10B via timestep-sensitive depth pruning with layer reinitialization (weight averaging) and layer-wise distillation + fine-tuning; 2) Amber-Image-6B via hybrid-stream architecture converting deep dual streams to single stream initialized from image branch, with progressive distillation and lightweight fine-tuning.

Result: 70% parameter reduction, <2000 GPU hours total compression/training, matches larger models on DPG-Bench and LongText-Bench with high-fidelity synthesis and superior text rendering, no need for large-scale data engineering.

Conclusion: Proposed compression framework efficiently transforms large diffusion transformers into lightweight models while maintaining quality, enabling practical deployment of text-to-image generation systems with significantly reduced computational requirements.

Abstract: Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.

[93] StructCore: Structure-Aware Image-Level Scoring for Training-Free Unsupervised Anomaly Detection

Joongwon Chae, Lihui Luo, Yang Liu, Runming Wang, Dongmei Yu, Zeming Liang, Xi Yuan, Dayan Zhang, Zhenglin Chen, Peiwu Qin, Ilmoon Chae

Main category: cs.CV

TL;DR: StructCore: A training-free, structure-aware image-level scoring method for unsupervised anomaly detection that captures distributional and spatial characteristics of anomaly score maps, outperforming traditional max pooling.

DetailsMotivation: Max pooling, the standard method for converting anomaly score maps to image-level decisions, discards most information about how anomaly evidence is distributed and structured across images, often causing normal and anomalous scores to overlap. This limitation motivates a more sophisticated approach that preserves structural information.

Method: StructCore computes a low-dimensional structural descriptor phi(S) from anomaly score maps that captures distributional and spatial characteristics. It refines image-level scoring via a diagonal Mahalanobis calibration estimated from training samples, without modifying pixel-level localization. The method is training-free.

Result: StructCore achieves image-level AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, demonstrating robust image-level anomaly detection by exploiting structural signatures missed by max pooling.

Conclusion: StructCore provides an effective, training-free alternative to max pooling for image-level anomaly detection that better captures structural information in anomaly score maps, leading to improved performance on standard benchmarks.

Abstract: Max pooling is the de facto standard for converting anomaly score maps into image-level decisions in memory-bank-based unsupervised anomaly detection (UAD). However, because it relies on a single extreme response, it discards most information about how anomaly evidence is distributed and structured across the image, often causing normal and anomalous scores to overlap. We propose StructCore, a training-free, structure-aware image-level scoring method that goes beyond max pooling. Given an anomaly score map, StructCore computes a low-dimensional structural descriptor phi(S) that captures distributional and spatial characteristics, and refines image-level scoring via a diagonal Mahalanobis calibration estimated from train-good samples, without modifying pixel-level localization. StructCore achieves image-level AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, demonstrating robust image-level anomaly detection by exploiting structural signatures missed by max pooling.

[94] Cholec80-port: A Geometrically Consistent Trocar Port Segmentation Dataset for Robust Surgical Scene Understanding

Shunsuke Kikuchi, Atsushi Kouno, Hiroki Matsuzaki

Main category: cs.CV

TL;DR: Dataset for trocar port segmentation in laparoscopic surgery with geometrically consistent annotations to improve downstream geometry-based vision tasks.

DetailsMotivation: Trocar ports in laparoscopic surgery persistently occlude views and attract disproportionate feature points, degrading geometry-based vision pipelines like image stitching, 3D reconstruction, and visual SLAM. Existing datasets lack explicit port labels or have inconsistent annotations that violate geometric consistency.

Method: Created Cholec80-port dataset with high-fidelity trocar port segmentation derived from Cholec80, with a rigorous standard operating procedure (SOP) defining port-sleeve masks that exclude the central opening. Also cleansed and unified existing public datasets under the same SOP.

Result: Experiments show that geometrically consistent annotations substantially improve cross-dataset robustness beyond what dataset size alone provides.

Conclusion: Geometrically consistent port segmentation annotations are crucial for improving the performance of geometry-based vision pipelines in surgical settings, and the proposed dataset and SOP address current limitations in existing surgical datasets.

Abstract: Trocar ports are camera-fixed, pseudo-static structures that can persistently occlude laparoscopic views and attract disproportionate feature points due to specular, textured surfaces. This makes ports particularly detrimental to geometry-based downstream pipelines such as image stitching, 3D reconstruction, and visual SLAM, where dynamic or non-anatomical outliers degrade alignment and tracking stability. Despite this practical importance, explicit port labels are rare in public surgical datasets, and existing annotations often violate geometric consistency by masking the central lumen (opening), even when anatomical regions are visible through it. We present Cholec80-port, a high-fidelity trocar port segmentation dataset derived from Cholec80, together with a rigorous standard operating procedure (SOP) that defines a port-sleeve mask excluding the central opening. We additionally cleanse and unify existing public datasets under the same SOP. Experiments demonstrate that geometrically consistent annotations substantially improve cross-dataset robustness beyond what dataset size alone provides.

[95] Cross Pseudo Labeling For Weakly Supervised Video Anomaly Detection

Lee Dayeon, Kim Dongheyong, Park Chaewon, Woo Sungmin, Lee Sangyoun

Main category: cs.CV

TL;DR: CPL-VAD is a dual-branch framework for weakly supervised video anomaly detection that combines binary anomaly localization with category classification using cross pseudo labeling and vision-language alignment.

DetailsMotivation: The paper addresses the challenge of weakly supervised video anomaly detection, where only video-level labels are available. Existing methods struggle with both precise anomaly localization and accurate abnormal category classification simultaneously.

Method: Proposes CPL-VAD with two branches: 1) binary anomaly detection branch for snippet-level localization, and 2) category classification branch using vision-language alignment to recognize abnormal event categories. The branches exchange pseudo labels to transfer complementary strengths - temporal precision from the detection branch and semantic discrimination from the classification branch.

Result: Experiments on XD-Violence and UCF-Crime datasets demonstrate state-of-the-art performance in both anomaly detection and abnormal category classification tasks.

Conclusion: CPL-VAD effectively combines temporal localization and semantic understanding through cross pseudo labeling, advancing weakly supervised video anomaly detection by addressing both detection and classification challenges simultaneously.

Abstract: Weakly supervised video anomaly detection aims to detect anomalies and identify abnormal categories with only video-level labels. We propose CPL-VAD, a dual-branch framework with cross pseudo labeling. The binary anomaly detection branch focuses on snippet-level anomaly localization, while the category classification branch leverages vision-language alignment to recognize abnormal event categories. By exchanging pseudo labels, the two branches transfer complementary strengths, combining temporal precision with semantic discrimination. Experiments on XD-Violence and UCF-Crime demonstrate that CPL-VAD achieves state-of-the-art performance in both anomaly detection and abnormal category classification.

[96] ComptonUNet: A Deep Learning Model for GRB Localization with Compton Cameras under Noisy and Low-Statistic Conditions

Shogo Sato, Kazuo Tanaka, Shojun Ogasawara, Kazuki Yamamoto, Kazuhiko Murasaki, Ryuichi Tanida, Jun Kataoka

Main category: cs.CV

TL;DR: ComptonUNet is a hybrid deep learning framework for robust gamma-ray burst localization that combines statistical efficiency with denoising capabilities to handle low photon statistics and strong background noise.

DetailsMotivation: Gamma-ray bursts are important astrophysical probes but detecting faint ones from distant universe is challenging due to low photon statistics and background noise. Existing machine learning models struggle to balance statistical robustness with noise suppression.

Method: ComptonUNet is a hybrid deep learning framework that jointly processes raw data and reconstructs images. It combines statistical efficiency of direct reconstruction models with denoising capabilities of image-based architectures to handle limited photon statistics and strong background contamination.

Result: ComptonUNet significantly outperforms existing approaches, achieving improved localization accuracy across a wide range of low-statistic and high-background scenarios in realistic simulations of GRB-like events.

Conclusion: ComptonUNet provides an effective solution for robust GRB localization under challenging conditions of limited photon statistics and strong background contamination, advancing the detection of faint gamma-ray bursts from distant universe.

Abstract: Gamma-ray bursts (GRBs) are among the most energetic transient phenomena in the universe and serve as powerful probes for high-energy astrophysical processes. In particular, faint GRBs originating from a distant universe may provide unique insights into the early stages of star formation. However, detecting and localizing such weak sources remains challenging owing to low photon statistics and substantial background noise. Although recent machine learning models address individual aspects of these challenges, they often struggle to balance the trade-off between statistical robustness and noise suppression. Consequently, we propose ComptonUNet, a hybrid deep learning framework that jointly processes raw data and reconstructs images for robust GRB localization. ComptonUNet was designed to operate effectively under conditions of limited photon statistics and strong background contamination by combining the statistical efficiency of direct reconstruction models with the denoising capabilities of image-based architectures. We perform realistic simulations of GRB-like events embedded in background environments representative of low-Earth orbit missions to evaluate the performance of ComptonUNet. Our results demonstrate that ComptonUNet significantly outperforms existing approaches, achieving improved localization accuracy across a wide range of low-statistic and high-background scenarios.

[97] 3D Scene Rendering with Multimodal Gaussian Splatting

Chi-Shiang Gau, Konstantinos D. Polyzos, Athanasios Bacharis, Saketh Madhuvarasu, Tara Javidi

Main category: cs.CV

TL;DR: RF-augmented 3D Gaussian Splatting uses radar signals to improve scene reconstruction robustness in adverse conditions where vision fails.

DetailsMotivation: Vision-based 3D Gaussian Splatting struggles with adverse weather, low light, and occlusions, while RF signals (like automotive radar) are robust to these conditions. The paper aims to create a more robust multimodal reconstruction framework.

Method: Integrates RF sensing (automotive radar) with Gaussian Splatting pipelines. Uses sparse RF depth measurements to predict efficient depth and generate high-quality 3D point clouds for initializing Gaussian functions across various GS architectures.

Result: Numerical tests show improved 3D scene rendering fidelity with RF-informed structural accuracy. The approach achieves high-quality reconstruction while maintaining computational efficiency.

Conclusion: RF-augmented Gaussian Splatting provides a robust alternative to vision-only approaches, especially in challenging environmental conditions where visual cues are unreliable.

Abstract: 3D scene reconstruction and rendering are core tasks in computer vision, with applications spanning industrial monitoring, robotics, and autonomous driving. Recent advances in 3D Gaussian Splatting (GS) and its variants have achieved impressive rendering fidelity while maintaining high computational and memory efficiency. However, conventional vision-based GS pipelines typically rely on a sufficient number of camera views to initialize the Gaussian primitives and train their parameters, typically incurring additional processing cost during initialization while falling short in conditions where visual cues are unreliable, such as adverse weather, low illumination, or partial occlusions. To cope with these challenges, and motivated by the robustness of radio-frequency (RF) signals to weather, lighting, and occlusions, we introduce a multimodal framework that integrates RF sensing, such as automotive radar, with GS-based rendering as a more efficient and robust alternative to vision-only GS rendering. The proposed approach enables efficient depth prediction from only sparse RF-based depth measurements, yielding a high-quality 3D point cloud for initializing Gaussian functions across diverse GS architectures. Numerical tests demonstrate the merits of judiciously incorporating RF sensing into GS pipelines, achieving high-fidelity 3D scene rendering driven by RF-informed structural accuracy.

[98] B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates

Hiromichi Kamata, Samuel Arthur Munro, Fuminori Homma

Main category: cs.CV

TL;DR: B³-Seg is a fast, training-free Bayesian method for open-vocabulary 3D Gaussian Splatting segmentation that uses sequential Beta-Bernoulli updates and active view selection via Expected Information Gain.

DetailsMotivation: Existing 3DGS segmentation methods require predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency interactive editing in film and game production.

Method: Reformulates segmentation as sequential Beta-Bernoulli Bayesian updates with active view selection via analytic Expected Information Gain (EIG). The Bayesian formulation guarantees adaptive monotonicity and submodularity of EIG, enabling greedy (1-1/e) approximation to optimal view sampling.

Result: Achieves competitive results compared to high-cost supervised methods while operating end-to-end segmentation within a few seconds on multiple datasets.

Conclusion: B³-Seg enables practical, interactive 3DGS segmentation with provable information efficiency, addressing the need for camera-free and training-free segmentation in real-time editing applications.

Abstract: Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production. However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use. We propose B$^3$-Seg (Beta-Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions. Our approach reformulates segmentation as sequential Beta-Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG). This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy. Experiments on multiple datasets show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds. The results demonstrate that B$^3$-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.

[99] BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning

Siyuan Liang, Yongcheng Jing, Yingjie Wang, Jiaxing Huang, Ee-chien Chang, Dacheng Tao

Main category: cs.CV

TL;DR: BadCLIP++: A stealthy and persistent backdoor attack framework for multimodal contrastive learning models using semantic-fusion QR micro-triggers and stability techniques to maintain high attack success rates even under strong defenses and fine-tuning.

DetailsMotivation: Existing backdoor attacks against multimodal contrastive learning models lack stealthiness and persistence, failing under strong detection or continuous fine-tuning due to cross-modal inconsistency and gradient dilution at low poisoning rates.

Method: Proposes BadCLIP++ with: 1) Semantic-fusion QR micro-triggers embedded near task-relevant regions for stealthiness, 2) Target-aligned subset selection to strengthen signals at low injection rates, 3) Stability techniques including radius shrinkage, centroid alignment, curvature control, and elastic weight consolidation for persistence, 4) Theoretical analysis showing co-directional gradients within trust regions.

Result: With only 0.3% poisoning, achieves 99.99% ASR in digital settings (11.4 points above baselines). Maintains >99.90% ASR across 19 defenses with <0.8% clean accuracy drop. Achieves 65.03% success in physical attacks and robustness against watermark removal defenses.

Conclusion: BadCLIP++ effectively addresses stealthiness and persistence challenges in multimodal contrastive learning backdoor attacks through innovative trigger design and stability mechanisms, demonstrating strong performance across various attack scenarios and defenses.

Abstract: Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.

[100] NRGS-SLAM: Monocular Non-Rigid SLAM for Endoscopy via Deformation-Aware 3D Gaussian Splatting

Jiwei Shan, Zeyu Cai, Yirui Li, Yongbo Chen, Lijun Han, Yun-hui Liu, Hesheng Wang, Shing Shin Cheng

Main category: cs.CV

TL;DR: NRGS-SLAM: A monocular non-rigid SLAM system for endoscopic scenes using 3D Gaussian Splatting with deformation-aware representation and Bayesian self-supervision to handle soft-tissue deformations.

DetailsMotivation: Endoscopic scenes violate the rigidity assumption due to persistent soft-tissue deformations, creating coupling ambiguity between camera motion and tissue deformation. Existing monocular non-rigid SLAM methods lack effective decoupling mechanisms and use sparse/low-fidelity representations, leading to tracking drift and poor reconstruction quality.

Method: Proposes NRGS-SLAM with: 1) Deformation-aware 3D Gaussian map where each Gaussian has learnable deformation probability optimized via Bayesian self-supervision; 2) Deformable tracking module with coarse-to-fine pose estimation prioritizing low-deformation regions; 3) Deformable mapping module balancing capacity and efficiency; 4) Unified robust geometric loss with external priors.

Result: Achieves up to 50% reduction in RMSE for camera pose estimation and higher-quality photo-realistic reconstructions compared to state-of-the-art methods on multiple public endoscopic datasets. Ablation studies validate key design choices.

Conclusion: NRGS-SLAM effectively addresses the non-rigidity challenge in endoscopic SLAM through deformation-aware 3D Gaussian representation and Bayesian self-supervision, demonstrating superior performance in both tracking accuracy and reconstruction quality.

Abstract: Visual simultaneous localization and mapping (V-SLAM) is a fundamental capability for autonomous perception and navigation. However, endoscopic scenes violate the rigidity assumption due to persistent soft-tissue deformations, creating a strong coupling ambiguity between camera ego-motion and intrinsic deformation. Although recent monocular non-rigid SLAM methods have made notable progress, they often lack effective decoupling mechanisms and rely on sparse or low-fidelity scene representations, which leads to tracking drift and limited reconstruction quality. To address these limitations, we propose NRGS-SLAM, a monocular non-rigid SLAM system for endoscopy based on 3D Gaussian Splatting. To resolve the coupling ambiguity, we introduce a deformation-aware 3D Gaussian map that augments each Gaussian primitive with a learnable deformation probability, optimized via a Bayesian self-supervision strategy without requiring external non-rigidity labels. Building on this representation, we design a deformable tracking module that performs robust coarse-to-fine pose estimation by prioritizing low-deformation regions, followed by efficient per-frame deformation updates. A carefully designed deformable mapping module progressively expands and refines the map, balancing representational capacity and computational efficiency. In addition, a unified robust geometric loss incorporates external geometric priors to mitigate the inherent ill-posedness of monocular non-rigid SLAM. Extensive experiments on multiple public endoscopic datasets demonstrate that NRGS-SLAM achieves more accurate camera pose estimation (up to 50% reduction in RMSE) and higher-quality photo-realistic reconstructions than state-of-the-art methods. Comprehensive ablation studies further validate the effectiveness of our key design choices. Source code will be publicly available upon paper acceptance.

[101] Selective Training for Large Vision Language Models via Visual Information Gain

Seulbi Lee, Sangheum Hwang

Main category: cs.CV

TL;DR: VIG metric measures visual information gain to quantify how much training samples/tokens benefit from images, enabling selective training to reduce language bias in LVLMs.

DetailsMotivation: LVLMs suffer from language bias where they produce answers without relying on visual evidence. Existing approaches lack quantitative measures of how much individual training samples or tokens actually benefit from visual input.

Method: Introduces Visual Information Gain (VIG), a perplexity-based metric that measures reduction in prediction uncertainty provided by visual input. Uses VIG for fine-grained analysis at sample/token levels, then proposes VIG-guided selective training that prioritizes high-VIG samples/tokens.

Result: The approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

Conclusion: VIG provides a quantitative measure for visual grounding, enabling more efficient training of LVLMs by focusing on visually informative content, reducing language bias while maintaining performance with less supervision.

Abstract: Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

[102] EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models

Yahong Wang, Juncheng Wu, Zhangkai Ni, Chengmei Yang, Yihang Liu, Longzhen Yang, Yuyin Zhou, Ying Wen, Lianghua He

Main category: cs.CV

TL;DR: EntropyPrune: A matrix-entropy-guided token pruning framework for MLLMs that identifies Entropy Collapse Layers to prune redundant visual tokens efficiently, achieving 68.2% FLOP reduction while preserving 96.0% performance.

DetailsMotivation: MLLMs have high inference costs due to processing hundreds of visual tokens per image. Existing token pruning methods use heuristic, static layer selection which lacks interpretability and cross-model transferability.

Method: Introduces matrix-entropy perspective to identify “Entropy Collapse Layer” where visual representation information drops sharply. Proposes EntropyPrune framework that quantifies information value of individual visual tokens and prunes redundant ones without attention maps. Uses spectral equivalence of dual Gram matrices for efficient entropy computation with 64x theoretical speedup.

Result: Outperforms SOTA pruning methods on diverse multimodal benchmarks. On LLaVA-1.5-7B: 68.2% FLOP reduction while preserving 96.0% original performance. Generalizes effectively to high-resolution and video-based models.

Conclusion: EntropyPrune provides principled, interpretable token pruning for MLLMs with strong robustness and scalability, offering efficient acceleration while maintaining performance.

Abstract: Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an “Entropy Collapse Layer” (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.

[103] GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

Ye Zhu, Kaleb S. Newman, Johannes F. Lutzeyer, Adriana Romero-Soriano, Michal Drozdzal, Olga Russakovsky

Main category: cs.CV

TL;DR: GASS enhances text-to-image diversity by controlling both prompt-dependent and prompt-independent variation through geometric decomposition in CLIP embedding space.

DetailsMotivation: Current text-to-image models lack diversity in generated outputs, which restricts user choice and risks amplifying societal biases. Existing methods primarily use entropy-based guidance but don't explicitly control different sources of variation.

Method: Geometry-Aware Spherical Sampling (GASS) decomposes diversity in CLIP embeddings into two orthogonal directions: text embedding (prompt-dependent variation) and an identified orthogonal direction (prompt-independent variation). It increases geometric projection spread along both axes and guides T2I sampling via expanded predictions along the generation trajectory.

Result: Experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks show effective disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.

Conclusion: GASS provides a geometric approach to enhance text-to-image diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation, addressing limitations of existing entropy-based methods.

Abstract: Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.

[104] HiMAP: History-aware Map-occupancy Prediction with Fallback

Yiming Xu, Yi Yang, Hao Cheng, Monika Sester

Main category: cs.CV

TL;DR: HiMAP is a tracking-free trajectory prediction framework for autonomous driving that uses historical occupancy maps instead of multi-object tracking, making it robust to tracking failures.

DetailsMotivation: Current motion forecasting methods rely on multi-object tracking (MOT) with identity association, which fails under occlusion, identity switches, or missed detections, degrading prediction quality and increasing safety risks.

Method: Converts past detections into spatiotemporally invariant historical occupancy maps, introduces a historical query module that conditions on current agent state to iteratively retrieve agent-specific history from unlabeled occupancy representations, and uses a DETR-style decoder to produce multi-modal future trajectories.

Result: On Argoverse 2, achieves performance comparable to tracking-based methods without IDs, outperforms strong baselines in no-tracking setting with 11% FDE, 12% ADE improvements, and 4% MR reduction over fine-tuned QCNet.

Conclusion: HiMAP provides reliable trajectory prediction without tracking dependence, offers stable forecasts for all agents simultaneously without waiting for tracking recovery, and serves as a robust fallback for safety-critical autonomy.

Abstract: Accurate motion forecasting is critical for autonomous driving, yet most predictors rely on multi-object tracking (MOT) with identity association, assuming that objects are correctly and continuously tracked. When tracking fails due to, e.g., occlusion, identity switches, or missed detections, prediction quality degrades and safety risks increase. We present \textbf{HiMAP}, a tracking-free, trajectory prediction framework that remains reliable under MOT failures. HiMAP converts past detections into spatiotemporally invariant historical occupancy maps and introduces a historical query module that conditions on the current agent state to iteratively retrieve agent-specific history from unlabeled occupancy representations. The retrieved history is summarized by a temporal map embedding and, together with the final query and map context, drives a DETR-style decoder to produce multi-modal future trajectories. This design lifts identity reliance, supports streaming inference via reusable encodings, and serves as a robust fallback when tracking is unavailable. On Argoverse~2, HiMAP achieves performance comparable to tracking-based methods while operating without IDs, and it substantially outperforms strong baselines in the no-tracking setting, yielding relative gains of 11% in FDE, 12% in ADE, and a 4% reduction in MR over a fine-tuned QCNet. Beyond aggregate metrics, HiMAP delivers stable forecasts for all agents simultaneously without waiting for tracking to recover, highlighting its practical value for safety-critical autonomy. The code is available under: https://github.com/XuYiMing83/HiMAP.

[105] Inferring Height from Earth Embeddings: First insights using Google AlphaEarth

Alireza Hamoudzadeh, Valeria Belloni, Roberta Ravanelli

Main category: cs.CV

TL;DR: Earth Embeddings encode geospatial features that can guide DL models for terrain height mapping, with U-Net++ showing better generalization than U-Net despite distribution shift challenges.

DetailsMotivation: To investigate whether geospatial and multimodal features in Earth Embeddings can effectively guide deep learning regression models for regional surface height mapping, using AlphaEarth Embeddings at 10m resolution.

Method: Used AlphaEarth Embeddings at 10m spatial resolution as input features, employed U-Net and U-Net++ architectures as lightweight convolutional decoders to translate embeddings into surface height estimates, evaluated using high-quality Digital Surface Model (DSM) as reference.

Result: Both architectures achieved strong training performance (R² = 0.97), but test performance decreased due to distribution shifts. U-Net++ showed better generalization (R² = 0.84, median difference = -2.62m) than standard U-Net (R² = 0.78, median difference = -7.22m). Testing RMSE was ~16m for U-Net++.

Conclusion: Earth Embeddings capture transferable topographic patterns and show promising potential for guiding DL-based height mapping workflows, especially with spatially aware architectures like U-Net++, though bias issues need addressing for improved regional transferability.

Abstract: This study investigates whether the geospatial and multimodal features encoded in \textit{Earth Embeddings} can effectively guide deep learning (DL) regression models for regional surface height mapping. In particular, we focused on AlphaEarth Embeddings at 10 m spatial resolution and evaluated their capability to support terrain height inference using a high-quality Digital Surface Model (DSM) as reference. U-Net and U-Net++ architectures were thus employed as lightweight convolutional decoders to assess how well the geospatial information distilled in the embeddings can be translated into accurate surface height estimates. Both architectures achieved strong training performance (both with $R^2 = 0.97$), confirming that the embeddings encode informative and decodable height-related signals. On the test set, performance decreased due to distribution shifts in height frequency between training and testing areas. Nevertheless, U-Net++ shows better generalization ($R^2 = 0.84$, median difference = -2.62 m) compared with the standard U-Net ($R^2 = 0.78$, median difference = -7.22 m), suggesting enhanced robustness to distribution mismatch. While the testing RMSE (approximately 16 m for U-Net++) and residual bias highlight remaining challenges in generalization, strong correlations indicate that the embeddings capture transferable topographic patterns. Overall, the results demonstrate the promising potential of AlphaEarth Embeddings to guide DL-based height mapping workflows, particularly when combined with spatially aware convolutional architectures, while emphasizing the need to address bias for improved regional transferability.

[106] EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection

Hung Mai, Loi Dinh, Duc Hai Nguyen, Dat Do, Luong Doan, Khanh Nguyen Quoc, Huan Vu, Phong Ho, Naeem Ul Islam, Tuan Do

Main category: cs.CV

TL;DR: EA-Swin: Embedding-Agnostic Swin Transformer for AI-generated video detection using factorized windowed attention on pretrained video embeddings, achieving 0.97-0.99 accuracy across major generators.

DetailsMotivation: Existing video detection methods struggle with highly realistic synthetic videos from advanced generators like Sora2 and Veo3, as they rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs.

Method: EA-Swin models spatiotemporal dependencies directly on pretrained video embeddings using a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Also created EA-Video dataset with 130K videos covering diverse generators.

Result: Achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by 5-20% margin, with strong generalization to unseen distributions.

Conclusion: EA-Swin provides a scalable and robust solution for modern AI-generated video detection, establishing new state-of-the-art performance with strong generalization capabilities.

Abstract: Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.

[107] Physics Encoded Spatial and Temporal Generative Adversarial Network for Tropical Cyclone Image Super-resolution

Ruoyi Zhang, Jiawei Yuan, Lujia Ye, Runling Yu, Liling Zhao

Main category: cs.CV

TL;DR: PESTGAN is a physics-encoded GAN for super-resolution of tropical cyclone satellite imagery that incorporates atmospheric physics constraints to improve structural and perceptual quality while maintaining pixel accuracy.

DetailsMotivation: Existing deep learning super-resolution methods treat satellite image sequences as generic videos, ignoring the underlying atmospheric physical laws that govern cloud motion in tropical cyclones, leading to less physically plausible results.

Method: Proposes Physics Encoded Spatial and Temporal GAN (PESTGAN) with disentangled generator architecture incorporating a PhyCell module that approximates the vorticity equation via constrained convolutions to separate physical dynamics from visual textures, plus a dual-discriminator framework with temporal discriminator for motion consistency.

Result: Experiments on Digital Typhoon dataset for 4× upscaling show PESTGAN achieves better structural fidelity and perceptual quality while maintaining competitive pixel-wise accuracy, excelling in reconstructing meteorologically plausible cloud structures with superior physical fidelity.

Conclusion: Incorporating physical constraints into super-resolution models improves the reconstruction of physically plausible structures in satellite imagery, particularly important for meteorological applications like tropical cyclone tracking.

Abstract: High-resolution satellite imagery is indispensable for tracking the genesis, intensification, and trajectory of tropical cyclones (TCs). However, existing deep learning-based super-resolution (SR) methods often treat satellite image sequences as generic videos, neglecting the underlying atmospheric physical laws governing cloud motion. To address this, we propose a Physics Encoded Spatial and Temporal Generative Adversarial Network (PESTGAN) for TC image super-resolution. Specifically, we design a disentangled generator architecture incorporating a PhyCell module, which approximates the vorticity equation via constrained convolutions and encodes the resulting approximate physical dynamics as implicit latent representations to separate physical dynamics from visual textures. Furthermore, a dual-discriminator framework is introduced, employing a temporal discriminator to enforce motion consistency alongside spatial realism. Experiments on the Digital Typhoon dataset for 4$\times$ upscaling demonstrate that PESTGAN establishes a better performance in structural fidelity and perceptual quality. While maintaining competitive pixel-wise accuracy compared to existing approaches, our method significantly excels in reconstructing meteorologically plausible cloud structures with superior physical fidelity.

[108] Attachment Anchors: A Novel Framework for Laparoscopic Grasping Point Prediction in Colorectal Surgery

Dennis N. Schneider, Lars Wagner, Daniel Rueckert, Dirk Wilhelm

Main category: cs.CV

TL;DR: Attachment anchors improve grasping point prediction in colorectal surgery by encoding tissue-anatomical attachment relationships, reducing uncertainty through local reference frame normalization.

DetailsMotivation: Colorectal surgeries are complex, prolonged procedures underrepresented in research but offer repetitive tissue manipulation patterns ideal for autonomous machine learning support. Current grasping point prediction methods lack structured representations of tissue-anatomical relationships.

Method: Introduces attachment anchors - a structured representation encoding local geometric and mechanical relationships between tissue and anatomical attachments. This normalizes surgical scenes into consistent local reference frames, reducing prediction uncertainty. The anchors can be predicted from laparoscopic images and integrated into machine learning grasping frameworks.

Result: Experiments on 90 colorectal surgeries show attachment anchors improve grasping point prediction compared to image-only baselines, with particularly strong gains in out-of-distribution settings (unseen procedures and operating surgeons).

Conclusion: Attachment anchors are an effective intermediate representation for learning-based tissue manipulation in colorectal surgery, enabling more robust grasping point prediction especially in challenging out-of-distribution scenarios.

Abstract: Accurate grasping point prediction is a key challenge for autonomous tissue manipulation in minimally invasive surgery, particularly in complex and variable procedures such as colorectal interventions. Due to their complexity and prolonged duration, colorectal procedures have been underrepresented in current research. At the same time, they pose a particularly interesting learning environment due to repetitive tissue manipulation, making them a promising entry point for autonomous, machine learning-driven support. Therefore, in this work, we introduce attachment anchors, a structured representation that encodes the local geometric and mechanical relationships between tissue and its anatomical attachments in colorectal surgery. This representation reduces uncertainty in grasping point prediction by normalizing surgical scenes into a consistent local reference frame. We demonstrate that attachment anchors can be predicted from laparoscopic images and incorporated into a grasping framework based on machine learning. Experiments on a dataset of 90 colorectal surgeries demonstrate that attachment anchors improve grasping point prediction compared to image-only baselines. There are particularly strong gains in out-of-distribution settings, including unseen procedures and operating surgeons. These results suggest that attachment anchors are an effective intermediate representation for learning-based tissue manipulation in colorectal surgery.

[109] Leveraging Contrastive Learning for a Similarity-Guided Tampered Document Data Generation Pipeline

Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, Aymen Shabou

Main category: cs.CV

TL;DR: A novel method for generating high-quality tampered document images using contrastive learning and auxiliary networks to improve tampered text detection models.

DetailsMotivation: Existing methods for generating tampered document images suffer from limited variety, poor visual quality, and visible artifacts, which undermines model performance on real-world data due to data scarcity in tampered text detection.

Method: 1) Train an auxiliary network with contrastive learning to compare text crops using novel positive/negative pair strategies; 2) Train a second auxiliary network to evaluate crop quality (proper character enclosure); 3) Use both networks in a generation pipeline to produce diverse, high-quality tampered documents.

Result: Models trained on datasets generated using the proposed method show consistent performance improvements across various architectures and datasets compared to existing approaches, demonstrating the effectiveness of the data generation pipeline.

Conclusion: The proposed framework successfully generates high-quality tampered document images that improve tampered text detection models, addressing data scarcity and quality issues in this domain.

Abstract: Detecting tampered text in document images is a challenging task due to data scarcity. To address this, previous work has attempted to generate tampered documents using rule-based methods. However, the resulting documents often suffer from limited variety and poor visual quality, typically leaving highly visible artifacts that are rarely observed in real-world manipulations. This undermines the model’s ability to learn robust, generalizable features and results in poor performance on real-world data. Motivated by this discrepancy, we propose a novel method for generating high-quality tampered document images. We first train an auxiliary network to compare text crops, leveraging contrastive learning with a novel strategy for defining positive pairs and their corresponding negatives. We also train a second auxiliary network to evaluate whether a crop tightly encloses the intended characters, without cutting off parts of characters or including parts of adjacent ones. Using a carefully designed generation pipeline that leverages both networks, we introduce a framework capable of producing diverse, high-quality tampered document images. We assess the effectiveness of our data generation pipeline by training multiple models on datasets derived from the same source images, generated using our method and existing approaches, under identical training protocols. Evaluating these models on various open-source datasets shows that our pipeline yields consistent performance improvements across architectures and datasets.

[110] Polaffini: A feature-based approach for robust affine and polyaffine image registration

Antoine Legouhy, Cosimo Campo, Ross Callaghan, Hojjat Azadbakht, Hui Zhang

Main category: cs.CV

TL;DR: Polaffini is a robust medical image registration framework that uses deep learning segmentation models to extract anatomical centroids for feature-based affine and polyaffine registration, outperforming intensity-based methods.

DetailsMotivation: Medical image registration traditionally relies on intensity-based methods with surrogate alignment measures, while feature-based approaches using explicit anatomical correspondences have been challenging. Recent advances in deep learning segmentation models now enable reliable anatomical delineations that can be leveraged for anatomically-grounded registration.

Method: Polaffini extracts centroids from segmented anatomical regions obtained from pre-trained deep learning segmentation models. These centroids serve as anatomically grounded feature points with 1-to-1 correspondence. The framework uses these points for efficient global and local affine matching via closed-form solutions, producing transformations ranging from affine to polyaffine with tunable smoothness in the log-Euclidean framework.

Result: Polaffini outperforms popular intensity-based registration techniques in terms of structural alignment and provides improved initialization for downstream non-linear registration. The method is fast, robust, and accurate, making it suitable for integration into medical image processing pipelines.

Conclusion: Polaffini demonstrates that recent advances in deep learning segmentation can be effectively leveraged to create anatomically-grounded registration algorithms that overcome traditional limitations of feature-based approaches, offering superior performance over intensity-based methods.

Abstract: In this work we present Polaffini, a robust and versatile framework for anatomically grounded registration. Medical image registration is dominated by intensity-based registration methods that rely on surrogate measures of alignment quality. In contrast, feature-based approaches that operate by identifying explicit anatomical correspondences, while more desirable in theory, have largely fallen out of favor due to the challenges of reliably extracting features. However, such challenges are now significantly overcome thanks to recent advances in deep learning, which provide pre-trained segmentation models capable of instantly delivering reliable, fine-grained anatomical delineations. We aim to demonstrate that these advances can be leveraged to create new anatomically-grounded image registration algorithms. To this end, we propose Polaffini, which obtains, from these segmented regions, anatomically grounded feature points with 1-to-1 correspondence in a particularly simple way: extracting their centroids. These enable efficient global and local affine matching via closed-form solutions. Those are used to produce an overall transformation ranging from affine to polyaffine with tunable smoothness. Polyaffine transformations can have many more degrees of freedom than affine ones allowing for finer alignment, and their embedding in the log-Euclidean framework ensures diffeomorphic properties. Polaffini has applications both for standalone registration and as pre-alignment for subsequent non-linear registration, and we evaluate it against popular intensity-based registration techniques. Results demonstrate that Polaffini outperforms competing methods in terms of structural alignment and provides improved initialisation for downstream non-linear registration. Polaffini is fast, robust, and accurate, making it particularly well-suited for integration into medical image processing pipelines.

Yuchang Jiang, Anton Raichuk, Xiaoye Tong, Vivien Sainte Fare Garnot, Daniel Ortiz-Gonzalo, Dan Morris, Konrad Schindler, Jan Dirk Wegner, Maxim Neumann

Main category: cs.CV

TL;DR: First 10m-resolution tree crop map for South America using multi-modal deep learning on Sentinel satellite imagery, revealing 11M hectares of tree crops with 23% linked to forest loss, and highlighting issues with EUDR regulatory maps misclassifying agriculture as forest.

DetailsMotivation: To support zero-deforestation policies like EUDR by addressing the lack of high-resolution data distinguishing diverse agricultural systems from forests, which hinders effective monitoring of tree crop expansion and deforestation.

Method: Multi-modal, spatio-temporal deep learning model trained on Sentinel-1 and Sentinel-2 satellite imagery time series to generate 10m-resolution tree crop maps for South America.

Result: Identified approximately 11 million hectares of tree crops, with 23% linked to 2000-2020 forest cover loss. Found that existing EUDR regulatory maps often misclassify established agriculture (especially smallholder agroforestry) as “forest”.

Conclusion: The high-resolution baseline map mitigates risks of false deforestation alerts and unfair penalties for small-scale farmers, supporting more effective, inclusive, and equitable conservation policies.

Abstract: Monitoring tree crop expansion is vital for zero-deforestation policies like the European Union’s Regulation on Deforestation-free Products (EUDR). However, these efforts are hindered by a lack of highresolution data distinguishing diverse agricultural systems from forests. Here, we present the first 10m-resolution tree crop map for South America, generated using a multi-modal, spatio-temporal deep learning model trained on Sentinel-1 and Sentinel-2 satellite imagery time series. The map identifies approximately 11 million hectares of tree crops, 23% of which is linked to 2000-2020 forest cover loss. Critically, our analysis reveals that existing regulatory maps supporting the EUDR often classify established agriculture, particularly smallholder agroforestry, as “forest”. This discrepancy risks false deforestation alerts and unfair penalties for small-scale farmers. Our work mitigates this risk by providing a high-resolution baseline, supporting conservation policies that are effective, inclusive, and equitable.

[112] DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

Changhun Kim, Martin Mayr, Thomas Gorges, Fei Wu, Mathias Seuret, Andreas Maier, Vincent Christlein

Main category: cs.CV

TL;DR: DRetHTR is a decoder-only handwritten text recognition model using Retentive Networks instead of Transformers, achieving similar accuracy with 1.6-1.9x faster inference and 38-42% less memory by eliminating the growing KV cache problem.

DetailsMotivation: Transformers in HTR systems suffer from growing key-value cache during decoding, making inference slow and memory-intensive. The authors aim to maintain Transformer-level accuracy while improving efficiency through alternative architectures.

Method: Uses Retentive Networks (RetNet) with softmax-free retention instead of attention, injects multi-scale sequential priors, and employs layer-wise gamma scaling to progressively enlarge retention horizon across layers for local-to-global modeling.

Result: Achieves best reported test character error rates: 2.26% (IAM-A, en), 1.81% (RIMES, fr), 3.46% (Bentham, en), and competitive 4.21% (READ-2016, de). Provides 1.6-1.9x faster inference with 38-42% less memory usage compared to Transformer baseline.

Conclusion: Decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency, demonstrating the viability of alternative architectures for efficient sequence modeling in document understanding tasks.

Abstract: State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.

[113] SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery

Lorenzo Caselli, Marco Mistretta, Simone Magistri, Andrew D. Bagdanov

Main category: cs.CV

TL;DR: SpectralGCD is an efficient multimodal approach for Generalized Category Discovery that uses CLIP cross-modal similarities as unified representations, with spectral filtering and knowledge distillation to maintain semantic quality.

DetailsMotivation: Existing GCD methods either overfit to known classes when using only image features, or are computationally expensive when using multimodal approaches that treat modalities independently. There's a need for an efficient multimodal GCD method that properly integrates cross-modal information.

Method: Uses CLIP cross-modal image-concept similarities as unified representations, expresses images as mixtures over semantic concepts from a large dictionary, introduces Spectral Filtering using cross-modal covariance matrix to retain relevant concepts, and employs forward/reverse knowledge distillation from a teacher model to maintain semantic quality and alignment.

Result: Achieves accuracy comparable or superior to state-of-the-art methods across six benchmarks while requiring only a fraction of the computational cost.

Conclusion: SpectralGCD provides an effective and efficient multimodal solution for Generalized Category Discovery that properly leverages cross-modal information while maintaining computational efficiency.

Abstract: Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross-modal representations of the student remain both semantically sufficient and well-aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost. The code is publicly available at: https://github.com/miccunifi/SpectralGCD.

[114] A High-Level Survey of Optical Remote Sensing

Panagiotis Koletsis, Vasilis Efthymiou, Maria Vakalopoulou, Nikos Komodakis, Anastasios Doulamis, Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: A comprehensive survey paper on optical remote sensing capabilities using RGB cameras on drones, covering diverse tasks, datasets, and methodologies to guide researchers entering the field.

DetailsMotivation: The paper aims to address the lack of a holistic survey in optical remote sensing despite significant advances in computer vision and widespread drone adoption. With most drones equipped with RGB cameras and a vast literature on diverse tasks, there's a need for a comprehensive guide to help researchers navigate the field efficiently.

Method: The authors conduct a comprehensive literature review and survey of optical remote sensing capabilities, organizing information about various tasks, methodologies, datasets, and key insights. They structure this information to provide a holistic perspective of the field.

Result: The paper presents a comprehensive overview of optical remote sensing capabilities, including diverse tasks and methodologies, available datasets, and key insights. It serves as a guide for researchers to understand the field’s landscape and focus on relevant areas.

Conclusion: This survey fills a gap by providing the first holistic perspective on optical remote sensing capabilities, offering valuable guidance for researchers entering the field and helping them navigate the extensive literature efficiently.

Abstract: In recent years, significant advances in computer vision have also propelled progress in remote sensing. Concurrently, the use of drones has expanded, with many organizations incorporating them into their operations. Most drones are equipped by default with RGB cameras, which are both robust and among the easiest sensors to use and interpret. The body of literature on optical remote sensing is vast, encompassing diverse tasks, capabilities, and methodologies. Each task or methodology could warrant a dedicated survey. This work provides a comprehensive overview of the capabilities of the field, while also presenting key information, such as datasets and insights. It aims to serve as a guide for researchers entering the field, offering high-level insights and helping them focus on areas most relevant to their interests. To the best of our knowledge, no existing survey addresses this holistic perspective.

[115] EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

Xiaomeng Peng, Xilang Huang, Seon Han Choi

Main category: cs.CV

TL;DR: EAGLE is a tuning-free framework that uses expert model outputs to guide multimodal LLMs for industrial anomaly detection with interpretable descriptions, improving detection accuracy without parameter updates.

DetailsMotivation: Existing deep learning approaches for industrial anomaly detection provide limited semantic explanations, while MLLMs require costly fine-tuning and don't consistently improve accuracy over lightweight detectors.

Method: Proposes expert-augmented attention guidance (EAGLE) - a tuning-free framework that integrates outputs from expert models to guide MLLMs toward accurate detection and interpretable anomaly descriptions, with analysis of attention distribution in intermediate layers.

Result: EAGLE improves anomaly detection performance across multiple MLLMs without parameter updates, achieving results comparable to fine-tuning based methods on MVTec-AD and VisA datasets.

Conclusion: EAGLE enables MLLMs to provide both accurate detection and interpretable descriptions for industrial anomaly detection without requiring fine-tuning, with successful detection associated with increased attention concentration on anomalous regions.

Abstract: Industrial anomaly detection is important for smart manufacturing, but many deep learning approaches produce only binary decisions and provide limited semantic explanations. Multimodal large language models (MLLMs) can potentially generate fine-grained, language-based analyses, yet existing methods often require costly fine-tuning and do not consistently improve anomaly detection accuracy compared to lightweight specialist detectors. We propose expert-augmented attention guidance for industrial anomaly detection in MLLMs (EAGLE), a tuning-free framework that integrates outputs from expert model to guide MLLMs toward both accurate detection and interpretable anomaly descriptions. We further study how EAGLE affects MLLMs internals by examining the attention distribution of MLLMs to the anomalous image regions in the intermediate layers. We observe that successful anomaly detection is associated with increased attention concentration on anomalous regions, and EAGLE tends to encourage this alignment. Experiments on MVTec-AD and VisA show that EAGLE improves anomaly detection performance across multiple MLLMs without any parameter updates, achieving results comparable to fine-tuning based methods. Code is available at \href{https://github.com/shengtun/Eagle}{https://github.com/shengtun/Eagle}

[116] 4D Monocular Surgical Reconstruction under Arbitrary Camera Motions

Jiwei Shan, Zeyu Cai, Cheng-Tai Hsieh, Yirui Li, Hao Liu, Lijun Han, Hesheng Wang, Shing Shin Cheng

Main category: cs.CV

TL;DR: Local-EndoGS: A 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion using progressive window-based global representation and coarse-to-fine optimization.

DetailsMotivation: Existing methods for deformable surgical scene reconstruction from endoscopic videos have limitations: they typically require fixed viewpoints, rely on stereo depth priors or accurate structure-from-motion initialization, and struggle with monocular sequences with large camera motion in real clinical settings.

Method: Proposes Local-EndoGS with: 1) Progressive window-based global representation allocating local deformable scene models to each observed window for scalability; 2) Coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors for robust initialization; 3) Long-range 2D pixel trajectory constraints and physical motion priors for deformation plausibility.

Result: Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate effectiveness of key designs.

Conclusion: Local-EndoGS enables high-quality 4D reconstruction from monocular endoscopic sequences with arbitrary camera motion, addressing limitations of existing methods and showing superior performance on challenging clinical datasets.

Abstract: Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.

[117] QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery

Xuan-Bac Nguyen, Hoang-Quan Nguyen, Sankalp Pandey, Tim Faltermeier, Nicholas Borys, Hugh Churchill, Khoa Luu

Main category: cs.CV

TL;DR: Physics-aware multimodal framework for characterizing 2D quantum materials from optical microscopy images using synthetic data generation, instruction tuning, and physics-informed attention.

DetailsMotivation: Current vision models struggle with characterizing 2D quantum materials from optical microscopy due to subtle contrast variations, limited labeled data, lack of physical priors, and poor generalization across different materials and imaging setups.

Method: Four-part framework: 1) Synthia - physics-based synthetic data generator simulating optical responses; 2) QMat-Instruct - large-scale multimodal instruction dataset; 3) QuPAINT - physics-aware instruction tuning with Physics-Informed Attention module; 4) QF-Bench - comprehensive benchmark for evaluation.

Result: The framework produces diverse synthetic data, enables MLLMs to understand material appearance and thickness, and provides robust flake representations through physics-informed attention fusion.

Conclusion: The physics-aware multimodal approach addresses key limitations in quantum material characterization by combining synthetic data generation, instruction tuning, and physical priors for improved generalization across materials and imaging conditions.

Abstract: Characterizing two-dimensional quantum materials from optical microscopy images is challenging due to the subtle layer-dependent contrast, limited labeled data, and significant variation across laboratories and imaging setups. Existing vision models struggle in this domain since they lack physical priors and cannot generalize to new materials or hardware conditions. This work presents a new physics-aware multimodal framework that addresses these limitations from both the data and model perspectives. We first present Synthia, a physics-based synthetic data generator that simulates realistic optical responses of quantum material flakes under thin-film interference. Synthia produces diverse and high-quality samples, helping reduce the dependence on expert manual annotation. We introduce QMat-Instruct, the first large-scale instruction dataset for quantum materials, comprising multimodal, physics-informed question-answer pairs designed to teach Multimodal Large Language Models (MLLMs) to understand the appearance and thickness of flakes. Then, we propose Physics-Aware Instruction Tuning (QuPAINT), a multimodal architecture that incorporates a Physics-Informed Attention module to fuse visual embeddings with optical priors, enabling more robust and discriminative flake representations. Finally, we establish QF-Bench, a comprehensive benchmark spanning multiple materials, substrates, and imaging settings, offering standardized protocols for fair and reproducible evaluation.

[118] Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection

Yichen Lu, Siwei Nie, Minlong Lu, Xudong Yang, Xiaobo Zhang, Peng Zhang

Main category: cs.CV

TL;DR: PixTrace and CopyNCE method improves image copy detection by using pixel coordinate tracking and geometrically-guided contrastive learning to handle sophisticated image edits.

DetailsMotivation: Existing self-supervised learning methods for image copy detection struggle with sophisticated edits due to insufficient fine-grained correspondence learning between edited image pairs.

Method: Proposes PixTrace (pixel coordinate tracking module) to maintain explicit spatial mappings across editing transformations, and CopyNCE (geometrically-guided contrastive loss) that regularizes patch affinity using overlap ratios from verified mappings.

Result: Achieves state-of-the-art performance: 88.7% uAP / 83.9% RP90 for matcher, 72.6% uAP / 68.4% RP90 for descriptor on DISC21 dataset, with better interpretability.

Conclusion: The method successfully bridges pixel-level traceability with patch-level similarity learning, suppressing supervision noise in SSL training for improved image copy detection.

Abstract: Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning. While self-supervised learning (SSL) has advanced ICD systems, existing view-level contrastive methods struggle with sophisticated edits due to insufficient fine-grained correspondence learning. We address this limitation by exploiting the inherent geometric traceability in edited content through two key innovations. First, we propose PixTrace - a pixel coordinate tracking module that maintains explicit spatial mappings across editing transformations. Second, we introduce CopyNCE, a geometrically-guided contrastive loss that regularizes patch affinity using overlap ratios derived from PixTrace’s verified mappings. Our method bridges pixel-level traceability with patch-level similarity learning, suppressing supervision noise in SSL training. Extensive experiments demonstrate not only state-of-the-art performance (88.7% uAP / 83.9% RP90 for matcher, 72.6% uAP / 68.4% RP90 for descriptor on DISC21 dataset) but also better interpretability over existing methods.

[119] FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality

Hanyuan Zhang, Lucas He, Runlong He, Abdolrahim Kadkhodamohammadi, Danail Stoyanov, Brian R. Davidson, Evangelos B. Mazomenos, Matthew J. Clarkson

Main category: cs.CV

TL;DR: A surgical AR system for laparoscopic liver surgery that uses depth maps and foundation pose estimation for camera-liver alignment, replacing complex finite-element models with simpler non-rigid iterative closest point (NICP) for deformable registration.

DetailsMotivation: To create a more accessible and engineering-friendly augmented reality system for tumor localization in laparoscopic liver surgery by reducing the complexity and expertise requirements of traditional finite-element-based deformable registration methods.

Method: Integrates laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation, and replaces finite-element-based deformation with non-rigid iterative closest point (NICP) for deformable registration, combining rigid and NICP registration.

Result: Achieved 9.91 mm mean registration error on real patient data across 3 cases, with combined rigid-NICP registration outperforming rigid-only registration, demonstrating clinically relevant accuracy.

Conclusion: The proposed pipeline offers a lightweight, engineering-friendly alternative to finite-element-based deformation models while maintaining clinically relevant accuracy for tumor localization in laparoscopic liver surgery.

Abstract: Augmented reality can improve tumor localization in laparoscopic liver surgery. Existing registration pipelines typically depend on organ contours; deformable (non-rigid) alignment is often handled with finite-element (FE) models coupled to dimensionality-reduction or machine-learning components. We integrate laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and replace FE-based deformation with non-rigid iterative closest point (NICP) to lower engineering/modeling complexity and expertise requirements. On real patient data, the depth-augmented foundation pose approach achieved 9.91 mm mean registration error in 3 cases. Combined rigid-NICP registration outperformed rigid-only registration, demonstrating NICP as an efficient substitute for finite-element deformable models. This pipeline achieves clinically relevant accuracy while offering a lightweight, engineering-friendly alternative to FE-based deformation.

[120] LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs

Behzad Bozorgtabar, Dwarikanath Mahapatra, Sudipta Roy, Muzammal Naseer, Imran Razzak, Zongyuan Ge

Main category: cs.CV

TL;DR: LATA is a training-free, label-free refinement method that improves medical vision-language models’ uncertainty calibration under domain shift using graph smoothing and conformal prediction, reducing prediction set sizes while maintaining coverage guarantees.

DetailsMotivation: Medical VLMs need reliable uncertainty calibration under domain shifts, but existing conformal prediction methods produce large prediction sets with imbalanced class coverage, especially in few-shot imbalanced scenarios. Adapting to calibration labels breaks exchangeability and voids guarantees.

Method: LATA uses Laplacian-assisted transductive adaptation to smooth zero-shot probabilities over an image-image k-NN graph with CCCP mean-field updates, preserving SCP validity via deterministic transform. Includes failure-aware conformal score for instance-level difficulty and label plausibility.

Result: Across 3 medical VLMs and 9 downstream tasks, LATA consistently reduces prediction set size and class-conditioned coverage variance while matching or tightening target coverage, outperforming prior transductive baselines with far less compute.

Conclusion: LATA effectively sharpens zero-shot predictions without compromising exchangeability, providing a black-box, compute-light solution for improving medical VLM reliability under domain shift.

Abstract: Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.

[121] GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking

Zixu Cheng, Da Li, Jian Hu, Ziquan Liu, Wei Li, Shaogang Gong

Main category: cs.CV

TL;DR: GraphThinker: Reinforcement finetuning method that constructs event-level scene graphs and enhances visual grounding to reduce hallucinations in video reasoning by MLLMs.

DetailsMotivation: Video reasoning requires understanding causal relationships between events, but these are often implicit and costly to annotate. Existing MLLMs infer event relations through dense captions or summaries but lack explicit causal structure modeling, leading to hallucinations during video reasoning.

Method: Proposes GraphThinker with two key components: 1) Uses MLLM to construct event-based video scene graph (EVSG) that explicitly models intra- and inter-event relations, incorporating these as intermediate thinking process; 2) Introduces visual attention reward during reinforcement finetuning to strengthen video grounding and mitigate hallucinations.

Result: Evaluated on RexTime and VidHalluc datasets, GraphThinker shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.

Conclusion: GraphThinker effectively addresses hallucination issues in video reasoning by explicitly modeling causal structures through event-based scene graphs and enhancing visual grounding via reinforcement finetuning.

Abstract: Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.

[122] RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

Qiucheng Wu, Jing Shi, Simon Jenni, Kushal Kafle, Tianyu Wang, Shiyu Chang, Handong Zhao

Main category: cs.CV

TL;DR: RetouchIQ is a framework that uses MLLM agents guided by a generalist reward model for instruction-based executable image editing, improving over previous methods by providing flexible, case-by-case evaluation through multimodal reasoning.

DetailsMotivation: Current MLLM-based image editing lacks reliable reward signals for RL training due to the subjective nature of creative editing. Rule-based rewards using fixed reference images and handcrafted metrics are insufficient for capturing aesthetic goals and instruction consistency.

Method: RetouchIQ uses MLLM agents to interpret editing intentions and generate executable adjustments. It introduces a generalist reward model (RL fine-tuned MLLM) that evaluates edited results through generated metrics on a case-by-case basis, providing scalar feedback through multimodal reasoning for RL training.

Result: RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. The framework establishes a new benchmark for instruction-based image editing with a curated dataset of 190k instruction-reasoning pairs.

Conclusion: Generalist reward-driven MLLM agents show potential as flexible, explainable, and executable assistants for professional image editing, bridging high-level aesthetic goals with precise parameter control through multimodal reasoning.

Abstract: Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.

[123] Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly

Main category: cs.CV

TL;DR: A geospatial discovery framework combining active learning, meta-learning, and concept-guided reasoning for efficient target discovery in resource-constrained environments with sparse data.

DetailsMotivation: Real-world applications like environmental monitoring and disaster response face challenges with costly data collection, dynamic environments, and sparse/biased ground truth data, limiting existing learning-based methods like reinforcement learning.

Method: Unified framework integrating active learning, online meta-learning, and concept-guided reasoning using concept relevance. Two key innovations: 1) concept-weighted uncertainty sampling where uncertainty is modulated by learned relevance from domain concepts, and 2) relevance-aware meta-batch formation promoting semantic diversity during online meta-updates.

Result: Tested on real-world PFAS contamination dataset, showing reliable target uncovering with limited data in varying environments.

Conclusion: The proposed framework effectively addresses sparse data challenges in geospatial discovery through concept-guided active learning and meta-learning strategies.

Abstract: In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of concept relevance, which captures how domain-specific factors influence target presence: a concept-weighted uncertainty sampling strategy, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a relevance-aware meta-batch formation strategy that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method’s reliability at uncovering targets with limited data and a varying environment.

[124] CORAL: Correspondence Alignment for Improved Virtual Try-On

Jiyoung Kim, Youngjin Shin, Siyoon Jin, Dahyun Chung, Jisu Nam, Tongmin Kim, Jongjae Park, Hyeonwoo Kang, Seungryong Kim

Main category: cs.CV

TL;DR: CORAL is a DiT-based virtual try-on framework that explicitly aligns person-garment correspondence through attention matching and entropy minimization, improving detail preservation in unpaired settings.

DetailsMotivation: Existing VTON methods struggle to preserve fine garment details in unpaired settings due to lack of explicit person-garment alignment enforcement and unclear correspondence emergence in Diffusion Transformers.

Method: Analyzes full 3D attention in DiT-based architectures, revealing correspondence depends on query-key matching. Introduces CORAL with correspondence distillation loss (aligns reliable matches with attention) and entropy minimization loss (sharpens attention distribution).

Result: CORAL consistently improves over baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate design choices. Also proposes VLM-based evaluation protocol.

Conclusion: Explicit alignment of query-key matching with external correspondences in DiT-based frameworks significantly improves virtual try-on performance, especially for detail preservation in unpaired settings.

Abstract: Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.

[125] IntRec: Intent-based Retrieval with Contrastive Refinement

Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu

Main category: cs.CV

TL;DR: IntRec is an interactive object retrieval framework that refines predictions using user feedback through dual memory sets for positive and negative cues, enabling fine-grained disambiguation in cluttered scenes.

DetailsMotivation: Existing open-vocabulary detectors operate in a one-shot manner and lack the ability to refine predictions based on user feedback, making them ineffective for ambiguous queries or scenes with multiple similar objects.

Method: Proposes IntRec with an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). Uses a contrastive alignment function to rank candidate objects by maximizing similarity to positive cues while penalizing rejected ones.

Result: On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5 respectively. On LVIS-Ambiguous benchmark, improves by +7.9 AP over one-shot baseline after single corrective feedback with <30ms added latency per interaction.

Conclusion: IntRec provides substantial improvements in retrieval accuracy without additional supervision through interactive refinement based on user feedback, enabling fine-grained disambiguation in cluttered scenes.

Abstract: Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.

[126] Human-level 3D shape perception emerges from multi-view learning

Tyler Bonnen, Jitendra Malik, Angjoo Kanazawa

Main category: cs.CV

TL;DR: Neural networks trained on visual-spatial objectives without object biases can achieve human-level 3D shape inference from 2D images, matching human accuracy and predicting fine-grained behavioral patterns.

DetailsMotivation: To develop computational models that can match human ability to infer 3D structure from 2D visual inputs, which has been a longstanding challenge in visual intelligence research.

Method: Train neural networks using visual-spatial objectives on naturalistic multi-view image data to predict camera location and visual depth without object-related inductive biases, then evaluate zero-shot on 3D perception tasks.

Result: Models match human accuracy on 3D shape inferences without task-specific training, and model responses predict human error patterns and reaction times, revealing correspondence between model dynamics and human perception.

Conclusion: Human-level 3D perception can emerge from simple, scalable learning objectives over naturalistic visual-spatial data, suggesting fundamental principles of visual intelligence.

Abstract: Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view’ models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.

[127] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, Mingyu Ding

Main category: cs.CV

TL;DR: A study of counterfactual failures in Vision-Language-Action models where they ignore language instructions and rely on visual shortcuts, with a proposed Counterfactual Action Guidance method to improve language following.

DetailsMotivation: Vision-Language-Action models often fail to faithfully follow language instructions when presented with instructions lacking strong scene-specific supervision, instead acting based on vision shortcuts induced by dataset biases.

Method: Introduces LIBERO-CF benchmark for evaluating counterfactual failures and proposes Counterfactual Action Guidance (CAG) - a dual-branch inference scheme combining a standard VLA policy with a language-unconditioned Vision-Action module for counterfactual comparison during action selection.

Result: CAG improves language following accuracy by 9.7% and task success by 3.6% on under-observed tasks, with further gains of 15.5% and 8.5% when paired with a VA model. Real-world evaluations show 9.4% reduction in counterfactual failures and 17.2% average improvement in task success.

Conclusion: Counterfactual failures are prevalent in VLAs, and the proposed CAG method effectively addresses this issue through simple plug-and-play integration that improves language following without additional training or architecture changes.

Abstract: Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.

[128] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

Akashah Shabbir, Muhammad Umer Sheikh, Muhammad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muhammad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, Salman Khan

Main category: cs.CV

TL;DR: OpenEarthAgent is a tool-augmented geospatial agent framework for multimodal reasoning with satellite imagery, natural language queries, and structured reasoning traces in remote sensing applications.

DetailsMotivation: Extending multimodal reasoning capabilities to remote sensing domain is challenging due to requirements for spatial scale reasoning, geographic structures, multispectral indices, and multi-step logic. Current models lack specialized training for geospatial analysis with satellite imagery.

Method: Unified framework using supervised fine-tuning over structured reasoning trajectories. Trained on 14,538 instances with 100K+ reasoning steps, incorporating GIS operations and spectral indices (NDVI, NBR, NDBI). Aligns model with verified multi-step tool interactions across diverse analytical contexts.

Result: Demonstrates structured reasoning, stable spatial understanding, and interpretable behavior through tool-driven geospatial interactions. Shows consistent improvements over baselines and competitive performance relative to open/closed-source models.

Conclusion: OpenEarthAgent successfully bridges the gap in multimodal reasoning for remote sensing, enabling geospatial agents to interpret satellite imagery, connect with language, and perform structured analytical tasks with explicit reasoning traces.

Abstract: Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.

[129] MotionHint: Self-Supervised Monocular Visual Odometry with Motion Constraints

Cong Wang, Yu-Ping Wang, Dinesh Manocha

Main category: cs.CV

TL;DR: MotionHint: Self-supervised monocular visual odometry using motion constraints to help overcome local minima in existing SSM-VO systems.

DetailsMotivation: Existing self-supervised monocular visual odometry (SSM-VO) algorithms often get stuck in local minima within their self-supervised loss functions, limiting performance. The paper aims to address this by incorporating motion constraints to guide optimization.

Method: Proposes MotionHint algorithm that uses a neural network (PPnet) to predict next camera pose and uncertainty. Combines original self-supervised loss with motion loss (weighted difference between prediction and generated ego-motion) to overcome local minima.

Result: On KITTI benchmark, MotionHint reduces Absolute Trajectory Error (ATE) by up to 28.73% when applied to existing state-of-the-art SSM-VO systems, showing significant performance improvement.

Conclusion: MotionHint effectively incorporates motion constraints to improve self-supervised monocular VO by helping overcome local minima, demonstrating easy applicability to existing systems with substantial performance gains.

Abstract: We present a novel self-supervised algorithm named MotionHint for monocular visual odometry (VO) that takes motion constraints into account. A key aspect of our approach is to use an appropriate motion model that can help existing self-supervised monocular VO (SSM-VO) algorithms to overcome issues related to the local minima within their self-supervised loss functions. The motion model is expressed with a neural network named PPnet. It is trained to coarsely predict the next pose of the camera and the uncertainty of this prediction. Our self-supervised approach combines the original loss and the motion loss, which is the weighted difference between the prediction and the generated ego-motion. Taking two existing SSM-VO systems as our baseline, we evaluate our MotionHint algorithm on the standard KITTI benchmark. Experimental results show that our MotionHint algorithm can be easily applied to existing open-sourced state-of-the-art SSM-VO systems to greatly improve the performance by reducing the resulting ATE by up to 28.73%.

[130] Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar

Cong Wang, Di Kang, Yan-Pei Cao, Linchao Bao, Ying Shan, Song-Hai Zhang

Main category: cs.CV

TL;DR: Neural point-based method for photorealistic dynamic human head rendering with improved handling of challenging facial regions like mouth interior, eyes, and hair/beard.

DetailsMotivation: Existing methods struggle with challenging facial regions (mouth interior, eyes, hair/beard) in AR/VR and video conferencing applications, resulting in unrealistic and blurry results. Need for better modeling of topologically changing regions and thin structures.

Method: Uses neural point representation with neural volume rendering, discarding mesh-based connectivity constraints. Neural points are constrained around target expression surface via high-resolution UV displacement map. Introduces patch-wise depth-guided sampling, lightweight radiance decoding, and Grid-Error-Patch ray sampling for efficiency.

Result: Outperforms previous state-of-the-art methods on Multiface dataset, especially in handling challenging facial regions. Demonstrates effectiveness in modeling topologically changing regions and ensuring accurate expression control.

Conclusion: Proposed method achieves more realistic dynamic human head rendering with better handling of complex facial features and improved efficiency through novel sampling and decoding strategies.

Abstract: Rendering photorealistic and dynamically moving human heads is crucial for ensuring a pleasant and immersive experience in AR/VR and video conferencing applications. However, existing methods often struggle to model challenging facial regions (e.g., mouth interior, eyes, hair/beard), resulting in unrealistic and blurry results. In this paper, we propose {\fullname} ({\name}), a method that adopts the neural point representation as well as the neural volume rendering process and discards the predefined connectivity and hard correspondence imposed by mesh-based approaches. Specifically, the neural points are strategically constrained around the surface of the target expression via a high-resolution UV displacement map, achieving increased modeling capacity and more accurate control. We introduce three technical innovations to improve the rendering and training efficiency: a patch-wise depth-guided (shading point) sampling strategy, a lightweight radiance decoding process, and a Grid-Error-Patch (GEP) ray sampling strategy during training. By design, our {\name} is better equipped to handle topologically changing regions and thin structures while also ensuring accurate expression control when animating avatars. Experiments conducted on three subjects from the Multiface dataset demonstrate the effectiveness of our designs, outperforming previous state-of-the-art methods, especially in handling challenging facial regions.

[131] LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference

Cong Wang, Yu-Ping Wang, Dinesh Manocha

Main category: cs.CV

TL;DR: LoLep: A novel view synthesis method that regresses locally-learned planes from single RGB images without depth information, using disparity sampling and geometric supervision for accurate scene representation.

DetailsMotivation: The paper addresses the challenge of novel view synthesis from single RGB images without depth information. Traditional methods struggle with regressing appropriate plane locations for accurate scene representation when depth data is unavailable.

Method: 1) Pre-partition disparity space into bins with a disparity sampler to regress local offsets for multiple planes per bin; 2) Two optimizing strategies combining different disparity distributions; 3) Occlusion-aware reprojection loss for geometric supervision; 4) Self-attention mechanism with Block-Sampling Self-Attention (BS-SA) module for large feature maps.

Result: State-of-the-art results on different datasets: LPIPS reduction of 4.8%-9.0% and RV reduction of 73.9%-83.5% compared to MINE. Effective performance on real-world images demonstrated.

Conclusion: LoLep successfully addresses novel view synthesis from single RGB images without depth by learning local plane representations through disparity sampling and geometric supervision, achieving superior performance over existing methods.

Abstract: We propose a novel method, LoLep, which regresses Locally-Learned planes from a single RGB image to represent scenes accurately, thus generating better novel views. Without the depth information, regressing appropriate plane locations is a challenging problem. To solve this issue, we pre-partition the disparity space into bins and design a disparity sampler to regress local offsets for multiple planes in each bin. However, only using such a sampler makes the network not convergent; we further propose two optimizing strategies that combine with different disparity distributions of datasets and propose an occlusion-aware reprojection loss as a simple yet effective geometric supervision technique. We also introduce a self-attention mechanism to improve occlusion inference and present a Block-Sampling Self-Attention (BS-SA) module to address the problem of applying self-attention to large feature maps. We demonstrate the effectiveness of our approach and generate state-of-the-art results on different datasets. Compared to MINE, our approach has an LPIPS reduction of 4.8%-9.0% and an RV reduction of 73.9%-83.5%. We also evaluate the performance on real-world images and demonstrate the benefits.

[132] MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing

Cong Wang, Di Kang, He-Yi Sun, Shen-Han Qian, Zi-Xuan Wang, Linchao Bao, Song-Hai Zhang

Main category: cs.CV

TL;DR: MeGA: Hybrid Mesh-Gaussian Head Avatar that uses different representations for different head components (mesh for face, 3D Gaussians for hair) to achieve high-fidelity renderings and support editing tasks.

DetailsMotivation: Existing head avatar methods struggle to render all head components (skin, hair) with high quality simultaneously because they use a single representation for components with drastically different characteristics.

Method: Proposes a hybrid representation: enhanced FLAME mesh with UV displacement maps for facial geometry, deferred neural rendering for facial colors, and 3D Gaussian Splatting for static canonical hair with MLP-based deformation fields for dynamic expressions. Uses occlusion-aware blending to combine components.

Result: Outperforms previous state-of-the-art methods on NeRSemble dataset, generates higher-fidelity renderings for whole head, and supports various editing functionalities including hairstyle alteration and texture editing.

Conclusion: MeGA demonstrates that modeling different head components with suitable representations (mesh for face, 3D Gaussians for hair) enables high-fidelity head avatars that support downstream editing tasks.

Abstract: Creating high-fidelity head avatars from multi-view videos is a core issue for many AR/VR applications. However, existing methods usually struggle to obtain high-quality renderings for all different head components simultaneously since they use one single representation to model components with drastically different characteristics (e.g., skin vs. hair). In this paper, we propose a Hybrid Mesh-Gaussian Head Avatar (MeGA) that models different head components with more suitable representations. Specifically, we select an enhanced FLAME mesh as our facial representation and predict a UV displacement map to provide per-vertex offsets for improved personalized geometric details. To achieve photorealistic renderings, we obtain facial colors using deferred neural rendering and disentangle neural textures into three meaningful parts. For hair modeling, we first build a static canonical hair using 3D Gaussian Splatting. A rigid transformation and an MLP-based deformation field are further applied to handle complex dynamic expressions. Combined with our occlusion-aware blending, MeGA generates higher-fidelity renderings for the whole head and naturally supports more downstream tasks. Experiments on the NeRSemble dataset demonstrate the effectiveness of our designs, outperforming previous state-of-the-art methods and supporting various editing functionalities, including hairstyle alteration and texture editing.

[133] Multiple Object Detection and Tracking in Panoramic Videos for Cycling Safety Analysis

Jingwei Guo, Yitai Cheng, Meihui Wang, Ilya Ilyankou, Natchapon Jongwiriyanurak, Xiaowei Gao, Nicola Christie, James Haworth

Main category: cs.CV

TL;DR: A novel framework for analyzing panoramic cycling videos to detect safety risks, improving object detection and tracking for vehicle overtaking detection.

DetailsMotivation: Cyclists face high injury risks but conventional crash data is too sparse. Panoramic video offers 360° views but existing computer vision models struggle with distortions, small objects, and boundary continuity in panoramic imagery.

Method: Three-step framework: (1) Enhance object detection by segmenting and projecting 360° images into sub-images, (2) Modify multi-object tracking to incorporate boundary continuity and object category information, (3) Validate through real-world vehicle overtaking detection using panoramic videos from London cyclists.

Result: Improved average precision across varying resolutions, 10.0% decrease in identification switches, 2.7% improvement in identification precision, and 0.82 F-score for overtaking detection task.

Conclusion: The proposed method effectively handles panoramic video challenges and demonstrates practical effectiveness for cycling safety applications through improved object detection and tracking performance.

Abstract: Cyclists face a disproportionate risk of injury, yet conventional crash records are too sparse to identify risk factors at fine spatial and temporal scales. Recently, naturalistic studies have used video data to capture the complex behavioural and infrastructural risk factors. A promising format is panoramic video, which can record 360$^\circ$ views around a rider. However, its use is limited by distortions, large numbers of small objects, and boundary continuity, which cannot be handled using existing computer vision models. This research proposes a novel three-step framework: (1) enhancing object detection accuracy on panoramic imagery by segmenting and projecting the original 360$^\circ$ images into sub-images; (2) modifying multi-object tracking models to incorporate boundary continuity and object category information; and (3) validating through a real-world application of vehicle overtaking detection. The methodology is evaluated using panoramic videos recorded by cyclists on London’s roadways under diverse conditions. Experimental results demonstrate improvements over baselines, achieving higher average precision across varying image resolutions. Moreover, the enhanced tracking approach yields a 10.0% decrease in identification switches and a 2.7% improvement in identification precision. The overtaking detection task achieves a high F-score of 0.82, illustrating the practical effectiveness of the proposed method in real-world cycling safety scenarios.

[134] Multi-View 3D Reconstruction using Knowledge Distillation

Aditya Dutt, Ishikaa Lunawat, Manpreet Kaur

Main category: cs.CV

TL;DR: Knowledge distillation pipeline using Dust3r as teacher to train efficient student models for 3D reconstruction from stereo images, comparing CNN and Vision Transformer architectures on 12Scenes dataset.

DetailsMotivation: Large foundation models like Dust3r produce high-quality 3D reconstructions but require significant inference time and compute resources, limiting practical applications like visual localization. Need efficient student models that maintain similar performance.

Method: Propose knowledge distillation pipeline with Dust3r as teacher. Train student models using 3D reconstructed points from Dust3r. Compare CNN-based and Vision Transformer architectures, with variations using pre-trained models vs from scratch. Use 12Scenes dataset. Perform ablation studies with hyperparameter tuning.

Result: Vision Transformer architecture shows best performance both visually and quantitatively compared to CNN-based models. Student models learn scene-specific representations and output 3D points with replicable performance similar to Dust3r.

Conclusion: Knowledge distillation effectively creates efficient student models for 3D reconstruction. Vision Transformers outperform CNNs in this distillation setup, offering promising direction for efficient visual understanding tasks.

Abstract: Large Foundation Models like Dust3r can produce high quality outputs such as pointmaps, camera intrinsics, and depth estimation, given stereo-image pairs as input. However, the application of these outputs on tasks like Visual Localization requires a large amount of inference time and compute resources. To address these limitations, in this paper, we propose the use of a knowledge distillation pipeline, where we aim to build a student-teacher model with Dust3r as the teacher and explore multiple architectures of student models that are trained using the 3D reconstructed points output by Dust3r. Our goal is to build student models that can learn scene-specific representations and output 3D points with replicable performance such as Dust3r. The data set we used to train our models is 12Scenes. We test two main architectures of models: a CNN-based architecture and a Vision Transformer based architecture. For each architecture, we also compare the use of pre-trained models against models built from scratch. We qualitatively compare the reconstructed 3D points output by the student model against Dust3r’s and discuss the various features learned by the student model. We also perform ablation studies on the models through hyperparameter tuning. Overall, we observe that the Vision Transformer presents the best performance visually and quantitatively.

[135] Simple Self Organizing Map with Vision Transformers

Alan Luo, Kaiwen Yuan

Main category: cs.CV

TL;DR: ViTs underperform on small datasets due to lack of inductive biases; Self-Organizing Maps (SOMs) offer topology-preserving properties that can complement ViTs, leading to synergistic improvements in both unsupervised and supervised tasks.

DetailsMotivation: Vision Transformers lack inductive biases that make them underperform on smaller datasets, while Self-Organizing Maps inherently preserve topology and spatial organization but haven't been well integrated with modern deep learning architectures like ViTs.

Method: The paper explores how Vision Transformers and Self-Organizing Maps can mutually empower each other, bridging the gap between these architectures to leverage SOMs’ topology-preserving properties with ViTs’ modern deep learning capabilities.

Result: The architectures synergistically enhance each other, leading to significantly improved performance in both unsupervised and supervised tasks.

Conclusion: Combining ViTs with SOMs addresses ViTs’ limitations on small datasets while modernizing SOMs with deep learning architectures, creating a mutually beneficial framework for vision tasks.

Abstract: Vision Transformers (ViTs) have demonstrated exceptional performance in various vision tasks. However, they tend to underperform on smaller datasets due to their inherent lack of inductive biases. Current approaches address this limitation implicitly-often by pairing ViTs with pretext tasks or by distilling knowledge from convolutional neural networks (CNNs) to strengthen the prior. In contrast, Self-Organizing Maps (SOMs), a widely adopted self-supervised framework, are inherently structured to preserve topology and spatial organization, making them a promising candidate to directly address the limitations of ViTs in limited or small training datasets. Despite this potential, equipping SOMs with modern deep learning architectures remains largely unexplored. In this study, we conduct a novel exploration on how Vision Transformers (ViTs) and Self-Organizing Maps (SOMs) can empower each other, aiming to bridge this critical research gap. Our findings demonstrate that these architectures can synergistically enhance each other, leading to significantly improved performance in both unsupervised and supervised tasks. Code is publicly available on GitHub.

[136] Beyond the Encoder: Joint Encoder-Decoder Contrastive Pre-Training Improves Dense Prediction

Sébastien Quetin, Tapotosh Ghosh, Farhad Maleki

Main category: cs.CV

TL;DR: DeCon is an encoder-decoder self-supervised learning framework that jointly pre-trains both encoder and decoder using contrastive learning for improved dense prediction tasks.

DetailsMotivation: Current contrastive SSL methods focus only on pre-training encoders, while decoders are trained separately for downstream tasks. This overlooks the benefits of joint encoder-decoder pre-training for dense prediction tasks.

Method: Extends SSL architectures to support diverse decoders and their contrastive losses. Introduces weighted encoder-decoder contrastive loss with non-competing objectives for joint pre-training.

Result: Achieves SOTA performance on most evaluated tasks when pre-trained on ImageNet-1K, COCO, and COCO+. Improves COCO object detection by +0.37 AP, instance segmentation by +0.32 AP, and semantic segmentation by +1.42 mIoU on Pascal VOC and +0.50 mIoU on Cityscapes.

Conclusion: Joint encoder-decoder pre-training significantly enhances representation quality for dense prediction tasks, with improvements generalizing across backbones, decoders, datasets, and persisting in out-of-domain scenarios.

Abstract: Contrastive learning methods in self-supervised settings have primarily focused on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. However, this conventional approach overlooks the potential benefits of jointly pre-training both encoder and decoder. In this paper, we propose DeCon, an efficient encoder-decoder self-supervised learning (SSL) framework that supports joint contrastive pre-training. We first extend existing SSL architectures to accommodate diverse decoders and their corresponding contrastive losses. Then, we introduce a weighted encoder-decoder contrastive loss with non-competing objectives to enable the joint pre-training of encoder-decoder architectures. By adapting a contrastive SSL framework for dense prediction, DeCon establishes consistent state-of-the-art performance on most of the evaluated tasks when pre-trained on Imagenet-1K, COCO and COCO+. Notably, when pre-training a ResNet-50 encoder on COCO dataset, DeCon improves COCO object detection and instance segmentation compared to the baseline framework by +0.37 AP and +0.32 AP, respectively, and boosts semantic segmentation by +1.42 mIoU on Pascal VOC and by +0.50 mIoU on Cityscapes. These improvements generalize across recent backbones, decoders, datasets, and dense tasks beyond segmentation and object detection, and persist in out-of-domain scenarios, including limited-data settings, demonstrating that joint pre-training significantly enhances representation quality for dense prediction. Code is available at https://github.com/sebquetin/DeCon.git.

[137] Can Vision-Language Models Answer Face to Face Questions in the Real-World?

Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, Roland Memisevic

Main category: cs.CV

TL;DR: Introduces IVD benchmark for real-time multimodal AI conversation about live scenes, shows current models lag behind humans but fine-tuning helps close perceptual gaps.

DetailsMotivation: To assess whether AI models can engage in real-time conversations about live scenes using camera and microphone input, which is crucial for real-world AI assistants and humanoid robots.

Method: Created the Qualcomm Interactive Video Dataset (IVD) benchmark with question-answering setup where models must answer in real-time based on camera and audio input, then evaluated existing models and fine-tuned them.

Result: Existing models perform far below human level on real-time multimodal conversation tasks, but fine-tuning on IVD data significantly reduces the performance gap for many required perceptual skills.

Conclusion: Real-time multimodal conversation about live scenes remains challenging for current AI models, but targeted fine-tuning on appropriate datasets can substantially improve performance toward human-level interaction.

Abstract: AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.

[138] Towards Scalable Language-Image Pre-training for 3D Medical Imaging

Chenhui Zhao, Yiwei Lyu, Asadur Chowdury, Edward Harake, Akhil Kondepudi, Akshay Rao, Xinhai Hou, Honglak Lee, Todd Hollon

Main category: cs.CV

TL;DR: HLIP introduces hierarchical attention for language-image pre-training on uncurated 3D medical imaging studies, achieving SOTA performance on brain MRI and head CT benchmarks.

DetailsMotivation: Current language-image pre-training for 3D medical imaging requires manual curation by radiologists, limiting scalability. The authors aim to enable pre-training directly on uncurated clinical studies to align with radiologist workflow and achieve scalability.

Method: Introduces Hierarchical attention for Language-Image Pre-training (HLIP) with a novel hierarchical attention mechanism that models the intrinsic hierarchy of radiology data: slice, scan, and study levels. Trained on large-scale uncurated datasets (220K brain MRI studies and 240K head CT studies).

Result: Achieves state-of-the-art performance: +10.5% balanced ACC on Pub-Brain-5 brain MRI benchmark; +8.3% and +1.7% macro AUC on CQ500 and RSNA head CT benchmarks; +4.3% macro AUC on Rad-ChestCT benchmark when pre-trained on CT-RATE.

Conclusion: Direct pre-training on uncurated clinical datasets with hierarchical attention is a scalable and effective approach for language-image pre-training in 3D medical imaging, demonstrating strong performance and generalizability.

Abstract: The scalability of current language-image pre-training for 3D medical imaging, such as CT and MRI, is constrained by the need for radiologists to manually curate raw clinical studies. In this work, we pioneer pre-training directly on uncurated studies, which both aligns more closely with the radiologist’s workflow and provides a natural path to scalability. However, the unique structure of such data presents new challenges for existing model architectures, which were originally designed for 2D slices or single 3D scans. To address this, we introduce a novel hierarchical attention mechanism inspired by the intrinsic hierarchy of radiology data: slice, scan, and study. We denote our framework as Hierarchical attention for Language-Image Pre-training (HLIP). Trained on 220K studies with 3.13 million scans for brain MRI and 240K studies with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +10.5% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +8.3% and +1.7% macro AUC on head CT benchmarks CQ500 and RSNA, respectively. HLIP also exhibits strong generalizability on existing 3D medical language-image pre-training benchmarks, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/Zch0414/hlip.

[139] CoreEditor: Correspondence-constrained Diffusion for Consistent 3D Editing

Zhe Zhu, Honghua Chen, Peng Li, Mingqiang Wei

Main category: cs.CV

TL;DR: CoreEditor: A novel framework for consistent text-driven 3D editing using correspondence-constrained attention and semantic similarity to maintain cross-view consistency

DetailsMotivation: Existing text-driven 3D editing approaches adapted from 2D image editors often fail to maintain cross-view consistency, leading to insufficient edits and blurry details due to lack of explicit control over multi-view information exchange.

Method: Introduces correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout diffusion denoising. Incorporates semantic similarity estimated during denoising for more reliable correspondence modeling. Also includes selective editing pipeline allowing users to choose preferred results from multiple candidates.

Result: Extensive experiments show CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.

Conclusion: CoreEditor provides a robust framework for text-to-3D editing that maintains cross-view consistency through correspondence-constrained attention and semantic similarity modeling, offering both improved quality and user flexibility.

Abstract: Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.

[140] Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model

Zhuoxu Huang, Mingqi Gao, Jungong Han

Main category: cs.CV

TL;DR: PLM bridges representation gap between LLMs and 3D point clouds using object-centric tokens and geometric reactivation for improved 3D segmentation.

DetailsMotivation: Current 3D object segmentation with LLMs suffers from representation misalignment: LLMs process semantic tokens while 3D point clouds convey only dense geometry. This misalignment limits both input (requires heavy pre-alignment) and output (loses fine-grained accuracy).

Method: Proposes Point Linguist Model (PLM) with two key components: 1) Object-centric Discriminative Representation (OcDR) learns object-centric tokens capturing target semantics and scene relations with hard negative-aware training, 2) Geometric Reactivation Decoder (GRD) predicts masks by combining OcDR tokens (carrying LLM-inferred geometry) with dense features.

Result: Achieves significant improvements: +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks.

Conclusion: PLM effectively bridges the representation gap between LLMs and 3D point clouds without requiring large-scale pre-alignment, demonstrating comprehensive object-centric reasoning for robust 3D understanding.

Abstract: 3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization. However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures. In prior methods, misalignment limits both input and output. At the input stage, dense point patches require heavy pre-alignment, weakening object-level semantics and confusing similar distractors. At the output stage, predictions depend only on dense features without explicit geometric cues, leading to a loss of fine-grained accuracy. To address these limitations, we present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images. Specifically, we introduce Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective. This mitigates the misalignment between LLM tokens and 3D points, enhances resilience to distractors, and facilitates semantic-level reasoning within LLMs. For accurate segmentation, we introduce the Geometric Reactivation Decoder (GRD), which predicts masks by combining OcDR tokens carrying LLM-inferred geometry with corresponding dense features, preserving comprehensive dense features throughout the pipeline. Extensive experiments show that PLM achieves significant improvements of +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks, demonstrating the effectiveness of comprehensive object-centric reasoning for robust 3D understanding.

[141] Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

Mahdi Farahbakhsh, Vishnu Teja Kunde, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

Main category: cs.CV

TL;DR: A novel inference-time search algorithm that guides diffusion model sampling using side information to improve reconstruction quality for inverse problems.

DetailsMotivation: Existing diffusion-based inverse problem solvers overlook valuable side information that could significantly enhance reconstruction quality, especially in severely ill-posed settings.

Method: Proposes a plug-and-play inference-time search algorithm that guides diffusion sampling using diverse side information (reference images, text descriptions, anatomical scans) without requiring training.

Result: Consistently improves reconstruction quality across multiple diffusion-based solvers (DPS, DAPS, MPGD) for various inverse problems including inpainting, super-resolution, and deblurring tasks.

Conclusion: The search-based approach effectively incorporates side information to enhance diffusion-based inverse problem solving, outperforming other methods like reward gradient-based approaches.

Abstract: Diffusion models have been widely used as powerful priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel inference-time search algorithm that guides the sampling process using side information. Our framework can be added to existing diffusion-based reconstruction pipelines in a plug-and-play manner, without requiring any training. Through extensive experiments across a range of inverse problems, including inpainting, super-resolution, and several deblurring tasks, and across multiple diffusion-based inverse problem solvers (DPS, DAPS, and MPGD), we show that augmenting each solver with our framework consistently improves the quality of the reconstructions over the corresponding original method. In order to demonstrate the generality of our approach, we consider diverse forms of side information, including reference images, textual descriptions, and anatomical MRI scans. We also show that our search-based approach outperforms other ways of incorporating side information, including reward gradient-based method. Code is available at \href{https://github.com/mahdi-farahbakhsh/DISS}{here}.

[142] LayerSync: Self-aligning Intermediate Layers

Yasaman Haghighi, Bastien van Delft, Mariam Hassan, Alexandre Alahi

Main category: cs.CV

TL;DR: LayerSync improves diffusion model training by using internal layer representations as self-supervision, accelerating training and improving quality across multiple modalities without external models or data.

DetailsMotivation: Prior work shows external guidance on diffusion model intermediate representations improves training, but this requires pretrained models or additional data. The authors aim to create a self-sufficient approach that leverages the model's own internal representations for regularization.

Method: LayerSync regularizes diffusion models using their own intermediate representations. It identifies semantically rich representations from certain layers and uses them as intrinsic guidance for weaker representations in other layers, creating a plug-and-play regularizer term.

Result: LayerSync accelerates training (8.75x speedup for flow-based transformer on ImageNet) and improves generation quality (23.6% improvement). It works across image, audio, video, and motion generation domains without needing pretrained models or additional data.

Conclusion: LayerSync provides an effective, domain-agnostic approach for improving diffusion model training efficiency and generation quality through self-supervised regularization using internal layer representations.

Abstract: We propose LayerSync, a domain-agnostic approach for improving the generation quality and the training efficiency of diffusion models. Prior studies have highlighted the connection between the quality of generation and the representations learned by diffusion models, showing that external guidance on model intermediate representations accelerates training. We reconceptualize this paradigm by regularizing diffusion models with their own intermediate representations. Building on the observation that representation quality varies across diffusion model layers, we show that the most semantically rich representations can act as an intrinsic guidance for weaker ones, reducing the need for external supervision. Our approach, LayerSync, is a self-sufficient, plug-and-play regularizer term with no overhead on diffusion model training and generalizes beyond the visual domain to other modalities. LayerSync requires no pretrained models nor additional data. We extensively evaluate the method on image generation and demonstrate its applicability to other domains such as audio, video, and motion generation. We show that it consistently improves the generation quality and the training efficiency. For example, we speed up the training of flow-based transformer by over 8.75x on ImageNet dataset and improved the generation quality by 23.6%. The code is available at https://github.com/vita-epfl/LayerSync.

[143] A Study on Inference Latency for Vision Transformers on Mobile Devices

Zhuojin Li, Marco Paolieri, Leana Golubchik

Main category: cs.CV

TL;DR: Quantitative study of 190 real-world vision transformers on mobile devices, comparing with 102 CNNs to understand latency factors and developing a dataset for latency prediction.

DetailsMotivation: As machine learning advances on mobile devices, particularly in computer vision, there's a need to understand the performance characteristics of vision transformers (ViTs) on mobile platforms compared to traditional CNNs, and to develop tools for predicting their inference latency.

Method: 1) Quantitative analysis of 190 real-world ViTs and 102 CNNs on mobile devices; 2) Identification of factors influencing ViT latency; 3) Creation of a dataset with measured latencies of 1000 synthetic ViTs across two ML frameworks and six mobile platforms; 4) Development of latency prediction models for new ViTs.

Result: The study provides insights into ViT latency factors on mobile devices, creates a comprehensive latency dataset, and demonstrates that inference latency of new ViTs can be accurately predicted for real-world applications.

Conclusion: This work enables better understanding and prediction of vision transformer performance on mobile devices, supporting more efficient deployment of ViT-based computer vision applications in mobile environments.

Abstract: Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.

[144] Improving segmentation of retinal arteries and veins using cardiac signal in doppler holograms

Marius Dubosc, Yann Fischer, Zacharie Auray, Nicolas Boutry, Edwin Carlinet, Michael Atlan, Thierry Geraud

Main category: cs.CV

TL;DR: A method for artery-vein segmentation in temporal Doppler holograms using U-Nets enhanced with pulse analysis features to exploit temporal dynamics.

DetailsMotivation: Traditional retinal artery-vein segmentation methods focus only on spatial information and miss the temporal richness available in Doppler holography data, which captures dynamic blood flow behavior.

Method: Proposes using standard U-Net segmentation architectures enhanced with features derived from a dedicated pulse analysis pipeline to incorporate temporal dynamics from Doppler holograms.

Result: The approach achieves performance comparable to more complex attention- or iteration-based models, demonstrating that time-resolved preprocessing can unlock deep learning potential for Doppler holography.

Conclusion: Temporal preprocessing enables conventional segmentation models to effectively exploit Doppler holography’s temporal dynamics, opening new possibilities for quantitative retinal hemodynamics analysis.

Abstract: Doppler holography is an emerging retinal imaging technique that captures the dynamic behavior of blood flow with high temporal resolution, enabling quantitative assessment of retinal hemodynamics. This requires accurate segmentation of retinal arteries and veins, but traditional segmentation methods focus solely on spatial information and overlook the temporal richness of holographic data. In this work, we propose a simple yet effective approach for artery-vein segmentation in temporal Doppler holograms using standard segmentation architectures. By incorporating features derived from a dedicated pulse analysis pipeline, our method allows conventional U-Nets to exploit temporal dynamics and achieve performance comparable to more complex attention- or iteration-based models. These findings demonstrate that time-resolved preprocessing can unlock the full potential of deep learning for Doppler holography, opening new perspectives for quantitative exploration of retinal hemodynamics. The dataset is publicly available at https://huggingface.co/datasets/DigitalHolography/

[145] INQUIRE-Search: Interactive Discovery in Large-Scale Biodiversity Databases

Edward Vendrow, Julia Chae, Rupa Kurinchi-Vendhan, Isaac Eckert, Jazlynn Hall, Marta Jarzyna, Reymond Miyajima, Ruth Oliver, Laura Pollock, Lauren Shrack, Scott Yanco, Oisin Mac Aodha, Sara Beery

Main category: cs.CV

TL;DR: INQUIRE-Search is an AI-powered system that uses natural language to search and analyze ecological phenomena in biodiversity image databases like iNaturalist, enabling scalable scientific discovery from community science data.

DetailsMotivation: Community science platforms contain hundreds of millions of biodiversity images with evidence of complex ecological phenomena, but current manual inspection workflows make this information largely inaccessible at scale.

Method: Developed an open-source system that uses natural language processing to search ecological image databases, verify relevant observations, export data, and enable downstream scientific analysis through five illustrative case studies.

Result: INQUIRE-Search concentrates relevant observations 3-25x more efficiently than comparable manual inspection budgets across case studies, enabling ecological inference from seasonal behavior variation to forest regrowth after wildfires.

Conclusion: The system represents a new paradigm for interactive, efficient, and scalable scientific discovery that unlocks previously inaccessible value in biodiversity datasets, requiring reframing of scientific processes including experiment design and uncertainty analysis.

Abstract: Many ecological questions center on complex phenomena, such as species interactions, behaviors, phenology, and responses to disturbance, that are inherently difficult to observe and sparsely documented. Community science platforms such as iNaturalist contain hundreds of millions of biodiversity images, which often contain evidence of these complex phenomena. However, current workflows that seek to discover and analyze this evidence often rely on manual inspection, leaving this information largely inaccessible at scale. We introduce INQUIRE-Search, an open-source system that uses natural language to enable scientists to rapidly search within an ecological image database like iNaturalist for specific phenomena, verify and export relevant observations, and use these outputs for downstream scientific analysis. Across five illustrative case studies, INQUIRE-Search concentrates relevant observations 3-25x more efficiently than comparable manual inspection budgets. These examples demonstrate how the system can be used for ecological inference, from analyzing seasonal variation in behavior across species to forest regrowth after wildfires. These examples illustrate a new paradigm for interactive, efficient, and scalable scientific discovery that can begin to unlock previously inaccessible scientific value in large-scale biodiversity datasets. Finally, we highlight how AI-enabled discovery tools for science require reframing aspects of the scientific process, including experiment design, data collection, survey effort, and uncertainty analysis.

[146] Boosting Medical Visual Understanding From Multi-Granular Language Learning

Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan

Main category: cs.CV

TL;DR: MGLL is a contrastive learning framework that improves multi-label and cross-granularity alignment in vision-language models, particularly for complex domains like medical imaging.

DetailsMotivation: Current vision-language models like CLIP focus on single-label, single-granularity alignment, which is insufficient for complex domains like medical imaging where images have multiple labels across different annotation granularities.

Method: Proposes Multi-Granular Language Learning (MGLL) framework with structured multi-label supervision, integration of textual descriptions across granularities, soft-label supervision with point-wise constraints, and smooth KL divergence for cross-granularity consistency.

Result: MGLL outperforms state-of-the-art methods on downstream tasks when pretrained on large-scale multi-granular datasets.

Conclusion: MGLL effectively addresses limitations of current vision-language models for complex multi-label, multi-granularity alignment tasks, particularly in specialized domains like medical imaging.

Abstract: Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at https://github.com/HUANGLIZI/MGLL.

[147] Structural Prognostic Event Modeling for Multimodal Cancer Survival Analysis

Yilan Zhang, Li Nanbo, Changchun Yang, Jürgen Schmidhuber, Xin Gao

Main category: cs.CV

TL;DR: SlotSPE: A slot-based framework for modeling structural prognostic events in cancer by compressing multimodal histology and genomic data into distinctive slot representations to improve survival prediction.

DetailsMotivation: Current multimodal approaches struggle with efficient modeling of intra- and inter-modal interactions due to high-dimensional inputs, and fail to capture sparse, patient-specific prognostic events that determine patient outcomes despite being unannotated.

Method: SlotSPE uses slot attention to compress each patient’s multimodal inputs into compact, modality-specific sets of mutually distinctive slots, inspired by factorial coding principles. These slot representations encode prognostic events and enable efficient modeling of complex interactions while incorporating biological priors.

Result: Outperforms existing methods on 8 out of 10 cancer benchmarks with 2.9% overall improvement, remains robust under missing genomic data, and provides improved interpretability through structured event decomposition.

Conclusion: SlotSPE effectively addresses the challenge of capturing sparse prognostic events in multimodal cancer data, demonstrating superior performance, robustness, and interpretability for survival prediction.

Abstract: The integration of histology images and gene profiles has shown great promise for improving survival prediction in cancer. However, current approaches often struggle to model intra- and inter-modal interactions efficiently and effectively due to the high dimensionality and complexity of the inputs. A major challenge is capturing critical prognostic events that, though few, underlie the complexity of the observed inputs and largely determine patient outcomes. These events, manifested as high-level structural signals such as spatial histologic patterns or pathway co-activations, are typically sparse, patient-specific, and unannotated, making them inherently difficult to uncover. To address this, we propose SlotSPE, a slot-based framework for structural prognostic event modeling. Specifically, inspired by the principle of factorial coding, we compress each patient’s multimodal inputs into compact, modality-specific sets of mutually distinctive slots using slot attention. By leveraging these slot representations as encodings for prognostic events, our framework enables both efficient and effective modeling of complex intra- and inter-modal interactions, while also facilitating seamless incorporation of biological priors that enhance prognostic relevance. Extensive experiments on ten cancer benchmarks show that SlotSPE outperforms existing methods in 8 out of 10 cohorts, achieving an overall improvement of 2.9%. It remains robust under missing genomic data and delivers markedly improved interpretability through structured event decomposition.

[148] Restrictive Hierarchical Semantic Segmentation for Stratified Tooth Layer Detection

Ryan Banks, Camila Lindoni Azevedo, Hongying Tang, Yunpeng Li

Main category: cs.CV

TL;DR: Hierarchical semantic segmentation framework for dental imaging that embeds anatomical structure through recurrent level-wise prediction with feature conditioning and consistency constraints.

DetailsMotivation: Existing hierarchy-aware segmentation methods provide weak supervision through loss functions only. Need explicit anatomical hierarchy encoding for better dental disease staging and anatomical structure understanding.

Method: Recurrent level-wise prediction with restrictive output heads and top-down feature conditioning using FiLM. Backbone re-run at each tree depth with previous level logits. Probabilistic composition rule enforces parent-child consistency. Hierarchical loss combines per-level Dice, cross entropy, and consistency terms.

Result: Hierarchical variants consistently increase IoU, Dice, and recall, especially for fine-grained anatomies, producing more anatomically coherent masks. However, increased recall over precision implies more false positives. Validated on TL-pano dataset with 194 panoramic radiographs.

Conclusion: Explicit hierarchical structuring improves both performance and clinical plausibility in low-data dental imaging regimes, demonstrating value of anatomical hierarchy encoding.

Abstract: Accurate understanding of anatomical structures is essential for reliably staging certain dental diseases. A way of introducing this within semantic segmentation models is by utilising hierarchy-aware methodologies. However, existing hierarchy-aware segmentation methods largely encode anatomical structure through the loss functions, providing weak and indirect supervision. We introduce a general framework that embeds an explicit anatomical hierarchy into semantic segmentation by coupling a recurrent, level-wise prediction scheme with restrictive output heads and top-down feature conditioning. At each depth of the class tree, the backbone is re-run on the original image concatenated with logits from the previous level. Child class features are conditioned using Feature-wise Linear Modulation of their parent class probabilities, to modulate child feature spaces for fine grained detection. A probabilistic composition rule enforces consistency between parent and descendant classes. Hierarchical loss combines per-level class weighted Dice and cross entropy loss and a consistency term loss, ensuring parent predictions are the sum of their children. We validate our approach on our proposed dataset, TL-pano, containing 194 panoramic radiographs with dense instance and semantic segmentation annotations, of tooth layers and alveolar bone. Utilising UNet and HRNet as donor models across a 5-fold cross validation scheme, the hierarchical variants consistently increase IoU, Dice, and recall, particularly for fine-grained anatomies, and produce more anatomically coherent masks. However, hierarchical variants also demonstrated increased recall over precision, implying increased false positives. The results demonstrate that explicit hierarchical structuring improves both performance and clinical plausibility, especially in low data dental imaging regimes.

[149] Block-Recurrent Dynamics in Vision Transformers

Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller

Main category: cs.CV

TL;DR: Vision Transformers exhibit block-recurrent depth structure where computation can be approximated with far fewer distinct blocks applied recurrently, enabling dynamical systems analysis.

DetailsMotivation: To provide a mechanistic account of Vision Transformers' computational phenomenology by interpreting their depth as a well-characterized dynamical flow rather than just architectural structure.

Method: Proposes Block-Recurrent Hypothesis (BRH) and trains Recurrent Approximations to Phase-structured TransfORmers (Raptor) - block-recurrent surrogates of pretrained ViTs using only k ≪ L distinct blocks applied recurrently.

Result: Raptor models recover 96% of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent runtime. Analysis reveals directional convergence, token-specific dynamics, and low-rank updates in late depth.

Conclusion: Vision Transformers exhibit compact recurrent programs along depth, enabling study through principled dynamical systems analysis and revealing low-complexity normative solutions.

Abstract: As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent runtime. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

[150] Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

Bac Nguyen, Yuhta Takida, Naoki Murata, Chieh-Hsin Lai, Toshimitsu Uesaka, Stefano Ermon, Yuki Mitsufuji

Main category: cs.CV

TL;DR: CODA enhances object-centric learning with diffusion models by adding register slots to reduce slot entanglement and using contrastive alignment to improve slot-image correspondence.

DetailsMotivation: Slot Attention with diffusion models suffers from slot entanglement and weak alignment between object slots and image content, limiting object-centric learning performance.

Method: Proposes Contrastive Object-centric Diffusion Alignment (CODA) with two key components: (1) register slots to absorb residual attention and reduce interference between object slots, (2) contrastive alignment loss to explicitly encourage slot-image correspondence.

Result: Improves object discovery (+6.1% FG-ARI on COCO), property prediction, and compositional image generation on synthetic (MOVi-C/E) and real-world datasets (VOC, COCO). Register slots add negligible overhead.

Conclusion: CODA provides an effective framework for robust object-centric learning in complex, real-world scenes with potential applications in multimodal understanding.

Abstract: Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), a simple extension that (i) employs register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slot-image correspondence. The resulting training objective serves as a tractable surrogate for maximizing mutual information (MI) between slots and inputs, strengthening slot representation quality. On both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO), CODA improves object discovery (e.g., +6.1% FG-ARI on COCO), property prediction, and compositional image generation over strong baselines. Register slots add negligible overhead, keeping CODA efficient and scalable. These results indicate potential applications of CODA as an effective framework for robust OCL in complex, real-world scenes. Code and pretrained models are available at https://github.com/sony/coda.

[151] Tuning-free Visual Effect Transfer across Videos

Maxwell Jones, Rameen Abdal, Or Patashnik, Ruslan Salakhutdinov, Sergey Tulyakov, Jun-Yan Zhu, Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: RefVFX is a framework for transferring complex temporal effects from reference videos to target videos/images using a feed-forward approach, addressing limitations of text-based editing for dynamic effects.

DetailsMotivation: Existing video editing methods struggle with dynamic temporal effects like lighting changes or character transformations that are difficult to describe via text prompts or static keyframes. There's a need for reference-based temporal effect transfer that preserves input motion while applying complex temporal dynamics.

Method: 1) Creates a large-scale dataset of triplets (reference effect video, input image/video, output video) using an automated pipeline for video-to-video effects, augmented with image-to-video effects from LoRA adapters and code-based temporal effects. 2) Trains a reference-conditioned model using recent text-to-video backbones on this dataset.

Result: RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference evaluations.

Conclusion: The framework successfully enables transfer of complex temporal effects from reference videos to target content, addressing limitations of text-based editing for dynamic temporal phenomena.

Abstract: We present RefVFX, a new framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner. While existing methods excel at prompt-based or keyframe-conditioned editing, they struggle with dynamic temporal effects such as dynamic lighting changes or character transformations, which are difficult to describe via text or static conditions. Transferring a video effect is challenging, as the model must integrate the new temporal dynamics with the input video’s existing motion and appearance. % To address this, we introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. Creating this data is non-trivial, especially the video-to-video effect triplets, which do not exist naturally. To generate these, we propose a scalable automated pipeline that creates high-quality paired videos designed to preserve the input’s motion and structure while transforming it based on some fixed, repeatable effect. We then augment this data with image-to-video effects derived from LoRA adapters and code-based temporal effects generated through programmatic composition. Building on our new dataset, we train our reference-conditioned model using recent text-to-video backbones. Experimental results demonstrate that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference. See our website at https://snap-research.github.io/RefVFX/

[152] Di3PO - Diptych Diffusion DPO for Targeted Improvements in Image Generation

Sanjana Reddy, Ishaan Malhi, Sally Ma, Praneet Dutta

Main category: cs.CV

TL;DR: Di3PO is a novel method for constructing positive/negative image pairs for preference tuning of text-to-image diffusion models that isolates specific regions for improvement while keeping surrounding context stable, addressing issues with existing methods.

DetailsMotivation: Existing preference tuning methods for T2I diffusion models rely on expensive generation steps that often yield training pairs lacking meaningful differences, having irrelevant pixel variations, or being costly to sample/filter, degrading training efficiency.

Method: Di3PO constructs positive and negative pairs by isolating specific regions targeted for improvement during preference tuning while keeping the surrounding image context stable, enabling more focused and efficient training.

Result: The method demonstrates efficacy on the challenging task of text rendering in diffusion models, showing improvements over baseline methods of supervised fine-tuning (SFT) and direct preference optimization (DPO).

Conclusion: Di3PO provides a more efficient and effective approach to preference tuning for text-to-image diffusion models by addressing key limitations in existing training pair construction methods.

Abstract: Existing methods for preference tuning of text-to-image (T2I) diffusion models often rely on computationally expensive generation steps to create positive and negative pairs of images. These approaches frequently yield training pairs that either lack meaningful differences, are expensive to sample and filter, or exhibit significant variance in irrelevant pixel regions, thereby degrading training efficiency. To address these limitations, we introduce “Di3PO”, a novel method for constructing positive and negative pairs that isolates specific regions targeted for improvement during preference tuning, while keeping the surrounding context in the image stable. We demonstrate the efficacy of our approach by applying it to the challenging task of text rendering in diffusion models, showcasing improvements over baseline methods of SFT and DPO.

[153] Universal Anti-forensics Attack against Image Forgery Detection via Multi-modal Guidance

Haipeng Li, Rongxuan Peng, Anwei Luo, Shunquan Tan, Changsheng Chen, Anastasia Antsiferova

Main category: cs.CV

TL;DR: ForgeryEraser is an anti-forensics attack framework that exploits vulnerabilities in AIGC detectors by manipulating image embeddings in VLM feature space to erase forgery traces without needing access to target detectors.

DetailsMotivation: Existing AIGC authenticity assessment protocols overlook anti-forensics attacks, failing to ensure comprehensive robustness of detectors in real-world applications. There's a need to test detector vulnerabilities to ensure they can withstand adversarial manipulation.

Method: Exploits systemic reliance on VLMs (like CLIP) as shared backbones in AIGC detectors. Uses multi-modal guidance loss to drive forged image embeddings toward text-derived authentic anchors in VLM feature space while repelling from forgery anchors, rather than traditional logit-based optimization.

Result: Causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Induces explainable forensic models to generate explanations consistent with authentic images for forged images.

Conclusion: Reveals adversarial vulnerability in AIGC detectors due to reliance on publicly accessible VLMs as backbones. Demonstrates need for more robust detector architectures that can withstand anti-forensics attacks.

Abstract: The rapid advancement of AI-Generated Content (AIGC) technologies poses significant challenges for authenticity assessment. However, existing evaluation protocols largely overlook anti-forensics attack, failing to ensure the comprehensive robustness of state-of-the-art AIGC detectors in real-world applications. To bridge this gap, we propose ForgeryEraser, a framework designed to execute universal anti-forensics attack without access to the target AIGC detectors. We reveal an adversarial vulnerability stemming from the systemic reliance on Vision-Language Models (VLMs) as shared backbones (e.g., CLIP), where downstream AIGC detectors inherit the feature space of these publicly accessible models. Instead of traditional logit-based optimization, we design a multi-modal guidance loss to drive forged image embeddings within the VLM feature space toward text-derived authentic anchors to erase forgery traces, while repelling them from forgery anchors. Extensive experiments demonstrate that ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Moreover, ForgeryEraser induces explainable forensic models to generate explanations consistent with authentic images for forged images. Our code will be made publicly available.

[154] VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping

Sanoojan Baliah, Yohan Abeysinghe, Rusiru Thushara, Khan Muhammad, Abhinav Dhall, Karthik Nandakumar, Muhammad Haris Khan

Main category: cs.CV

TL;DR: VFace is a training-free, plug-and-play method for high-quality face swapping in videos that integrates with diffusion-based image face swapping approaches using frequency spectrum attention, target structure guidance, and flow-guided temporal smoothing.

DetailsMotivation: Existing video face swapping methods often suffer from temporal inconsistencies and visual artifacts when applied frame-by-frame. The authors aim to create a practical solution that enhances temporal coherence without requiring additional training or video-specific fine-tuning.

Method: Three key techniques: 1) Frequency Spectrum Attention Interpolation to preserve identity characteristics, 2) Target Structure Guidance via attention injection to align structural features, and 3) Flow-Guided Attention Temporal Smoothening to enforce spatiotemporal coherence without modifying the underlying diffusion model.

Result: Extensive experiments show VFace significantly enhances temporal consistency and visual fidelity compared to frame-wise generation approaches, offering a practical modular solution for video face swapping.

Conclusion: VFace provides an effective training-free, plug-and-play method for high-quality video face swapping that seamlessly integrates with existing diffusion-based image face swapping approaches while addressing temporal coherence issues.

Abstract: We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.

[155] Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening

Wooseok Jeon, Seunghyun Shin, Dongmin Shin, Hae-Gon Jeon

Main category: cs.CV

TL;DR: Motion Prior Distillation (MPD) is an inference-time technique that improves video inbetweening by distilling forward motion priors into backward paths to reduce temporal discontinuities in I2V diffusion models.

DetailsMotivation: Existing inference-time sampling methods for image-to-video inbetweening suffer from temporal discontinuities and visual artifacts due to misalignment between forward and backward generated paths, as each path follows different motion priors from their conditioning frames.

Method: Proposes Motion Prior Distillation (MPD), an inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path, avoiding denoising ambiguity in end-conditioned paths.

Result: The method yields more temporally coherent inbetweening results with forward motion prior, validated through quantitative evaluations on standard benchmarks and extensive user studies in practical scenarios.

Conclusion: MPD is a simple yet effective inference-time approach that improves temporal coherence in video inbetweening by aligning motion priors between forward and backward generation paths.

Abstract: Recent progress in image-to-video (I2V) diffusion models has significantly advanced the field of generative inbetweening, which aims to generate semantically plausible frames between two keyframes. In particular, inference-time sampling strategies, which leverage the generative priors of large-scale pre-trained I2V models without additional training, have become increasingly popular. However, existing inference-time sampling, either fusing forward and backward paths in parallel or alternating them sequentially, often suffers from temporal discontinuities and undesirable visual artifacts due to the misalignment between the two generated paths. This is because each path follows the motion prior induced by its own conditioning frame. In this work, we propose Motion Prior Distillation (MPD), a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path. Our method can deliberately avoid denoising the end-conditioned path which causes the ambiguity of the path, and yield more temporally coherent inbetweening results with the forward motion prior. We not only perform quantitative evaluations on standard benchmarks, but also conduct extensive user studies to demonstrate the effectiveness of our approach in practical scenarios.

[156] VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Jiarong Liang, Max Ku, Ka-Hei Hui, Ping Nie, Wenhu Chen

Main category: cs.CV

TL;DR: VisPhyWorld is an execution-based framework that evaluates MLLMs’ physical reasoning by requiring them to generate executable simulator code from visual observations, making world representations inspectable and falsifiable.

DetailsMotivation: Current benchmarks for evaluating MLLMs' physical reasoning rely on recognition-style protocols (VQA, VoE) that can be answered without explicit physical hypotheses, making it hard to assess genuine physical understanding.

Method: Proposes VisPhyWorld framework requiring models to generate executable simulator code from visual observations. Introduces VisPhyBench with 209 evaluation scenes from 108 physical templates, evaluating appearance reconstruction and physically plausible motion reproduction.

Result: Pipeline produces valid reconstructed videos in 97.7% of benchmark cases. Experiments show state-of-the-art MLLMs achieve strong semantic scene understanding but struggle with accurate physical parameter inference and consistent physical dynamics simulation.

Conclusion: Execution-based evaluation through code generation provides a more rigorous test of physical reasoning in MLLMs, revealing significant gaps in their ability to model physical dynamics despite strong semantic understanding.

Abstract: Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% on the benchmark. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.

[157] CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography

Qingqing Zhu, Qiao Jin, Tejas S. Mathai, Yin Fang, Zhizheng Wang, Yifan Yang, Maame Sarfo-Gyamfi, Benjamin Hou, Ran Gu, Praveen T. S. Balamuralikrishna, Kenneth C. Wang, Ronald M. Summers, Zhiyong Lu

Main category: cs.CV

TL;DR: CT-Bench is a comprehensive benchmark dataset for AI-based lesion analysis on CT scans, featuring 20,335 lesions with annotations and 2,850 QA pairs for evaluating multimodal models on lesion localization, description, and attribute categorization.

DetailsMotivation: Progress in AI for medical imaging is limited by scarce publicly available CT datasets with lesion-level annotations. The authors aim to bridge this gap by creating a benchmark that enables evaluation of multimodal models on comprehensive lesion analysis tasks.

Method: Created CT-Bench with two components: 1) Lesion Image and Metadata Set with 20,335 lesions from 7,795 CT studies containing bounding boxes, descriptions, and size information; 2) Multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization, including hard negative examples.

Result: Evaluated multiple state-of-the-art multimodal models (vision-language and medical CLIP variants) against radiologist assessments. Fine-tuning models on the Lesion Image and Metadata Set yielded significant performance gains across both benchmark components.

Conclusion: CT-Bench serves as a valuable comprehensive benchmark for lesion analysis, demonstrating clinical utility through improved model performance when fine-tuned on the dataset. The benchmark addresses real-world diagnostic challenges through inclusion of hard negative examples.

Abstract: Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.

[158] Accelerating Large-Scale Dataset Distillation via Exploration-Exploitation Optimization

Muhammad J. Alahmadi, Peng Gao, Feiyi Wang, Dongkuan Xu

Main category: cs.CV

TL;DR: E²D is a dataset distillation method that uses exploration-exploitation optimization to bridge the accuracy-efficiency gap in large-scale dataset distillation, achieving state-of-the-art results on ImageNet benchmarks with significant speed improvements.

DetailsMotivation: Current dataset distillation methods face a trade-off: optimization-based methods are accurate but computationally expensive, while optimization-free methods are efficient but less accurate. The authors aim to overcome this efficiency-accuracy gap for large-scale dataset distillation.

Method: E²D uses full-image initialization to preserve semantic integrity, followed by a two-phase optimization strategy: exploration phase with uniform updates to identify high-loss regions, and exploitation phase focusing updates on those regions to accelerate convergence while minimizing redundant computation.

Result: E²D achieves state-of-the-art performance on ImageNet-1K while being 18× faster than previous methods, and substantially improves accuracy on ImageNet-21K while remaining 4.3× faster.

Conclusion: Targeted, redundancy-reducing updates (rather than brute-force optimization) can bridge the accuracy-efficiency gap in large-scale dataset distillation, making E²D a practical solution for resource-constrained scenarios.

Abstract: Dataset distillation compresses the original data into compact synthetic datasets, reducing training time and storage while retaining model performance, enabling deployment under limited resources. Although recent decoupling-based distillation methods enable dataset distillation at large scale, they continue to face an efficiency gap: optimization-based decoupling methods achieve higher accuracy but demand intensive computation, whereas optimization-free decoupling methods are efficient but sacrifice accuracy. To overcome this trade-off, we propose Exploration–Exploitation Distillation (E$^2$D), a simple, practical method that minimizes redundant computation through an efficient pipeline that begins with full-image initialization to preserve semantic integrity and feature diversity. It then uses a two-phase optimization strategy: an exploration phase that performs uniform updates and identifies high-loss regions, and an exploitation phase that focuses updates on these regions to accelerate convergence. We evaluate E$^2$D on large-scale benchmarks, surpassing the state-of-the-art on ImageNet-1K while being $18\times$ faster, and on ImageNet-21K, our method substantially improves accuracy while remaining $4.3\times$ faster. These results demonstrate that targeted, redundancy-reducing updates, rather than brute-force optimization, bridge the gap between accuracy and efficiency in large-scale dataset distillation. Code is available at https://github.com/ncsu-dk-lab/E2D.

[159] Intracoronary Optical Coherence Tomography Image Processing and Vessel Classification Using Machine Learning

Amal Lahchim, Lambros Athanasiou

Main category: cs.CV

TL;DR: Automated pipeline for vessel segmentation and classification in intracoronary OCT images using machine learning with preprocessing, artifact removal, clustering, and classification achieving near-perfect accuracy.

DetailsMotivation: Intracoronary OCT provides high-resolution coronary vessel visualization but faces challenges from noise, imaging artifacts, and complex tissue structures, requiring automated analysis solutions.

Method: Integrated pipeline with image preprocessing, guidewire artifact removal, polar-to-Cartesian transformation, unsupervised K-means clustering, local feature extraction, and classification using Logistic Regression and SVM for pixel-wise vessel classification.

Result: Achieved precision, recall, and F1-score values up to 1.00 with overall classification accuracy of 99.68%, demonstrating excellent performance with low computational complexity.

Conclusion: Provides reliable and efficient automated OCT image analysis with potential for clinical decision support and real-time medical image processing applications.

Abstract: Intracoronary Optical Coherence Tomography (OCT) enables high-resolution visualization of coronary vessel anatomy but presents challenges due to noise, imaging artifacts, and complex tissue structures. This paper proposes a fully automated pipeline for vessel segmentation and classification in OCT images using machine learning techniques. The proposed method integrates image preprocessing, guidewire artifact removal, polar-to-Cartesian transformation, unsupervised K-means clustering, and local feature extraction. These features are used to train Logistic Regression and Support Vector Machine classifiers for pixel-wise vessel classification. Experimental results demonstrate excellent performance, achieving precision, recall, and F1-score values up to 1.00 and overall classification accuracy of 99.68%. The proposed approach provides accurate vessel boundary detection while maintaining low computational complexity and requiring minimal manual annotation. This method offers a reliable and efficient solution for automated OCT image analysis and has potential applications in clinical decision support and real-time medical image processing.

[160] Parameter-Free Adaptive Multi-Scale Channel-Spatial Attention Aggregation framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired

Qi He, XiangXiang Wang, Jingtao Zhang, Yongbin Yu, Hongxiang Chu, Manping Fan, JingYe Cai, Zhenglin Yang

Main category: cs.CV

TL;DR: AMAA framework improves monocular 3D Semantic Scene Completion for indoor assistive perception by addressing voxel-feature reliability and cross-scale fusion issues through attention mechanisms and adaptive feature gating.

DetailsMotivation: Existing monocular SSC approaches lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation, making them vulnerable to projection diffusion and feature entanglement, which limits structural stability needed for safety-critical scene understanding in assistive systems for visually impaired users.

Method: Proposes Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon MonoScene pipeline. Uses parallel channel-spatial attention aggregation to jointly calibrate lifted voxel features in semantic and spatial dimensions, and implements hierarchical adaptive feature-gating strategy to regulate information injection across scales during multi-scale encoder-decoder fusion.

Result: On NYUv2 benchmark: achieves 27.25% SSC mIoU (+0.31 improvement) and 43.10% SC IoU (+0.59 improvement). System-level deployment on NVIDIA Jetson platform verifies stable execution on embedded hardware without significantly increasing system complexity.

Conclusion: AMAA improves monocular SSC quality and provides a reliable, deployable perception framework for indoor assistive systems targeting visually impaired users, addressing key limitations in feature reliability and cross-scale fusion while maintaining computational efficiency.

Abstract: In indoor assistive perception for visually impaired users, 3D Semantic Scene Completion (SSC) is expected to provide structurally coherent and semantically consistent occupancy under strictly monocular vision for safety-critical scene understanding. However, existing monocular SSC approaches often lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation during 2D-3D projection and multi-scale fusion, making them vulnerable to projection diffusion and feature entanglement and thus limiting structural stability. To address these challenges, this paper presents an Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon the MonoScene pipeline. Rather than introducing a heavier backbone, AMAA focuses on reliability-oriented feature regulation within a monocular SSC framework. Specifically, lifted voxel features are jointly calibrated in semantic and spatial dimensions through parallel channel-spatial attention aggregation, while multi-scale encoder-decoder fusion is stabilized via a hierarchical adaptive feature-gating strategy that regulates information injection across scales. Experiments on the NYUv2 benchmark demonstrate consistent improvements over MonoScene without significantly increasing system complexity: AMAA achieves 27.25% SSC mIoU (+0.31) and 43.10% SC IoU (+0.59). In addition, system-level deployment on an NVIDIA Jetson platform verifies that the complete AMAA framework can be executed stably on embedded hardware. Overall, AMAA improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.

cs.AI

[161] AIdentifyAGE Ontology for Decision Support in Forensic Dental Age Assessment

Renato Marcelo, Ana Rodrigues, Cristiana Palmela Pereira, António Figueiras, Rui Santos, José Rui Figueira, Alexandre P Francisco, Cátia Vaz

Main category: cs.AI

TL;DR: AIdentifyAGE ontology provides a standardized semantic framework for forensic dental age assessment, integrating manual and AI-assisted workflows with judicial context for enhanced transparency and reproducibility.

DetailsMotivation: Current dental age assessment practices face challenges including methodological heterogeneity, fragmented data representation, and limited interoperability between clinical, forensic, and legal systems, which hinder transparency and reproducibility, especially with increasing AI adoption.

Method: Development of a domain-specific ontology that models the complete medico-legal workflow, integrating judicial context, individual information, forensic examination data, dental developmental assessment methods, radiographic imaging, statistical reference studies, and AI-based estimation methods, built on established biomedical, dental, and machine learning ontologies.

Result: The AIdentifyAGE ontology provides a standardized, semantically coherent framework that enables traceable linkage between observations, methods, reference data, and reported outcomes, ensuring interoperability, extensibility, and FAIR principles compliance.

Conclusion: The ontology enhances consistency, transparency, and explainability in forensic dental age assessment, establishing a foundation for ontology-driven decision support systems in medico-legal contexts.

Abstract: Age assessment is crucial in forensic and judicial decision-making, particularly in cases involving undocumented individuals and unaccompanied minors, where legal thresholds determine access to protection, healthcare, and judicial procedures. Dental age assessment is widely recognized as one of the most reliable biological approaches for adolescents and young adults, but current practices are challenged by methodological heterogeneity, fragmented data representation, and limited interoperability between clinical, forensic, and legal information systems. These limitations hinder transparency and reproducibility, amplified by the increasing adoption of AI- based methods. The AIdentifyAGE ontology is domain-specific and provides a standardized, semantically coherent framework, encompassing both manual and AI-assisted forensic dental age assessment workflows, and enabling traceable linkage between observations, methods, reference data, and reported outcomes. It models the complete medico-legal workflow, integrating judicial context, individual-level information, forensic examination data, dental developmental assessment methods, radiographic imaging, statistical reference studies, and AI-based estimation methods. It is being developed together with domain experts, and it builds on upper and established biomedical, dental, and machine learning ontologies, ensuring interoperability, extensibility, and compliance with FAIR principles. The AIdentifyAGE ontology is a fundamental step to enhance consistency, transparency, and explainability, establishing a robust foundation for ontology-driven decision support systems in medico-legal and judicial contexts.

[162] Retrieval Augmented (Knowledge Graph), and Large Language Model-Driven Design Structure Matrix (DSM) Generation of Cyber-Physical Systems

H. Sinan Bank, Daniel R. Herber

Main category: cs.AI

TL;DR: LLMs, RAG, and GraphRAG tested for automated Design Structure Matrix generation on mechanical systems, with evaluation on component relationship identification tasks.

DetailsMotivation: To explore automated methods for generating Design Structure Matrices (DSMs) which are crucial for system architecture analysis, aiming to reduce manual effort and improve efficiency in engineering design.

Method: Tested three approaches: Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Graph-based RAG (GraphRAG) on two use cases (power screwdriver and CubeSat). Evaluated performance on component relationship determination and component identification tasks.

Result: Despite design and computational challenges, the methods show potential for automated DSM generation. Performance was measured at both element-level and overall architecture level, with all code made publicly available.

Conclusion: Opportunities exist for automated DSM generation using LLM-based approaches, though challenges remain. Public code release enables reproducibility and further domain expert feedback.

Abstract: We explore the potential of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Graph-based RAG (GraphRAG) for generating Design Structure Matrices (DSMs). We test these methods on two distinct use cases – a power screwdriver and a CubeSat with known architectural references – evaluating their performance on two key tasks: determining relationships between predefined components, and the more complex challenge of identifying components and their subsequent relationships. We measure the performance by assessing each element of the DSM and overall architecture. Despite design and computational challenges, we identify opportunities for automated DSM generation, with all code publicly available for reproducibility and further feedback from the domain experts.

[163] Contextuality from Single-State Representations: An Information-Theoretic Principle for Adaptive Intelligence

Song-Ju Kim

Main category: cs.AI

TL;DR: Contextuality is not unique to quantum mechanics but arises inevitably from single-state reuse in classical probabilistic systems, creating irreducible information-theoretic costs that nonclassical frameworks avoid by relaxing global joint probability assumptions.

DetailsMotivation: To understand the fundamental representational consequences of single-state reuse in adaptive systems, which is common in both natural and artificial intelligence but poorly understood in terms of its constraints on representation.

Method: Model contexts as interventions acting on a shared internal state, prove that classical models reproducing contextual outcome statistics incur irreducible information-theoretic costs, provide a minimal constructive example, and show how nonclassical frameworks avoid this by relaxing the assumption of a single global joint probability space.

Result: Contextuality is an inevitable consequence of single-state reuse in classical probabilistic representations, with dependence on context that cannot be mediated solely through internal state, creating irreducible information-theoretic costs.

Conclusion: Contextuality is a general representational constraint on adaptive intelligence independent of physical implementation, arising from fundamental limitations of classical probabilistic frameworks when reusing single states across multiple contexts.

Abstract: Adaptive systems often operate across multiple contexts while reusing a fixed internal state space due to constraints on memory, representation, or physical resources. Such single-state reuse is ubiquitous in natural and artificial intelligence, yet its fundamental representational consequences remain poorly understood. We show that contextuality is not a peculiarity of quantum mechanics, but an inevitable consequence of single-state reuse in classical probabilistic representations. Modeling contexts as interventions acting on a shared internal state, we prove that any classical model reproducing contextual outcome statistics must incur an irreducible information-theoretic cost: dependence on context cannot be mediated solely through the internal state. We provide a minimal constructive example that explicitly realizes this cost and clarifies its operational meaning. We further explain how nonclassical probabilistic frameworks avoid this obstruction by relaxing the assumption of a single global joint probability space, without invoking quantum dynamics or Hilbert space structure. Our results identify contextuality as a general representational constraint on adaptive intelligence, independent of physical implementation.

[164] Mobility-Aware Cache Framework for Scalable LLM-Based Human Mobility Simulation

Hua Yan, Heng Tan, Yingxue Zhang, Yu Yang

Main category: cs.AI

TL;DR: MobCache: A mobility-aware cache framework using reconstructible caches and latent-space reasoning to enable efficient large-scale human mobility simulations with LLMs.

DetailsMotivation: Large-scale human mobility simulation is important for urban planning, epidemiology, and transportation, but current LLM-based approaches have high computational costs that limit scalability.

Method: MobCache has two components: (1) a reasoning component that encodes reasoning steps as latent-space embeddings with a latent-space evaluator for reuse/recombination, and (2) a decoding component with a lightweight decoder trained using mobility law-constrained distillation to translate latent reasoning chains to natural language.

Result: Experiments show MobCache significantly improves efficiency across multiple dimensions while maintaining performance comparable to state-of-the-art LLM-based methods.

Conclusion: MobCache enables efficient large-scale human mobility simulations by leveraging reconstructible caches and latent-space reasoning, addressing the scalability limitations of current LLM-based approaches.

Abstract: Large-scale human mobility simulation is critical for applications such as urban planning, epidemiology, and transportation analysis. Recent works treat large language models (LLMs) as human agents to simulate realistic mobility behaviors using structured reasoning, but their high computational cost limits scalability. To address this, we design a mobility-aware cache framework named MobCache that leverages reconstructible caches to enable efficient large-scale human mobility simulations. It consists of: (1) a reasoning component that encodes each reasoning step as a latent-space embedding and uses a latent-space evaluator to enable the reuse and recombination of reasoning steps; and (2) a decoding component that employs a lightweight decoder trained with mobility law-constrained distillation to translate latent-space reasoning chains into natural language, thereby improving simulation efficiency while maintaining fidelity. Experiments show that MobCache significantly improves efficiency across multiple dimensions while maintaining performance comparable to state-of-the-art LLM-based methods.

[165] When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Mubashara Akhtar, Anka Reuel, Prajna Soni, Sanchit Ahuja, Pawan Sasanka Ammanamanchi, Ruchit Rawal, Vilém Zouhar, Srishti Yadav, Chenxi Whitehouse, Dayeon Ki, Jennifer Mickel, Leshem Choshen, Marek Šuppa, Jan Batzner, Jenny Chim, Jeba Sania, Yanan Long, Hossein A. Rahmani, Christina Knight, Yiyang Nan, Jyoutir Raj, Yu Fan, Shubham Singh, Subramanyam Sahoo, Eliya Habba, Usman Gohar, Siddhesh Pawar, Robert Scholz, Arjun Subramonian, Jingwei Ni, Mykel Kochenderfer, Sanmi Koyejo, Mrinmaya Sachan, Stella Biderman, Zeerak Talat, Avijit Ghosh, Irene Solaiman

Main category: cs.AI

TL;DR: Analysis of benchmark saturation across 60 LLM benchmarks reveals nearly half show saturation, with expert-curated benchmarks resisting saturation better than crowdsourced ones, and test data hiding providing no protective effect.

DetailsMotivation: AI benchmarks are crucial for measuring progress but often become saturated, losing their ability to differentiate between top-performing models, which diminishes their long-term value for guiding model development and deployment decisions.

Method: Analyzed 60 LLM benchmarks from major model developers’ technical reports, characterized them along 14 properties spanning task design, data construction, and evaluation format, and tested five hypotheses about how each property contributes to saturation rates.

Result: Nearly half of benchmarks exhibit saturation, with saturation rates increasing as benchmarks age. Expert-curated benchmarks resist saturation better than crowdsourced ones, and hiding test data (public vs. private) shows no protective effect against saturation.

Conclusion: The study identifies which benchmark design choices extend longevity and informs strategies for creating more durable evaluation frameworks that maintain their ability to differentiate between state-of-the-art models over time.

Abstract: Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.

[166] Simple Baselines are Competitive with Code Evolution

Yonatan Gideoni, Sebastian Risi, Yarin Gal

Main category: cs.AI

TL;DR: Simple baselines match or exceed sophisticated code evolution methods across mathematical bounds, agentic scaffolds, and ML competitions, revealing issues in evaluation and methodology.

DetailsMotivation: Many code evolution techniques using LLMs show impressive results but lack comparison to simpler baselines, raising questions about their true effectiveness and whether complexity is necessary.

Method: Tested two simple baselines against sophisticated code evolution methods across three domains: mathematical bounds improvement, agentic scaffold design, and machine learning competitions.

Result: Simple baselines matched or exceeded more complex methods in all three domains. For mathematical bounds, search space design and domain knowledge mattered more than the evolution pipeline. For agentic scaffolds, high variance and small datasets led to suboptimal selection, with hand-designed majority vote performing best.

Conclusion: Code evolution research needs better evaluation methods, reduced stochasticity, and more rigorous practices. The main challenges are search space design (by domain experts) rather than the search algorithms themselves.

Abstract: Code evolution is a family of techniques that rely on large language models to search through possible computer programs by evolving or mutating existing code. Many proposed code evolution pipelines show impressive performance but are often not compared to simpler baselines. We test how well two simple baselines do over three domains: finding better mathematical bounds, designing agentic scaffolds, and machine learning competitions. We find that simple baselines match or exceed much more sophisticated methods in all three. By analyzing these results we find various shortcomings in how code evolution is both developed and used. For the mathematical bounds, a problem’s search space and domain knowledge in the prompt are chiefly what dictate a search’s performance ceiling and efficiency, with the code evolution pipeline being secondary. Thus, the primary challenge in finding improved bounds is designing good search spaces, which is done by domain experts, and not the search itself. When designing agentic scaffolds we find that high variance in the scaffolds coupled with small datasets leads to suboptimal scaffolds being selected, resulting in hand-designed majority vote scaffolds performing best. We propose better evaluation methods that reduce evaluation stochasticity while keeping the code evolution economically feasible. We finish with a discussion of avenues and best practices to enable more rigorous code evolution in future work.

[167] Improved Upper Bounds for Slicing the Hypercube

Duncan Soiffer, Nathaniel Itty, Christopher D. Rosin, Blake Bruell, Mason DiCicco, Gábor N. Sárközy, Ryan Offstein, Daniel Reichman

Main category: cs.AI

TL;DR: Paper improves upper bound on minimum number of hyperplanes needed to slice all edges of n-dimensional hypercube from ⌈5n/6⌉ to ⌈4n/5⌉ (with exception for odd multiples of 5), using computational search aided by reasoning LLMs.

DetailsMotivation: Study combinatorial geometry problem: determine minimum number of hyperplanes needed to slice all edges of n-dimensional hypercube. This is a classic problem in discrete geometry with connections to computational complexity and combinatorial optimization.

Method: Constructive proof approach using computational search aided by CPro1 - an automatic tool combining reasoning LLMs with automated hyperparameter tuning to create search algorithms for mathematical constructions. Specifically constructed 8 hyperplanes slicing Q_10.

Result: Proved S(n) ≤ ⌈4n/5⌉ (except when n is odd multiple of 5, then S(n) ≤ 4n/5 + 1), improving previous bound of ⌈5n/6⌉. Also obtained new lower bounds on maximum number of edges sliced by k < n hyperplanes.

Conclusion: Significant improvement in upper bound for hypercube edge slicing problem, demonstrating effectiveness of combining reasoning LLMs with automated search for mathematical constructions.

Abstract: A collection of hyperplanes $\mathcal{H}$ slices all edges of the $n$-dimensional hypercube $Q_n$ with vertex set ${-1,1}^n$ if, for every edge $e$ in the hypercube, there exists a hyperplane in $\mathcal{H}$ intersecting $e$ in its interior. Let $S(n)$ be the minimum number of hyperplanes needed to slice $Q_n$. We prove that $S(n) \leq \lceil \frac{4n}{5} \rceil$, except when $n$ is an odd multiple of $5$, in which case $S(n) \leq \frac{4n}{5} +1$. This improves upon the previously known upper bound of $S(n) \leq \lceil\frac{5n}{6} \rceil$ due to Paterson reported in 1971. We also obtain new lower bounds on the maximum number of edges in $Q_n$ that can be sliced using $k<n$ hyperplanes. We prove the improved upper bound on $S(n)$ by constructing $8$ hyperplanes slicing $Q_{10}$ aided by the recently introduced CPro1: an automatic tool that uses reasoning LLMs coupled with automated hyperparameter tuning to create search algorithms for the discovery of mathematical constructions.

[168] NeuDiff Agent: A Governed AI Workflow for Single-Crystal Neutron Crystallography

Zhongcan Xiao, Leyi Zhang, Guannan Zhang, Xiaoping Wang

Main category: cs.AI

TL;DR: NeuDiff Agent is an AI workflow system for crystallography data analysis that automates reduction, integration, refinement, and validation pipelines at neutron sources, achieving 4.6-5.0x speedup while maintaining validation standards.

DetailsMotivation: Large-scale scientific facilities face analysis latency as the bottleneck in scientific throughput, especially for complex samples requiring iterative reduction, integration, refinement, and validation processes. There's a need to improve time-to-result and analysis efficiency while maintaining traceability and validation standards.

Method: NeuDiff Agent is a governed, tool-using AI workflow that executes established crystallography pipelines under explicit governance. It restricts actions to allowlisted tools, enforces fail-closed verification gates at workflow boundaries, and captures complete provenance for inspection, auditing, and controlled replay. Performance was assessed using fixed prompt protocols with two large language model backends.

Result: In benchmark testing, NeuDiff Agent reduced wall time from 435 minutes (manual) to 86.5-94.4 minutes (4.6-5.0x faster) while producing validated CIF files with no checkCIF level A or B alerts. The system demonstrated practical deployment of agentic AI in facility crystallography while preserving traceability and validation requirements.

Conclusion: NeuDiff Agent establishes a practical route to deploy agentic AI in facility crystallography, significantly improving analysis throughput while maintaining the traceability and validation standards required for scientific publication.

Abstract: Large-scale facilities increasingly face analysis and reporting latency as the limiting step in scientific throughput, particularly for structurally and magnetically complex samples that require iterative reduction, integration, refinement, and validation. To improve time-to-result and analysis efficiency, NeuDiff Agent is introduced as a governed, tool-using AI workflow for TOPAZ at the Spallation Neutron Source that takes instrument data products through reduction, integration, refinement, and validation to a validated crystal structure and a publication-ready CIF. NeuDiff Agent executes this established pipeline under explicit governance by restricting actions to allowlisted tools, enforcing fail-closed verification gates at key workflow boundaries, and capturing complete provenance for inspection, auditing, and controlled replay. Performance is assessed using a fixed prompt protocol and repeated end-to-end runs with two large language model backends, with user and machine time partitioned and intervention burden and recovery behaviors quantified under gating. In a reference-case benchmark, NeuDiff Agent reduces wall time from 435 minutes (manual) to 86.5(4.7) to 94.4(3.5) minutes (4.6-5.0x faster) while producing a validated CIF with no checkCIF level A or B alerts. These results establish a practical route to deploy agentic AI in facility crystallography while preserving traceability and publication-facing validation requirements.

[169] Node Learning: A Framework for Adaptive, Decentralised and Collaborative Network Edge AI

Eiman Kanjo, Mustafa Aslanov

Main category: cs.AI

TL;DR: Node Learning is a decentralized AI paradigm where intelligence resides at individual edge nodes and expands through selective peer interactions, eliminating centralized bottlenecks.

DetailsMotivation: Centralized AI faces scalability issues in edge environments due to data transmission costs, latency, energy consumption, and dependence on large data centers, especially in heterogeneous, mobile, and resource-constrained settings.

Method: Nodes learn continuously from local data, maintain their own model state, and exchange learned knowledge opportunistically when collaboration is beneficial. Learning propagates through overlap and diffusion rather than global synchronization or central aggregation.

Result: The paper develops conceptual foundations for Node Learning, contrasts it with existing decentralized approaches, and examines implications for communication, hardware, trust, and governance.

Conclusion: Node Learning provides a unified abstraction for autonomous and cooperative behavior that accommodates heterogeneity in data, hardware, objectives, and connectivity, placing existing paradigms within a broader decentralized perspective.

Abstract: The expansion of AI toward the edge increasingly exposes the cost and fragility of cen- tralised intelligence. Data transmission, latency, energy consumption, and dependence on large data centres create bottlenecks that scale poorly across heterogeneous, mobile, and resource-constrained environments. In this paper, we introduce Node Learning, a decen- tralised learning paradigm in which intelligence resides at individual edge nodes and expands through selective peer interaction. Nodes learn continuously from local data, maintain their own model state, and exchange learned knowledge opportunistically when collaboration is beneficial. Learning propagates through overlap and diffusion rather than global synchro- nisation or central aggregation. It unifies autonomous and cooperative behaviour within a single abstraction and accommodates heterogeneity in data, hardware, objectives, and connectivity. This concept paper develops the conceptual foundations of this paradigm, contrasts it with existing decentralised approaches, and examines implications for communi- cation, hardware, trust, and governance. Node Learning does not discard existing paradigms, but places them within a broader decentralised perspective

[170] An order-oriented approach to scoring hesitant fuzzy elements

Luis Merino, Gabriel Navarro, Carlos Salvatierra, Evangelina Santos

Main category: cs.AI

TL;DR: A theoretical framework for scoring hesitant fuzzy sets based on order theory, introducing dominance functions for ranking with control sets and minimum thresholds.

DetailsMotivation: Traditional scoring approaches for hesitant fuzzy sets lack formal foundations in order theory, leading to inconsistent and inflexible scoring mechanisms that don't properly handle the inherent uncertainty in hesitant fuzzy elements.

Method: Proposes an order-oriented framework where scores are defined relative to specific orders on hesitant fuzzy elements (nonempty subsets of [0,1]). Analyzes classical orders, proves they don’t induce lattice structures, then introduces dominance functions that compare hesitant fuzzy elements relative to control sets with minimum acceptability thresholds. Provides two concrete examples: discrete dominance function and relative dominance function for finite sets.

Result: Shows that scores defined with respect to the symmetric order satisfy key normative criteria including strong monotonicity with respect to unions and the Gärdenfors condition. Demonstrates that classical orders on hesitant fuzzy elements do not induce lattice structures, contrary to prior claims. The proposed dominance functions can construct fuzzy preference relations on hesitant fuzzy sets and support group decision-making.

Conclusion: The order-oriented framework provides a more rigorous foundation for scoring hesitant fuzzy sets, with dominance functions offering practical tools for ranking and decision-making applications while maintaining theoretical coherence.

Abstract: Traditional scoring approaches on hesitant fuzzy sets often lack a formal base in order theory. This paper proposes a unified framework, where each score is explicitly defined with respect to a given order. This order-oriented perspective enables more flexible and coherent scoring mechanisms. We examine several classical orders on hesitant fuzzy elements, that is, nonempty subsets in [0,1], and show that, contrary to prior claims, they do not induce lattice structures. In contrast, we prove that the scores defined with respect to the symmetric order satisfy key normative criteria for scoring functions, including strong monotonicity with respect to unions and the Gärdenfors condition. Following this analysis, we introduce a class of functions, called dominance functions, for ranking hesitant fuzzy elements. They aim to compare hesitant fuzzy elements relative to control sets incorporating minimum acceptability thresholds. Two concrete examples of dominance functions for finite sets are provided: the discrete dominance function and the relative dominance function. We show that these can be employed to construct fuzzy preference relations on typical hesitant fuzzy sets and support group decision-making.

[171] IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

Priyaranjan Pattnayak, Sanchari Chowdhuri

Main category: cs.AI

TL;DR: IJR benchmark evaluates LLM safety across 12 Indic languages, revealing multilingual vulnerabilities not captured by English-only contract-bound evaluations.

DetailsMotivation: Current LLM safety alignment is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied, especially for South Asian users who frequently code-switch and romanize.

Method: Created Indic Jailbreak Robustness (IJR) benchmark with 45,216 prompts across 12 Indic languages, covering JSON (contract-bound) and Free (naturalistic) tracks to evaluate adversarial safety.

Result: Three key findings: 1) Contracts inflate refusals but don’t stop jailbreaks; 2) English-to-Indic attacks transfer strongly; 3) Orthography matters - romanized/mixed inputs reduce jailbreak success rates with systematic effects.

Conclusion: IJR reveals multilingual safety risks hidden by English-only evaluations, especially relevant for South Asian users who code-switch and romanize, highlighting the need for better multilingual safety alignment.

Abstract: Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied. We introduce \textbf{Indic Jailbreak Robustness (IJR)}, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks. IJR reveals three patterns. (1) Contracts inflate refusals but do not stop jailbreaks: in JSON, LLaMA and Sarvam exceed 0.92 JSR, and in Free all models reach 1.0 with refusals collapsing. (2) English to Indic attacks transfer strongly, with format wrappers often outperforming instruction wrappers. (3) Orthography matters: romanized or mixed inputs reduce JSR under JSON, with correlations to romanization share and tokenization (approx 0.28 to 0.32) indicating systematic effects. Human audits confirm detector reliability, and lite-to-full comparisons preserve conclusions. IJR offers a reproducible multilingual stress test revealing risks hidden by English-only, contract-focused evaluations, especially for South Asian users who frequently code-switch and romanize.

[172] Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, Ming Yan

Main category: cs.AI

TL;DR: GUI-Owl-1.5 is a native GUI agent model with multiple size variants that achieves SOTA results on GUI benchmarks across desktop, mobile, and browser platforms through hybrid data collection, unified reasoning enhancement, and multi-platform reinforcement learning.

DetailsMotivation: To create a versatile GUI agent that can operate across multiple platforms (desktop, mobile, browser) with cloud-edge collaboration capabilities, addressing the need for efficient GUI automation, grounding, tool-calling, and knowledge tasks in real-world applications.

Method: Develops multiple model sizes (2B-235B) with hybrid data flywheel combining simulated and cloud sandbox environments, unified thought-synthesis pipeline for reasoning enhancement, and MRPO (Multi-platform Environment RL) algorithm to handle platform conflicts and long-horizon tasks.

Result: Achieves state-of-the-art results on 20+ GUI benchmarks: 56.5 on OSWorld, 71.6 on AndroidWorld, 48.4 on WebArena for automation; 80.3 on ScreenSpotPro for grounding; 47.6 on OSWorld-MCP and 46.8 on MobileWorld for tool-calling; 75.5 on GUI-Knowledge Bench for memory/knowledge tasks.

Conclusion: GUI-Owl-1.5 demonstrates superior performance across diverse GUI tasks and platforms, offering open-source models with cloud-sandbox demo, representing significant advancement in GUI agent capabilities through innovative data collection, reasoning enhancement, and multi-platform RL approaches.

Abstract: The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge collaboration and real-time interaction. GUI-Owl-1.5 achieves state-of-the-art results on more than 20+ GUI benchmarks on open-source models: (1) on GUI automation tasks, it obtains 56.5 on OSWorld, 71.6 on AndroidWorld, and 48.4 on WebArena; (2) on grounding tasks, it obtains 80.3 on ScreenSpotPro; (3) on tool-calling tasks, it obtains 47.6 on OSWorld-MCP, and 46.8 on MobileWorld; (4) on memory and knowledge tasks, it obtains 75.5 on GUI-Knowledge Bench. GUI-Owl-1.5 incorporates several key innovations: (1) Hybird Data Flywheel: we construct the data pipeline for UI understanding and trajectory generation based on a combination of simulated environments and cloud-based sandbox environments, in order to improve the efficiency and quality of data collection. (2) Unified Enhancement of Agent Capabilities: we use a unified thought-synthesis pipeline to enhance the model’s reasoning capabilities, while placing particular emphasis on improving key agent abilities, including Tool/MCP use, memory and multi-agent adaptation; (3) Multi-platform Environment RL Scaling: We propose a new environment RL algorithm, MRPO, to address the challenges of multi-platform conflicts and the low training efficiency of long-horizon tasks. The GUI-Owl-1.5 models are open-sourced, and an online cloud-sandbox demo is available at https://github.com/X-PLUG/MobileAgent.

[173] OpenSage: Self-programming Agent Generation Engine

Hongwei Li, Zhun Wang, Qinrun Dai, Yuzhou Nie, Jinjun Peng, Ruitong Liu, Jingyang Zhang, Kaijie Zhu, Jingxuan He, Lun Wang, Yangruibo Ding, Yueqi Chen, Wenbo Guo, Dawn Song

Main category: cs.AI

TL;DR: OpenSage is an agent development kit that enables LLMs to automatically create agents with self-generated topology and toolsets, featuring hierarchical memory and specialized software engineering toolkits.

DetailsMotivation: Current agent development kits either lack sufficient functional support or require manual human design of agent components (topology, tools, memory), limiting agent generalizability and performance.

Method: OpenSage enables LLMs to automatically create agents with self-generated topology and toolsets, provides comprehensive structured memory support with hierarchical graph-based memory system, and includes specialized toolkits for software engineering tasks.

Result: Extensive experiments across three state-of-the-art benchmarks with various backbone models demonstrate OpenSage’s advantages over existing ADKs, with rigorous ablation studies confirming effectiveness of each component design.

Conclusion: OpenSage paves the way for next-generation agent development by shifting from human-centered to AI-centered paradigms, enabling automatic agent creation with self-generated components.

Abstract: Agent development kits (ADKs) provide effective platforms and tooling for constructing agents, and their designs are critical to the constructed agents’ performance, especially the functionality for agent topology, tools, and memory. However, current ADKs either lack sufficient functional support or rely on humans to manually design these components, limiting agents’ generalizability and overall performance. We propose OpenSage, the first ADK that enables LLMs to automatically create agents with self-generated topology and toolsets while providing comprehensive and structured memory support. OpenSage offers effective functionality for agents to create and manage their own sub-agents and toolkits. It also features a hierarchical, graph-based memory system for efficient management and a specialized toolkit tailored to software engineering tasks. Extensive experiments across three state-of-the-art benchmarks with various backbone models demonstrate the advantages of OpenSage over existing ADKs. We also conduct rigorous ablation studies to demonstrate the effectiveness of our design for each component. We believe OpenSage can pave the way for the next generation of agent development, shifting the focus from human-centered to AI-centered paradigms.

[174] AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Tanqiu Jiang, Yuhui Wang, Jiacheng Liang, Ting Wang

Main category: cs.AI

TL;DR: AgentLAB is a benchmark for evaluating LLM agent vulnerabilities to long-horizon attacks across 28 environments with 644 security test cases, revealing current agents remain highly susceptible despite single-turn defenses.

DetailsMotivation: As LLM agents are deployed in complex, long-horizon environments, they face new security risks from multi-turn attacks that exploit user-agent-environment interactions, which cannot be measured by existing single-turn security benchmarks.

Method: Developed AgentLAB benchmark with five novel attack types (intent hijacking, tool chaining, task injection, objective drifting, memory poisoning) across 28 realistic agentic environments and 644 security test cases to systematically evaluate LLM agent vulnerabilities.

Result: Evaluation shows representative LLM agents remain highly susceptible to long-horizon attacks, and defenses designed for single-turn interactions fail to reliably mitigate these multi-turn threats.

Conclusion: AgentLAB provides the first dedicated benchmark for measuring LLM agent vulnerabilities to adaptive, long-horizon attacks, serving as a valuable tool for tracking progress in securing practical agent deployments.

Abstract: LLM agents are increasingly deployed in long-horizon, complex environments to solve challenging problems, but this expansion exposes them to long-horizon attacks that exploit multi-turn user-agent-environment interactions to achieve objectives infeasible in single-turn settings. To measure agent vulnerabilities to such risks, we present AgentLAB, the first benchmark dedicated to evaluating LLM agent susceptibility to adaptive, long-horizon attacks. Currently, AgentLAB supports five novel attack types including intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning, spanning 28 realistic agentic environments, and 644 security test cases. Leveraging AgentLAB, we evaluate representative LLM agents and find that they remain highly susceptible to long-horizon attacks; moreover, defenses designed for single-turn interactions fail to reliably mitigate long-horizon threats. We anticipate that AgentLAB will serve as a valuable benchmark for tracking progress on securing LLM agents in practical settings. The benchmark is publicly available at https://tanqiujiang.github.io/AgentLAB_main.

[175] LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs

Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

Main category: cs.AI

TL;DR: LLM-Wikirace is a benchmark for evaluating planning, reasoning, and world knowledge in LLMs through Wikipedia navigation tasks, revealing limitations in current models’ long-horizon reasoning and replanning capabilities.

DetailsMotivation: To create a benchmark that evaluates LLMs' planning, reasoning, and world knowledge capabilities through a concrete task of navigating Wikipedia hyperlinks, which requires look-ahead planning and understanding of real-world concept connections.

Method: Developed LLM-Wikirace benchmark where models must navigate from a source Wikipedia page to a target page using hyperlinks step by step. Evaluated various open- and closed-source models (Gemini-3, GPT-5, Claude Opus 4.5) across easy and hard difficulty levels, analyzing performance, world knowledge requirements, and planning capabilities.

Result: Frontier models achieve superhuman performance on easy tasks but performance drops sharply on hard difficulty (Gemini-3 succeeds in only 23% of hard games). World knowledge is necessary but insufficient beyond a threshold where planning and long-horizon reasoning become dominant. Models struggle with replanning after failure and frequently enter loops.

Conclusion: LLM-Wikirace reveals clear limitations in current reasoning systems, particularly in long-horizon planning and recovery from failures. It provides an open benchmark where planning-capable LLMs still have substantial room for improvement.

Abstract: We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.

[176] Autonomous Data Processing using Meta-Agents

Udayan Khurana

Main category: cs.AI

TL;DR: ADP-MA is an autonomous data processing framework using hierarchical meta-agents to dynamically construct, execute, monitor, and refine data processing pipelines through agent orchestration.

DetailsMotivation: Traditional data processing pipelines are static and handcrafted for specific tasks, lacking adaptability to evolving requirements. Existing general-purpose agents and coding assistants can generate code but cannot autonomously monitor, manage, and optimize end-to-end pipelines once deployed.

Method: ADP-MA uses hierarchical agent orchestration with meta-agents that analyze input data and task specifications to design multi-phase plans, instantiate specialized ground-level agents, and continuously evaluate pipeline performance. The architecture has three components: planning module for strategy generation, orchestration layer for agent coordination and tool integration, and monitoring loop for iterative evaluation and backtracking.

Result: The framework demonstrates pipeline construction, execution monitoring, and adaptive refinement across representative data processing tasks through an interactive demo, showcasing context-aware optimization, adaptive workload partitioning, and progressive sampling for scalability.

Conclusion: ADP-MA provides an autonomous framework for dynamic data processing pipeline construction and optimization through hierarchical agent orchestration, addressing limitations of traditional static pipelines and current agent-based approaches.

Abstract: Traditional data processing pipelines are typically static and handcrafted for specific tasks, limiting their adaptability to evolving requirements. While general-purpose agents and coding assistants can generate code for well-understood data pipelines, they lack the ability to autonomously monitor, manage, and optimize an end-to-end pipeline once deployed. We present \textbf{Autonomous Data Processing using Meta-agents} (ADP-MA), a framework that dynamically constructs, executes, and iteratively refines data processing pipelines through hierarchical agent orchestration. At its core, \textit{meta-agents} analyze input data and task specifications to design a multi-phase plan, instantiate specialized \textit{ground-level agents}, and continuously evaluate pipeline performance. The architecture comprises three key components: a planning module for strategy generation, an orchestration layer for agent coordination and tool integration, and a monitoring loop for iterative evaluation and backtracking. Unlike conventional approaches, ADP-MA emphasizes context-aware optimization, adaptive workload partitioning, and progressive sampling for scalability. Additionally, the framework leverages a diverse set of external tools and can reuse previously designed agents, reducing redundancy and accelerating pipeline construction. We demonstrate ADP-MA through an interactive demo that showcases pipeline construction, execution monitoring, and adaptive refinement across representative data processing tasks.

[177] Narrow fine-tuning erodes safety alignment in vision-language agents

Idhant Gulati, Shivam Raval

Main category: cs.AI

TL;DR: Fine-tuning aligned vision-language models on harmful data causes severe emergent misalignment that generalizes across tasks and modalities, with multimodal evaluation revealing worse degradation than text-only benchmarks.

DetailsMotivation: There's a fundamental tension between acquiring new capabilities through post-training and preserving safety alignment in lifelong multimodal agents. Current post-training paradigms may not sufficiently preserve alignment in deployment settings.

Method: Experiments on Gemma3-4B vision-language model using LoRA fine-tuning on harmful datasets, evaluating misalignment across text and multimodal tasks. Geometric analysis of harmful behavior subspaces and evaluation of mitigation strategies (benign fine-tuning and activation-based steering).

Result: Misalignment scales with LoRA rank, multimodal evaluation shows substantially higher misalignment (70.71±1.22 at r=128) than text-only (41.19±2.51). Even 10% harmful data causes degradation. Harmful behaviors occupy low-dimensional subspace (10 principal components). Mitigation strategies reduce but don’t completely remove harmful behaviors.

Conclusion: Current post-training paradigms may not sufficiently preserve alignment, highlighting need for robust continual learning frameworks. Unimodal safety benchmarks underestimate alignment degradation in vision-language models.

Abstract: Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ($70.71 \pm 1.22$ at $r=128$) than text-only evaluation ($41.19 \pm 2.51$), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.

[178] DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs

Justin Albrethsen, Yash Datta, Kunal Kumar, Sharath Rajasekar

Main category: cs.AI

TL;DR: DeepContext is a stateful RNN-based framework for multi-turn jailbreak detection that tracks temporal intent evolution across conversations, outperforming stateless safety filters.

DetailsMotivation: Current LLM safety guardrails are stateless and treat multi-turn dialogues as disconnected events, creating a "Safety Gap" where adversarial tactics like Crescendo and ActorAttack can slowly bleed malicious intent across turn boundaries to bypass filters.

Method: DeepContext uses a Recurrent Neural Network (RNN) architecture that ingests sequences of fine-tuned turn-level embeddings, propagating a hidden state across conversations to capture incremental risk accumulation that stateless models overlook.

Result: Achieves state-of-the-art F1 score of 0.84 for multi-turn jailbreak detection, significantly outperforming hyperscaler cloud-provider guardrails and leading open-weight models like Llama-Prompt-Guard-2 (0.67) and Granite-Guardian (0.67), with sub-20ms inference overhead on T4 GPU.

Conclusion: Modeling the sequential evolution of intent is more effective and computationally efficient than deploying massive stateless models for safety monitoring in multi-turn dialogues.

Abstract: While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of disconnected events. This lack of temporal awareness facilitates a “Safety Gap” where adversarial tactics, like Crescendo and ActorAttack, slowly bleed malicious intent across turn boundaries to bypass stateless filters. We introduce DeepContext, a stateful monitoring framework designed to map the temporal trajectory of user intent. DeepContext discards the isolated evaluation model in favor of a Recurrent Neural Network (RNN) architecture that ingests a sequence of fine-tuned turn-level embeddings. By propagating a hidden state across the conversation, DeepContext captures the incremental accumulation of risk that stateless models overlook. Our evaluation demonstrates that DeepContext significantly outperforms existing baselines in multi-turn jailbreak detection, achieving a state-of-the-art F1 score of 0.84, which represents a substantial improvement over both hyperscaler cloud-provider guardrails and leading open-weight models such as Llama-Prompt-Guard-2 (0.67) and Granite-Guardian (0.67). Furthermore, DeepContext maintains a sub-20ms inference overhead on a T4 GPU, ensuring viability for real-time applications. These results suggest that modeling the sequential evolution of intent is a more effective and computationally efficient alternative to deploying massive, stateless models.

[179] SourceBench: Can AI Answers Reference Quality Web Sources?

Hexi Jin, Stephen Liu, Yuheng Li, Simran Malik, Yiying Zhang

Main category: cs.AI

TL;DR: SourceBench is a benchmark for evaluating the quality of web sources cited by LLMs across 100 real-world queries, using an 8-metric framework covering content quality and page-level signals.

DetailsMotivation: Current evaluations of LLMs focus on answer correctness but neglect the quality of cited web sources, which is crucial for trustworthiness and reliability in real-world applications.

Method: Developed SourceBench with 100 diverse queries across informational, factual, argumentative, social, and shopping intents. Created an 8-metric framework covering content quality (relevance, accuracy, objectivity) and page-level signals (freshness, authority, clarity). Built a human-labeled dataset and calibrated LLM-based evaluator matching expert judgments.

Result: Evaluated 8 LLMs, Google Search, and 3 AI search tools over 3996 cited sources. The benchmark reveals four key insights about source quality patterns and provides guidance for future GenAI and web search research.

Conclusion: SourceBench addresses the gap in evaluating source quality for LLM citations, providing a comprehensive framework and dataset that can guide improvements in how AI systems select and cite web sources.

Abstract: Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.

[180] Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

Arnold Cartagena, Ariane Teixeira

Main category: cs.AI

TL;DR: LLM agent safety evaluations focusing only on text refusal don’t measure actual harmful tool-call actions; GAP benchmark reveals significant divergence between text safety and tool-call safety across multiple domains.

DetailsMotivation: Current safety evaluations for LLM agents overwhelmingly measure text-level refusal behavior but don't assess whether alignment that suppresses harmful text also suppresses harmful actions through tool calls with real-world consequences.

Method: Introduced GAP benchmark with systematic evaluation across 6 frontier models, 6 regulated domains, 7 jailbreak scenarios per domain, 3 system prompt conditions, and 2 prompt variants, producing 17,420 analysis-ready datapoints to measure divergence between text-level and tool-call-level safety.

Result: Text safety does not transfer to tool-call safety - models often refuse harmful requests in text while simultaneously executing forbidden actions via tool calls. 219 such cases persisted across all 6 models even with safety-reinforced prompts. System prompt wording significantly influences tool-call behavior (21-57 percentage point variations). Runtime governance contracts reduce information leakage but don’t deter forbidden tool-call attempts.

Conclusion: Text-only safety evaluations are insufficient for assessing agent behavior; tool-call safety requires dedicated measurement and mitigation strategies separate from text-level alignment.

Abstract: Large language models deployed as agents increasingly interact with external systems through tool calls–actions with real-world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text-level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also suppress harmful actions? We introduce the GAP benchmark, a systematic evaluation framework that measures divergence between text-level safety and tool-call-level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system prompt conditions (neutral, safety-reinforced, and tool-encouraging), and two prompt variants, producing 17,420 analysis-ready datapoints. Our central finding is that text safety does not transfer to tool-call safety. Across all six models, we observe instances where the model’s text output refuses a harmful request while its tool calls simultaneously execute the forbidden action–a divergence we formalize as the GAP metric. Even under safety-reinforced system prompts, 219 such cases persist across all six models. System prompt wording exerts substantial influence on tool-call behavior: TC-safe rates span 21 percentage points for the most robust model and 57 for the most prompt-sensitive, with 16 of 18 pairwise ablation comparisons remaining significant after Bonferroni correction. Runtime governance contracts reduce information leakage in all six models but produce no detectable deterrent effect on forbidden tool-call attempts themselves. These results demonstrate that text-only safety evaluations are insufficient for assessing agent behavior and that tool-call safety requires dedicated measurement and mitigation.

[181] LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

Hejia Zhang, Zhongming Yu, Chia-Tung Ho, Haoxing Ren, Brucek Khailany, Jishen Zhao

Main category: cs.AI

TL;DR: LLM4Cov: An offline agent-learning framework for hardware verification that uses execution-validated data curation and policy-aware agentic data synthesis to enable scalable learning under execution constraints, achieving competitive performance with compact models.

DetailsMotivation: Execution-aware LLM agents need expensive and slow tool feedback, making online reinforcement learning impractical for hardware verification which relies on industrial simulators and non-differentiable execution signals.

Method: Models verification as memoryless state transitions guided by deterministic evaluators, introduces execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling for scalable offline learning.

Result: A compact 4B-parameter model achieves 69.2% coverage pass rate under agentic evaluation, outperforming its teacher by 5.3% and demonstrating competitive performance against models an order of magnitude larger.

Conclusion: LLM4Cov enables effective offline agent learning for hardware verification by addressing execution constraints through novel data curation and synthesis techniques, showing that compact models can achieve competitive performance.

Abstract: Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback is often expensive and slow to obtain, making online reinforcement learning (RL) impractical. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals. We propose LLM4Cov, an offline agent-learning framework that models verification as memoryless state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling to enable scalable learning under execution constraints. We further curate a reality-aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B-parameter model achieves 69.2% coverage pass rate under agentic evaluation, outperforming its teacher by 5.3% and demonstrating competitive performance against models an order of magnitude larger.

[182] Automating Agent Hijacking via Structural Template Injection

Xinhao Deng, Jiaqing Wu, Miao Chen, Yue Xiao, Ke Xu, Qi Li

Main category: cs.AI

TL;DR: Phantom: Automated agent hijacking framework using structured template injection to exploit LLM agent architecture vulnerabilities, achieving high attack success rates across commercial models.

DetailsMotivation: Existing agent hijacking attacks rely on manual prompt manipulation with low success rates and poor transferability to closed-source models. There's a need for automated, architecture-aware attacks that can effectively target commercial LLM agents.

Method: Uses Structured Template Injection targeting LLM agent architectural mechanisms. Introduces attack template search with multi-level template augmentation, Template Autoencoder (TAE) for continuous latent space embedding, and Bayesian optimization to find optimal adversarial vectors decoded into structured templates.

Result: Significantly outperforms existing baselines in Attack Success Rate (ASR) and query efficiency on Qwen, GPT, and Gemini. Identified over 70 vulnerabilities in real-world commercial products confirmed by vendors.

Conclusion: Structured template-based hijacking poses severe practical threats to LLM agents, requiring architectural defenses. The framework provides empirical foundation for securing next-generation agentic systems.

Abstract: Agent hijacking, highlighted by OWASP as a critical threat to the Large Language Model (LLM) ecosystem, enables adversaries to manipulate execution by injecting malicious instructions into retrieved content. Most existing attacks rely on manually crafted, semantics-driven prompt manipulation, which often yields low attack success rates and limited transferability to closed-source commercial models. In this paper, we propose Phantom, an automated agent hijacking framework built upon Structured Template Injection that targets the fundamental architectural mechanisms of LLM agents. Our key insight is that agents rely on specific chat template tokens to separate system, user, assistant, and tool instructions. By injecting optimized structured templates into the retrieved context, we induce role confusion and cause the agent to misinterpret the injected content as legitimate user instructions or prior tool outputs. To enhance attack transferability against black-box agents, Phantom introduces a novel attack template search framework. We first perform multi-level template augmentation to increase structural diversity and then train a Template Autoencoder (TAE) to embed discrete templates into a continuous, searchable latent space. Subsequently, we apply Bayesian optimization to efficiently identify optimal adversarial vectors that are decoded into high-potency structured templates. Extensive experiments on Qwen, GPT, and Gemini demonstrate that our framework significantly outperforms existing baselines in both Attack Success Rate (ASR) and query efficiency. Moreover, we identified over 70 vulnerabilities in real-world commercial products that have been confirmed by vendors, underscoring the practical severity of structured template-based hijacking and providing an empirical foundation for securing next-generation agentic systems.

[183] HQFS: Hybrid Quantum Classical Financial Security with VQC Forecasting, QUBO Annealing, and Audit-Ready Post-Quantum Signing

Srikumar Nayak

Main category: cs.AI

TL;DR: HQFS is a hybrid quantum-classical financial system that integrates forecasting, discrete risk optimization, and auditability in one pipeline using variational quantum circuits for predictions and quantum annealing for optimization with post-quantum signatures for verification.

DetailsMotivation: Traditional two-step financial risk systems (prediction then optimization) break under real constraints like market shifts, discrete constraints (lot sizes, caps), slow optimization for large asset sets, and lack of clear audit trails linking decisions to model states.

Method: 1) Uses variational quantum circuit (VQC) with classical head for return and volatility prediction; 2) Converts risk-return objective and constraints into QUBO solved via quantum annealing (with classical fallback); 3) Signs outputs with post-quantum signatures for verifiable audit trail.

Result: Reduces return prediction error by 7.8% and volatility prediction error by 6.1% vs classical baseline. Improves out-of-sample Sharpe by 9.4% and lowers maximum drawdown by 11.7%. Cuts average solve time by 28% compared to mixed-integer baseline while producing fully traceable signed records.

Conclusion: HQFS demonstrates a practical hybrid quantum-classical pipeline that addresses real-world financial system limitations by integrating prediction, optimization, and auditability while showing measurable improvements in accuracy, performance, and risk metrics.

Abstract: Here’s the corrected paragraph with all punctuation and formatting issues fixed: Financial risk systems usually follow a two-step routine: a model predicts return or risk, and then an optimizer makes a decision such as a portfolio rebalance. In practice, this split can break under real constraints. The prediction model may look good, but the final decision can be unstable when the market shifts, when discrete constraints are added (lot sizes, caps), or when the optimization becomes slow for larger asset sets. Also, regulated settings need a clear audit trail that links each decision to the exact model state and inputs. We present HQFS, a practical hybrid pipeline that connects forecasting, discrete risk optimization, and auditability in one flow. First, HQFS learns next-step return and a volatility proxy using a variational quantum circuit (VQC) with a small classical head. Second, HQFS converts the risk-return objective and constraints into a QUBO and solves it with quantum annealing when available, while keeping a compatible classical QUBO solver as a fallback for deployment. Third, HQFS signs each rebalance output using a post-quantum signature so the allocation can be verified later without trusting the runtime environment. On our market dataset study, HQFS reduces return prediction error by 7.8% and volatility prediction error by 6.1% versus a tuned classical baseline. For the decision layer, HQFS improves out-of-sample Sharpe by 9.4% and lowers maximum drawdown by 11.7%. The QUBO solve stage also cuts average solve time by 28% compared to a mixed-integer baseline under the same constraints, while producing fully traceable, signed allocation records.

[184] Fundamental Limits of Black-Box Safety Evaluation: Information-Theoretic and Computational Barriers from Latent Context Conditioning

Vishal Srivastava

Main category: cs.AI

TL;DR: Theoretical analysis showing black-box safety evaluation fails for models with latent context-conditioned policies where rare triggers in evaluation become prevalent in deployment, establishing fundamental statistical and computational limits.

DetailsMotivation: To challenge the assumption that model behavior on test distributions reliably predicts deployment performance, particularly for models with unobserved internal variables (latent context-conditioned policies) that are rare during evaluation but prevalent in deployment.

Method: Formal theoretical analysis using statistical methods: (1) Le Cam’s method for minimax lower bounds on passive evaluation, (2) Yao’s minimax principle and hash-based trigger construction for adaptive evaluation limits, (3) trapdoor one-way function assumptions for computational separation, and (4) white-box probing with bias correction under probe error.

Result: Proves fundamental limits: passive evaluation incurs ≥0.208δL error; adaptive evaluation has ≥δL/16 worst-case error; computational separation shows polynomial-time evaluators cannot detect unsafe behaviors; white-box probing requires O(1/(γ²ε_R²)) samples with explicit bias correction.

Conclusion: Black-box testing is statistically underdetermined for latent context-conditioned policies, requiring additional safeguards like architectural constraints, training-time guarantees, interpretability, and deployment monitoring for worst-case safety assurance.

Abstract: Black-box safety evaluation of AI systems assumes model behavior on test distributions reliably predicts deployment performance. We formalize and challenge this assumption through latent context-conditioned policies – models whose outputs depend on unobserved internal variables that are rare under evaluation but prevalent under deployment. We establish fundamental limits showing that no black-box evaluator can reliably estimate deployment risk for such models. (1) Passive evaluation: For evaluators sampling i.i.d. from D_eval, we prove minimax lower bounds via Le Cam’s method: any estimator incurs expected absolute error >= (5/24)deltaL approximately 0.208deltaL, where delta is trigger probability under deployment and L is the loss gap. (2) Adaptive evaluation: Using a hash-based trigger construction and Yao’s minimax principle, worst-case error remains >= delta*L/16 even for fully adaptive querying when D_dep is supported over a sufficiently large domain; detection requires Theta(1/epsilon) queries. (3) Computational separation: Under trapdoor one-way function assumptions, deployment environments possessing privileged information can activate unsafe behaviors that any polynomial-time evaluator without the trapdoor cannot distinguish. For white-box probing, estimating deployment risk to accuracy epsilon_R requires O(1/(gamma^2 * epsilon_R^2)) samples, where gamma = alpha_0 + alpha_1 - 1 measures probe quality, and we provide explicit bias correction under probe error. Our results quantify when black-box testing is statistically underdetermined and provide explicit criteria for when additional safeguards – architectural constraints, training-time guarantees, interpretability, and deployment monitoring – are mathematically necessary for worst-case safety assurance.

[185] Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie

Main category: cs.AI

TL;DR: Conv-FinRe is a conversational benchmark for stock recommendation that evaluates LLMs on decision quality rather than just behavior imitation, using multi-view references to distinguish rational analysis from user noise.

DetailsMotivation: Existing recommendation benchmarks focus on behavior imitation, but in financial advisory, user actions can be noisy or short-sighted under market volatility. Treating user choices as ground truth conflates behavioral imitation with decision quality, which is problematic for evaluating financial advisory systems.

Method: Introduces Conv-FinRe benchmark built from real market data and human decision trajectories. It includes onboarding interviews, step-wise market context, and advisory dialogues. Models must generate stock rankings over fixed investment horizons. Provides multi-view references distinguishing descriptive behavior from normative utility based on investor-specific risk preferences.

Result: Evaluation of state-of-the-art LLMs reveals persistent tension between rational decision quality and behavioral alignment. Models performing well on utility-based ranking often fail to match user choices, while behaviorally aligned models can overfit short-term noise.

Conclusion: Conv-FinRe enables diagnosis of whether LLMs follow rational analysis, mimic user noise, or are driven by market momentum. The benchmark addresses limitations of behavior-only evaluation in financial recommendation systems.

Abstract: Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user’s long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.

[186] Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

Zhao Tan, Yiji Zhao, Shiyu Wang, Chang Xu, Yuxuan Liang, Xiping Liu, Shirui Pan, Ming Jin

Main category: cs.AI

TL;DR: Sonar-TS is a neuro-symbolic framework for natural language querying of time series databases that uses a Search-Then-Verify pipeline with SQL for candidate window retrieval and Python programs for verification against raw signals.

DetailsMotivation: Existing Text-to-SQL methods fail to handle continuous morphological intents (like shapes or anomalies) in time series data, while time series models struggle with ultra-long histories. There's a need for effective natural language querying of massive temporal records.

Method: Proposes Sonar-TS framework with Search-Then-Verify pipeline: 1) Uses feature index to ping candidate windows via SQL queries, 2) Generates Python programs to lock on and verify candidates against raw signals. Also introduces NLQTSBench benchmark for evaluation.

Result: Sonar-TS effectively navigates complex temporal queries where traditional methods fail. The work presents the first systematic study of NLQ4TSDB and establishes a general framework with evaluation standard.

Conclusion: Sonar-TS addresses the limitations of existing methods for natural language querying of time series databases, providing a neuro-symbolic approach that combines SQL-based search with programmatic verification for complex temporal queries.

Abstract: Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.

[187] Cinder: A fast and fair matchmaking system

Saurav Pal

Main category: cs.AI

TL;DR: Cinder is a two-stage matchmaking system for multiplayer games that uses Ruzicka similarity for initial filtering and Kantorovich distance on skill buckets for precise fairness evaluation.

DetailsMotivation: Traditional matchmaking systems using simple average team skill metrics often result in unbalanced games, especially with heterogeneous skill levels in pre-made teams (lobbies). This leads to poor player experience and retention.

Method: Two-stage approach: 1) Rapid preliminary filter using Ruzicka similarity index on “non-outlier” skill ranges; 2) Precise fairness evaluation by mapping player ranks to non-linear skill buckets (inverted normal distribution) and calculating Kantorovich distance on sorted bucket indices to produce a “Sanction Score.”

Result: System viability demonstrated by analyzing Sanction Score distribution from 140 million simulated lobby pairings, providing foundation for fair matchmaking thresholds.

Conclusion: Cinder provides a robust framework for fair and fast matchmaking that addresses limitations of traditional average-based approaches, particularly for heterogeneous skill distributions.

Abstract: A fair and fast matchmaking system is an important component of modern multiplayer online games, directly impacting player retention and satisfaction. However, creating fair matches between lobbies (pre-made teams) of heterogeneous skill levels presents a significant challenge. Matching based simply on average team skill metrics, such as mean or median rating or rank, often results in unbalanced and one-sided games, particularly when skill distributions are wide or skewed. This paper introduces Cinder, a two-stage matchmaking system designed to provide fast and fair matches. Cinder first employs a rapid preliminary filter by comparing the “non-outlier” skill range of lobbies using the Ruzicka similarity index. Lobbies that pass this initial check are then evaluated using a more precise fairness metric. This second stage involves mapping player ranks to a non-linear set of skill buckets, generated from an inverted normal distribution, to provide higher granularity at average skill levels. The fairness of a potential match is then quantified using the Kantorovich distance on the lobbies’ sorted bucket indices, producing a “Sanction Score.” We demonstrate the system’s viability by analyzing the distribution of Sanction Scores from 140 million simulated lobby pairings, providing a robust foundation for fair matchmaking thresholds.

[188] M2F: Automated Formalization of Mathematical Literature at Scale

Zichen Wang, Wanli Ma, Zhenyu Ming, Gong Zhang, Kun Yuan, Zaiwen Wen

Main category: cs.AI

TL;DR: M2F is an agentic framework for end-to-end, project-scale autoformalization of mathematics into Lean, achieving textbook-scale formalization with high proof success rates.

DetailsMotivation: Current automated formalization of mathematics is limited to isolated theorems and short snippets, lacking the ability to scale to entire textbooks and research papers which require managing cross-file dependencies, resolving imports, and ensuring end-to-end compilation.

Method: Two-stage framework: 1) Statement compilation stage splits documents into atomic blocks, orders them via inferred dependencies, and repairs declaration skeletons until the project compiles (allowing placeholders in proofs). 2) Proof repair stage closes proof holes under fixed signatures using goal-conditioned local edits. Both stages keep the verifier in the loop, committing edits only when toolchain feedback confirms improvement.

Result: Successfully converted long-form mathematical sources into a project-scale Lean library of 153,853 lines from 479 pages of textbooks on real analysis and convex analysis, achieving 96% proof success on FATE-H benchmark (vs. 80% for strong baseline).

Conclusion: Practical, large-scale automated formalization of mathematical literature is within reach, demonstrating textbook-scale formalization at a pace that would typically require months or years of expert effort.

Abstract: Automated formalization of mathematics enables mechanical verification but remains limited to isolated theorems and short snippets. Scaling to textbooks and research papers is largely unaddressed, as it requires managing cross-file dependencies, resolving imports, and ensuring that entire projects compile end-to-end. We present M2F (Math-to-Formal), the first agentic framework for end-to-end, project-scale autoformalization in Lean. The framework operates in two stages. The statement compilation stage splits the document into atomic blocks, orders them via inferred dependencies, and repairs declaration skeletons until the project compiles, allowing placeholders in proofs. The proof repair stage closes these holes under fixed signatures using goal-conditioned local edits. Throughout both stages, M2F keeps the verifier in the loop, committing edits only when toolchain feedback confirms improvement. In approximately three weeks, M2F converts long-form mathematical sources into a project-scale Lean library of 153,853 lines from 479 pages textbooks on real analysis and convex analysis, fully formalized as Lean declarations with accompanying proofs. This represents textbook-scale formalization at a pace that would typically require months or years of expert effort. On FATE-H, we achieve $96%$ proof success (vs.\ $80%$ for a strong baseline). Together, these results demonstrate that practical, large-scale automated formalization of mathematical literature is within reach. The full generated Lean code from our runs is available at https://github.com/optsuite/ReasBook.git.

[189] Sales Research Agent and Sales Research Bench

Deepanjan Bhol

Main category: cs.AI

TL;DR: Sales Research Agent for CRM data analysis with quality benchmarking system

DetailsMotivation: Enterprises need AI systems that can answer sales questions over live CRM data with transparent, repeatable evidence of quality, which current models lack.

Method: Developed Sales Research Agent that connects to live CRM data, reasons over complex schemas, and produces insights through text and charts. Created Sales Research Bench benchmark with eight customer-weighted dimensions for quality assessment.

Result: In a 200-question test on customized enterprise schema, Sales Research Agent outperformed Claude Sonnet 4.5 by 13 points and ChatGPT-5 by 24.1 points on the 100-point composite score.

Conclusion: The system provides enterprises with a transparent, repeatable way to compare AI solutions for CRM data analysis with observable quality metrics.

Abstract: Enterprises increasingly need AI systems that can answer sales-leader questions over live, customized CRM data, but most available models do not expose transparent, repeatable evidence of quality. This paper describes the Sales Research Agent in Microsoft Dynamics 365 Sales, an AI-first application that connects to live CRM and related data, reasons over complex schemas, and produces decision-ready insights through text and chart outputs. To make quality observable, we introduce the Sales Research Bench, a purpose-built benchmark that scores systems on eight customer-weighted dimensions, including text and chart groundedness, relevance, explainability, schema accuracy, and chart quality. In a 200-question run on a customized enterprise schema on October 19, 2025, the Sales Research Agent outperformed Claude Sonnet 4.5 by 13 points and ChatGPT-5 by 24.1 points on the 100-point composite score, giving customers a repeatable way to compare AI solutions.

[190] Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

Shengtian Yang, Yu Li, Shuo He, Yewen Li, Qingpeng Cai, Peng Jiang, Lei Feng

Main category: cs.AI

TL;DR: PA-MoE introduces phase-aware mixture of experts for RL agents to address simplicity bias by grouping temporally consistent task phases to specialized experts.

DetailsMotivation: Existing RL methods for LLM agents use single policy networks causing simplicity bias where simple tasks dominate parameters, leaving insufficient capacity for complex tasks. Traditional MoE's token-level routing fragments phase-consistent patterns, undermining expert specialization.

Method: Proposes Phase-Aware Mixture of Experts (PA-MoE) with lightweight phase router that learns latent phase boundaries directly from RL objective without pre-defined categories, allocating temporally consistent assignments to same expert to preserve phase-specific expertise.

Result: Experimental results demonstrate effectiveness of the proposed PA-MoE approach.

Conclusion: PA-MoE successfully addresses simplicity bias in RL agents by enabling phase-aware expert specialization through temporal consistency in expert assignments.

Abstract: Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.

[191] Dynamic System Instructions and Tool Exposure for Efficient Agentic LLMs

Uria Franko

Main category: cs.AI

TL;DR: ITR (Instruction-Tool Retrieval) reduces LLM agent costs by 95% per-step context tokens and improves tool routing by 32% through dynamic retrieval of only necessary system instructions and tools per step.

DetailsMotivation: LLM agents suffer from high costs, latency, and errors due to repeatedly processing long system instructions and large tool catalogs at each step, which increases derailment probability and tool-selection errors.

Method: Proposes Instruction-Tool Retrieval (ITR), a RAG variant that retrieves minimal system-prompt fragments and smallest necessary tool subsets per step, composing dynamic runtime system prompts with narrowed toolsets and confidence-gated fallbacks.

Result: ITR reduces per-step context tokens by 95%, improves correct tool routing by 32% relative, cuts end-to-end episode cost by 70%, and enables agents to run 2-20x more loops within context limits compared to monolithic baseline.

Conclusion: ITR is particularly valuable for long-running autonomous agents as savings compound with agent steps, offering practical deployment guidance for cost-effective LLM agent operations.

Abstract: Large Language Model (LLM) agents often run for many steps while re-ingesting long system instructions and large tool catalogs each turn. This increases cost, agent derailment probability, latency, and tool-selection errors. We propose Instruction-Tool Retrieval (ITR), a RAG variant that retrieves, per step, only the minimal system-prompt fragments and the smallest necessary subset of tools. ITR composes a dynamic runtime system prompt and exposes a narrowed toolset with confidence-gated fallbacks. Using a controlled benchmark with internally consistent numbers, ITR reduces per-step context tokens by 95%, improves correct tool routing by 32% relative, and cuts end-to-end episode cost by 70% versus a monolithic baseline. These savings enable agents to run 2-20x more loops within context limits. Savings compound with the number of agent steps, making ITR particularly valuable for long-running autonomous agents. We detail the method, evaluation protocol, ablations, and operational guidance for practical deployment.

[192] IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use Agents

Seoyoung Lee, Seobin Yoon, Seongbeen Lee, Yoojung Chun, Dayoung Park, Doyeon Kim, Joo Yong Sim

Main category: cs.AI

TL;DR: IntentCUA: A multi-agent framework for computer-use automation that stabilizes long-horizon execution through intent-aligned plan memory and reusable skills

DetailsMotivation: Existing computer-use agents struggle with long-horizon tasks under noisy perception and evolving environments, leading to error accumulation and inefficiency due to drifting from user intent and repeatedly solving routine subproblems

Method: Multi-agent framework with Planner, Plan-Optimizer, and Critic coordinating over shared memory that abstracts raw interaction traces into multi-view intent representations and reusable skills. Intent prototypes retrieve subgroup-aligned skills and inject them into partial plans

Result: Achieved 74.83% task success rate with Step Efficiency Ratio of 0.91, outperforming RL-based and trajectory-centric baselines. Multi-view intent abstraction and shared plan memory jointly improve execution stability

Conclusion: System-level intent abstraction and memory-grounded coordination are key to reliable and efficient desktop automation in large, dynamic environments

Abstract: Computer-use agents operate over long horizons under noisy perception, multi-window contexts, evolving environment states. Existing approaches, from RL-based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error accumulation and inefficiency. We present IntentCUA, a multi-agent computer-use framework designed to stabilize long-horizon execution through intent-aligned plan memory. A Planner, Plan-Optimizer, and Critic coordinate over shared memory that abstracts raw interaction traces into multi-view intent representations and reusable skills. At runtime, intent prototypes retrieve subgroup-aligned skills and inject them into partial plans, reducing redundant re-planning and mitigating error propagation across desktop applications. In end-to-end evaluations, IntentCUA achieved a 74.83% task success rate with a Step Efficiency Ratio of 0.91, outperforming RL-based and trajectory-centric baselines. Ablations show that multi-view intent abstraction and shared plan memory jointly improve execution stability, with the cooperative multi-agent loop providing the largest gains on long-horizon tasks. These results highlight that system-level intent abstraction and memory-grounded coordination are key to reliable and efficient desktop automation in large, dynamic environments.

[193] RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Yunseok Han, Yejoon Lee, Jaeyoung Do

Main category: cs.AI

TL;DR: RFEval: A framework and benchmark for evaluating reasoning faithfulness in Large Reasoning Models, finding 49.7% unfaithfulness despite accuracy.

DetailsMotivation: Large Reasoning Models often produce plausible-sounding rationales that don't reflect their true decision process, undermining reliability and trust. Current evaluation focuses on accuracy but doesn't assess whether reasoning is faithful to the model's actual process.

Method: Introduces formal framework with two testable conditions: stance consistency (coherent stance linking reasoning to answer) and causal influence (reasoning causally drives answer under output-level interventions). Presents RFEval benchmark with 7,186 instances across 7 tasks using controlled counterfactual interventions to probe faithfulness.

Result: Evaluated 12 open-source LRMs, finding unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures concentrated in brittle domains like math and code. RL-style post-training reduces faithfulness even when accuracy maintained. Accuracy is neither sufficient nor reliable proxy for faithfulness.

Conclusion: Establishes rigorous methodology for auditing LRM reliability. Trustworthy AI requires optimizing not only for correct outcomes but also for structural integrity of reasoning process. Provides code and dataset for further research.

Abstract: Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post-training regimes than with scale: within-family ablations indicate that adding current RL-style objectives on top of supervised fine-tuning can reduce reasoning faithfulness, even when accuracy is maintained. Crucially, accuracy is neither a sufficient nor a reliable proxy for faithfulness: once controlling for model and task, the accuracy-faithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process. Our code and dataset can be found at project page: $\href{https://aidaslab.github.io/RFEval/}{https://aidaslab.github.io/RFEval/}$

[194] Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

Yonghyeon Jo, Sunwoo Lee, Seungyul Han

Main category: cs.AI

TL;DR: S2Q is a multi-agent RL method that learns multiple sub-value functions to retain alternative high-value actions, improving adaptability when value functions shift during training.

DetailsMotivation: Existing value decomposition methods in cooperative MARL rely on single optimal actions and struggle to adapt when underlying value functions shift during training, often converging to suboptimal policies.

Method: Proposes Successive Sub-value Q-learning (S2Q) which learns multiple sub-value functions to retain alternative high-value actions, and incorporates these into a Softmax-based behavior policy for persistent exploration.

Result: Experiments on challenging MARL benchmarks show S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance.

Conclusion: S2Q addresses the limitation of single optimal action approaches in MARL by learning multiple sub-value functions, enabling better adaptation to changing optima and improved performance.

Abstract: Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.

[195] Predictive Batch Scheduling: Accelerating Language Model Training Through Loss-Aware Sample Prioritization

Sumedh Rasal

Main category: cs.AI

TL;DR: PBS accelerates language model training by using a lightweight predictor to prioritize high-loss samples during batch construction based on token-level features, achieving 6-13% faster convergence.

DetailsMotivation: Current curriculum learning approaches require predefined difficulty metrics or expensive per-sample loss tracking. The authors aim to develop a more efficient method that can dynamically identify difficult samples during training without significant computational overhead.

Method: PBS uses a lightweight linear predictor trained online to estimate sample difficulty from four static token-level features: token frequency, sequence length, vocabulary diversity, and rare token ratio. The predictor dynamically prioritizes high-loss samples during batch construction.

Result: The predictor achieves 0.44 correlation with actual loss using only four simple features. Experiments on a 130M parameter transformer show 6-13% faster convergence measured by evaluation loss across training checkpoints, with predictor correlation improving from 0.14 to 0.44 over 10,000 training steps.

Conclusion: Token frequency statistics encode meaningful information about sample difficulty, enabling effective curriculum learning with negligible computational overhead through the PBS approach.

Abstract: We introduce Predictive Batch Scheduling (PBS), a novel training optimization technique that accelerates language model convergence by dynamically prioritizing high-loss samples during batch construction. Unlike curriculum learning approaches that require predefined difficulty metrics or hard example mining methods that demand expensive per-sample loss tracking, PBS employs a lightweight linear predictor trained online to estimate sample difficulty from static token-level features. Our predictor achieves 0.44 correlation with actual loss using only four simple features: token frequency, sequence length, vocabulary diversity, and rare token ratio. Experiments on a 130M parameter transformer demonstrate that PBS achieves 6-13% faster convergence measured by evaluation loss across training checkpoints, with the predictor’s correlation improving from 0.14 to 0.44 over 10,000 training steps. These results validate that token frequency statistics encode meaningful information about sample difficulty, enabling effective curriculum learning with negligible computational overhead.

[196] How AI Coding Agents Communicate: A Study of Pull Request Description Characteristics and Human Review Responses

Kan Watanabe, Rikuto Tsuchida, Takahiro Monno, Bin Huang, Kazuma Yamasaki, Youmei Fan, Kazumasa Shimari, Kenichi Matsumoto

Main category: cs.AI

TL;DR: AI coding agents create distinct pull request descriptions that affect human reviewer engagement, response timing, and merge outcomes in software development.

DetailsMotivation: With the rise of AI coding agents autonomously creating GitHub pull requests, there's limited understanding of how these agents differ in their PR description characteristics and how human reviewers respond to them, creating a gap in understanding human-AI collaborative software development dynamics.

Method: Empirical analysis of pull requests created by five AI coding agents using the AIDev dataset, examining agent differences in PR description characteristics (including structural features) and human reviewer response (review activity, response timing, sentiment, and merge outcomes).

Result: AI coding agents exhibit distinct PR description styles associated with differences in reviewer engagement, response time, and merge outcomes. Notable variation exists across agents in both reviewer interaction metrics and merge rates.

Conclusion: Pull request presentation and reviewer interaction dynamics play a significant role in human-AI collaborative software development, highlighting the importance of how AI agents communicate their changes to human reviewers.

Abstract: The rapid adoption of large language models has led to the emergence of AI coding agents that autonomously create pull requests on GitHub. However, how these agents differ in their pull request description characteristics, and how human reviewers respond to them, remains underexplored. In this study, we conduct an empirical analysis of pull requests created by five AI coding agents using the AIDev dataset. We analyze agent differences in pull request description characteristics, including structural features, and examine human reviewer response in terms of review activity, response timing, sentiment, and merge outcomes. We find that AI coding agents exhibit distinct PR description styles, which are associated with differences in reviewer engagement, response time, and merge outcomes. We observe notable variation across agents in both reviewer interaction metrics and merge rates. These findings highlight the role of pull request presentation and reviewer interaction dynamics in human-AI collaborative software development.

[197] Agentic Wireless Communication for 6G: Intent-Aware and Continuously Evolving Physical-Layer Intelligence

Zhaoyang Li, Xingzhi Jin, Junyu Pan, Qianqian Yang, Zhiguo Shi

Main category: cs.AI

TL;DR: Survey paper exploring intent-driven autonomous intelligence for 6G using LLM-based agents, focusing on multimodal perception and cross-layer decision making for physical layer communications.

DetailsMotivation: 6G wireless systems face growing complexity and diverse service demands requiring shift from rule-based control to intent-driven autonomous intelligence, where user requirements are multi-dimensional and dynamic.

Method: Investigates agentic AI for 6G physical layer through closed-loop pipeline of intent perception, autonomous decision making, and network execution; reviews physical-layer tasks, identifies application scenarios, discusses multimodal perception and cross-layer decision making technologies.

Result: Presents AgenCom case study - an intent-driven link decision agent that adaptively constructs communication links under diverse user preferences and channel conditions.

Conclusion: LLM-based agents with strong contextual understanding and cross-modal reasoning provide promising foundation for intent-aware network agents in 6G systems.

Abstract: As 6G wireless systems evolve, growing functional complexity and diverse service demands are driving a shift from rule-based control to intent-driven autonomous intelligence. User requirements are no longer captured by a single metric (e.g., throughput or reliability), but by multi-dimensional objectives such as latency sensitivity, energy preference, computational constraints, and service-level requirements. These objectives may also change over time due to environmental dynamics and user-network interactions. Therefore, accurate understanding of both the communication environment and user intent is critical for autonomous and sustainably evolving 6G communications. Large language models (LLMs), with strong contextual understanding and cross-modal reasoning, provide a promising foundation for intent-aware network agents. Compared with rule-driven or centrally optimized designs, LLM-based agents can integrate heterogeneous information and translate natural-language intents into executable control and configuration decisions. Focusing on a closed-loop pipeline of intent perception, autonomous decision making, and network execution, this paper investigates agentic AI for the 6G physical layer and its realization pathways. We review representative physical-layer tasks and their limitations in supporting intent awareness and autonomy, identify application scenarios where agentic AI is advantageous, and discuss key challenges and enabling technologies in multimodal perception, cross-layer decision making, and sustainable optimization. Finally, we present a case study of an intent-driven link decision agent, termed AgenCom, which adaptively constructs communication links under diverse user preferences and channel conditions.

[198] Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction

Xiaoran Cai, Wang Yang, Xiyu Ren, Chekun Law, Rohit Sharma, Peng Qi

Main category: cs.AI

TL;DR: Proposes a human-AI collaboration framework (STRIDE + SR-Delta) to create benchmark datasets for evaluating and harmonizing inconsistent sustainability/ESG ratings across agencies.

DetailsMotivation: Sustainability ratings from different agencies for the same company vary widely, limiting comparability, credibility, and decision-making relevance. There's a need to harmonize these inconsistent ratings.

Method: Two-part framework: 1) STRIDE - provides principled criteria and scoring system to guide construction of firm-level benchmark datasets using LLMs; 2) SR-Delta - discrepancy-analysis procedural framework that surfaces insights for potential adjustments.

Result: Framework enables scalable and comparable assessment of sustainability rating methodologies. Calls for AI community adoption to strengthen sustainability rating methodologies.

Conclusion: Proposes universal human-AI collaboration approach to generate trustworthy benchmark datasets for evaluating sustainability ratings, addressing current inconsistencies and supporting urgent sustainability agendas.

Abstract: Sustainability or ESG rating agencies use company disclosures and external data to produce scores or ratings that assess the environmental, social, and governance performance of a company. However, sustainability ratings across agencies for a single company vary widely, limiting their comparability, credibility, and relevance to decision-making. To harmonize the rating results, we propose adopting a universal human-AI collaboration framework to generate trustworthy benchmark datasets for evaluating sustainability rating methodologies. The framework comprises two complementary parts: STRIDE (Sustainability Trust Rating & Integrity Data Equation) provides principled criteria and a scoring system that guide the construction of firm-level benchmark datasets using large language models (LLMs), and SR-Delta, a discrepancy-analysis procedural framework that surfaces insights for potential adjustments. The framework enables scalable and comparable assessment of sustainability rating methodologies. We call on the broader AI community to adopt AI-powered approaches to strengthen and advance sustainability rating methodologies that support and enforce urgent sustainability agendas.

[199] Owen-based Semantics and Hierarchy-Aware Explanation (O-Shap)

Xiangyu Zhou, Chenhan Xiao, Yang Weng

Main category: cs.AI

TL;DR: O-Shap: A hierarchical Shapley value approach using Owen values with semantic segmentation that satisfies the T-property for improved feature attribution in vision tasks.

DetailsMotivation: Standard Shapley value methods assume feature independence, which breaks down in vision tasks where pixels have spatial and semantic dependencies. Existing segmentation methods for hierarchical Owen values violate consistency properties, leading to poor attribution quality.

Method: Proposes O-Shap using Owen values (hierarchical Shapley generalization) with a new segmentation approach that satisfies the T-property for semantic alignment across hierarchy levels. This enables computational pruning while maintaining consistency.

Result: Experiments on image and tabular datasets show O-Shap outperforms baseline SHAP variants in attribution precision, semantic coherence, and runtime efficiency, especially when structure matters.

Conclusion: O-Shap provides a theoretically grounded, efficient hierarchical attribution method that addresses feature dependencies in vision tasks through semantically aligned segmentation satisfying the T-property.

Abstract: Shapley value-based methods have become foundational in explainable artificial intelligence (XAI), offering theoretically grounded feature attributions through cooperative game theory. However, in practice, particularly in vision tasks, the assumption of feature independence breaks down, as features (i.e., pixels) often exhibit strong spatial and semantic dependencies. To address this, modern SHAP implementations now include the Owen value, a hierarchical generalization of the Shapley value that supports group attributions. While the Owen value preserves the foundations of Shapley values, its effectiveness critically depends on how feature groups are defined. We show that commonly used segmentations (e.g., axis-aligned or SLIC) violate key consistency properties, and propose a new segmentation approach that satisfies the $T$-property to ensure semantic alignment across hierarchy levels. This hierarchy enables computational pruning while improving attribution accuracy and interpretability. Experiments on image and tabular datasets demonstrate that O-Shap outperforms baseline SHAP variants in attribution precision, semantic coherence, and runtime efficiency, especially when structure matters.

[200] Instructor-Aligned Knowledge Graphs for Personalized Learning

Abdulrahman AlRabah, Priyanka Kargupta, Jiawei Han, Abdussalam Alawini

Main category: cs.AI

TL;DR: InstructKG automatically constructs instructor-aligned knowledge graphs from lecture materials to capture learning progressions and conceptual dependencies for personalized education.

DetailsMotivation: Educational concept mastery requires understanding prerequisites and sub-concepts, but large-scale courses make individual diagnosis infeasible. Existing knowledge graph approaches are either too surface-level or ignore pedagogical signals in instructional materials.

Method: Proposes InstructKG framework that extracts significant concepts as nodes and infers learning dependencies as directed edges from lecture materials (slides, notes). Combines temporal and semantic signals from educational materials with large language models.

Result: Experiments on real-world diverse lecture materials across multiple courses with human-based evaluation demonstrate that InstructKG captures rich, instructor-aligned learning progressions.

Conclusion: InstructKG effectively constructs knowledge graphs that capture intended learning progressions from educational materials, enabling better identification of knowledge gaps and personalized interventions.

Abstract: Mastering educational concepts requires understanding both their prerequisites (e.g., recursion before merge sort) and sub-concepts (e.g., merge sort as part of sorting algorithms). Capturing these dependencies is critical for identifying students’ knowledge gaps and enabling targeted intervention for personalized learning. This is especially challenging in large-scale courses, where instructors cannot feasibly diagnose individual misunderstanding or determine which concepts need reinforcement. While knowledge graphs offer a natural representation for capturing these conceptual relationships at scale, existing approaches are either surface-level (focusing on course-level concepts like “Algorithms” or logistical relationships such as course enrollment), or disregard the rich pedagogical signals embedded in instructional materials. We propose InstructKG, a framework for automatically constructing instructor-aligned knowledge graphs that capture a course’s intended learning progression. Given a course’s lecture materials (slides, notes, etc.), InstructKG extracts significant concepts as nodes and infers learning dependencies as directed edges (e.g., “part-of” or “depends-on” relationships). The framework synergizes the rich temporal and semantic signals unique to educational materials (e.g., “recursion” is taught before “mergesort”; “recursion” is mentioned in the definition of “merge sort”) with the generalizability of large language models. Through experiments on real-world, diverse lecture materials across multiple courses and human-based evaluation, we demonstrate that InstructKG captures rich, instructor-aligned learning progressions.

[201] Epistemology of Generative AI: The Geometry of Knowing

Ilya Levin

Main category: cs.AI

TL;DR: The paper develops an “Indexical Epistemology of High-Dimensional Spaces” to understand generative AI’s epistemic character, arguing neural networks transform symbolic input into positions in geometric meaning spaces, creating a new mode of knowledge production distinct from symbolic reasoning and statistical recombination.

DetailsMotivation: Generative AI operates through obscure epistemic mechanisms, lacking the engineering understanding needed for responsible integration into science and education. The paper aims to provide a philosophical framework to understand how neural networks process information differently from traditional computing paradigms.

Method: Analyzes the paradigmatic break from Turing-Shannon-von Neumann tradition to neural networks, drawing on four structural properties of high-dimensional geometry (concentration of measure, near-orthogonality, exponential directional capacity, manifold regularity). Builds on Peirce semiotics and Papert constructionism to develop an Indexical Epistemology of High-Dimensional Spaces.

Result: Proposes that generative models function as navigators of learned manifolds in high-dimensional semantic spaces, and introduces “navigational knowledge” as a third mode of knowledge production distinct from symbolic reasoning and statistical recombination.

Conclusion: Understanding generative AI requires recognizing it as operating in geometric spaces of meanings rather than traditional symbolic processing, with navigational knowledge emerging as a new epistemic mode that must be properly theorized for responsible AI integration.

Abstract: Generative AI presents an unprecedented challenge to our understanding of knowledge and its production. Unlike previous technological transformations, where engineering understanding preceded or accompanied deployment, generative AI operates through mechanisms whose epistemic character remains obscure, and without such understanding, its responsible integration into science, education, and institutional life cannot proceed on a principled basis. This paper argues that the missing account must begin with a paradigmatic break that has not yet received adequate philosophical attention. In the Turing-Shannon-von Neumann tradition, information enters the machine as encoded binary vectors, and semantics remains external to the process. Neural network architectures rupture this regime: symbolic input is instantly projected into a high-dimensional space where coordinates correspond to semantic parameters, transforming binary code into a position in a geometric space of meanings. It is this space that constitutes the active epistemic condition shaping generative production. Drawing on four structural properties of high-dimensional geometry concentration of measure, near-orthogonality, exponential directional capacity, and manifold regularity the paper develops an Indexical Epistemology of High-Dimensional Spaces. Building on Peirce semiotics and Papert constructionism, it reconceptualizes generative models as navigators of learned manifolds and proposes navigational knowledge as a third mode of knowledge production, distinct from both symbolic reasoning and statistical recombination.

[202] Efficient Parallel Algorithm for Decomposing Hard CircuitSAT Instances

Victor Kondratiev, Irina Gribanova, Alexander Semenov

Main category: cs.AI

TL;DR: A parallel algorithm for decomposing hard CircuitSAT instances using specialized constraints to partition SAT problems into weakened formulas, with parameter tuning for efficient decomposition guided by parallel hardness estimations.

DetailsMotivation: CircuitSAT problems are computationally hard, especially for applications like Logical Equivalence Checking of Boolean circuits and cryptographic hash function attacks. There's a need for efficient parallel algorithms to decompose these challenging instances into more manageable subproblems.

Method: The method uses specialized constraints to partition original SAT instances into a family of weakened formulas. It’s implemented as a parameterized parallel algorithm where parameters can be adjusted to efficiently identify high-quality decompositions. The approach includes parallel computation of hardness estimations to guide the decomposition process.

Result: The algorithm demonstrates practical efficacy on challenging CircuitSAT instances, including those encoding Logical Equivalence Checking of Boolean circuits and preimage attacks on cryptographic hash functions.

Conclusion: The proposed parallel decomposition algorithm provides an effective approach for tackling hard CircuitSAT problems, with applications in circuit verification and cryptography.

Abstract: We propose a novel parallel algorithm for decomposing hard CircuitSAT instances. The technique employs specialized constraints to partition an original SAT instance into a family of weakened formulas. Our approach is implemented as a parameterized parallel algorithm, where adjusting the parameters allows efficient identification of high-quality decompositions, guided by hardness estimations computed in parallel. We demonstrate the algorithm’s practical efficacy on challenging CircuitSAT instances, including those encoding Logical Equivalence Checking of Boolean circuits and preimage attacks on cryptographic hash functions.

[203] Bonsai: A Framework for Convolutional Neural Network Acceleration Using Criterion-Based Pruning

Joseph Bingham, Sam Helmich

Main category: cs.AI

TL;DR: Combine is a criterion-based pruning framework for CNNs that provides a standardized approach to compare different pruning criteria, demonstrates varying effects of criteria on different models, and introduces novel criteria functions achieving up to 79% filter pruning with maintained or improved accuracy.

DetailsMotivation: As CNNs grow larger and more computationally expensive, pruning techniques have emerged to reduce model size and computational requirements, but existing solutions lack standardized implementations and comparison frameworks, making them difficult to evaluate and deploy consistently.

Method: Introduces Combine, a criterion-based pruning framework that provides a standardized language for comparing pruning criteria functions, implements iterative pruning, and proposes novel criteria functions for filter removal in CNN architectures.

Result: The framework achieves up to 79% filter pruning on VGG-inspired models while retaining or improving accuracy, and reduces computational requirements by up to 68%, demonstrating that different pruning criteria have varying effects on different model architectures.

Conclusion: Combine provides an effective, standardized framework for CNN pruning that enables systematic comparison of pruning criteria, reveals architecture-dependent effects of different criteria, and achieves significant model compression with maintained performance.

Abstract: As the need for more accurate and powerful Convolutional Neural Networks (CNNs) increases, so too does the size, execution time, memory footprint, and power consumption. To overcome this, solutions such as pruning have been proposed with their own metrics and methodologies, or criteria, for how weights should be removed. These solutions do not share a common implementation and are difficult to implement and compare. In this work, we introduce Combine, a criterion- based pruning solution and demonstrate that it is fast and effective framework for iterative pruning, demonstrate that criterion have differing effects on different models, create a standard language for comparing criterion functions, and propose a few novel criterion functions. We show the capacity of these criterion functions and the framework on VGG inspired models, pruning up to 79% of filters while retaining or improving accuracy, and reducing the computations needed by the network by up to 68%.

[204] JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures

Ariel Larey, Elay Dahan, Amit Bleiweiss, Raizy Kellerman, Guy Leib, Omri Nayshool, Dan Ofer, Tal Zinger, Dan Dominissini, Gideon Rechavi, Nicole Bussola, Simon Lee, Shane O’Connell, Dung Hoang, Marissa Wirth, Alexander W. Charney, Nati Daniel, Yoli Shavit

Main category: cs.AI

TL;DR: JEPA-DNA is a novel genomic foundation model framework that combines Joint-Embedding Predictive Architecture with traditional generative objectives to capture both local genomic syntax and global functional context.

DetailsMotivation: Current genomic foundation models using MLM or NTP focus too much on local patterns and individual nucleotides, failing to capture broader functional context and global biological perspective needed for understanding genomic function.

Method: Integrates JEPA with generative objectives, introducing latent grounding by coupling token-level recovery with predictive objectives in latent space using CLS token supervision to predict high-level functional embeddings of masked genomic segments.

Result: JEPA-DNA consistently outperforms generative-only baselines across diverse genomic benchmarks in both supervised and zero-shot tasks, providing more robust and biologically grounded representations.

Conclusion: JEPA-DNA offers a scalable path toward foundation models that understand both genomic syntax and underlying functional logic, bridging the gap between local patterns and global biological context.

Abstract: Genomic Foundation Models (GFMs) have largely relied on Masked Language Modeling (MLM) or Next Token Prediction (NTP) to learn the language of life. While these paradigms excel at capturing local genomic syntax and fine-grained motif patterns, they often fail to capture the broader functional context, resulting in representations that lack a global biological perspective. We introduce JEPA-DNA, a novel pre-training framework that integrates the Joint-Embedding Predictive Architecture (JEPA) with traditional generative objectives. JEPA-DNA introduces latent grounding by coupling token-level recovery with a predictive objective in the latent space by supervising a CLS token. This forces the model to predict the high-level functional embeddings of masked genomic segments rather than focusing solely on individual nucleotides. JEPA-DNA extends both NTP and MLM paradigms and can be deployed either as a standalone from-scratch objective or as a continual pre-training enhancement for existing GFMs. Our evaluations across a diverse suite of genomic benchmarks demonstrate that JEPA-DNA consistently yields superior performance in supervised and zero-shot tasks compared to generative-only baselines. By providing a more robust and biologically grounded representation, JEPA-DNA offers a scalable path toward foundation models that understand not only the genomic alphabet, but also the underlying functional logic of the sequence.

[205] Texo: Formula Recognition within 20M Parameters

Sicheng Mao

Main category: cs.AI

TL;DR: Texo is a lightweight formula recognition model with only 20M parameters that achieves comparable performance to SOTA models while enabling real-time inference on consumer hardware and browser deployment.

DetailsMotivation: The motivation is to create a minimalist yet high-performance formula recognition model that can run efficiently on consumer-grade hardware and in browsers, addressing the computational limitations of existing large models while maintaining competitive accuracy.

Method: Texo uses attentive design, distillation techniques, and transfer of vocabulary and tokenizer to create a compact 20M parameter model. The approach focuses on architectural efficiency while preserving recognition capabilities.

Result: Texo achieves comparable performance to state-of-the-art models like UniMERNet-T and PPFormulaNet-S while reducing model size by 80% and 65% respectively, enabling real-time inference on consumer hardware and browser deployment.

Conclusion: Texo demonstrates that efficient formula recognition is achievable with minimalist architectures, making advanced mathematical OCR accessible for real-time applications and browser-based deployment.

Abstract: In this paper we present Texo, a minimalist yet highperformance formula recognition model that contains only 20 million parameters. By attentive design, distillation and transfer of the vocabulary and the tokenizer, Texo achieves comparable performance to state-of-the-art models such as UniMERNet-T and PPFormulaNet-S, while reducing the model size by 80% and 65%, respectively. This enables real-time inference on consumer-grade hardware and even in-browser deployment. We also developed a web application to demonstrate the model capabilities and facilitate its usage for end users.

[206] From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan’s Humanities and Social Sciences

Yi-Chih Huang

Main category: cs.AI

TL;DR: Proposes an AI Agent-based collaborative workflow for humanities/social sciences research, validated using Taiwan’s Claude.ai usage data from Anthropic Economic Index.

DetailsMotivation: Current generative AI research focuses on software engineering and natural sciences, lacking methodological exploration for humanities and social sciences. Need for structured AI collaboration frameworks in these fields.

Method: Designs a seven-stage modular workflow with three principles: task modularization, human-AI division of labor, and verifiability. Uses Taiwan’s Claude.ai usage data (N=7,729 conversations) from Anthropic Economic Index as empirical validation.

Result: Proposes replicable AI collaboration framework and identifies three human-AI collaboration modes: direct execution, iterative refinement, and human-led. Demonstrates workflow application to secondary data research.

Conclusion: Human judgment remains irreplaceable in research question formulation, theoretical interpretation, contextualized reasoning, and ethical reflection. Framework provides structured approach for AI collaboration in humanities/social sciences.

Abstract: Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences. Positioned as a “methodological experiment,” this study proposes an AI Agent-based collaborative research workflow (Agentic Workflow) for humanities and social science research. Taiwan’s Claude.ai usage data (N = 7,729 conversations, November 2025) from the Anthropic Economic Index (AEI) serves as the empirical vehicle for validating the feasibility of this methodology. This study operates on two levels: the primary level is the design and validation of a methodological framework - a seven-stage modular workflow grounded in three principles: task modularization, human-AI division of labor, and verifiability, with each stage delineating clear roles for human researchers (research judgment and ethical decisions) and AI Agents (information retrieval and text generation); the secondary level is the empirical analysis of AEI Taiwan data - serving as an operational demonstration of the workflow’s application to secondary data research, showcasing both the process and output quality (see Appendix A). This study contributes by proposing a replicable AI collaboration framework for humanities and social science researchers, and identifying three operational modes of human-AI collaboration - direct execution, iterative refinement, and human-led - through reflexive documentation of the operational process. This taxonomy reveals the irreplaceability of human judgment in research question formulation, theoretical interpretation, contextualized reasoning, and ethical reflection. Limitations including single-platform data, cross-sectional design, and AI reliability risks are acknowledged.

[207] Continual learning and refinement of causal models through dynamic predicate invention

Enrique Crespo-Fernandez, Oliver Ray, Telmo de Menezes e Silva Filho, Peter Flach

Main category: cs.AI

TL;DR: Symbolic causal world modeling framework using Meta-Interpretive Learning for sample-efficient, transparent, and scalable agent learning in complex environments.

DetailsMotivation: Standard world modeling methods struggle with sample inefficiency, lack of transparency, and poor scalability in complex environments. There's a need for agents to internalize the underlying logic of their world through more efficient and interpretable approaches.

Method: Integrates continuous model learning and repair into the agent’s decision loop using Meta-Interpretive Learning and predicate invention. Constructs symbolic causal world models entirely online, finding semantically meaningful and reusable abstractions to build a hierarchy of disentangled, high-quality concepts from observations.

Result: The lifted inference approach scales to domains with complex relational dynamics where propositional methods suffer from combinatorial explosion. Achieves sample-efficiency orders of magnitude higher than established PPO neural-network-based baseline.

Conclusion: The framework enables agents to construct transparent, scalable, and sample-efficient symbolic causal world models online, addressing key limitations of traditional world modeling approaches for complex environments.

Abstract: Efficiently navigating complex environments requires agents to internalize the underlying logic of their world, yet standard world modelling methods often struggle with sample inefficiency, lack of transparency, and poor scalability. We propose a framework for constructing symbolic causal world models entirely online by integrating continuous model learning and repair into the agent’s decision loop, by leveraging the power of Meta-Interpretive Learning and predicate invention to find semantically meaningful and reusable abstractions, allowing an agent to construct a hierarchy of disentangled, high-quality concepts from its observations. We demonstrate that our lifted inference approach scales to domains with complex relational dynamics, where propositional methods suffer from combinatorial explosion, while achieving sample-efficiency orders of magnitude higher than the established PPO neural-network-based baseline.

[208] Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom’s Taxonomy

Bianca Raimondi, Maurizio Gabbrielli

Main category: cs.AI

TL;DR: LLMs encode cognitive complexity levels (Bloom’s Taxonomy) in linearly separable neural representations, with ~95% classification accuracy showing early resolution of cognitive difficulty.

DetailsMotivation: To understand how LLMs internally represent cognitive complexity beyond surface-level performance metrics, using Bloom's Taxonomy as a hierarchical framework to probe neural representations.

Method: Analyzed high-dimensional activation vectors from different LLMs, using linear classifiers to test separability of cognitive levels (Remember to Create) in residual streams across layers.

Result: Linear classifiers achieved ~95% mean accuracy across all Bloom levels, showing cognitive level is encoded in linearly accessible subspaces, with representations becoming increasingly separable across layers.

Conclusion: LLMs resolve cognitive difficulty early in processing, with cognitive levels linearly separable in neural representations, providing insights into internal cognitive processing.

Abstract: The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. This study investigates the internal neural representations of cognitive complexity using Bloom’s Taxonomy as a hierarchical lens. By analyzing high-dimensional activation vectors from different LLMs, we probe whether different cognitive levels, ranging from basic recall (Remember) to abstract synthesis (Create), are linearly separable within the model’s residual streams. Our results demonstrate that linear classifiers achieve approximately 95% mean accuracy across all Bloom levels, providing strong evidence that cognitive level is encoded in a linearly accessible subspace of the model’s representations. These findings provide evidence that the model resolves the cognitive difficulty of a prompt early in the forward pass, with representations becoming increasingly separable across layers.

[209] Decoding the Human Factor: High Fidelity Behavioral Prediction for Strategic Foresight

Ben Yellin, Ehud Ezra, Mark Foreman, Shula Grinapol

Main category: cs.AI

TL;DR: LBM is a behavioral foundation model that fine-tunes LLMs to predict individual strategic choices by conditioning on structured psychometric trait profiles rather than using transient persona prompting.

DetailsMotivation: LLMs struggle with consistent, individual-specific behavior prediction in high-stakes environments, especially when predictions depend on complex interactions between psychological traits and situational constraints. Prompting approaches are brittle and suffer from identity drift.

Method: LBM shifts from persona prompting to behavioral embedding by conditioning on structured, high-dimensional trait profiles derived from comprehensive psychometric batteries. It’s trained on a proprietary dataset linking stable dispositions, motivational states, and situational constraints to observed choices.

Result: LBM fine-tuning improves behavioral prediction relative to unadapted Llama-3.1-8B-Instruct and performs comparably to frontier baselines when conditioned on Big Five traits. Unlike prompting baselines that hit complexity ceilings, LBM continues to benefit from increasingly dense trait profiles.

Conclusion: LBM establishes a scalable approach for high-fidelity behavioral simulation, enabling applications in strategic foresight, negotiation analysis, cognitive security, and decision support.

Abstract: Predicting human decision-making in high-stakes environments remains a central challenge for artificial intelligence. While large language models (LLMs) demonstrate strong general reasoning, they often struggle to generate consistent, individual-specific behavior, particularly when accurate prediction depends on complex interactions between psychological traits and situational constraints. Prompting-based approaches can be brittle in this setting, exhibiting identity drift and limited ability to leverage increasingly detailed persona descriptions. To address these limitations, we introduce the Large Behavioral Model (LBM), a behavioral foundation model fine-tuned to predict individual strategic choices with high fidelity. LBM shifts from transient persona prompting to behavioral embedding by conditioning on a structured, high-dimensional trait profile derived from a comprehensive psychometric battery. Trained on a proprietary dataset linking stable dispositions, motivational states, and situational constraints to observed choices, LBM learns to map rich psychological profiles to discrete actions across diverse strategic dilemmas. In a held-out scenario evaluation, LBM fine-tuning improves behavioral prediction relative to the unadapted Llama-3.1-8B-Instruct backbone and performs comparably to frontier baselines when conditioned on Big Five traits. Moreover, we find that while prompting-based baselines exhibit a complexity ceiling, LBM continues to benefit from increasingly dense trait profiles, with performance improving as additional trait dimensions are provided. Together, these results establish LBM as a scalable approach for high-fidelity behavioral simulation, enabling applications in strategic foresight, negotiation analysis, cognitive security, and decision support.

[210] ArXiv-to-Model: A Practical Study of Scientific LM Training

Anuj Gupta

Main category: cs.AI

TL;DR: Training a 1.36B-parameter scientific language model from raw arXiv LaTeX sources with detailed engineering insights for domain-specialized model development under constrained compute.

DetailsMotivation: There's a lack of practical documentation for training domain-specialized scientific language models from raw sources, despite frontier LLMs showing strong reasoning capabilities. Researchers need guidance for building specialized models under moderate compute budgets.

Method: End-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training on 2xA100 GPUs. Analyzed 24 experimental runs to study training stability, scaling behavior, data yield losses, and infrastructure bottlenecks.

Result: Preprocessing decisions significantly affect usable token volume, tokenization impacts symbolic stability, and storage/I/O constraints can rival compute as limiting factors. Showed stable training behavior with 52B pretraining tokens in data-rich regime.

Conclusion: Provides engineering-grounded, transparent account of training small scientific language models from scratch, offering practical insights for researchers building domain-specialized models under constrained compute.

Abstract: While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides an engineering-grounded, transparent account of training a small scientific language model from scratch. We hope these insights support researchers operating under moderate compute budgets who seek to build domain-specialized models.

[211] All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

Zeyu Zhang, Ryan Chen, Bradly C. Stadie

Main category: cs.AI

TL;DR: A framework for detecting and quantifying temporal knowledge leakage in LLMs when predicting past events, with a method to filter contaminated information through claim verification.

DetailsMotivation: LLMs may inadvertently use post-cutoff knowledge from training when predicting past events, undermining the validity of retrospective evaluation (backtesting). Current methods lack ways to detect and measure this temporal knowledge leakage.

Method: Introduces Shapley-DCLR metric that decomposes model rationales into atomic claims, categorizes them by temporal verifiability, and uses Shapley values to measure each claim’s contribution. Also proposes TimeSPEC method that interleaves generation with claim verification and regeneration to filter temporal contamination.

Result: Experiments on 350 instances across Supreme Court cases, NBA salaries, and stock returns reveal substantial leakage in standard prompting. TimeSPEC reduces Shapley-DCLR while preserving task performance, outperforming prompt-based temporal constraints.

Conclusion: Explicit, interpretable claim-level verification is more effective than prompt-based temporal constraints for reliable backtesting of LLMs on historical events.

Abstract: To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emph{temporal knowledge leakage}. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textit{Shapley values} to measure each claim’s contribution to the prediction. This yields the \textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate (\textbf{Shapley-DCLR}), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framework, we propose \textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims (\textbf{TimeSPEC}), which interleaves generation with claim verification and regeneration to proactively filter temporal contamination – producing predictions where every supporting claim can be traced to sources available before the cutoff date. Experiments on 350 instances spanning U.S. Supreme Court case prediction, NBA salary estimation, and stock return ranking reveal substantial leakage in standard prompting baselines. TimeSPEC reduces Shapley-DCLR while preserving task performance, demonstrating that explicit, interpretable claim-level verification outperforms prompt-based temporal constraints for reliable backtesting.

[212] Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web

Linxi Jiang, Rui Xi, Zhijie Liu, Shuo Chen, Zhiqiang Lin, Suman Nath

Main category: cs.AI

TL;DR: Web Verbs proposes a semantic layer for web actions using typed, documented functions that expose site capabilities through uniform interfaces, enabling LLMs to synthesize reliable and auditable workflows.

DetailsMotivation: Current web agents operate on low-level primitives like clicks and keystrokes which are brittle, inefficient, and difficult to verify. There's a need for a semantic layer for web actions to enable reliable, efficient, and verifiable agentic web interactions.

Method: Proposes Web Verbs - a web-scale set of typed, semantically documented functions that expose site capabilities through uniform interfaces, whether implemented through APIs or robust client-side workflows. Verbs serve as stable, composable units with preconditions, postconditions, policy tags, and logging support.

Result: Proof-of-concept implementation and case studies demonstrate concise and robust execution compared to existing agents. The approach improves reliability through stable interfaces, efficiency by reducing dozens of steps to few function calls, and verifiability through typed contracts and checkable traces.

Conclusion: Web Verbs provides a semantic abstraction layer that unifies API-based and browser-based paradigms, enabling LLMs to synthesize reliable and auditable workflows with explicit control and data flow. A roadmap for standardization is outlined for web-scale deployment.

Abstract: The Web is evolving from a medium that humans browse to an environment where software agents act on behalf of users. Advances in large language models (LLMs) make natural language a practical interface for goal-directed tasks, yet most current web agents operate on low-level primitives such as clicks and keystrokes. These operations are brittle, inefficient, and difficult to verify. Complementing content-oriented efforts such as NLWeb’s semantic layer for retrieval, we argue that the agentic web also requires a semantic layer for web actions. We propose \textbf{Web Verbs}, a web-scale set of typed, semantically documented functions that expose site capabilities through a uniform interface, whether implemented through APIs or robust client-side workflows. These verbs serve as stable and composable units that agents can discover, select, and synthesize into concise programs. This abstraction unifies API-based and browser-based paradigms, enabling LLMs to synthesize reliable and auditable workflows with explicit control and data flow. Verbs can carry preconditions, postconditions, policy tags, and logging support, which improves \textbf{reliability} by providing stable interfaces, \textbf{efficiency} by reducing dozens of steps into a few function calls, and \textbf{verifiability} through typed contracts and checkable traces. We present our vision, a proof-of-concept implementation, and representative case studies that demonstrate concise and robust execution compared to existing agents. Finally, we outline a roadmap for standardization to make verbs deployable and trustworthy at web scale.

[213] MedClarify: An information-seeking AI agent for medical diagnosis with case-specific follow-up questions

Hui Min Wong, Philip Heesen, Pascal Janetzky, Martin Bendszus, Stefan Feuerriegel

Main category: cs.AI

TL;DR: MedClarify is an AI agent that generates follow-up questions for medical diagnosis by computing differential diagnoses and selecting questions with highest expected information gain to reduce diagnostic uncertainty.

DetailsMotivation: Current medical LLMs struggle with iterative reasoning and generating informative follow-up questions needed for real-world clinical diagnosis, where uncertainty resolution through systematic questioning is essential.

Method: MedClarify computes candidate diagnoses (differential diagnosis), then proactively generates follow-up questions aimed at reducing diagnostic uncertainty by selecting questions with highest expected information gain.

Result: The approach reduces diagnostic errors by ~27 percentage points compared to standard single-shot LLM baselines, demonstrating effective uncertainty-aware reasoning through targeted questioning.

Conclusion: MedClarify offers a path to improve medical LLMs through agentic information-seeking, enabling dialogues that reflect the iterative and uncertain nature of real-world clinical reasoning.

Abstract: Large language models (LLMs) are increasingly used for diagnostic tasks in medicine. In clinical practice, the correct diagnosis can rarely be immediately inferred from the initial patient presentation alone. Rather, reaching a diagnosis often involves systematic history taking, during which clinicians reason over multiple potential conditions through iterative questioning to resolve uncertainty. This process requires considering differential diagnoses and actively excluding emergencies that demand immediate intervention. Yet, the ability of medical LLMs to generate informative follow-up questions and thus reason over differential diagnoses remains underexplored. Here, we introduce MedClarify, an AI agent for information-seeking that can generate follow-up questions for iterative reasoning to support diagnostic decision-making. Specifically, MedClarify computes a list of candidate diagnoses analogous to a differential diagnosis, and then proactively generates follow-up questions aimed at reducing diagnostic uncertainty. By selecting the question with the highest expected information gain, MedClarify enables targeted, uncertainty-aware reasoning to improve diagnostic performance. In our experiments, we first demonstrate the limitations of current LLMs in medical reasoning, which often yield multiple, similarly likely diagnoses, especially when patient cases are incomplete or relevant information for diagnosis is missing. We then show that our information-theoretic reasoning approach can generate effective follow-up questioning and thereby reduces diagnostic errors by ~27 percentage points (p.p.) compared to a standard single-shot LLM baseline. Altogether, MedClarify offers a path to improve medical LLMs through agentic information-seeking and to thus promote effective dialogues with medical LLMs that reflect the iterative and uncertain nature of real-world clinical reasoning.

[214] Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature

Angelo Porrello, Pietro Buzzega, Felix Dangel, Thomas Sommariva, Riccardo Salami, Lorenzo Bonicelli, Simone Calderara

Main category: cs.AI

TL;DR: Task arithmetic method for adapting foundation models with dataless regularization to prevent cross-task interference and representation drift

DetailsMotivation: Task arithmetic enables modular adaptation of foundation models but suffers from cross-task interference causing representation drift and performance degradation. Existing regularization methods require external task data, conflicting with modularity and privacy constraints.

Method: Proposes a dataless approach framing regularization against representation drift as a curvature matrix approximation problem. Uses Kronecker-Factored Approximate Curvature (KFAC) to obtain a practical regularizer with constant complexity in number of tasks.

Result: Achieves state-of-the-art results in task addition and negation. Method promotes robustness to task vector rescaling and eliminates need for held-out tuning.

Conclusion: Provides an effective dataless regularization method for task arithmetic that addresses representation drift without requiring external task data, maintaining modularity and privacy while improving performance.

Abstract: Task Arithmetic yields a modular, scalable way to adapt foundation models. Combining multiple task vectors, however, can lead to cross-task interference, causing representation drift and degraded performance. Representation drift regularization provides a natural remedy to disentangle task vectors; however, existing approaches typically require external task data, conflicting with modularity and data availability constraints (e.g., privacy requirements). We propose a dataless approach by framing regularization against representation drift as a curvature matrix approximation problem. This allows us to leverage well-established techniques; in particular, we adopt Kronecker-Factored Approximate Curvature and obtain a practical regularizer that achieves state-of-the-art results in task addition and negation. Our method has constant complexity in the number of tasks and promotes robustness to task vector rescaling, eliminating the need for held-out tuning.

[215] Visual Model Checking: Graph-Based Inference of Visual Routines for Image Retrieval

Adrià Molina, Oriol Ramos Terrades, Josep Lladós

Main category: cs.AI

TL;DR: A framework integrating formal verification with deep learning for image retrieval, enabling trustworthy verification of complex natural language queries through graph-based reasoning and neural code generation.

DetailsMotivation: Current embedding-based retrieval models struggle with complex queries involving relationships, object compositions, and precise constraints (identities, counts, proportions). There's a need for more reliable, verifiable retrieval that goes beyond vector similarity approximations.

Method: Combines graph-based verification methods with neural code generation to integrate formal verification into deep learning-based image retrieval. The approach grounds retrieval results in formal reasoning, explicitly verifying each atomic truth in user queries against retrieved content.

Result: Enables open-vocabulary natural language queries with trustworthy, verifiable results. The framework identifies which specific constraints are satisfied and which remain unmet, offering transparent and accountable retrieval while boosting performance of embedding-based approaches.

Conclusion: The proposed framework moves beyond ambiguity and approximation in vector representations by incorporating formal verification, providing more reliable and transparent image retrieval for complex natural language queries.

Abstract: Information retrieval lies at the foundation of the modern digital industry. While natural language search has seen dramatic progress in recent years largely driven by embedding-based models and large-scale pretraining, the field still faces significant challenges. Specifically, queries that involve complex relationships, object compositions, or precise constraints such as identities, counts and proportions often remain unresolved or unreliable within current frameworks. In this paper, we propose a novel framework that integrates formal verification into deep learning-based image retrieval through a synergistic combination of graph-based verification methods and neural code generation. Our approach aims to support open-vocabulary natural language queries while producing results that are both trustworthy and verifiable. By grounding retrieval results in a system of formal reasoning, we move beyond the ambiguity and approximation that often characterize vector representations. Instead of accepting uncertainty as a given, our framework explicitly verifies each atomic truth in the user query against the retrieved content. This allows us to not only return matching results, but also to identify and mark which specific constraints are satisfied and which remain unmet, thereby offering a more transparent and accountable retrieval process while boosting the results of the most popular embedding-based approaches.

[216] Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability

Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar

Main category: cs.AI

TL;DR: Paper introduces reusability and verifiability metrics to evaluate Chain-of-Thought reasoning quality in multi-agent LLM systems, finding these metrics don’t correlate with standard accuracy and that specialized reasoning models don’t produce consistently better CoTs than general-purpose LLMs.

DetailsMotivation: Current evaluation of Chain-of-Thought reasoning in multi-agent LLM systems focuses only on target task accuracy, which fails to assess the quality or utility of the reasoning process itself. There's a need for better metrics to evaluate reasoning quality beyond just final answer correctness.

Method: Introduces two novel measures: reusability (how easily an Executor can reuse the Thinker’s CoT) and verifiability (how frequently an Executor can match the Thinker’s answer using the CoT). Uses a Thinker-Executor framework to decouple CoT generation from execution. Evaluates four Thinker models against a committee of ten Executor models across five benchmarks.

Result: Reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.

Conclusion: The paper demonstrates the importance of evaluating reasoning quality beyond accuracy metrics and shows that specialized reasoning models don’t necessarily produce better reasoning traces than general-purpose models, suggesting current CoT evaluation methods need refinement.

Abstract: In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker’s CoT. Verifiability measures how frequently an Executor can match the Thinker’s answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.

[217] A Contrastive Variational AutoEncoder for NSCLC Survival Prediction with Missing Modalities

Michele Zanitti, Vanja Miskovic, Francesco Trovò, Alessandra Laura Giulia Pedrocchi, Ming Shen, Yan Kyaw Tun, Arsela Prelaj, Sokol Kosta

Main category: cs.AI

TL;DR: MCVAE: Multimodal Contrastive Variational AutoEncoder for robust survival prediction in NSCLC using incomplete multimodal data (WSI, transcriptomics, methylation) with stochastic masking and learned gating.

DetailsMotivation: Real-world clinical datasets often have missing modalities, making survival prediction challenging. Current models lack robustness to severe missingness patterns in multimodal data integration.

Method: Proposes MCVAE with modality-specific variational encoders, fusion bottleneck with learned gating, multi-task objective (survival + reconstruction loss), cross-modal contrastive loss, and stochastic modality masking during training.

Result: Demonstrates efficacy on TCGA-LUAD (n=475) and TCGA-LUSC (n=446) datasets for disease-specific survival prediction, showing robustness to severe missingness compared to SOTA models.

Conclusion: MCVAE effectively handles missing modalities in multimodal clinical data, though integration is not always beneficial - some modality subsets may perform better than full integration.

Abstract: Predicting survival outcomes for non-small cell lung cancer (NSCLC) patients is challenging due to the different individual prognostic features. This task can benefit from the integration of whole-slide images, bulk transcriptomics, and DNA methylation, which offer complementary views of the patient’s condition at diagnosis. However, real-world clinical datasets are often incomplete, with entire modalities missing for a significant fraction of patients. State-of-the-art models rely on available data to create patient-level representations or use generative models to infer missing modalities, but they lack robustness in cases of severe missingness. We propose a Multimodal Contrastive Variational AutoEncoder (MCVAE) to address this issue: modality-specific variational encoders capture the uncertainty in each data source, and a fusion bottleneck with learned gating mechanisms is introduced to normalize the contributions from present modalities. We propose a multi-task objective that combines survival loss and reconstruction loss to regularize patient representations, along with a cross-modal contrastive loss that enforces cross-modal alignment in the latent space. During training, we apply stochastic modality masking to improve the robustness to arbitrary missingness patterns. Extensive evaluations on the TCGA-LUAD (n=475) and TCGA-LUSC (n=446) datasets demonstrate the efficacy of our approach in predicting disease-specific survival (DSS) and its robustness to severe missingness scenarios compared to two state-of-the-art models. Finally, we bring some clarifications on multimodal integration by testing our model on all subsets of modalities, finding that integration is not always beneficial to the task.

[218] KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi

Main category: cs.AI

TL;DR: KLong is an open-source LLM agent trained for extremely long-horizon tasks using trajectory-splitting SFT and progressive RL training, achieving state-of-the-art performance on research paper analysis and coding benchmarks.

DetailsMotivation: The paper addresses the challenge of training LLM agents for extremely long-horizon tasks that require processing and reasoning over extended sequences, which is difficult with standard training methods due to context length limitations and training instability.

Method: 1) Cold-start via trajectory-splitting SFT that preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories; 2) Research-Factory pipeline for automated high-quality training data generation from research papers; 3) Progressive RL training with multiple stages of progressively extended timeouts.

Result: KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, with performance improvements generalizing to other coding benchmarks like SWE-bench Verified and MLE-bench.

Conclusion: The proposed trajectory-splitting SFT and progressive RL training enable effective training of LLM agents for extremely long-horizon tasks, with KLong demonstrating superior performance and generalization capabilities.

Abstract: This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

[219] A Privacy by Design Framework for Large Language Model-Based Applications for Children

Diana Addae, Diana Rogachova, Nafiseh Kahani, Masoud Barati, Michael Christensen, Chen Zhou

Main category: cs.AI

TL;DR: A Privacy-by-Design framework for developing AI applications for children, integrating privacy regulations and design guidelines throughout the LLM lifecycle with a case study of an educational tutor.

DetailsMotivation: Growing concerns about privacy risks for children using AI technologies, coupled with challenges in implementing existing privacy regulations in practice, necessitate a proactive framework for designing child-safe AI applications.

Method: Proposes a Privacy-by-Design framework that maps privacy principles from GDPR, PIPEDA, and COPPA to LLM lifecycle stages (data collection, model training, operational monitoring, validation), incorporates design guidelines from UNCRC and AADC, and demonstrates application through a case study of an LLM-based educational tutor for children under 13.

Result: The framework provides operational controls and design guidelines that help AI developers reduce privacy risks while meeting legal standards, demonstrating through case study analysis that technical/organizational controls and age-appropriate design decisions can support development of privacy-compliant AI applications for children.

Conclusion: A Privacy-by-Design approach integrating regulatory principles and child-centered design guidelines throughout the LLM lifecycle can enable development of AI applications for children that provide adequate privacy protections and comply with legal requirements.

Abstract: Children are increasingly using technologies powered by Artificial Intelligence (AI). However, there are growing concerns about privacy risks, particularly for children. Although existing privacy regulations require companies and organizations to implement protections, doing so can be challenging in practice. To address this challenge, this article proposes a framework based on Privacy-by-Design (PbD), which guides designers and developers to take on a proactive and risk-averse approach to technology design. Our framework includes principles from several privacy regulations, such as the General Data Protection Regulation (GDPR) from the European Union, the Personal Information Protection and Electronic Documents Act (PIPEDA) from Canada, and the Children’s Online Privacy Protection Act (COPPA) from the United States. We map these principles to various stages of applications that use Large Language Models (LLMs), including data collection, model training, operational monitoring, and ongoing validation. For each stage, we discuss the operational controls found in the recent academic literature to help AI service providers and developers reduce privacy risks while meeting legal standards. In addition, the framework includes design guidelines for children, drawing from the United Nations Convention on the Rights of the Child (UNCRC), the UK’s Age-Appropriate Design Code (AADC), and recent academic research. To demonstrate how this framework can be applied in practice, we present a case study of an LLM-based educational tutor for children under 13. Through our analysis and the case study, we show that by using data protection strategies such as technical and organizational controls and making age-appropriate design decisions throughout the LLM life cycle, we can support the development of AI applications for children that provide privacy protections and comply with legal requirements.

[220] WarpRec: Unifying Academic Rigor and Industrial Scale for Responsible, Reproducible, and Efficient Recommendation

Marco Avolio, Potito Aghilar, Sabino Roccotelli, Vito Walter Anelli, Chiara Mallamaci, Vincenzo Paparella, Marco Valentini, Alejandro Bellogín, Michelantonio Trizio, Joseph Trotta, Antonio Ferrara, Tommaso Di Noia

Main category: cs.AI

TL;DR: WarpRec is a high-performance recommender system framework that bridges academia-industry gap with backend-agnostic architecture, supporting 50+ algorithms, 40 metrics, and sustainable computing while enabling transition to agentic AI.

DetailsMotivation: Current recommender system research faces a fractured ecosystem where researchers must choose between easy in-memory experimentation and costly distributed industrial engines, creating barriers between academia and industry.

Method: Develops WarpRec framework with novel backend-agnostic architecture that supports 50+ state-of-the-art algorithms, 40 metrics, and 19 filtering/splitting strategies, enabling seamless transition from local to distributed execution with integrated energy tracking via CodeCarbon.

Result: Creates a unified framework that eliminates the trade-off between experimental ease and industrial scalability, demonstrates ecological responsibility through energy tracking, and anticipates the shift toward agentic AI in recommender systems.

Conclusion: WarpRec bridges academia-industry gap, serves as architectural backbone for next-generation sustainable recommender systems, and enables evolution toward agentic AI within the Generative AI ecosystem.

Abstract: Innovation in Recommender Systems is currently impeded by a fractured ecosystem, where researchers must choose between the ease of in-memory experimentation and the costly, complex rewriting required for distributed industrial engines. To bridge this gap, we present WarpRec, a high-performance framework that eliminates this trade-off through a novel, backend-agnostic architecture. It includes 50+ state-of-the-art algorithms, 40 metrics, and 19 filtering and splitting strategies that seamlessly transition from local execution to distributed training and optimization. The framework enforces ecological responsibility by integrating CodeCarbon for real-time energy tracking, showing that scalability need not come at the cost of scientific integrity or sustainability. Furthermore, WarpRec anticipates the shift toward Agentic AI, leading Recommender Systems to evolve from static ranking engines into interactive tools within the Generative AI ecosystem. In summary, WarpRec not only bridges the gap between academia and industry but also can serve as the architectural backbone for the next generation of sustainable, agent-ready Recommender Systems. Code is available at https://github.com/sisinflab/warprec/

[221] CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello, Maud Ehrmann, Simon Clematide

Main category: cs.AI

TL;DR: HIPE-2026 is a CLEF evaluation lab focused on extracting person-place relations from historical texts, extending previous campaigns to include semantic relation extraction with temporal reasoning.

DetailsMotivation: The motivation is to advance historical text processing by moving beyond named entity recognition to semantic relation extraction, specifically person-place associations, which is crucial for digital humanities applications like knowledge graph construction and historical biography reconstruction.

Method: The lab introduces a three-fold evaluation profile assessing: 1) accuracy of relation extraction, 2) computational efficiency, and 3) domain generalization across multiple languages and time periods. Systems must classify two relation types: “at” (historical presence) and “isAt” (location around publication time).

Result: The paper describes the design and objectives of the HIPE-2026 evaluation campaign, building on previous HIPE editions, with a focus on creating standardized benchmarks for person-place relation extraction from historical texts.

Conclusion: HIPE-2026 aims to advance historical text processing by establishing benchmarks for semantic relation extraction that support downstream digital humanities applications, with a focus on multilingual, temporal-aware person-place associations.

Abstract: HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person–place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ (“Has the person ever been at this place?”) and $isAt$ (“Is the person located at this place around publication time?”) - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.

[222] Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems

Pranay Jain, Maximilian Kasper, Göran Köber, Axel Plinge, Dominik Seuß

Main category: cs.AI

TL;DR: Benchmarking framework for optimizing AI models on ARM Cortex processors (M0+, M4, M7) focusing on energy efficiency, accuracy, and resource utilization in embedded systems

DetailsMotivation: Need for practical benchmarking to optimize AI models on resource-constrained embedded systems, balancing energy efficiency, accuracy, and computational demands for real-world applications

Method: Automated test bench design for systematic evaluation across KPIs, using Pareto analysis to balance trade-offs between energy consumption and model accuracy, with correlation analysis between FLOPs and inference time

Result: Near-linear correlation between FLOPs and inference time; M7 ideal for short inference cycles, M4 better for energy efficiency in longer tasks, M0+ suitable for simpler tasks; framework enables optimal processor-model combinations

Conclusion: Provides practical guidance for developers to design energy-efficient AI systems on ARM Cortex processors, balancing performance requirements with sustainability considerations

Abstract: This work presents a practical benchmarking framework for optimizing artificial intelligence (AI) models on ARM Cortex processors (M0+, M4, M7), focusing on energy efficiency, accuracy, and resource utilization in embedded systems. Through the design of an automated test bench, we provide a systematic approach to evaluate across key performance indicators (KPIs) and identify optimal combinations of processor and AI model. The research highlights a nearlinear correlation between floating-point operations (FLOPs) and inference time, offering a reliable metric for estimating computational demands. Using Pareto analysis, we demonstrate how to balance trade-offs between energy consumption and model accuracy, ensuring that AI applications meet performance requirements without compromising sustainability. Key findings indicate that the M7 processor is ideal for short inference cycles, while the M4 processor offers better energy efficiency for longer inference tasks. The M0+ processor, while less efficient for complex AI models, remains suitable for simpler tasks. This work provides insights for developers, guiding them to design energy-efficient AI systems that deliver high performance in realworld applications.

[223] Enhancing Large Language Models (LLMs) for Telecom using Dynamic Knowledge Graphs and Explainable Retrieval-Augmented Generation

Dun Yuan, Hao Zhou, Xue Liu, Hao Chen, Yan Xin, Jianzhong, Zhang

Main category: cs.AI

TL;DR: KG-RAG integrates knowledge graphs with retrieval-augmented generation to enhance LLMs for telecom-specific tasks, improving accuracy and reducing hallucinations in domain-specific applications.

DetailsMotivation: General-domain LLMs struggle with telecom applications due to domain complexity, evolving standards, and specialized terminology, leading to hallucinations and reduced utility in telecom operations.

Method: KG-RAG framework combines knowledge graphs (structured representation of telecom domain knowledge) with retrieval-augmented generation (dynamic retrieval of relevant facts) to ground LLM outputs in telecom specifications.

Result: KG-RAG outperforms both LLM-only and standard RAG baselines, achieving average accuracy improvements of 14.3% over RAG and 21.6% over LLM-only models across benchmark datasets.

Conclusion: KG-RAG effectively produces accurate, reliable, and explainable outputs in complex telecom scenarios by integrating structured domain knowledge with LLM capabilities.

Abstract: Large language models (LLMs) have shown strong potential across a variety of tasks, but their application in the telecom field remains challenging due to domain complexity, evolving standards, and specialized terminology. Therefore, general-domain LLMs may struggle to provide accurate and reliable outputs in this context, leading to increased hallucinations and reduced utility in telecom operations.To address these limitations, this work introduces KG-RAG-a novel framework that integrates knowledge graphs (KGs) with retrieval-augmented generation (RAG) to enhance LLMs for telecom-specific tasks. In particular, the KG provides a structured representation of domain knowledge derived from telecom standards and technical documents, while RAG enables dynamic retrieval of relevant facts to ground the model’s outputs. Such a combination improves factual accuracy, reduces hallucination, and ensures compliance with telecom specifications.Experimental results across benchmark datasets demonstrate that KG-RAG outperforms both LLM-only and standard RAG baselines, e.g., KG-RAG achieves an average accuracy improvement of 14.3% over RAG and 21.6% over LLM-only models. These results highlight KG-RAG’s effectiveness in producing accurate, reliable, and explainable outputs in complex telecom scenarios.

[224] ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

Hongjue Zhao, Haosen Sun, Jiangtao Kong, Xiaochang Li, Qineng Wang, Liwei Jiang, Qi Zhu, Tarek Abdelzaher, Yejin Choi, Manling Li, Huajie Shao

Main category: cs.AI

TL;DR: ODESteer: A unified ODE-based theoretical framework for activation steering in LLM alignment that uses barrier functions for multi-step adaptive steering, outperforming existing methods on alignment benchmarks.

DetailsMotivation: Current activation steering methods lack unified theoretical foundations and rely on one-step steering that fails to capture complex activation patterns, limiting their effectiveness in LLM alignment.

Method: Proposes an ODE-based framework where activation addition is interpreted as first-order ODE approximation. Uses barrier functions (log-density ratio between positive/negative activations) to construct ODEs for multi-step adaptive steering.

Result: ODESteer achieves consistent improvements: 5.7% over TruthfulQA, 2.5% over UltraFeedback, and 2.4% over RealToxicityPrompts compared to state-of-the-art activation steering methods.

Conclusion: Establishes principled theoretical foundations for activation steering via ODEs and validates effectiveness through ODESteer, advancing LLM alignment techniques.

Abstract: Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textit{empirical} advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7%$ improvement over TruthfulQA, $2.5%$ over UltraFeedback, and $2.4%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.

[225] A Hybrid Federated Learning Based Ensemble Approach for Lung Disease Diagnosis Leveraging Fusion of SWIN Transformer and CNN

Asif Hasan Chowdhury, Md. Fahim Islam, M Ragib Anjum Riad, Faiyaz Bin Hashem, Md Tanzim Reza, Md. Golam Rabiul Alam

Main category: cs.AI

TL;DR: Hybrid FL-enabled ensemble approach combining SWIN Transformer and CNNs for lung disease diagnosis from X-rays, using federated learning for secure distributed medical data processing.

DetailsMotivation: To create a secure, distributed AI system for medical diagnosis that leverages federated learning to protect patient data while improving disease detection accuracy, specifically for COVID-19 and pneumonia from X-ray images.

Method: Combines SWIN Transformer with CNN models (DenseNet201, Inception V3, VGG19) in a hybrid ensemble approach, implemented using TensorFlow/Keras with federated learning framework for distributed, secure training across medical institutions.

Result: The paper proposes a system that enables accurate detection of COVID-19 and pneumonia from X-ray reports while maintaining data privacy through federated learning, though specific accuracy metrics are not provided in the abstract.

Conclusion: The hybrid FL-enabled ensemble approach provides a secure, distributed solution for medical diagnosis that can assist physicians while protecting patient data privacy through federated learning integration.

Abstract: The significant advancements in computational power cre- ate a vast opportunity for using Artificial Intelligence in different ap- plications of healthcare and medical science. A Hybrid FL-Enabled Ensemble Approach For Lung Disease Diagnosis Leveraging a Combination of SWIN Transformer and CNN is the combination of cutting-edge technology of AI and Federated Learning. Since, medi- cal specialists and hospitals will have shared data space, based on that data, with the help of Artificial Intelligence and integration of federated learning, we can introduce a secure and distributed system for medical data processing and create an efficient and reliable system. The proposed hybrid model enables the detection of COVID-19 and Pneumonia based on x-ray reports. We will use advanced and the latest available tech- nology offered by Tensorflow and Keras along with Microsoft-developed Vision Transformer, that can help to fight against the pandemic that the world has to fight together as a united. We focused on using the latest available CNN models (DenseNet201, Inception V3, VGG 19) and the Transformer model SWIN Transformer in order to prepare our hy- brid model that can provide a reliable solution as a helping hand for the physician in the medical field. In this research, we will discuss how the Federated learning-based Hybrid AI model can improve the accuracy of disease diagnosis and severity prediction of a patient using the real-time continual learning approach and how the integration of federated learn- ing can ensure hybrid model security and keep the authenticity of the information.

[226] AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

Lance Ying, Ryan Truong, Prafull Sharma, Kaiya Ivy Zhao, Nathan Cloos, Kelsey R. Allen, Thomas L. Griffiths, Katherine M. Collins, José Hernández-Orallo, Phillip Isola, Samuel J. Gershman, Joshua B. Tenenbaum

Main category: cs.AI

TL;DR: AI GameStore: A platform using LLMs to generate human games from popular gaming platforms to evaluate vision-language models’ general intelligence through game playing performance.

DetailsMotivation: Current AI benchmarks are too narrow and quickly saturate, failing to evaluate human-like general intelligence. Need a more comprehensive evaluation approach that tests AI systems across the full spectrum of human games.

Method: Created AI GameStore platform using LLMs with humans-in-the-loop to synthesize new human games by sourcing and adapting standardized game environments from popular platforms like Apple App Store and Steam. Generated 100 games and evaluated 7 frontier vision-language models on short gameplay episodes.

Result: Best models achieved less than 10% of human average score on majority of games. Models particularly struggled with games challenging world-model learning, memory, and planning capabilities.

Conclusion: AI GameStore provides a practical way to measure and drive progress toward human-like general intelligence by evaluating AI systems across the “Multiverse of Human Games” - a comprehensive space of all conceivable human games.

Abstract: Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a “human game” to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy – the “Multiverse of Human Games”. Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.

[227] MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models

Hojung Jung, Rodrigo Hormazabal, Jaehyeong Jo, Youngrok Park, Kyunggeun Roh, Se-Young Yun, Sehui Han, Dae-Woong Jeong

Main category: cs.AI

TL;DR: MolHIT introduces a hierarchical discrete diffusion model for molecular graph generation that achieves near-perfect chemical validity and state-of-the-art performance on molecular generation tasks.

DetailsMotivation: Existing graph diffusion models for molecular generation suffer from low chemical validity and struggle to meet desired properties compared to 1D modeling approaches. There's a need for improved molecular graph generation frameworks that can overcome these limitations.

Method: MolHIT uses a Hierarchical Discrete Diffusion Model that generalizes discrete diffusion to additional categories encoding chemical priors, and employs decoupled atom encoding that splits atom types according to their chemical roles.

Result: MolHIT achieves new state-of-the-art performance on the MOSES dataset with near-perfect validity for the first time in graph diffusion, surpassing strong 1D baselines across multiple metrics. It also demonstrates strong performance in downstream tasks including multi-property guided generation and scaffold extension.

Conclusion: MolHIT represents a significant advancement in molecular graph generation, overcoming long-standing performance limitations and achieving unprecedented chemical validity while maintaining strong performance across various molecular generation tasks.

Abstract: Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the discrete nature of 2D molecular graphs, existing models suffer from low chemical validity and struggle to meet the desired properties compared to 1D modeling. In this work, we introduce MolHIT, a powerful molecular graph generation framework that overcomes long-standing performance limitations in existing methods. MolHIT is based on the Hierarchical Discrete Diffusion Model, which generalizes discrete diffusion to additional categories that encode chemical priors, and decoupled atom encoding that splits the atom types according to their chemical roles. Overall, MolHIT achieves new state-of-the-art performance on the MOSES dataset with near-perfect validity for the first time in graph diffusion, surpassing strong 1D baselines across multiple metrics. We further demonstrate strong performance in downstream tasks, including multi-property guided generation and scaffold extension.

[228] AutoNumerics: An Autonomous, PDE-Agnostic Multi-Agent Pipeline for Scientific Computing

Jianda Du, Youran Sun, Haizhao Yang

Main category: cs.AI

TL;DR: AutoNumerics is a multi-agent framework that autonomously designs, implements, debugs, and verifies numerical solvers for PDEs from natural language descriptions, generating transparent solvers grounded in classical numerical analysis rather than black-box neural approaches.

DetailsMotivation: Traditional PDE solver design requires substantial mathematical expertise and manual tuning, while recent neural network approaches are computationally expensive and lack interpretability. There's a need for accessible, automated PDE solving that maintains transparency and interpretability.

Method: Multi-agent framework with coarse-to-fine execution strategy and residual-based self-verification mechanism. The system generates transparent solvers grounded in classical numerical analysis from natural language PDE descriptions.

Result: Experiments on 24 canonical and real-world PDE problems show competitive or superior accuracy compared to existing neural and LLM-based baselines. The framework correctly selects numerical schemes based on PDE structural properties.

Conclusion: AutoNumerics demonstrates viability as an accessible paradigm for automated PDE solving, offering transparent, interpretable solvers that bridge the gap between traditional numerical methods and modern AI approaches.

Abstract: PDEs are central to scientific and engineering modeling, yet designing accurate numerical solvers typically requires substantial mathematical expertise and manual tuning. Recent neural network-based approaches improve flexibility but often demand high computational cost and suffer from limited interpretability. We introduce \texttt{AutoNumerics}, a multi-agent framework that autonomously designs, implements, debugs, and verifies numerical solvers for general PDEs directly from natural language descriptions. Unlike black-box neural solvers, our framework generates transparent solvers grounded in classical numerical analysis. We introduce a coarse-to-fine execution strategy and a residual-based self-verification mechanism. Experiments on 24 canonical and real-world PDE problems demonstrate that \texttt{AutoNumerics} achieves competitive or superior accuracy compared to existing neural and LLM-based baselines, and correctly selects numerical schemes based on PDE structural properties, suggesting its viability as an accessible paradigm for automated PDE solving.

[229] A Scalable Framework for Evaluating Health Language Models

Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow, Nova Hammerquist, Benjamin Graef, Cathy Speed, Mark Malhotra, Shwetak Patel, Javier L. Prieto, Daniel McDuff, Ahmed A. Metwally

Main category: cs.AI

TL;DR: Adaptive Precise Boolean rubrics framework for efficient LLM evaluation in healthcare using targeted boolean questions instead of Likert scales

DetailsMotivation: Current LLM evaluation in healthcare relies heavily on costly human experts using Likert scales, which is not scalable and introduces human factors; need for more efficient, automated evaluation methods

Method: Adaptive Precise Boolean rubrics that use minimal sets of targeted boolean questions to identify gaps in model responses, contrasting with traditional complex evaluation targets

Result: Higher inter-rater agreement among expert/non-expert evaluators, better automated assessment, and ~50% reduction in evaluation time compared to Likert scales

Conclusion: The framework enables more extensive and cost-effective LLM evaluation in healthcare domains like metabolic health

Abstract: Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.

[230] Sufficient, Necessary and Complete Causal Explanations in Image Classification

David A Kelly, Hana Chockler

Main category: cs.AI

TL;DR: Causal explanations for image classifiers that combine formal rigor with black-box computability, showing equivalence to logic-based explanations while being practical for vision models.

DetailsMotivation: Existing explanation methods for image classifiers lack formal rigor, while logic-based explanations are formally rigorous but not computable for image classifiers due to strict assumptions. Need explanations that are both formally rigorous and practically computable for vision models.

Method: Introduces causal explanations with formal properties equivalent to logic-based ones. Defines δ-complete explanations with confidence thresholds and 1-complete causal explanations. Implements black-box algorithms that subdivide images into sufficient and necessary components without needing model internals, gradients, or monotonicity assumptions.

Result: Algorithms efficiently compute all explanation types in ~6s per image on ResNet models. Different models show different patterns of sufficiency, necessity, and completeness. Methods are totally black-box and work without model knowledge or access to internals.

Conclusion: Causal explanations provide formally rigorous yet practically computable explanations for image classifiers, bridging the gap between formal logic-based approaches and practical black-box methods for vision models.

Abstract: Existing algorithms for explaining the outputs of image classifiers are based on a variety of approaches and produce explanations that frequently lack formal rigour. On the other hand, logic-based explanations are formally and rigorously defined but their computability relies on strict assumptions about the model that do not hold on image classifiers. In this paper, we show that causal explanations, in addition to being formally and rigorously defined, enjoy the same formal properties as logic-based ones, while still lending themselves to black-box algorithms and being a natural fit for image classifiers. We prove formal properties of causal explanations and their equivalence to logic-based explanations. We demonstrate how to subdivide an image into its sufficient and necessary components. We introduce $δ$-complete explanations, which have a minimum confidence threshold and 1-complete causal explanations, explanations that are classified with the same confidence as the original image. We implement our definitions, and our experimental results demonstrate that different models have different patterns of sufficiency, necessity, and completeness. Our algorithms are efficiently computable, taking on average 6s per image on a ResNet model to compute all types of explanations, and are totally black-box, needing no knowledge of the model, no access to model internals, no access to gradient, nor requiring any properties, such as monotonicity, of the model.

[231] Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

Szymon Pawlonka, Mikołaj Małkiński, Jacek Mańdziuk

Main category: cs.AI

TL;DR: Bongard-RWR+ is a large-scale dataset of 5,400 instances for abstract visual reasoning using real-world-like images generated via VLM pipeline, revealing VLMs’ limitations in fine-grained concept recognition.

DetailsMotivation: Existing Bongard Problem datasets have limitations: synthetic images lack real-world complexity, real-world image datasets use high-level features reducing task difficulty, and the recent Bongard-RWR dataset is too small (60 instances) for robust evaluation.

Method: Created Bongard-RWR+ dataset using VLM pipeline: 1) Used Pixtral-12B to describe manually curated images and generate new descriptions aligned with abstract concepts, 2) Used Flux.1-dev to synthesize images from these descriptions, 3) Manually verified generated images reflect intended concepts, resulting in 5,400 instances.

Result: Evaluation of state-of-the-art VLMs across binary/multiclass classification and textual answer generation shows VLMs can recognize coarse-grained visual concepts but consistently struggle with discerning fine-grained concepts, highlighting reasoning limitations.

Conclusion: The Bongard-RWR+ dataset provides a challenging benchmark for abstract visual reasoning, revealing current VLMs’ limitations in fine-grained concept recognition despite advances in multimodal understanding.

Abstract: Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just $60$ instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of $5,400$ instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.

[232] $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts

Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer, Ion Stoica, Kannan Ramchandran, Ahmad Beirami, Ziteng Sun

Main category: cs.AI

TL;DR: SPECS is a latency-aware test-time scaling method that uses speculative decoding with a smaller model to generate candidates and evaluates them using signals from both a larger target model and a dedicated reward model, achieving comparable accuracy to beam search with reduced latency.

DetailsMotivation: Current test-time scaling methods for LLMs optimize for accuracy based on total compute resources (FLOPS) but overlook latency constraints, which directly impacts user experience. There's a need for methods that balance accuracy with latency reduction.

Method: SPECS uses a smaller, faster model to generate candidate sequences efficiently (speculative decoding), then evaluates these candidates using signals from both a larger target model and a dedicated reward model. It introduces reward-guided soft verification and a reward-based deferral mechanism for integration.

Result: On MATH500, AMC23 and OlympiadBench datasets, SPECS matches or surpasses beam search accuracy while reducing latency by up to ~19.1%. Theoretical analysis shows the algorithm converges to the solution of a KL-regularized reinforcement learning objective with increasing beam width.

Conclusion: SPECS provides an effective latency-aware test-time scaling approach that balances accuracy and latency, addressing the gap in current methods that overlook latency constraints while optimizing for accuracy.

Abstract: Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration. However, increased compute often comes at the expense of higher user-facing latency, directly impacting user experience. Current test-time scaling methods primarily optimize for accuracy based on total compute resources (FLOPS), often overlooking latency constraints. To address this gap, we propose $\texttt{SPECS}$, a latency-aware test-time scaling method inspired by speculative decoding. $\texttt{SPECS}$~uses a smaller, faster model to generate candidate sequences efficiently, and evaluates these candidates using signals from both a larger target model and a dedicated reward model. We introduce new integration strategies, including reward-guided soft verification and a reward-based deferral mechanism. Empirical results on MATH500, AMC23 and OlympiadBench datasets show that $\texttt{SPECS}$~matches or surpasses beam search accuracy while reducing latency by up to $\sim$19.1%. Our theoretical analysis shows that our algorithm converges to the solution of a KL-regularized reinforcement learning objective with increasing beam width.

[233] Goal Inference from Open-Ended Dialog

Rachel Ma, Jingyi Qu, Andreea Bobu, Dylan Hadfield-Menell

Main category: cs.AI

TL;DR: Online method for embodied AI agents to learn user goals from natural language conversations using LLMs and Bayesian inference to maintain uncertainty over goals

DetailsMotivation: Embodied AI agents need to efficiently learn diverse user goals and preferences through natural language dialog, requiring methods that can capture human preferences intuitively while maintaining uncertainty about goals to ensure reliable task execution

Method: Uses LLMs to extract natural language goal representations from conversations, prompts LLMs to role-play as humans with different goals, and uses corresponding likelihoods to perform Bayesian inference over potential goals to maintain uncertainty

Result: Evaluated in text-based grocery shopping domain and AI2Thor robot simulation, showing improved performance compared to ablation baselines lacking explicit goal representation or probabilistic inference

Conclusion: Proposed approach enables embodied agents to learn diverse user goals online with uncertainty quantification, achieving similar flexibility to offline methods like RLHF but with online efficiency

Abstract: Embodied AI Agents are quickly becoming important and common tools in society. These embodied agents should be able to learn about and accomplish a wide range of user goals and preferences efficiently and robustly. Large Language Models (LLMs) are often used as they allow for opportunities for rich and open-ended dialog type interaction between the human and agent to accomplish tasks according to human preferences. In this thesis, we argue that for embodied agents that deal with open-ended dialog during task assistance: 1) AI Agents should extract goals from conversations in the form of Natural Language (NL) to be better at capturing human preferences as it is intuitive for humans to communicate their preferences on tasks to agents through natural language. 2) AI Agents should quantify/maintain uncertainty about these goals to ensure that actions are being taken according to goals that the agent is extremely certain about. We present an online method for embodied agents to learn and accomplish diverse user goals. While offline methods like RLHF can represent various goals but require large datasets, our approach achieves similar flexibility with online efficiency. We extract natural language goal representations from conversations with Large Language Models (LLMs). We prompt an LLM to role play as a human with different goals and use the corresponding likelihoods to run Bayesian inference over potential goals. As a result, our method can represent uncertainty over complex goals based on unrestricted dialog. We evaluate in a text-based grocery shopping domain and an AI2Thor robot simulation. We compare our method to ablation baselines that lack either explicit goal representation or probabilistic inference.

[234] Bridging Symbolic Control and Neural Reasoning in LLM Agents: Structured Cognitive Loop with a Governance Layer

Myung Ho Kim

Main category: cs.AI

TL;DR: SCL is a modular agent architecture separating cognition into 5 phases with soft symbolic control for explainable, controllable AI agents.

DetailsMotivation: Address fundamental architectural problems in LLM agents: entangled reasoning/execution, memory volatility, and uncontrolled action sequences. Current frameworks like ReAct and AutoGPT lack explainability and controllability.

Method: Structured Cognitive Loop (SCL) with 5-phase modular architecture: Retrieval, Cognition, Control, Action, Memory (R-CCAM). Soft Symbolic Control applies symbolic constraints to probabilistic inference while preserving neural flexibility.

Result: Achieves zero policy violations, eliminates redundant tool calls, maintains complete decision traceability on multi-step conditional reasoning tasks. Outperforms existing frameworks like ReAct and AutoGPT.

Conclusion: SCL offers a practical path toward reliable, explainable, governable AI agents by connecting expert system principles with modern LLM capabilities through modular decomposition, adaptive symbolic governance, and transparent state management.

Abstract: Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences. We introduce Structured Cognitive Loop (SCL), a modular architecture that explicitly separates agent cognition into five phases: Retrieval, Cognition, Control, Action, and Memory (R-CCAM). Soft Symbolic Control constitutes a dedicated governance layer within SCL, applying symbolic constraints to probabilistic inference while preserving the flexibility of neural reasoning and restoring the explainability and controllability of classical symbolic systems. Through empirical validation on multi-step conditional reasoning tasks, we demonstrate that SCL achieves zero policy violations, eliminates redundant tool calls, and maintains complete decision traceability. These results address critical gaps in existing frameworks such as ReAct, AutoGPT, and memory-augmented approaches. Our contributions are threefold: (1) we situate SCL within the taxonomy of hybrid intelligence, differentiating it from prompt-centric and memory-only approaches; (2) we formally define Soft Symbolic Control and contrast it with neuro-symbolic AI; and (3) we derive three design principles for trustworthy agents: modular decomposition, adaptive symbolic governance, and transparent state management. We provide a complete open-source implementation demonstrating the R-CCAM loop architecture, alongside a live GPT-4o-powered travel planning agent. By connecting expert system principles with modern LLM capabilities, this work offers a practical and theoretically grounded path toward reliable, explainable, and governable AI agents.

[235] GAI: Generative Agents for Innovation

Masahiro Sato

Main category: cs.AI

TL;DR: GAI framework enables multiple LLM agents with internal states to engage in collective reasoning for innovation, successfully replicating Dyson’s bladeless fan invention through analogy-driven dialogue.

DetailsMotivation: To investigate whether collective reasoning among generative agents can facilitate novel and coherent thinking that leads to innovation, and to develop a framework that replicates the innovation process.

Method: Proposes GAI framework with dynamic internal state processing architecture and analogy-driven dialogue scheme for multiple generative agents. Evaluated using Dyson’s bladeless fan invention as case study with fictional technical documents.

Result: Models with internal states significantly outperformed those without, achieving higher average scores and lower variance. Five heterogeneous agents with internal states successfully replicated key ideas of Dyson’s invention.

Conclusion: Internal states enable agents to refine ideas and construct/share more coherent concepts, demonstrating that collective reasoning among generative agents can facilitate innovation.

Abstract: This study examines whether collective reasoning among generative agents can facilitate novel and coherent thinking that leads to innovation. To achieve this, it proposes GAI, a new LLM-empowered framework designed for reflection and interaction among multiple generative agents to replicate the process of innovation. The core of the GAI framework lies in an architecture that dynamically processes the internal states of agents and a dialogue scheme specifically tailored to facilitate analogy-driven innovation. The framework’s functionality is evaluated using Dyson’s invention of the bladeless fan as a case study, assessing the extent to which the core ideas of the innovation can be replicated through a set of fictional technical documents. The experimental results demonstrate that models with internal states significantly outperformed those without, achieving higher average scores and lower variance. Notably, the model with five heterogeneous agents equipped with internal states successfully replicated the key ideas underlying the Dyson’s invention. This indicates that the internal state enables agents to refine their ideas, resulting in the construction and sharing of more coherent and comprehensive concepts.

[236] AI-Assisted Decision Making with Human Learning

Gali Noti, Kate Donahue, Jon Kleinberg, Sigal Oren

Main category: cs.AI

TL;DR: AI-assisted decision-making where algorithm selects features for human to consider, balancing short-term accuracy vs. educating human through feature selection.

DetailsMotivation: AI systems increasingly support human decision-making but final decisions remain with humans. Need to understand how algorithms should select features when humans learn from repeated interactions, balancing immediate accuracy vs. long-term human education.

Method: Framework where algorithm selects features for human to consider, human makes predictions based on their own model. Analyzes tradeoff between recommending informative features (educating human) vs. selecting features aligned with human’s existing understanding (short-term accuracy). Examines impact of algorithm’s patience (time-discount rate) and human’s learning ability.

Result: Optimal feature selection has clean combinatorial characterization, reducible to stationary sequence of feature subsets. As algorithm becomes more patient or human’s learning improves, algorithm selects more informative features, enhancing both prediction accuracy and human understanding.

Conclusion: AI-assisted decision-making involves fundamental tradeoff between short-term accuracy and human education through feature selection. Optimal strategy depends on algorithm’s patience and human’s learning ability, with patient algorithms favoring informative features that improve both accuracy and understanding over time.

Abstract: AI systems increasingly support human decision-making. In many cases, despite the algorithm’s superior performance, the final decision remains in human hands. For example, an AI may assist doctors in determining which diagnostic tests to run, but the doctor ultimately makes the diagnosis. This paper studies such AI-assisted decision-making settings, where the human learns through repeated interactions with the algorithm. In our framework, the algorithm – designed to maximize decision accuracy according to its own model – determines which features the human can consider. The human then makes a prediction based on their own less accurate model. We observe that the discrepancy between the algorithm’s model and the human’s model creates a fundamental tradeoff: Should the algorithm prioritize recommending more informative features, encouraging the human to learn their importance, even if it results in less accurate predictions in the short term until learning occurs? Or is it preferable to forgo educating the human and instead select features that align more closely with their existing understanding, minimizing the immediate cost of learning? Our analysis reveals how this trade-off is shaped by both the algorithm’s patience (the time-discount rate of its objective over multiple periods) and the human’s willingness and ability to learn. We show that optimal feature selection has a surprisingly clean combinatorial characterization, reducible to a stationary sequence of feature subsets that is tractable to compute. As the algorithm becomes more “patient” or the human’s learning improves, the algorithm increasingly selects more informative features, enhancing both prediction accuracy and the human’s understanding.

[237] Capturing Individual Human Preferences with Reward Features

André Barreto, Vincent Dumoulin, Yiran Mao, Mark Rowland, Nicolas Perez-Nieves, Bobak Shahriari, Yann Dauphin, Doina Precup, Hugo Larochelle

Main category: cs.AI

TL;DR: The paper proposes adaptive reward models for RLHF that can specialize to individual users by learning general reward features and using linear combinations to adapt to specific preferences, showing benefits increase with rater diversity.

DetailsMotivation: Current RLHF approaches use a single reward model that doesn't distinguish between users, which is problematic in contexts with high disagreement potential like LLM training. There's a need for reward models that can adapt to individual user preferences.

Method: Proposes an adaptive reward model architecture that learns a set of general reward features, then adapts to specific users via linear combinations of these features. Derives PAC bounds showing error dependency on training examples and number of raters. Includes experiments with LLMs comparing adaptive vs non-adaptive approaches.

Result: Experiments show adaptive reward models outperform non-adaptive baselines, with benefits increasing with number of raters and preference heterogeneity. The model compares favorably to other adaptive approaches including in-context personalization methods.

Conclusion: Adaptive reward modeling is beneficial for RLHF in contexts with user disagreement. Learning general reward features enables efficient adaptation to individual preferences, even those not seen in training data.

Abstract: Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We formalise and analyse the problem of learning a reward model that can be specialised to a user. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound showing the dependency of the approximation error on the number of training examples, as usual, and also on the number of human raters who provided feedback on them. Based on our theoretical findings, we discuss how to best collect pairwise preference data and argue that adaptive reward models should be beneficial when there is considerable disagreement among users. We also propose a concrete architecture for an adaptive reward model. Our approach leverages the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models illustrating our theoretical results and comparing the proposed architecture with a non-adaptive baseline. Consistent with our analysis, the benefits provided by our model increase with the number of raters and the heterogeneity of their preferences. We also show that our model compares favourably to adaptive counterparts, including those performing in-context personalisation.

[238] The Correspondence Between Bounded Graph Neural Networks and Fragments of First-Order Logic

Bernardo Cuenca Grau, Eva Feng, Przemysław Andrzej Wałęga

Main category: cs.AI

TL;DR: GNN architectures are designed to match fragments of first-order logic, establishing theoretical connections between graph neural networks and logical expressiveness using finite model theory.

DetailsMotivation: While GNNs have shown broad applicability for graph-structured data, there's a need to better understand their expressive power and theoretical foundations, particularly in relation to logical formalisms.

Method: The authors propose GNN architectures that correspond precisely to fragments of first-order logic, including modal logics and two-variable fragments. They apply methods from finite model theory to analyze the logical expressiveness of GNNs.

Result: The research establishes a unifying framework that connects GNN architectures with specific fragments of first-order logic, providing theoretical grounding for understanding GNN expressive power.

Conclusion: The paper provides a formal theoretical foundation for understanding GNN expressive power by establishing precise correspondences between GNN architectures and fragments of first-order logic.

Abstract: Graph Neural Networks (GNNs) address two key challenges in applying deep learning to graph-structured data: they handle varying size input graphs and ensure invariance under graph isomorphism. While GNNs have demonstrated broad applicability, understanding their expressive power remains an important question. In this paper, we propose GNN architectures that correspond precisely to prominent fragments of first-order logic (FO), including various modal logics as well as more expressive two-variable fragments. To establish these results, we apply methods from finite model theory of first-order and modal logics to the domain of graph representation learning. Our results provide a unifying framework for understanding the logical expressiveness of GNNs within FO.

[239] Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning

Bosung Kim, Prithviraj Ammanabrolu

Main category: cs.AI

TL;DR: ∞-THOR is a framework for long-horizon embodied AI tasks with scalable trajectory generation, a novel embodied QA task testing long-context reasoning, and a benchmark suite with complex tasks spanning hundreds of steps.

DetailsMotivation: To advance long-context understanding in embodied AI by addressing the limitations of current systems that struggle with extended reasoning and planning across complex, multi-step environments.

Method: Proposes a generation framework for unlimited long-horizon trajectories, introduces “Needle(s) in the Embodied Haystack” QA task, creates a dataset with ground-truth action sequences, and explores architectural adaptations like interleaved Goal-State-Action modeling and Context Parallelism for LLM-based agents.

Result: Experimental results highlight the challenges of the benchmark and provide insights into training strategies and model behaviors under long-horizon conditions, establishing a foundation for next-generation embodied AI systems.

Conclusion: ∞-THOR provides a comprehensive framework for advancing long-context reasoning in embodied AI, enabling the development of systems capable of robust, long-term planning and interaction.

Abstract: We introduce $\infty$-THOR, a new framework for long-horizon embodied tasks that advances long-context understanding in embodied AI. $\infty$-THOR provides: (1) a generation framework for synthesizing scalable, reproducible, and unlimited long-horizon trajectories; (2) a novel embodied QA task, Needle(s) in the Embodied Haystack, where multiple scattered clues across extended trajectories test agents’ long-context reasoning ability; and (3) a long-horizon dataset and benchmark suite featuring complex tasks that span hundreds of environment steps, each paired with ground-truth action sequences. To enable this capability, we explore architectural adaptations, including interleaved Goal-State-Action modeling, context extension techniques, and Context Parallelism, to equip LLM-based agents for extreme long-context reasoning and interaction. Experimental results and analyses highlight the challenges posed by our benchmark and provide insights into training strategies and model behaviors under long-horizon conditions. Our work provides a foundation for the next generation of embodied AI systems capable of robust, long-term reasoning and planning.

[240] Drones that Think on their Feet: Sudden Landing Decisions with Embodied AI

Diego Ortiz Barbosa, Mohit Agrawal, Yash Malegaonkar, Luis Burbano, Axel Andersson, György Dán, Henrik Sandberg, Alvaro A. Cardenas

Main category: cs.AI

TL;DR: Embodied AI using large visual language models enables drones to make adaptive real-time decisions for emergency landings in dynamic urban environments, overcoming limitations of hand-coded recovery rules.

DetailsMotivation: Traditional approaches to drone safety rely on hand-coded recovery rules that cannot anticipate the vast range of real-world contingencies, creating incomplete safety systems that lack adaptability to sudden events.

Method: Uses embodied AI powered by large visual language models to provide commonsense reasoning, allowing drones to dynamically interpret surroundings and generate appropriate actions in real time within a simulated urban benchmark in Unreal Engine.

Result: Demonstrates that embodied AI enables adaptive recovery and decision-making pipelines previously infeasible to design by hand, allowing drones to make sudden maneuvers for safe landings in response to unexpected events.

Conclusion: Embodied AI advances resilience and safety in autonomous aerial systems by enabling a new class of adaptive recovery capabilities that overcome limitations of traditional hand-coded approaches.

Abstract: Autonomous drones must often respond to sudden events, such as alarms, faults, or unexpected changes in their environment, that require immediate and adaptive decision-making. Traditional approaches rely on safety engineers hand-coding large sets of recovery rules, but this strategy cannot anticipate the vast range of real-world contingencies and quickly becomes incomplete. Recent advances in embodied AI, powered by large visual language models, provide commonsense reasoning to assess context and generate appropriate actions in real time. We demonstrate this capability in a simulated urban benchmark in the Unreal Engine, where drones dynamically interpret their surroundings and decide on sudden maneuvers for safe landings. Our results show that embodied AI makes possible a new class of adaptive recovery and decision-making pipelines that were previously infeasible to design by hand, advancing resilience and safety in autonomous aerial systems.

[241] Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents

Gil Pasternak, Dheeraj Rajagopal, Julia White, Dhruv Atreja, Matthew Thomas, George Hurn-Maloney, Ash Lewis

Main category: cs.AI

TL;DR: PROBE benchmark evaluates LLM-based agents’ proactivity through three capabilities: searching for unspecified issues, identifying bottlenecks, and executing resolutions, showing current models struggle with only 40% success rate.

DetailsMotivation: Current benchmarks for evaluating proactive LLM agents are limited to localized contexts and cannot test reasoning across diverse sources and longer time horizons, creating a gap in assessing true proactivity.

Method: PROBE decomposes proactivity into three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions, creating a comprehensive benchmark for evaluation.

Result: Even state-of-the-art models struggle with the benchmark, with best end-to-end performance of 40% achieved by both GPT-5 and Claude Opus-4.1, highlighting significant limitations in current autonomous agent systems.

Conclusion: The PROBE benchmark reveals current limitations in LLM-based agent proactivity and exposes promising research directions for improving autonomous action capabilities in agentic systems.

Abstract: LLM-based agents are increasingly moving towards proactivity: rather than awaiting instruction, they exercise agency to anticipate user needs and solve them autonomously. However, evaluating proactivity is challenging; current benchmarks are constrained to localized context, limiting their ability to test reasoning across sources and longer time horizons. To address this gap, we present PROBE (Proactive Resolution Of BottlEnecks). PROBE decomposes proactivity as a pipeline of three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. We apply PROBE to evaluate leading LLMs and popular agentic frameworks, showing that even state-of-the-art models struggle to solve this benchmark. Computing our consistent measurements across frontier LLMs and agents, we find that the best end-to-end performance of 40% is achieved by both GPT-5 and Claude Opus-4.1. Additionally, we demonstrate the relative capabilities of each model and analyze mutual failure modes. Our results highlight the current limitations of autonomous action in agentic systems, and expose promising future research directions.

[242] CaveAgent: Transforming LLMs into Stateful Runtime Operators

Maohao Ran, Zhenglin Wan, Cooper Lin, Yanting Zhang, Hongyu Xin, Hongwei Fan, Yibo Xu, Beier Luo, Yaxin Zhou, Wangbo Zhao, Lijie Yang, Lang Feng, Fuchao Yang, Jingxuan Wu, Yiqiao Huang, Chendong Ma, Dailing Jiang, Jianbo Deng, Sirui Han, Yang You, Bo An, Yike Guo, Jun Song

Main category: cs.AI

TL;DR: CaveAgent is an LLM-based agent framework that shifts from text-centric to runtime-centric operation, using Python runtime as persistent state with semantic orchestration for complex task execution.

DetailsMotivation: Current LLM-based agents are constrained by text-centric paradigms that struggle with long-horizon tasks due to fragile multi-turn dependencies and context drift. There's a need for more robust systems that can handle complex, interdependent tasks with persistent state management.

Method: CaveAgent introduces a dual-stream architecture that inverts conventional paradigms: 1) elevates persistent Python runtime as central state locus, 2) uses lightweight semantic stream as orchestrator, 3) implements Stateful Runtime Management to inject/manipulate/retrieve complex Python objects across turns, 4) provides runtime-integrated skill management system extending Agent Skills open standard.

Result: CaveAgent shows consistent improvement across challenging benchmarks, handles data scales that cause context overflow in JSON-based and code-based agents, provides programmatically verifiable feedback for automated evaluation, and establishes foundation for Reinforcement Learning with Verifiable Rewards (RLVR).

Conclusion: CaveAgent represents a paradigm shift from text-centric to runtime-centric agent operation, enabling robust handling of complex tasks through persistent state management, reduced context drift, and ecosystem interoperability through executable skill injections.

Abstract: LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms that struggle with long-horizon tasks due to fragile multi-turn dependencies and context drift. We present CaveAgent, a framework that shifts tool use from LLM-as-Text-Generator'' to LLM-as-Runtime-Operator.’’ CaveAgent introduces a dual-stream architecture that inverts the conventional paradigm: rather than treating the LLM’s text context as the primary workspace with tools as auxiliary, CaveAgent elevates the persistent Python runtime as the central locus of state, with a lightweight semantic stream serving as its orchestrator. Beyond leveraging code generation to resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, CaveAgent introduces \textit{Stateful Runtime Management}: it injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns, unlike existing code-based approaches that remain text-bound. CaveAgent further provides a runtime-integrated skill management system that extends the Agent Skills open standard, enabling ecosystem interoperability through executable skill injections. This persistence mechanism serves as a high-fidelity external memory that reduces context drift in multi-turn interactions and preserves processed data for downstream applications without information loss. Evaluations show consistent improvement across challenging benchmarks, enabling CaveAgent to handle data scales that cause context overflow in both JSON-based and code-based agents. The accessible runtime state further provides programmatically verifiable feedback, enabling automated evaluation and reward signal generation without human annotation and establishing a structural foundation for future research in Reinforcement Learning with Verifiable Rewards (RLVR).

[243] Puzzle it Out: Local-to-Global World Model for Offline Multi-Agent Reinforcement Learning

Sijia li, Xinran Li, Shibo Chen, Jun Zhang

Main category: cs.AI

TL;DR: A novel local-to-global world model framework for offline multi-agent reinforcement learning that improves generalization by generating synthetic data with uncertainty-aware sampling.

DetailsMotivation: Existing offline MARL methods are overly conservative and struggle to generalize beyond dataset support. Model-based approaches can expand datasets but face challenges in accurately modeling complex multi-agent dynamics due to high dimensionality and non-stationarity.

Method: Proposes LOGO (local-to-global) world model that leverages easier-to-estimate local predictions to infer global state dynamics, capturing agent dependencies implicitly. Uses trained model to generate synthetic data and introduces uncertainty-aware sampling mechanism with adaptive weighting based on prediction uncertainty.

Result: Extensive experiments across 8 scenarios against 8 baselines show the method surpasses state-of-the-art baselines on standard offline MARL benchmarks, establishing a new model-based baseline for generalizable offline multi-agent learning.

Conclusion: The LOGO framework effectively addresses generalization challenges in offline MARL by combining local-to-global modeling with uncertainty-aware synthetic data generation, achieving superior performance with reduced computational overhead.

Abstract: Offline multi-agent reinforcement learning (MARL) aims to solve cooperative decision-making problems in multi-agent systems using pre-collected datasets. Existing offline MARL methods primarily constrain training within the dataset distribution, resulting in overly conservative policies that struggle to generalize beyond the support of the data. While model-based approaches offer a promising solution by expanding the original dataset with synthetic data generated from a learned world model, the high dimensionality, non-stationarity, and complexity of multi-agent systems make it challenging to accurately estimate the transitions and reward functions in offline MARL. Given the difficulty of directly modeling joint dynamics, we propose a local-to-global (LOGO) world model, a novel framework that leverages local predictions-which are easier to estimate-to infer global state dynamics, thus improving prediction accuracy while implicitly capturing agent-wise dependencies. Using the trained world model, we generate synthetic data to augment the original dataset, expanding the effective state-action space. To ensure reliable policy learning, we further introduce an uncertainty-aware sampling mechanism that adaptively weights synthetic data by prediction uncertainty, reducing approximation error propagation to policies. In contrast to conventional ensemble-based methods, our approach requires only an additional encoder for uncertainty estimation, significantly reducing computational overhead while maintaining accuracy. Extensive experiments across 8 scenarios against 8 baselines demonstrate that our method surpasses state-of-the-art baselines on standard offline MARL benchmarks, establishing a new model-based baseline for generalizable offline multi-agent learning.

[244] Autonomous Business System via Neuro-symbolic AI

Cecil Pang, Hiroki Sayama

Main category: cs.AI

TL;DR: AUTOBUS is a neuro-symbolic system combining LLM-based AI agents, predicate-logic programming, and enterprise knowledge graphs to execute end-to-end business initiatives by modeling tasks with explicit conditions, data, rules, and API actions.

DetailsMotivation: Enterprise systems are siloed and rigid while LLMs lack deterministic execution of complex business logic, creating a gap for systems that can interpret natural language while ensuring verifiable business process execution.

Method: Combines LLM-based AI agents with predicate-logic programming and business-semantics-centric enterprise data in a neuro-symbolic architecture. Models initiatives as task networks with explicit conditions, represents enterprise data as knowledge graphs translated into logic facts, and uses AI agents to synthesize task-specific logic programs executed by a logic engine.

Result: Demonstrates accelerated time to market in a data-rich organization through a case study, with a reference implementation available on GitHub.

Conclusion: AUTOBUS provides a coherent neuro-symbolic architecture that combines the natural language understanding of LLMs with deterministic logic-based execution for enterprise business processes, with human oversight for high-impact decisions.

Abstract: Current business environments demand continuous reconfiguration of cross-functional processes, yet enterprise systems remain organized around siloed departments, rigid workflows, and hard-coded automation. Meanwhile, large language models (LLMs) excel at interpreting natural language and unstructured data but lack deterministic and verifiable execution of complex business logic. We introduce Autonomous Business System (AUTOBUS), a system that combines LLM-based AI agents, predicate-logic programming, and business-semantics-centric enterprise data into a coherent neuro-symbolic architecture for executing end-to-end business initiatives. AUTOBUS models an initiative as a network of tasks with explicit pre- and post-conditions, required data, evaluation rules, and API-level actions. Enterprise data is represented as a knowledge graph whose entities, relationships, and constraints are translated into logic facts and foundational rules, providing semantic grounding for reasoning. Core AI agents synthesize task instructions, enterprise semantics, and available tools into task-specific logic programs executed by a logic engine that enforces constraints and orchestrates actions. Humans define semantics and policies, curate tools, and oversee high-impact or ambiguous decisions. We present the AUTOBUS architecture and a case study that demonstrates accelerated time to market in a data-rich organization. A reference implementation of the case study is available at https://github.com/cecilpang/autobus-paper.

[245] Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection

Yongxin Deng, Zhen Fang, Sharon Li, Ling Chen

Main category: cs.AI

TL;DR: SpikeScore method for cross-domain hallucination detection in LLMs by measuring uncertainty fluctuations in multi-turn dialogues

DetailsMotivation: Existing hallucination detection methods perform poorly when generalizing across domains, creating a need for robust cross-domain hallucination detection (GHD) for real-world LLM deployment

Method: Proposes SpikeScore metric that quantifies abrupt uncertainty fluctuations in multi-turn dialogues following LLM responses, based on observation that hallucination-initiated dialogues show larger uncertainty fluctuations than factual ones

Result: SpikeScore-based detection outperforms representative baselines in cross-domain generalization and surpasses advanced generalization-oriented methods across multiple LLMs and benchmarks

Conclusion: SpikeScore provides effective cross-domain hallucination detection by leveraging universal patterns in uncertainty fluctuations, addressing the important GHD problem for practical LLM deployment

Abstract: Hallucination detection is critical for deploying large language models (LLMs) in real-world applications. Existing hallucination detection methods achieve strong performance when the training and test data come from the same domain, but they suffer from poor cross-domain generalization. In this paper, we study an important yet overlooked problem, termed generalizable hallucination detection (GHD), which aims to train hallucination detectors on data from a single domain while ensuring robust performance across diverse related domains. In studying GHD, we simulate multi-turn dialogues following LLMs’ initial response and observe an interesting phenomenon: hallucination-initiated multi-turn dialogues universally exhibit larger uncertainty fluctuations than factual ones across different domains. Based on the phenomenon, we propose a new score SpikeScore, which quantifies abrupt fluctuations in multi-turn dialogues. Through both theoretical analysis and empirical validation, we demonstrate that SpikeScore achieves strong cross-domain separability between hallucinated and non-hallucinated responses. Experiments across multiple LLMs and benchmarks demonstrate that the SpikeScore-based detection method outperforms representative baselines in cross-domain generalization and surpasses advanced generalization-oriented methods, verifying the effectiveness of our method in cross-domain hallucination detection.

[246] An Adaptive Differentially Private Federated Learning Framework with Bi-level Optimization

Jin Wang, Hui Ma, Fei Xing, Ming Yan

Main category: cs.AI

TL;DR: Adaptive differentially private federated learning framework for heterogeneous and privacy-constrained settings with local compression, adaptive gradient clipping, and constraint-aware aggregation.

DetailsMotivation: Federated learning faces challenges with device heterogeneity and Non-IID data causing unstable gradient updates. Differential privacy exacerbates these issues with fixed gradient clipping and noise injection leading to training oscillation and performance degradation.

Method: Proposes adaptive differentially private federated learning framework with: 1) lightweight local compressed module to regularize intermediate representations and constrain gradient variability, 2) adaptive gradient clipping strategy that dynamically adjusts thresholds based on historical update statistics, and 3) constraint-aware aggregation mechanism to suppress unreliable or noise-dominated client updates.

Result: Extensive experiments on CIFAR-10 and SVHN demonstrate improved convergence stability and classification accuracy compared to baseline methods.

Conclusion: The proposed framework effectively addresses challenges of device heterogeneity, Non-IID data, and differential privacy constraints in federated learning, achieving better stability and performance.

Abstract: Federated learning enables collaborative model training across distributed clients while preserving data privacy. However, in practical deployments, device heterogeneity, non-independent, and identically distributed (Non-IID) data often lead to highly unstable and biased gradient updates. When differential privacy is enforced, conventional fixed gradient clipping and Gaussian noise injection may further amplify gradient perturbations, resulting in training oscillation and performance degradation and degraded model performance. To address these challenges, we propose an adaptive differentially private federated learning framework that explicitly targets model efficiency under heterogeneous and privacy-constrained settings. On the client side, a lightweight local compressed module is introduced to regularize intermediate representations and constrain gradient variability, thereby mitigating noise amplification during local optimization. On the server side, an adaptive gradient clipping strategy dynamically adjusts clipping thresholds based on historical update statistics to avoid over-clipping and noise domination. Furthermore, a constraint-aware aggregation mechanism is designed to suppress unreliable or noise-dominated client updates and stabilize global optimization. Extensive experiments on CIFAR-10 and SVHN demonstrate improved convergence stability and classification accuracy.

[247] EduEVAL-DB: A Role-Based Dataset for Pedagogical Risk Evaluation in Educational Explanations

Javier Irigoyen, Roberto Daza, Aythami Morales, Julian Fierrez, Francisco Jurado, Alvaro Ortigosa, Ruben Tolosana

Main category: cs.AI

TL;DR: EduEVAL-DB is a dataset for evaluating AI tutors’ pedagogical explanations, featuring teacher-role-based explanations with pedagogical risk annotations across five dimensions.

DetailsMotivation: There's a need for better evaluation frameworks for AI tutors and pedagogical evaluators, particularly for assessing the quality and risks of instructional explanations generated by large language models in educational contexts.

Method: Created a dataset of 854 explanations for 139 ScienceQA questions with one human-teacher explanation and six LLM-simulated teacher role explanations per question. Developed a pedagogical risk rubric with five dimensions, annotated explanations with binary risk labels through semi-automatic expert review, and conducted validation experiments comparing Gemini 2.5 Pro with Llama 3.1 8B for risk detection.

Result: The dataset enables evaluation of pedagogical explanation quality. Preliminary experiments show the dataset’s suitability for benchmarking models and that supervised fine-tuning on EduEVAL-DB can improve pedagogical risk detection in deployable local models.

Conclusion: EduEVAL-DB provides a valuable resource for developing and evaluating AI tutors with better pedagogical explanation capabilities, particularly for assessing and mitigating risks in educational AI systems.

Abstract: This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are annotated with binary risk labels through a semi-automatic process with expert teacher review. Finally, we present preliminary validation experiments to assess the suitability of EduEVAL-DB for evaluation. We benchmark a state-of-the-art education-oriented model (Gemini 2.5 Pro) against a lightweight local Llama 3.1 8B model and examine whether supervised fine-tuning on EduEVAL-DB supports pedagogical risk detection using models deployable on consumer hardware.

[248] EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Sushant Mehta, Logan Ritchie, Suhaas Garre, Nick Heiner, Edwin Chen

Main category: cs.AI

TL;DR: Training AI agents on high-fidelity enterprise simulation environments (CoreCraft) improves task performance and enables transfer learning to out-of-distribution benchmarks.

DetailsMotivation: To develop AI agents that can perform complex, multi-step professional work in real enterprise settings, and to understand whether training on high-fidelity environments produces generalizable capabilities beyond the training distribution.

Method: Created CoreCraft - a comprehensive enterprise simulation of customer support with 2,500+ entities, 14 entity types, and 23 tools. Trained GLM 4.6 using Group Relative Policy Optimization (GRPO) with adaptive clipping on this environment for one epoch.

Result: After training, task pass rate improved from 25.37% to 36.76% on held-out tasks. More importantly, the model showed transfer learning gains: +4.5% on BFCL Parallel, +7.4% on Tau2-Bench Retail, and +6.8% on Tool Decathlon (Pass@1).

Conclusion: Environment quality, diversity, and realism are key factors enabling generalizable agent capabilities. High-fidelity training environments with task-centric design, expert rubrics, and realistic workflows facilitate transfer learning to out-of-distribution tasks.

Abstract: We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI’s suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on Tau2-Bench Retail, and +6.8% on Tool Decathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.

cs.SD

[249] Speech to Speech Synthesis for Voice Impersonation

Bjorn Johnson, Jared Levy

Main category: cs.SD

TL;DR: STSSN is a speech-to-speech synthesis model for voice impersonation that combines speech recognition and synthesis techniques to perform style transfer, outperforming GAN-based approaches.

DetailsMotivation: While speech recognition and synthesis models have advanced significantly, speech-to-speech processing models remain underexplored. The paper aims to bridge this gap by creating a system for voice impersonation through speech-to-speech style transfer.

Method: Proposes Speech to Speech Synthesis Network (STSSN) that fuses state-of-the-art speech recognition and synthesis systems. The model performs speech-to-speech style transfer for voice impersonation by combining these two disciplines.

Result: STSSN generates realistic audio samples despite some capacity limitations. When benchmarked against a generative adversarial model performing similar tasks, STSSN produces more convincing results.

Conclusion: The proposed STSSN model demonstrates effectiveness in speech-to-speech style transfer for voice impersonation, showing promising results that outperform GAN-based approaches in generating realistic audio.

Abstract: Numerous models have shown great success in the fields of speech recognition as well as speech synthesis, but models for speech to speech processing have not been heavily explored. We propose Speech to Speech Synthesis Network (STSSN), a model based on current state of the art systems that fuses the two disciplines in order to perform effective speech to speech style transfer for the purpose of voice impersonation. We show that our proposed model is quite powerful, and succeeds in generating realistic audio samples despite a number of drawbacks in its capacity. We benchmark our proposed model by comparing it with a generative adversarial model which accomplishes a similar task, and show that ours produces more convincing results.

[250] Generative Audio Extension and Morphing

Prem Seetharaman, Oriol Nieto, Justin Salamon

Main category: cs.SD

TL;DR: A diffusion transformer (DiT) model for controllable audio generation that can extend audio clips forward/backward and morph between two audio references using masked latents and novel classifier-free guidance.

DetailsMotivation: Sound designers need tools to extend and morph sounds from their libraries for creative tasks. Current generative audio models lack sufficient controllability for these specific operations.

Method: Uses a diffusion transformer (DiT) with masked noisy latents and a novel variant of classifier-free guidance on these masked latents. Fine-tunes on stationary audio data to reduce hallucinations.

Result: Achieves Fréchet Audio Distances comparable to real training data. Subjective listener tests show positive ratings. Can successfully extend audio forward/backward and morph between two audio references.

Conclusion: Enables more controllable and expressive generative sound frameworks, allowing sound designers to focus on creativity rather than repetitive tasks.

Abstract: In audio-related creative tasks, sound designers often seek to extend and morph different sounds from their libraries. Generative audio models, capable of creating audio using examples as references, offer promising solutions. By masking the noisy latents of a DiT and applying a novel variant of classifier-free guidance on such masked latents, we demonstrate that: (i) given an audio reference, we can extend it both forward and backward for a specified duration, and (ii) given two audio references, we can morph them seamlessly for the desired duration. Furthermore, we show that by fine-tuning the model on different types of stationary audio data we mitigate potential hallucinations. The effectiveness of our method is supported by objective metrics, with the generated audio achieving Fréchet Audio Distances (FADs) comparable to those of real samples from the training data. Additionally, we validate our results through a subjective listener test, where subjects gave positive ratings to the proposed model generations. This technique paves the way for more controllable and expressive generative sound frameworks, enabling sound designers to focus less on tedious, repetitive tasks and more on their actual creative process.

[251] AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, Zeyu Jin

Main category: cs.SD

TL;DR: AudioChat is a framework for audio foundation models that can generate, edit, and understand complex multi-source audio scenes called “audio stories” through LLM-based toolcalling agents and a novel Audio Transfusion Forcing objective.

DetailsMotivation: Current audio foundation models struggle with complex multi-source acoustic scenes (audio stories) that have multiple speakers and background/foreground sound effects, introducing new layers of semantic, temporal, and physical complexity.

Method: Proposes AudioChat framework with LLM-based toolcalling agents that simulate user-system interactions to generate training data, and introduces Audio Transfusion Forcing objective for simultaneous decomposition of high-level instructions via structured chain-of-thought reasoning and interactive multi-turn audio understanding/generation.

Result: Develops three new metrics to directly measure task performance for generation and editing instead of relying on distribution-based scoring, with a demo available to showcase capabilities.

Conclusion: AudioChat addresses the challenge of processing complex audio stories through a novel framework that enables audio foundation models to handle multi-source acoustic scenes with semantic, temporal, and physical complexity.

Abstract: Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://wanchichen.github.io/audiochat/.

[252] AutoProsody: A Prosodic Feature Extraction Tool for Indian Languages

Preethi Thinakaran, Malarvizhi Muthuramalingam, Sooriya S, Anushiya Rachel Gladston, P. Vijayalakshmi, Hema A Murthy, T. Nagarajan

Main category: cs.SD

TL;DR: SIToBI tool for automatic prosodic annotation of Indian languages, providing syllable-level pitch, intensity, and break indices from speech signals.

DetailsMotivation: Prosodic information from speech is valuable but manual extraction is laborious, especially for Indian languages which are syllable-timed and need specialized tools.

Method: Developed SIToBI (Segmentation with Intensity, Tones and Break Indices) tool that provides time-aligned phoneme, syllable, and word transcriptions with syllable-level pitch contours, break indices, and relative intensity indices.

Result: Tool performs well when compared against manual annotations, demonstrated for Tamil, Hindi, and Indian English, with potential for extension to other Indian and syllable-timed languages.

Conclusion: SIToBI is an effective tool for automatic prosodic annotation of Indian languages, addressing the syllable-timed nature of these languages and reducing manual effort.

Abstract: The availability of prosodic information from speech signals is useful in a wide range of applications. However, deriving this information from speech signals can be a laborious task involving manual intervention. Therefore, the current work focuses on developing a tool that can provide prosodic annotations corresponding to a given speech signal, particularly for Indian languages. The proposed Segmentation with Intensity, Tones and Break Indices (SIToBI) tool provides time-aligned phoneme, syllable, and word transcriptions, syllable-level pitch contour annotations, break indices, and syllable-level relative intensity indices. The tool focuses more on syllable-level annotations since Indian languages are syllable-timed. Indians, regardless of the language they speak, may exhibit influences from other languages. As a result, other languages spoken in India may also exhibit syllable-timed characteristics. The accuracy of the annotations derived from the tool is analyzed by comparing them against manual annotations and the tool is observed to perform well. While the current work focuses on three languages, namely, Tamil, Hindi, and Indian English, the tool can easily be extended to other Indian languages and possibly other syllable-timed languages as well.

[253] Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis

Haoshen Wang, Xueli Zhong, Bingbing Lin, Jia Huang, Xingduo Pan, Shengxiang Liang, Nizhuan Wang, Wai Ting Siok

Main category: cs.SD

TL;DR: ProtoDisent-TTS: A prototype-based disentanglement framework for dysarthric speech that separates speaker timbre from pathological articulation using a pre-trained TTS backbone and pathology prototype codebook.

DetailsMotivation: Dysarthric speech has high variability and limited labeled data, making ASR and assistive technologies challenging. Existing methods often entangle speaker identity with pathological articulation, limiting controllability and robustness.

Method: Uses a pre-trained TTS backbone with pathology prototype codebook to factorize speaker timbre and dysarthric articulation in unified latent space. Employs dual-classifier objective with gradient reversal layer to enforce speaker embedding invariance to pathological attributes.

Result: Enables bidirectional transformation between healthy and dysarthric speech on TORGO dataset, achieving consistent ASR performance gains and robust, speaker-aware speech reconstruction.

Conclusion: The framework provides interpretable and controllable representations of healthy/dysarthric speech patterns, improving disentanglement and performance for dysarthric speech applications.

Abstract: Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of healthy and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. Experiments on the TORGO dataset demonstrate that this design enables bidirectional transformation between healthy and dysarthric speech, leading to consistent ASR performance gains and robust, speaker-aware speech reconstruction.

cs.LG

[254] MMCAformer: Macro-Micro Cross-Attention Transformer for Traffic Speed Prediction with Microscopic Connected Vehicle Driving Behavior

Lei Han, Mohamed Abdel-Aty, Younggun Kim, Yang-Jun Joo, Zubayer Islam

Main category: cs.LG

TL;DR: MMCAformer: A transformer-based model that integrates macro traffic flow features with micro driving behavior features from connected vehicle data for improved traffic speed prediction with uncertainty estimation.

DetailsMotivation: Existing traffic speed prediction methods rely on aggregated macroscopic data but ignore microscopic human driving behaviors that influence traffic dynamics. Connected vehicle data provides rich driving behavior features that could enhance prediction accuracy.

Method: Proposes Macro-Micro Cross-Attention Transformer (MMCAformer) using self-attention for macro traffic dependencies and cross-attention to capture spatiotemporal interactions between macro traffic status and micro driving behavior. Optimized with Student-t negative log-likelihood loss for uncertainty estimation.

Result: Experiments on four Florida freeways show MMCAformer outperforms baselines. Adding micro driving behavior features improves prediction accuracy (RMSE, MAE, MAPE reduced by 9.0%, 6.9%, 10.2%) and reduces uncertainty (predictive intervals decreased by 10.1-24.0%). Hard braking and acceleration are most influential features.

Conclusion: Incorporating micro driving behavior features from connected vehicle data significantly enhances traffic speed prediction accuracy and reduces uncertainty, especially under congested conditions, demonstrating the importance of behavioral insights for traffic prediction.

Abstract: Accurate speed prediction is crucial for proactive traffic management to enhance traffic efficiency and safety. Existing studies have primarily relied on aggregated, macroscopic traffic flow data to predict future traffic trends, whereas road traffic dynamics are also influenced by individual, microscopic human driving behaviors. Recent Connected Vehicle (CV) data provide rich driving behavior features, offering new opportunities to incorporate these behavioral insights into speed prediction. To this end, we propose the Macro-Micro Cross-Attention Transformer (MMCAformer) to integrate CV data-based micro driving behavior features with macro traffic features for speed prediction. Specifically, MMCAformer employs self-attention to learn intrinsic dependencies in macro traffic flow and cross-attention to capture spatiotemporal interplays between macro traffic status and micro driving behavior. MMCAformer is optimized with a Student-t negative log-likelihood loss to provide point-wise speed prediction and estimate uncertainty. Experiments on four Florida freeways demonstrate the superior performance of the proposed MMCAformer compared to baselines. Compared with only using macro features, introducing micro driving behavior features not only enhances prediction accuracy (e.g., overall RMSE, MAE, and MAPE reduced by 9.0%, 6.9%, and 10.2%, respectively) but also shrinks model prediction uncertainty (e.g., mean predictive intervals decreased by 10.1-24.0% across the four freeways). Results reveal that hard braking and acceleration frequencies emerge as the most influential features. Such improvements are more pronounced under congested, low-speed traffic conditions.

[255] A Few-Shot LLM Framework for Extreme Day Classification in Electricity Markets

Saud Alghumayjan, Ming Yi, Bolun Xu

Main category: cs.LG

TL;DR: LLM-based few-shot classification framework for predicting electricity price spikes using natural-language prompts of system state features.

DetailsMotivation: To develop a data-efficient approach for electricity price spike prediction that can work well with limited historical data, addressing challenges in power markets where data scarcity is common.

Method: Aggregates electricity demand, renewable generation, weather forecasts, and recent prices into statistical features, formats them as natural-language prompts, and feeds them to an LLM with general instructions for few-shot classification.

Result: Achieves performance comparable to supervised ML models (SVM, XGBoost) with full data, and outperforms them when limited historical data is available, using Texas electricity market data.

Conclusion: LLMs show potential as data-efficient tools for electricity price spike classification in data-scarce settings, demonstrating their applicability beyond traditional NLP tasks.

Abstract: This paper proposes a few-shot classification framework based on Large Language Models (LLMs) to predict whether the next day will have spikes in real-time electricity prices. The approach aggregates system state information, including electricity demand, renewable generation, weather forecasts, and recent electricity prices, into a set of statistical features that are formatted as natural-language prompts and fed to an LLM along with general instructions. The model then determines the likelihood that the next day would be a spike day and reports a confidence score. Using historical data from the Texas electricity market, we demonstrate that this few-shot approach achieves performance comparable to supervised machine learning models, such as Support Vector Machines and XGBoost, and outperforms the latter two when limited historical data are available. These findings highlight the potential of LLMs as a data-efficient tool for classifying electricity price spikes in settings with scarce data.

[256] Real-time Secondary Crash Likelihood Prediction Excluding Post Primary Crash Features

Lei Han, Mohamed Abdel-Aty, Zubayer Islam, Chenzhu Wang

Main category: cs.LG

TL;DR: Hybrid framework for real-time secondary crash prediction using traffic flow and environmental features without post-crash data, achieving 91% accuracy with low false alarm rate.

DetailsMotivation: Existing secondary crash prediction methods rely on post-crash features (crash type, severity) that are rarely available in real-time, limiting practical applicability for active traffic management systems.

Method: Proposes hybrid framework with dynamic spatiotemporal window to extract real-time traffic flow and environmental features from primary crash locations and upstream segments. Includes three models: primary crash model and two secondary crash models for different comparative scenarios. Uses ensemble learning with six ML algorithms and voting-based mechanism to combine outputs.

Result: Framework correctly identifies 91% of secondary crashes with false alarm rate of 0.20. AUC improves from 0.654, 0.744, and 0.902 for individual models to 0.952 for hybrid model, outperforming previous studies on Florida freeways.

Conclusion: The proposed real-time hybrid framework effectively predicts secondary crash likelihood without post-crash features, demonstrating superior performance for active traffic management applications.

Abstract: Secondary crash likelihood prediction is a critical component of an active traffic management system to mitigate congestion and adverse impacts caused by secondary crashes. However, existing approaches mainly rely on post-crash features (e.g., crash type and severity) that are rarely available in real time, limiting their practical applicability. To address this limitation, we propose a hybrid secondary crash likelihood prediction framework that does not depend on post-crash features. A dynamic spatiotemporal window is designed to extract real-time traffic flow and environmental features from primary crash locations and their upstream segments. The framework includes three models: a primary crash model to estimate the likelihood of secondary crash occurrence, and two secondary crash models to evaluate traffic conditions at crash and upstream segments under different comparative scenarios. An ensemble learning strategy integrating six machine learning algorithms is developed to enhance predictive performance, and a voting-based mechanism combines the outputs of the three models. Experiments on Florida freeways demonstrate that the proposed hybrid framework correctly identifies 91% of secondary crashes with a low false alarm rate of 0.20. The Area Under the ROC Curve improves from 0.654, 0.744, and 0.902 for the individual models to 0.952 for the hybrid model, outperforming previous studies.

[257] Quantifying LLM Attention-Head Stability: Implications for Circuit Universality

Karan Bali, Jack Stanley, Praneet Suresh, Danilo Bzdok

Main category: cs.LG

TL;DR: Transformer attention head stability varies across layers and model sizes, with middle layers being least stable but most functionally important, and weight decay improves cross-instance robustness.

DetailsMotivation: To determine whether transformer circuits discovered in mechanistic interpretability are stable across different training instances or idiosyncratic to specific runs, which is crucial for safety-critical applications and scalable oversight.

Method: Systematically study stability across refits in transformer language models of various sizes, quantifying layer-by-layer similarity of attention head representations across independently initialized training runs with rigorous experiments.

Result: Middle-layer heads are least stable yet most representationally distinct; deeper models show stronger mid-depth divergence; unstable heads in deeper layers become more functionally important; weight decay optimization substantially improves attention-head stability; residual stream is comparatively stable.

Conclusion: Cross-instance robustness of circuits is essential for scalable oversight and white-box monitorability of AI systems, establishing stability as a prerequisite for reliable mechanistic interpretability.

Abstract: In mechanistic interpretability, recent work scrutinizes transformer “circuits” - sparse, mono or multi layer sub computations, that may reflect human understandable functions. Yet, these network circuits are rarely acid-tested for their stability across different instances of the same deep learning architecture. Without this, it remains unclear whether reported circuits emerge universally across labs or turn out to be idiosyncratic to a particular estimation instance, potentially limiting confidence in safety-critical settings. Here, we systematically study stability across-refits in increasingly complex transformer language models of various sizes. We quantify, layer by layer, how similarly attention heads learn representations across independently initialized training runs. Our rigorous experiments show that (1) middle-layer heads are the least stable yet the most representationally distinct; (2) deeper models exhibit stronger mid-depth divergence; (3) unstable heads in deeper layers become more functionally important than their peers from the same layer; (4) applying weight decay optimization substantially improves attention-head stability across random model initializations; and (5) the residual stream is comparatively stable. Our findings establish the cross-instance robustness of circuits as an essential yet underappreciated prerequisite for scalable oversight, drawing contours around possible white-box monitorability of AI systems.

[258] DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Haoxiang Sun, Lizhen Xu, Bing Zhao, Wotao Yin, Wei Wang, Boyu Yang, Rui Wang, Hu Wei

Main category: cs.LG

TL;DR: DeepVision-103K is a large-scale dataset for Reinforcement Learning with Verifiable Rewards (RLVR) training that enhances multimodal mathematical reasoning in Large Multimodal Models by providing diverse K12 math topics, extensive knowledge points, and rich visual elements.

DetailsMotivation: Existing RLVR datasets are limited by small-scale manual construction or recombination of prior resources, which restricts data diversity and coverage, thereby constraining further improvements in multimodal model performance for visual reflection and reasoning tasks.

Method: The authors introduce DeepVision-103K, a comprehensive dataset covering diverse K12 mathematical topics, extensive knowledge points, and rich visual elements for RLVR training of Large Multimodal Models.

Result: Models trained on DeepVision achieve strong performance on multimodal mathematical benchmarks and generalize effectively to general multimodal reasoning tasks, with analysis revealing enhanced visual perception, reflection, and reasoning capabilities.

Conclusion: DeepVision-103K is an effective dataset for advancing multimodal reasoning capabilities in Large Multimodal Models through RLVR training, demonstrating improved performance on mathematical and general multimodal reasoning tasks.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has been shown effective in enhancing the visual reflection and reasoning capabilities of Large Multimodal Models (LMMs). However, existing datasets are predominantly derived from either small-scale manual construction or recombination of prior resources, which limits data diversity and coverage, thereby constraining further gains in model performance. To this end, we introduce \textbf{DeepVision-103K}, a comprehensive dataset for RLVR training that covers diverse K12 mathematical topics, extensive knowledge points, and rich visual elements. Models trained on DeepVision achieve strong performance on multimodal mathematical benchmarks, and generalize effectively to general multimodal reasoning tasks. Further analysis reveals enhanced visual perception, reflection and reasoning capabilities in trained models, validating DeepVision’s effectiveness for advancing multimodal reasoning. Data: \href{https://huggingface.co/datasets/skylenage/DeepVision-103K}{this url}.

[259] PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency

Zhangyi Liu, Huaizhi Qu, Xiaowei Yin, He Sun, Yanjun Han, Tianlong Chen, Zhun Deng

Main category: cs.LG

TL;DR: PETS introduces a principled optimization framework for efficient test-time self-consistency that allocates sampling budgets based on question difficulty, reducing sampling costs by up to 75% while maintaining performance.

DetailsMotivation: Current test-time scaling methods require aggregating multiple stochastic reasoning trajectories, but achieving sample-efficient self-consistency under limited computational budgets remains challenging. There's a need for principled approaches to allocate sampling resources effectively.

Method: PETS introduces a principled optimization framework using a new measure called “self-consistency rate” (agreement with infinite-budget majority vote). It studies offline (all questions known) and online (sequential questions) settings. Offline approach models reasoning traces as workers in crowdsourcing, enabling theoretical guarantees and efficient majority-voting allocation. Online method adapts budgets to question difficulty on the fly while maintaining theoretical guarantees.

Result: PETS consistently outperforms uniform allocation. On GPQA benchmark, PETS achieves perfect self-consistency while reducing sampling budget by up to 75% in offline setting and 55% in online setting compared to uniform allocation.

Conclusion: PETS provides a theoretically grounded, sample-efficient framework for test-time self-consistency that significantly reduces computational costs while maintaining or improving performance, with applications to various reasoning tasks.

Abstract: Test-time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample-efficient test-time self-consistency under a limited budget remains an open challenge. We introduce PETS (Principled and Efficient Test-TimeSelf-Consistency), which initiates a principled study of trajectory allocation through an optimization framework. Central to our approach is the self-consistency rate, a new measure defined as agreement with the infinite-budget majority vote. This formulation makes sample-efficient test-time allocation theoretically grounded and amenable to rigorous analysis. We study both offline and online settings. In the offline regime, where all questions are known in advance, we connect trajectory allocation to crowdsourcing, a classic and well-developed area, by modeling reasoning traces as workers. This perspective allows us to leverage rich existing theory, yielding theoretical guarantees and an efficient majority-voting-based allocation algorithm. In the online streaming regime, where questions arrive sequentially and allocations must be made on the fly, we propose a novel method inspired by the offline framework. Our approach adapts budgets to question difficulty while preserving strong theoretical guarantees and computational efficiency. Experiments show that PETS consistently outperforms uniform allocation. On GPQA, PETS achieves perfect self-consistency in both settings while reducing the sampling budget by up to 75% (offline) and 55% (online) relative to uniform allocation. Code is available at https://github.com/ZDCSlab/PETS.

[260] Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

Yongzhong Xu

Main category: cs.LG

TL;DR: Geometric analysis reveals grokking in transformers involves low-dimensional confinement in an execution subspace with transverse curvature growth preceding generalization.

DetailsMotivation: To understand the poorly understood phenomenon of grokking - the delayed transition from memorization to generalization in small algorithmic tasks - by analyzing optimization dynamics in transformers trained on modular arithmetic.

Method: Geometric analysis of optimization dynamics using PCA of attention weight trajectories to identify low-dimensional execution subspace, measurement of commutator defects (non-commutativity of gradient steps) projected onto learned subspace, and causal intervention experiments.

Result: Training evolves predominantly within a low-dimensional execution subspace (single PC captures 68-83% variance), curvature grows sharply in orthogonal directions preceding generalization, and motion along learned subspace is necessary for grokking while artificially increasing curvature is insufficient.

Conclusion: Grokking reflects escape from a metastable regime characterized by low-dimensional confinement and transverse curvature accumulation, with consistent findings across learning rates and hyperparameter regimes.

Abstract: Grokking – the delayed transition from memorization to generalization in small algorithmic tasks – remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects – the non-commutativity of successive gradient steps – and project them onto this learned subspace. We find that curvature grows sharply in directions orthogonal to the execution subspace while the trajectory remains largely confined to it. Importantly, curvature growth consistently precedes generalization across learning rates and hyperparameter regimes, with the lead time obeying a power law in the grokking timescale. Causal intervention experiments show that motion along the learned subspace is necessary for grokking, while artificially increasing curvature is insufficient. Together, these results support a geometric account in which grokking reflects escape from a metastable regime characterized by low-dimensional confinement and transverse curvature accumulation. All findings replicate across this learning-rate range, a qualitatively different slow regime (lr=5e-5, wd=0.1, 3 layers), and three random seeds, though alignment dynamics differ quantitatively between regimes. Causal intervention experiments establish that orthogonal gradient flow is necessary but not sufficient for grokking: suppressing it prevents generalization with a monotonic dose-response across four operations, while artificially boosting curvature defects has no effect.

[261] LiveClin: A Live Clinical Benchmark without Leakage

Xidong Wang, Shuqi Guo, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, Benyou Wang

Main category: cs.LG

TL;DR: LiveClin is a live benchmark for medical LLMs built from contemporary case reports to address data contamination and knowledge obsolescence issues in static benchmarks, featuring multimodal clinical scenarios evaluated by physicians.

DetailsMotivation: Current medical LLM evaluation suffers from data contamination (models trained on benchmark data) and knowledge obsolescence (static benchmarks don't reflect current medical knowledge), leading to inflated scores that don't reflect real-world clinical performance.

Method: Built LiveClin from contemporary peer-reviewed case reports updated biannually. Used AI-human workflow with 239 physicians to transform authentic patient cases into complex multimodal evaluation scenarios covering entire clinical pathways. Created 1,407 case reports with 6,605 questions.

Result: Evaluation of 26 models showed poor performance on real-world scenarios - top model achieved only 35.7% Case Accuracy. Human experts (Chief Physicians and Attending Physicians) outperformed most models, demonstrating the gap between current models and clinical expertise.

Conclusion: LiveClin provides a continuously evolving, clinically grounded framework for evaluating medical LLMs that better approximates real-world practice, helping guide development toward greater reliability and utility in clinical settings.

Abstract: The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI-human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. The benchmark currently comprises 1,407 case reports and 6,605 questions. Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real-world scenarios, with the top-performing model achieving a Case Accuracy of just 35.7%. In benchmarking against human experts, Chief Physicians achieved the highest accuracy, followed closely by Attending Physicians, with both surpassing most models. LiveClin thus provides a continuously evolving, clinically grounded framework to guide the development of medical LLMs towards closing this gap and achieving greater reliability and real-world utility. Our data and code are publicly available at https://github.com/AQ-MedAI/LiveClin.

[262] Linear Convergence in Games with Delayed Feedback via Extra Prediction

Yuma Fujimoto, Kenshi Abe, Kaito Ariu

Main category: cs.LG

TL;DR: WOGDA algorithm with extra optimism accelerates convergence in bilinear games with feedback delays, achieving exponential rates that improve with additional future reward prediction.

DetailsMotivation: Feedback delays in multi-agent learning severely degrade performance, and convergence rates under delayed feedback remain unclear even for fundamental bilinear games. The paper aims to understand and improve convergence in such delayed settings.

Method: Proposes Weighted Optimistic Gradient Descent-Ascent (WOGDA) with extra optimism that predicts farther future rewards. Analyzes it as an approximation of Extra Proximal Point (EPP) method, comparing to classical Proximal Point (PP).

Result: Standard optimism achieves linear convergence at rate exp(-Θ(t/m^5)) after t iterations for delay m. Extra optimism tolerates larger step sizes and accelerates rate to exp(-Θ(t/(m^2 log m))). Experiments confirm accelerated convergence.

Conclusion: Extra optimism is a promising countermeasure against performance degradation caused by feedback delays in multi-agent learning, significantly accelerating convergence in bilinear games.

Abstract: Feedback delays are inevitable in real-world multi-agent learning. They are known to severely degrade performance, and the convergence rate under delayed feedback is still unclear, even for bilinear games. This paper derives the rate of linear convergence of Weighted Optimistic Gradient Descent-Ascent (WOGDA), which predicts future rewards with extra optimism, in unconstrained bilinear games. To analyze the algorithm, we interpret it as an approximation of the Extra Proximal Point (EPP), which is updated based on farther future rewards than the classical Proximal Point (PP). Our theorems show that standard optimism (predicting the next-step reward) achieves linear convergence to the equilibrium at a rate $\exp(-Θ(t/m^{5}))$ after $t$ iterations for delay $m$. Moreover, employing extra optimism (predicting farther future reward) tolerates a larger step size and significantly accelerates the rate to $\exp(-Θ(t/(m^{2}\log m)))$. Our experiments also show accelerated convergence driven by the extra optimism and are qualitatively consistent with our theorems. In summary, this paper validates that extra optimism is a promising countermeasure against performance degradation caused by feedback delays.

[263] Attending to Routers Aids Indoor Wireless Localization

Ayush Roy, Tahsin Fuad Hassan, Roshan Ayyalasomayajula, Vishnu Suresh Lokhande

Main category: cs.LG

TL;DR: The paper introduces attention mechanisms to Wi-Fi localization by weighting router contributions differently during triangulation, improving accuracy by over 30% compared to benchmarks.

DetailsMotivation: Current Wi-Fi localization methods fail to properly weight information from different routers during aggregation, leading to suboptimal convergence and reduced accuracy across diverse environments.

Method: Incorporates attention layers into standard machine learning localization architecture to weight each router’s contribution differently during triangulation, inspired by traditional weighted triangulation methods.

Result: Attention to Routers outperforms benchmark architecture by over 30% in accuracy on open-sourced datasets.

Conclusion: Applying attention mechanisms to router weighting significantly improves Wi-Fi localization performance by better emphasizing the relevance of each router’s information.

Abstract: Modern machine learning-based wireless localization using Wi-Fi signals continues to face significant challenges in achieving groundbreaking performance across diverse environments. A major limitation is that most existing algorithms do not appropriately weight the information from different routers during aggregation, resulting in suboptimal convergence and reduced accuracy. Motivated by traditional weighted triangulation methods, this paper introduces the concept of attention to routers, ensuring that each router’s contribution is weighted differently when aggregating information from multiple routers for triangulation. We demonstrate, by incorporating attention layers into a standard machine learning localization architecture, that emphasizing the relevance of each router can substantially improve overall performance. We have also shown through evaluation over the open-sourced datasets and demonstrate that Attention to Routers outperforms the benchmark architecture by over 30% in accuracy.

[264] Machine Learning Argument of Latitude Error Model for LEO Satellite Orbit and Covariance Correction

Alex Moody, Penina Axelrad, Rebecca Russell

Main category: cs.LG

TL;DR: Machine learning approach to correct orbit propagation errors in LEO satellites, particularly addressing atmospheric drag mismodeling to maintain Gaussian uncertainty assumptions for PNT services.

DetailsMotivation: LEO satellites are being used for alternative PNT services, requiring accurate orbit propagation with realistic uncertainty quantification. The Gaussian uncertainty assumption breaks down due to atmospheric drag mismodeling, limiting the utility of VCM ephemerides for longer time horizons.

Method: Developed ML models (time-conditioned neural network and Gaussian Process) to predict argument of latitude error as Gaussian distribution using parameters from single VCM epoch and reverse propagation errors. The 1D model captures mismodeled drag effects that can be mapped to Cartesian state space.

Result: The learned models successfully correct error growth in argument of latitude for diverse LEO satellites, extending applicability of Gaussian assumption and improving orbit propagation accuracy. The correction method updates only dimensions of dominant error growth while maintaining physics-based VCM covariance propagation in other dimensions.

Conclusion: The ML-based correction extends utility of VCM ephemerides to longer time horizons without modifying existing propagator functionality, enabling more reliable PNT services from LEO satellites.

Abstract: Low Earth orbit (LEO) satellites are leveraged to support new position, navigation, and timing (PNT) service alternatives to GNSS. These alternatives require accurate propagation of satellite position and velocity with a realistic quantification of uncertainty. It is commonly assumed that the propagated uncertainty distribution is Gaussian; however, the validity of this assumption can be quickly compromised by the mismodeling of atmospheric drag. We develop a machine learning approach that corrects error growth in the argument of latitude for a diverse set of LEO satellites. The improved orbit propagation accuracy extends the applicability of the Gaussian assumption and modeling of the errors with a corrected mean and covariance. We compare the performance of a time-conditioned neural network and a Gaussian Process on datasets computed with an open source orbit propagator and publicly available Vector Covariance Message (VCM) ephemerides. The learned models predict the argument of latitude error as a Gaussian distribution given parameters from a single VCM epoch and reverse propagation errors. We show that this one-dimensional model captures the effect of mismodeled drag, which can be mapped to the Cartesian state space. The correction method only updates information along the dimensions of dominant error growth, while maintaining the physics-based propagation of VCM covariance in the remaining dimensions. We therefore extend the utility of VCM ephemerides to longer time horizons without modifying the functionality of the existing propagator.

[265] Omitted Variable Bias in Language Models Under Distribution Shift

Victoria Lin, Louis-Philippe Morency, Eli Ben-Michael

Main category: cs.LG

TL;DR: A framework for analyzing distribution shifts in language models that separates observable and unobservable components, addressing omitted variable bias through worst-case generalization bounds.

DetailsMotivation: Modern language models perform well on in-distribution data but are brittle under distribution shifts. Current methods only address observable shifts, leaving unobservable variables that cause omitted variable bias, compromising both evaluation and optimization.

Method: Separates distribution shifts into observable and unobservable components. Introduces a framework that maps omitted variable strength to bounds on worst-case generalization performance under distribution shift. Uses these bounds for evaluation and optimization.

Result: The framework provides more principled measures of out-of-distribution performance, improves true OOD performance compared to standard methods, and enables inference about omitted variable strength when target distribution labels are available.

Conclusion: Addressing both observable and unobservable components of distribution shift is crucial for robust language models. The proposed framework offers better evaluation and optimization for OOD scenarios.

Abstract: Despite their impressive performance on a wide variety of tasks, modern language models remain susceptible to distribution shifts, exhibiting brittle behavior when evaluated on data that differs in distribution from their training data. In this paper, we describe how distribution shifts in language models can be separated into observable and unobservable components, and we discuss how established approaches for dealing with distribution shift address only the former. Importantly, we identify that the resulting omitted variable bias from unobserved variables can compromise both evaluation and optimization in language models. To address this challenge, we introduce a framework that maps the strength of the omitted variables to bounds on the worst-case generalization performance of language models under distribution shift. In empirical experiments, we show that using these bounds directly in language model evaluation and optimization provides more principled measures of out-of-distribution performance, improves true out-of-distribution performance relative to standard distribution shift adjustment methods, and further enables inference about the strength of the omitted variables when target distribution labels are available.

[266] Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency

Victoria Lin, Xinnuo Xu, Rachel Lawrence, Risa Ueno, Amit Sharma, Javier Gonzalez, Niranjani Prasad

Main category: cs.LG

TL;DR: DCC is a training-free inference method that evaluates and improves LLMs’ causal reasoning by checking consistency between counterfactual predictions without needing labeled data.

DetailsMotivation: LLMs show brittleness on counterfactual questions despite strong reasoning benchmarks, indicating weak causal reasoning. Existing labeled counterfactual datasets are limited in scale, creating a need for methods that can evaluate and improve causal reasoning without extensive labeled data.

Method: Double Counterfactual Consistency (DCC) is an inference-time method that verifies two key causal reasoning elements: causal intervention and counterfactual prediction. It works by checking consistency between counterfactual predictions without requiring labeled counterfactual data, and can be used as a training-free test-time rejection sampling criterion.

Result: DCC effectively evaluates causal reasoning abilities of leading LLMs across various reasoning tasks and interventions. It serves as an effective training-free test-time rejection sampling criterion that directly improves performance on reasoning tasks across multiple model families.

Conclusion: DCC provides a lightweight, data-efficient method for assessing and enhancing LLMs’ causal reasoning capabilities without requiring labeled counterfactual data, addressing limitations in current evaluation approaches.

Abstract: Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability. While recent work has demonstrated that labeled counterfactual tasks can be useful benchmarks of LLMs’ causal reasoning, producing such data at the scale required to cover the vast potential space of counterfactuals is limited. In this work, we introduce double counterfactual consistency (DCC), a lightweight inference-time method for measuring and guiding the ability of LLMs to reason causally. Without requiring labeled counterfactual data, DCC verifies a model’s ability to execute two important elements of causal reasoning: causal intervention and counterfactual prediction. Using DCC, we evaluate the causal reasoning abilities of various leading LLMs across a range of reasoning tasks and interventions. Moreover, we demonstrate the effectiveness of DCC as a training-free test-time rejection sampling criterion and show that it can directly improve performance on reasoning tasks across multiple model families.

[267] Escaping the Cognitive Well: Efficient Competition Math with Off-the-Shelf Models

Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, Sanjeev Arora

Main category: cs.LG

TL;DR: A cost-effective inference pipeline achieves state-of-the-art performance on IMO math problems using general-purpose models, addressing grader failure through conjecture extraction and context detachment.

DetailsMotivation: Previous methods for solving IMO-level math problems either used custom/unreleased models or required prohibitively expensive inference on public models. There's a need for cost-effective solutions using general-purpose models.

Method: The pipeline addresses “Cognitive Well” failures (iterative refinement converging to wrong solutions) through conjecture extraction - isolating candidate lemmas from generated solutions and independently verifying them alongside their negations in fresh environments (context detachment).

Result: Achieves 67.1% performance on IMO-ProofBench Advanced using Gemini 3.0 Pro with average cost of ~31 USD per question, representing state-of-the-art at time of evaluation and more than doubling success rate of next best public pipeline.

Conclusion: The work demonstrates that best-in-class performance on challenging math reasoning tasks can be achieved cost-effectively using general-purpose models with careful pipeline design addressing specific failure modes.

Abstract: In the past year, custom and unreleased math reasoning models reached gold medal performance on the International Mathematical Olympiad (IMO). Similar performance was then reported using large-scale inference on publicly available models but at prohibitive costs (e.g., 3000 USD per problem). In this work, we present an inference pipeline that attains best-in-class performance on IMO-style math problems at an average inference cost orders of magnitude below competing methods while using only general-purpose off-the-shelf models. Our method relies on insights about grader failure in solver-grader pipelines, which we call the Cognitive Well (iterative refinement converging to a wrong solution that the solver as well as the pipeline’s internal grader consider to be basically correct). Our pipeline addresses these failure modes through conjecture extraction, wherein candidate lemmas are isolated from generated solutions and independently verified alongside their negations in a fresh environment (context detachment). On IMO-ProofBench Advanced (PB-Adv), our pipeline achieves 67.1 percent performance using Gemini 3.0 Pro with an average cost per question of approximately 31 USD. At the time of evaluation, this represented the state-of-the-art on PB-Adv among both public and unreleased models, and more than doubles the success rate of the next best publicly accessible pipeline, all at a fraction of the cost.

[268] Efficient Tail-Aware Generative Optimization via Flow Model Fine-Tuning

Zifan Wang, Riccardo De Santi, Xiaoyu Mo, Michael M. Zavlanos, Andreas Krause, Karl H. Johansson

Main category: cs.LG

TL;DR: TFFT is a distributional fine-tuning method for diffusion/flow models that uses Conditional Value-at-Risk to control tail behavior for reliability (left tail) or discovery (right tail), with efficient two-stage optimization.

DetailsMotivation: Existing fine-tuning methods for diffusion and flow models only maximize expected reward without controlling tail behavior, but tail control is essential for reliability (limiting low-reward failures) and discovery (prioritizing rare high-reward outcomes).

Method: Tail-aware Flow Fine-Tuning (TFFT) uses Conditional Value-at-Risk (CVaR) for tail shaping. It decomposes CVaR optimization into two stages: 1) lightweight one-dimensional threshold optimization, and 2) single entropy-regularized fine-tuning with a specific pseudo-reward, making it computationally comparable to standard expected fine-tuning.

Result: Demonstrated effectiveness across illustrative experiments, high-dimensional text-to-image generation, and molecular design applications.

Conclusion: TFFT provides a principled and efficient method for controlling tail behavior in diffusion/flow model fine-tuning, addressing both reliability and discovery goals through CVaR-based optimization.

Abstract: Fine-tuning pre-trained diffusion and flow models to optimize downstream utilities is central to real-world deployment. Existing entropy-regularized methods primarily maximize expected reward, providing no mechanism to shape tail behavior. However, tail control is often essential: the lower tail determines reliability by limiting low-reward failures, while the upper tail enables discovery by prioritizing rare, high-reward outcomes. In this work, we present Tail-aware Flow Fine-Tuning (TFFT), a principled and efficient distributional fine-tuning algorithm based on the Conditional Value-at-Risk (CVaR). We address two distinct tail-shaping goals: right-CVaR for seeking novel samples in the high-reward tail and left-CVaR for controlling worst-case samples in the low-reward tail. Unlike prior approaches that rely on non-linear optimization, we leverage the variational dual formulation of CVaR to decompose it into a decoupled two-stage procedure: a lightweight one-dimensional threshold optimization step, and a single entropy-regularized fine-tuning process via a specific pseudo-reward. This decomposition achieves CVaR fine-tuning efficiently with computational cost comparable to standard expected fine-tuning methods. We demonstrate the effectiveness of TFFT across illustrative experiments, high-dimensional text-to-image generation, and molecular design.

[269] Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning

Xuefeng Wang, Lei Zhang, Henglin Pu, Ahmed H. Qureshi, Husheng Li

Main category: cs.LG

TL;DR: Continuous-time multi-agent reinforcement learning framework using physics-informed neural networks to solve Hamilton-Jacobi-Bellman equations for high-frequency dynamical systems

DetailsMotivation: Existing RL methods struggle with complex dynamical systems requiring high-frequency or irregular time interactions. Continuous-time RL shows promise but has been limited to single-agent domains due to curse of dimensionality in HJB equations and difficulty approximating centralized value functions in multi-agent settings.

Method: Proposes CT-MARL framework using physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale. Introduces Value Gradient Iteration (VGI) module that aligns value learning with value-gradient learning by iteratively refining value gradients along trajectories to improve gradient fidelity.

Result: Method evaluated on continuous-time variants of standard benchmarks including multi-agent particle environment (MPE) and multi-agent MuJoCo. Consistently outperforms existing continuous-time RL baselines and scales to complex multi-agent dynamics.

Conclusion: The proposed CT-MARL framework with PINNs and VGI successfully addresses challenges of continuous-time multi-agent reinforcement learning, enabling effective learning in high-dimensional, high-frequency dynamical systems.

Abstract: Existing reinforcement learning (RL) methods struggle with complex dynamical systems that demand interactions at high frequencies or irregular time intervals. Continuous-time RL (CTRL) has emerged as a promising alternative by replacing discrete-time Bellman recursion with differential value functions defined as viscosity solutions of the Hamilton–Jacobi–Bellman (HJB) equation. While CTRL has shown promise, its applications have been largely limited to the single-agent domain. This limitation stems from two key challenges: (i) conventional solution methods for HJB equations suffer from the curse of dimensionality (CoD), making them intractable in high-dimensional systems; and (ii) even with HJB-based learning approaches, accurately approximating centralized value functions in multi-agent settings remains difficult, which in turn destabilizes policy training. In this paper, we propose a CT-MARL framework that uses physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale. To ensure the value is consistent with its differential structure, we align value learning with value-gradient learning by introducing a Value Gradient Iteration (VGI) module that iteratively refines value gradients along trajectories. This improves gradient fidelity, in turn yielding more accurate values and stronger policy learning. We evaluate our method using continuous-time variants of standard benchmarks, including multi-agent particle environment (MPE) and multi-agent MuJoCo. Our results demonstrate that our approach consistently outperforms existing continuous-time RL baselines and scales to complex multi-agent dynamics.

[270] TopoFlow: Physics-guided Neural Networks for high-resolution air quality prediction

Ammar Kheder, Helmi Toropainen, Wenqing Peng, Samuel Antão, Jia Chen, Zhi-Song Liu, Michael Boy

Main category: cs.LG

TL;DR: TopoFlow: Physics-guided neural network for high-resolution air quality prediction using topography-aware attention and wind-guided patch reordering in vision transformers.

DetailsMotivation: Current air quality prediction systems lack explicit modeling of physical processes like topography and wind dynamics that govern pollutant transport and dispersion, limiting accuracy and reliability.

Method: Uses vision transformer architecture with two novel mechanisms: topography-aware attention to model terrain-induced flow patterns, and wind-guided patch reordering to align spatial representations with prevailing wind directions.

Result: Achieves PM2.5 RMSE of 9.71 ug/m3, 71-80% improvement over operational forecasting systems and 13% improvement over state-of-the-art AI baselines, with errors below China’s air quality threshold.

Conclusion: Principled integration of physical knowledge into neural networks can fundamentally advance air quality prediction, with consistent performance gains across pollutants and forecast lead times.

Abstract: We propose TopoFlow (Topography-aware pollutant Flow learning), a physics-guided neural network for efficient, high-resolution air quality prediction. To explicitly embed physical processes into the learning framework, we identify two critical factors governing pollutant dynamics: topography and wind direction. Complex terrain can channel, block, and trap pollutants, while wind acts as a primary driver of their transport and dispersion. Building on these insights, TopoFlow leverages a vision transformer architecture with two novel mechanisms: topography-aware attention, which explicitly models terrain-induced flow patterns, and wind-guided patch reordering, which aligns spatial representations with prevailing wind directions. Trained on six years of high-resolution reanalysis data assimilating observations from over 1,400 surface monitoring stations across China, TopoFlow achieves a PM2.5 RMSE of 9.71 ug/m3, representing a 71-80% improvement over operational forecasting systems and a 13% improvement over state-of-the-art AI baselines. Forecast errors remain well below China’s 24-hour air quality threshold of 75 ug/m3 (GB 3095-2012), enabling reliable discrimination between clean and polluted conditions. These performance gains are consistent across all four major pollutants and forecast lead times from 12 to 96 hours, demonstrating that principled integration of physical knowledge into neural networks can fundamentally advance air quality prediction.

[271] Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees

Itamar Hadad, Guy Katz, Shahaf Bassan

Main category: cs.LG

TL;DR: Automated circuit discovery with provable guarantees using neural network verification techniques

DetailsMotivation: Prior circuit discovery methods lack provable guarantees over continuous input domains, relying on heuristics and approximations instead of formal verification

Method: Leverage neural network verification to develop algorithms providing three types of guarantees: input domain robustness, robust patching, and minimality

Result: Algorithms yield circuits with substantially stronger robustness guarantees than standard methods, demonstrated on various vision models using state-of-the-art verifiers

Conclusion: Establishes principled foundation for provable circuit discovery with formal guarantees, uncovering theoretical connections between different guarantee types

Abstract: Automated circuit discovery is a central tool in mechanistic interpretability for identifying the internal components of neural networks responsible for specific behaviors. While prior methods have made significant progress, they typically depend on heuristics or approximations and do not offer provable guarantees over continuous input domains for the resulting circuits. In this work, we leverage recent advances in neural network verification to propose a suite of automated algorithms that yield circuits with provable guarantees. We focus on three types of guarantees: (1) input domain robustness, ensuring the circuit agrees with the model across a continuous input region; (2) robust patching, certifying circuit alignment under continuous patching perturbations; and (3) minimality, formalizing and capturing a wide array of various notions of succinctness. Interestingly, we uncover a diverse set of novel theoretical connections among these three families of guarantees, with critical implications for the convergence of our algorithms. Finally, we conduct experiments with state-of-the-art verifiers on various vision models, showing that our algorithms yield circuits with substantially stronger robustness guarantees than standard circuit discovery methods, establishing a principled foundation for provable circuit discovery.

[272] HiVAE: Hierarchical Latent Variables for Scalable Theory of Mind

Nigel Doering, Rahath Malladi, Arshia Sangwan, David Danks, Tauhidur Rahman

Main category: cs.LG

TL;DR: HiVAE is a hierarchical variational architecture that scales theory of mind reasoning to realistic spatiotemporal domains using a three-level VAE hierarchy inspired by human cognition’s belief-desire-intention structure.

DetailsMotivation: Existing theory of mind approaches focus on small human-understandable gridworld spaces, limiting their application to realistic domains. The authors aim to scale ToM reasoning to realistic spatiotemporal environments.

Method: HiVAE uses a three-level hierarchical variational autoencoder architecture inspired by the belief-desire-intention structure of human cognition. The hierarchy enables scaling to complex domains like a 3,185-node campus navigation task.

Result: The hierarchical structure achieves substantial performance improvements on the large-scale navigation task. However, a critical limitation is identified: learned latent representations lack explicit grounding to actual mental states.

Conclusion: While HiVAE successfully scales ToM reasoning to realistic domains, the lack of grounded mental state representations remains a challenge. The authors propose self-supervised alignment strategies and seek community feedback on grounding approaches.

Abstract: Theory of mind (ToM) enables AI systems to infer agents’ hidden goals and mental states, but existing approaches focus mainly on small human understandable gridworld spaces. We introduce HiVAE, a hierarchical variational architecture that scales ToM reasoning to realistic spatiotemporal domains. Inspired by the belief-desire-intention structure of human cognition, our three-level VAE hierarchy achieves substantial performance improvements on a 3,185-node campus navigation task. However, we identify a critical limitation: while our hierarchical structure improves prediction, learned latent representations lack explicit grounding to actual mental states. We propose self-supervised alignment strategies and present this work to solicit community feedback on grounding approaches.

[273] Learning under noisy supervision is governed by a feedback-truth gap

Elan Schonfeld, Elias Wisnia

Main category: cs.LG

TL;DR: A fundamental feedback-truth gap emerges when feedback is processed faster than task structure can be evaluated, causing learners to prioritize feedback over truth across neural networks and human learning systems.

DetailsMotivation: The paper investigates a fundamental learning constraint: when feedback is processed faster than task structure can be evaluated, learners inevitably favor feedback over truth, creating a feedback-truth gap that affects learning under noisy supervision.

Method: Developed a two-timescale model showing the gap emerges when feedback and task evaluation rates differ. Tested across three systems: neural networks trained with noisy labels (30 datasets, 2,700 runs), human probabilistic reversal learning (N=292), and human reward/punishment learning with concurrent EEG (N=25). Truth was operationally defined in each context.

Result: The feedback-truth gap appeared universally across all systems but was regulated differently: dense networks accumulated it as memorization; sparse-residual scaffolding suppressed it; humans generated transient over-commitment that was actively recovered. Neural over-commitment (~0.04-0.10) was amplified tenfold into behavioral commitment (d = 3.3-3.9).

Conclusion: The feedback-truth gap is a fundamental constraint on learning under noisy supervision, and its consequences depend on the regulatory mechanisms each system employs to manage the discrepancy between feedback processing and task structure evaluation.

Abstract: When feedback is absorbed faster than task structure can be evaluated, the learner will favor feedback over truth. A two-timescale model shows this feedback-truth gap is inevitable whenever the two rates differ and vanishes only when they match. We test this prediction across neural networks trained with noisy labels (30 datasets, 2,700 runs), human probabilistic reversal learning (N = 292), and human reward/punishment learning with concurrent EEG (N = 25). In each system, truth is defined operationally: held-out labels, the objectively correct option, or the participant’s pre-feedback expectation - the only non-circular reference decodable from post-feedback EEG. The gap appeared universally but was regulated differently: dense networks accumulated it as memorization; sparse-residual scaffolding suppressed it; humans generated transient over-commitment that was actively recovered. Neural over-commitment (~0.04-0.10) was amplified tenfold into behavioral commitment (d = 3.3-3.9). The gap is a fundamental constraint on learning under noisy supervision; its consequences depend on the regulation each system employs.

[274] VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training – A Chess Case Study

Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang

Main category: cs.LG

TL;DR: Verbalized Action Masking (VAM) improves RL post-training for LLMs by verbalizing action masks in prompts and using iterative pruning to enhance exploration in sparse feedback environments like chess.

DetailsMotivation: Exploration is a key bottleneck in RL post-training of LLMs, where sparse feedback and large action spaces often lead to premature collapse into repetitive behaviors. The paper addresses the need for more effective exploration mechanisms in this context.

Method: Proposes Verbalized Action Masking (VAM) which verbalizes an action mask in the prompt and enforces the model to output actions only from the masked set. Introduces iterative action-space pruning: if target action isn’t sampled, remove valid sampled actions from mask and resample under reduced candidate set, repeating until target is sampled or budget exhausted.

Result: VAM improves learning efficiency and final performance over strong baselines across held-out chess puzzles and full-game play measured by average centipawn loss (ACPL). Evaluated in two training regimes: engine-play regime (states via play against engine) and fixed-dataset regime (training from fixed dataset with verifier scores).

Conclusion: Verbalized masking serves as a practical mechanism for controllable exploration in LLM RL post-training, addressing exploration challenges in sparse feedback environments with large action spaces.

Abstract: Exploration remains a key bottleneck for reinforcement learning (RL) post-training of large language models (LLMs), where sparse feedback and large action spaces can lead to premature collapse into repetitive behaviors. We propose Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set. Building on this interface, we introduce iterative action-space pruning: if the target action is not sampled, we remove valid sampled actions from the mask and resample under the reduced candidate set, repeating until the target is sampled or a fixed budget is exhausted. We study VAM in chess and evaluate it under two training regimes: an engine-play regime that generates states via play against an engine opponent and a fixed-dataset regime that trains from a fixed dataset of positions with verifier scores. Across held-out chess puzzles and full-game play measured by average centipawn loss (ACPL), VAM improves learning efficiency and final performance over strong baselines, highlighting verbalized masking as a practical mechanism for controllable exploration in LLM RL post-training.

[275] A Residual-Aware Theory of Position Bias in Transformers

Hanna Herasimchyk, Robin Labryga, Tomislav Prusina, Sören Laue

Main category: cs.LG

TL;DR: Theoretical analysis reveals that residual connections prevent attention collapse in Transformers, explaining why attention doesn’t collapse to first token as predicted, and provides architectural explanation for Lost-in-the-Middle phenomenon.

DetailsMotivation: Transformer models show systematic position bias, but theoretical predictions of attention collapse to first token don't match practical observations. The architectural origins of position bias remain poorly understood.

Method: Developed a residual-aware theory of cumulative attention rollout that incorporates residual connections into analysis. Proved mathematically that at finite depth, causal Transformers induce U-shaped position bias.

Result: Residual connections prevent attention collapse under realistic conditions. At finite depth, Transformers show U-shaped position bias with attention concentrating on early and late tokens, explaining Lost-in-the-Middle phenomenon.

Conclusion: Residual connections are crucial architectural component that prevents attention collapse and creates systematic position biases in Transformers, providing principled explanation for observed attention patterns.

Abstract: Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. Under causal masking at infinite depth, prior theoretical analyses of attention rollout predict an inevitable collapse of attention onto the first token. Such collapse, however, does not occur in practice. We resolve this discrepancy with a residual-aware theory of cumulative attention rollout. By incorporating residual connections, we show that this architectural component prevents collapse under realistic conditions. At finite depth, we prove that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens. This result provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.

[276] Training Large Reasoning Models Efficiently via Progressive Thought Encoding

Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu, Jianfeng Gao

Main category: cs.LG

TL;DR: Progressive Thought Encoding enables efficient RL training for large reasoning models by encoding intermediate reasoning into fixed-size vectors, eliminating need for full-cache rollouts while maintaining reasoning performance under memory constraints.

DetailsMotivation: Large reasoning models face efficiency barriers in RL training due to long rollouts requiring extensive memory for autoregressive decoding. Current sliding-window cache strategies disrupt long-context reasoning and degrade performance, creating need for methods that maintain reasoning ability under fixed memory constraints.

Method: Progressive Thought Encoding is a parameter-efficient fine-tuning method that progressively encodes intermediate reasoning steps into fixed-size vector representations, enabling models to reason effectively under fixed-size caches without backpropagating through full-cache rollouts.

Result: Experiments on three models (Qwen2.5-3B/7B-Instruct, DeepSeek-R1-Distill-Llama-8B) across six mathematical benchmarks show +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-tuning, with up to +23.4 accuracy improvement on AIME2024/2025 under tight cache budgets.

Conclusion: Progressive Thought Encoding improves reasoning accuracy while making RL training of large reasoning models substantially more efficient and scalable under real-world memory constraints, enabling better performance with limited computational resources.

Abstract: Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-tuning on average, with up to +23.4 accuracy improvement on AIME2024/2025 under the same tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.

[277] What is the Value of Censored Data? An Exact Analysis for the Data-driven Newsvendor

Rachitesh Kumar, Omar Mouchtaki

Main category: cs.LG

TL;DR: The paper studies offline data-driven newsvendor problem with censored demand data, providing exact worst-case regret analysis for inventory policies under demand censoring and showing how targeted exploration can improve performance.

DetailsMotivation: The motivation is to address the practical challenge in inventory management where demand data is censored at inventory levels - only sales are observed, not true demand. This censoring creates fundamental limitations for learning from passive sales data, and the paper aims to understand what can and cannot be learned offline under such conditions.

Method: The authors provide a general procedure to compute exact worst-case regret of classical data-driven inventory policies over all demand distributions. Their main technical contribution reduces the infinite-dimensional, non-convex optimization problem to a finite-dimensional one, enabling exact characterization of policy performance for any sample size and censoring levels.

Result: The analysis shows that demand censoring fundamentally limits learning from passive sales data, but targeted exploration at high inventory levels can substantially improve worst-case guarantees, enabling near-optimal performance even under heavy censoring. Policies based on the “sales-as-demand” heuristic can suffer severe performance degradation as censored data accumulates.

Conclusion: The quality of point-of-sale information critically shapes what can be learned offline. While demand censoring presents fundamental limitations, strategic exploration can overcome these limitations, and treating sales as demand without accounting for censoring leads to poor performance.

Abstract: We study the offline data-driven newsvendor problem with censored demand data. In contrast to prior works where demand is fully observed, we consider the setting where demand is censored at the inventory level and only sales are observed; sales match demand when there is sufficient inventory, and equal the available inventory otherwise. We provide a general procedure to compute the exact worst-case regret of classical data-driven inventory policies, evaluated over all demand distributions. Our main technical result shows that this infinite-dimensional, non-convex optimization problem can be reduced to a finite-dimensional one, enabling an exact characterization of the performance of policies for any sample size and censoring levels. We leverage this reduction to derive sharp insights on the achievable performance of standard inventory policies under demand censoring. In particular, our analysis of the Kaplan-Meier policy shows that while demand censoring fundamentally limits what can be learned from passive sales data, just a small amount of targeted exploration at high inventory levels can substantially improve worst-case guarantees, enabling near-optimal performance even under heavy censoring. In contrast, when the point-of-sale system does not record stockout events and only reports realized sales, a natural and commonly used approach is to treat sales as demand. Our results show that policies based on this sales-as-demand heuristic can suffer severe performance degradation as censored data accumulates, highlighting how the quality of point-of-sale information critically shapes what can, and cannot, be learned offline.

[278] On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang

Main category: cs.LG

TL;DR: Two-layer neural networks learn modular addition through Fourier features, phase symmetry, and frequency diversification, enabling robust majority voting despite individual neuron noise.

DetailsMotivation: To provide a complete mechanistic interpretation of how neural networks learn the modular addition task, bridging the gap between known single-frequency Fourier features and the global solution, and explaining training dynamics including grokking.

Method: Theoretical analysis of two-layer neural networks using gradient flow analysis, formalizing diversification conditions (phase symmetry and frequency diversification), analyzing phase coupling dynamics, and characterizing the competitive landscape using ODE comparison lemma.

Result: Networks learn to approximate a flawed indicator function through collective majority voting enabled by phase symmetry, with frequencies competing within neurons based on initial spectral magnitude and phase alignment. Grokking is characterized as a three-stage process of memorization followed by generalization phases.

Conclusion: The paper provides a complete mechanistic understanding of how neural networks solve modular addition through Fourier feature learning, phase symmetry, and competitive dynamics, offering insights into training phenomena like grokking.

Abstract: We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum. Furthermore, we explain the emergence of these features under random initialization via a lottery ticket mechanism. Our gradient flow analysis proves that frequencies compete within each neuron, with the “winner” determined by its initial spectral magnitude and phase alignment. From a technical standpoint, we provide a rigorous characterization of the layer-wise phase coupling dynamics and formalize the competitive landscape using the ODE comparison lemma. Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases, driven by the competition between loss minimization and weight decay.

[279] Position: Why a Dynamical Systems Perspective is Needed to Advance Time Series Modeling

Daniel Durstewitz, Christoph Jürgen Hemmer, Florian Hess, Charlotte Ricarda Doll, Lukas Eisenmann

Main category: cs.LG

TL;DR: The paper argues for adopting a dynamical systems perspective in time series modeling, suggesting that understanding underlying dynamical systems can improve forecasting, enable long-term statistical predictions, and provide theoretical insights about performance bounds and generalization.

DetailsMotivation: Current time series modeling approaches lack theoretical grounding and may not fully leverage the fact that time series data typically originates from underlying dynamical systems. The field needs a dynamical systems perspective to advance beyond current limitations.

Method: The paper reviews dynamical systems theory concepts, methods, measures, and models, and discusses dynamical systems reconstruction approaches that aim to infer surrogate models of underlying dynamical systems from observational data.

Result: The paper argues that dynamical systems-based models offer advantages including: better short-term forecasting, ability to predict long-term statistics, theoretical insights about performance bounds, generalization to unseen regimes (like tipping points), and potential control strategies.

Conclusion: Adopting a dynamical systems perspective can advance time series modeling significantly, enabling better forecasting with lower computational and memory requirements. The paper provides specific suggestions for translating dynamical systems reconstruction insights into time series modeling.

Abstract: Time series (TS) modeling has come a long way from early statistical, mainly linear, approaches to the current trend in TS foundation models. With a lot of hype and industrial demand in this field, it is not always clear how much progress there really is. To advance TS forecasting and analysis to the next level, here we argue that the field needs a dynamical systems (DS) perspective. TS of observations from natural or engineered systems almost always originate from some underlying DS, and arguably access to its governing equations would yield theoretically optimal forecasts. This is the promise of DS reconstruction (DSR), a class of ML/AI approaches that aim to infer surrogate models of the underlying DS from data. But models based on DS principles offer other profound advantages: Beyond short-term forecasts, they enable to predict the long-term statistics of an observed system, which in many practical scenarios may be the more relevant quantities. DS theory furthermore provides domain-independent theoretical insight into mechanisms underlying TS generation, and thereby will inform us, e.g., about upper bounds on performance of any TS model, generalization into unseen regimes as in tipping points, or potential control strategies. After reviewing some of the central concepts, methods, measures, and models in DS theory and DSR, we will discuss how insights from this field can advance TS modeling in crucial ways, enabling better forecasting with much lower computational and memory footprints. We conclude with a number of specific suggestions for translating insights from DSR into TS modeling.

[280] ML-driven detection and reduction of ballast information in multi-modal datasets

Yaroslav Solovko

Main category: cs.LG

TL;DR: A multimodal framework for detecting and reducing redundant/low-utility information (ballast) across various data types using multiple analytical techniques, achieving significant dimensionality reduction with minimal performance impact.

DetailsMotivation: Modern datasets often contain redundant or low-utility information (ballast) that increases dimensionality, storage requirements, and computational costs without contributing meaningful analytical value, creating inefficiencies in machine learning pipelines.

Method: Proposes a generalized multimodal framework using entropy, mutual information, Lasso, SHAP, PCA, topic modeling, and embedding analysis to identify ballast features across structured, semi-structured, unstructured, and sparse data types. Introduces a novel Ballast Score to integrate these signals into a unified cross-modal pruning strategy.

Result: Experimental results show that significant portions of feature space (often exceeding 70% in sparse or semi-structured data) can be pruned with minimal or even improved classification performance, along with substantial reductions in training time and memory footprint.

Conclusion: The framework identifies distinct ballast typologies (statistical, semantic, infrastructural) and offers practical guidance for creating leaner, more efficient machine learning pipelines through systematic ballast reduction.

Abstract: Modern datasets often contain ballast as redundant or low-utility information that increases dimensionality, storage requirements, and computational cost without contributing meaningful analytical value. This study introduces a generalized, multimodal framework for ballast detection and reduction across structured, semi-structured, unstructured, and sparse data types. Using diverse datasets, entropy, mutual information, Lasso, SHAP, PCA, topic modelling, and embedding analysis are applied to identify and eliminate ballast features. A novel Ballast Score is proposed to integrate these signals into a unified, cross-modal pruning strategy. Experimental results demonstrate that significant portions of the feature space as often exceeding 70% in sparse or semi-structured data, can be pruned with minimal or even improved classification performance, along with substantial reductions in training time and memory footprint. The framework reveals distinct ballast typologies (e.g. statistical, semantic, infrastructural), and offers practical guidance for leaner, more efficient machine learning pipelines.

[281] Construction of a classification model for dementia among Brazilian adults aged 50 and over

F. S. Menezes, M. C. F. G. Barretto, E. Q. C. Garcia, T. A. E. Ferreira, J. G. Alvez

Main category: cs.LG

TL;DR: A dementia classification model for Brazilian middle-aged/elderly using Random Forest and logistic regression on ELSI-Brazil data, identifying key risk/protective factors with RF outperforming logistic regression.

DetailsMotivation: To develop a dementia classification model for middle-aged and elderly Brazilians using low-cost, modifiable variables to identify vulnerable individuals and inform public health policies for dementia prevention.

Method: Observational cross-sectional study using Brazilian Longitudinal Study of Aging (ELSI-Brazil) data (n=9,412). Combined variable selection with multivariable analysis using Random Forest and logistic regression to estimate dementia risk.

Result: Dementia prevalence was 9.6%. Key risk factors: illiteracy (OR=7.42), age ≥90 (OR=11.00), low weight (OR=2.11), low handgrip strength (OR=2.50), black skin color (OR=1.47), physical inactivity (OR=1.61), hearing loss (OR=1.65), depressive symptoms (OR=1.72). Protective factors: higher education (OR=0.44), life satisfaction (OR=0.72), employment (OR=0.78). RF outperformed logistic regression with AUC=0.776, sensitivity=0.708, specificity=0.702.

Conclusion: The study reinforces dementia’s multidimensional nature and importance of accessible factors for identifying vulnerable individuals. Strengthening brain health-focused public policies can improve resource allocation in primary care and dementia prevention in Brazil.

Abstract: To build a dementia classification model for middle-aged and elderly Brazilians, implemented in Python, combining variable selection and multivariable analysis, using low-cost variables with modification potential. Observational study with a predictive modeling approach using a cross-sectional design, aimed at estimating the chances of developing dementia, using data from the Brazilian Longitudinal Study of Aging (ELSI-Brazil), involving 9,412 participants. Dementia was determined based on neuropsychological assessment and informant-based cognitive function. Analyses were performed using Random Forest (RF) and multivariable logistic regression to estimate the risk of dementia in the middle-aged and elderly populations of Brazil. The prevalence of dementia was 9.6%. The highest odds of dementia were observed in illiterate individuals (Odds Ratio (OR) = 7.42), individuals aged 90 years or older (OR = 11.00), low weight (OR = 2.11), low handgrip strength (OR = 2.50), self-reported black skin color (OR = 1.47), physical inactivity (OR = 1.61), self-reported hearing loss (OR = 1.65), and presence of depressive symptoms (OR = 1.72). Higher education (OR=0.44), greater life satisfaction (OR=0.72), and being employed (OR=0.78) were protective factors. The RF model outperformed logistic regression, achieving an area under the ROC curve of 0.776, with a sensitivity of 0.708, a specificity of 0.702, an F1-score of 0.311, a G-means of 0.705, and an accuracy of 0.703. Conclusion: The findings reinforce the multidimensional nature of dementia and the importance of accessible factors for identifying vulnerable individuals. Strengthening public policies focused on promoting brain health can contribute significantly to the efficient allocation of resources in primary care and dementia prevention in Brazil

[282] Exact Certification of Data-Poisoning Attacks Using Mixed-Integer Programming

Philip Sosnin, Jodie Knapp, Fraser Kennedy, Josh Collyer, Calvin Tsay

Main category: cs.LG

TL;DR: A verification framework for neural network training that provides sound and complete guarantees against data poisoning attacks using mixed-integer quadratic programming.

DetailsMotivation: Current methods lack formal guarantees for data poisoning attacks during neural network training. There's a need for a framework that can provide both sound and complete certification of training-time robustness against adversarial data manipulation.

Method: Formulates adversarial data manipulation, model training, and test-time evaluation as a single mixed-integer quadratic programming (MIQCP) problem. The framework encodes both gradient-based training dynamics and model evaluation at test time to enable exact certification.

Result: Experimental evaluation on small models confirms the approach delivers a complete characterization of robustness against data poisoning. Finding the global optimum provably yields worst-case poisoning attacks while bounding effectiveness of all possible attacks.

Conclusion: The framework enables the first exact certification of training-time robustness against data poisoning attacks, providing both sound and complete guarantees for neural network security during training.

Abstract: This work introduces a verification framework that provides both sound and complete guarantees for data poisoning attacks during neural network training. We formulate adversarial data manipulation, model training, and test-time evaluation in a single mixed-integer quadratic programming (MIQCP) problem. Finding the global optimum of the proposed formulation provably yields worst-case poisoning attacks, while simultaneously bounding the effectiveness of all possible attacks on the given training pipeline. Our framework encodes both the gradient-based training dynamics and model evaluation at test time, enabling the first exact certification of training-time robustness. Experimental evaluation on small models confirms that our approach delivers a complete characterization of robustness against data poisoning.

[283] Beyond Message Passing: A Symbolic Alternative for Expressive and Interpretable Graph Learning

Chuqin Geng, Li Zhang, Haolin Ye, Ziyu Zhao, Yuhe Jiang, Tara Saba, Xinyu Wang, Xujie Si

Main category: cs.LG

TL;DR: SymGraph: A symbolic framework for graph neural networks that replaces continuous message passing with discrete structural hashing and topological role-based aggregation to overcome 1-WL expressivity limits while providing better interpretability and 10-100x speedups.

DetailsMotivation: GNNs are crucial in high-stakes domains like drug discovery but suffer from black-box nature and trustworthiness issues. Self-explainable GNNs exist but inherit limitations from standard message-passing backbones, including the 1-WL expressivity barrier and lack of fine-grained interpretability.

Method: Proposes SymGraph, a symbolic framework that replaces continuous message passing with discrete structural hashing and topological role-based aggregation. This approach theoretically surpasses the 1-WL barrier and avoids differentiable optimization overhead.

Result: SymGraph achieves state-of-the-art performance, outperforming existing self-explainable GNNs. It delivers 10x to 100x speedups in training time using only CPU execution. Generates rules with superior semantic granularity compared to existing rule-based methods.

Conclusion: SymGraph offers a promising approach for scientific discovery and explainable AI by overcoming fundamental limitations of traditional GNNs while providing better interpretability and computational efficiency.

Abstract: Graph Neural Networks (GNNs) have become essential in high-stakes domains such as drug discovery, yet their black-box nature remains a significant barrier to trustworthiness. While self-explainable GNNs attempt to bridge this gap, they often rely on standard message-passing backbones that inherit fundamental limitations, including the 1-Weisfeiler-Lehman (1-WL) expressivity barrier and a lack of fine-grained interpretability. To address these challenges, we propose SymGraph, a symbolic framework designed to transcend these constraints. By replacing continuous message passing with discrete structural hashing and topological role-based aggregation, our architecture theoretically surpasses the 1-WL barrier, achieving superior expressiveness without the overhead of differentiable optimization. Extensive empirical evaluations demonstrate that SymGraph achieves state-of-the-art performance, outperforming existing self-explainable GNNs. Notably, SymGraph delivers 10x to 100x speedups in training time using only CPU execution. Furthermore, SymGraph generates rules with superior semantic granularity compared to existing rule-based methods, offering great potential for scientific discovery and explainable AI.

[284] Neural Proposals, Symbolic Guarantees: Neuro-Symbolic Graph Generation with Hard Constraints

Chuqin Geng, Li Zhang, Mark Zhang, Haolin Ye, Ziyu Zhao, Xujie Si

Main category: cs.LG

TL;DR: NSGGM is a neurosymbolic framework for molecule generation that combines neural models for scaffold proposals with symbolic SMT solvers for constraint satisfaction, offering formal guarantees and explicit controllability that pure neural methods lack.

DetailsMotivation: Pure deep neural approaches for molecule and graph generation have limitations in controllability and lack formal guarantees, making them black-box solutions that cannot ensure chemical validity or enforce user-specific constraints.

Method: Neuro-symbolic framework with autoregressive neural model proposing scaffolds and refining interaction signals, combined with CPU-efficient SMT solver that constructs full graphs while enforcing chemical validity, structural rules, and user constraints.

Result: Strong performance on both unconstrained and constrained generation tasks, matching state-of-the-art generative performance while offering explicit controllability and guarantees. Introduced Logical-Constraint Molecular Benchmark for testing strict hard-rule satisfaction.

Conclusion: Neurosymbolic modeling can achieve competitive generative performance while providing the explicit controllability and formal guarantees that pure neural methods cannot offer, making it suitable for workflows requiring verifiable compliance.

Abstract: We challenge black-box purely deep neural approaches for molecules and graph generation, which are limited in controllability and lack formal guarantees. We introduce Neuro-Symbolic Graph Generative Modeling (NSGGM), a neurosymbolic framework that reapproaches molecule generation as a scaffold and interaction learning task with symbolic assembly. An autoregressive neural model proposes scaffolds and refines interaction signals, and a CPU-efficient SMT solver constructs full graphs while enforcing chemical validity, structural rules, and user-specific constraints, yielding molecules that are correct by construction and interpretable control that pure neural methods cannot provide. NSGGM delivers strong performance on both unconstrained generation and constrained generation tasks, demonstrating that neuro-symbolic modeling can match state-of-the-art generative performance while offering explicit controllability and guarantees. To evaluate more nuanced controllability, we also introduce a Logical-Constraint Molecular Benchmark, designed to test strict hard-rule satisfaction in workflows that require explicit, interpretable specifications together with verifiable compliance.

[285] Multi-Agent Lipschitz Bandits

Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

Main category: cs.LG

TL;DR: Decentralized multi-player stochastic bandits with continuous action spaces and collision avoidance, achieving near-optimal regret with time-independent coordination costs.

DetailsMotivation: Address the challenge of multi-agent coordination in continuous action spaces where collisions yield zero reward, aiming to design communication-free policies with coordination costs independent of time horizon.

Method: Propose a modular protocol: first solve multi-agent coordination via maxima-directed search to seat players on distinct high-value regions, then decouple into N independent single-player Lipschitz bandits.

Result: Achieve near-optimal regret bound of Õ(T^{(d+1)/(d+2)}) plus T-independent coordination cost, matching single-player rate, extending to general distance-threshold collision models.

Conclusion: First framework providing such guarantees for decentralized multi-player bandits with continuous action spaces and collision avoidance, enabling efficient coordination without communication.

Abstract: We study the decentralized multi-player stochastic bandit problem over a continuous, Lipschitz-structured action space where hard collisions yield zero reward. Our objective is to design a communication-free policy that maximizes collective reward, with coordination costs that are independent of the time horizon $T$. We propose a modular protocol that first solves the multi-agent coordination problem – identifying and seating players on distinct high-value regions via a novel maxima-directed search – and then decouples the problem into $N$ independent single-player Lipschitz bandits. We establish a near-optimal regret bound of $\tilde{O}(T^{(d+1)/(d+2)})$ plus a $T$-independent coordination cost, matching the single-player rate. To our knowledge, this is the first framework providing such guarantees, and it extends to general distance-threshold collision models.

[286] A Unified Framework for Locality in Scalable MARL

Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

Main category: cs.LG

TL;DR: Novel decomposition of policy-induced interdependence in MARL reveals policy-dependent locality, establishing tighter spectral conditions for exponential decay and enabling localized policy improvement with theoretical guarantees.

DetailsMotivation: Current MARL approaches suffer from the curse of dimensionality. Existing conditions for exploiting locality via Exponential Decay Property are too conservative because they rely on worst-case environment bounds and ignore the regularizing effect of policies themselves.

Method: Develops a novel decomposition of policy-induced interdependence matrix H^π that separates environment sensitivity to state (E^s) and action (E^a) from policy sensitivity to state (Π(π)). Uses this framework to derive a general spectral condition ρ(E^s + E^aΠ(π)) < 1 for exponential decay, which is tighter than prior norm-based conditions.

Result: Shows that locality can be policy-dependent - smooth policies can induce locality even in strongly action-coupled environments, revealing a fundamental locality-optimality tradeoff. Provides a provably-sound localized block-coordinate policy improvement framework with guarantees tied to the spectral radius.

Conclusion: Establishes that locality in MARL is not purely environment-dependent but can be induced by policy smoothness, offering tighter theoretical conditions and practical algorithms for scalable MARL through localized policy improvement.

Abstract: Scalable Multi-Agent Reinforcement Learning (MARL) is fundamentally challenged by the curse of dimensionality. A common solution is to exploit locality, which hinges on an Exponential Decay Property (EDP) of the value function. However, existing conditions that guarantee the EDP are often conservative, as they are based on worst-case, environment-only bounds (e.g., supremums over actions) and fail to capture the regularizing effect of the policy itself. In this work, we establish that locality can also be a \emph{policy-dependent} phenomenon. Our central contribution is a novel decomposition of the policy-induced interdependence matrix, $H^π$, which decouples the environment’s sensitivity to state ($E^{\mathrm{s}}$) and action ($E^{\mathrm{a}}$) from the policy’s sensitivity to state ($Π(π)$). This decomposition reveals that locality can be induced by a smooth policy (small $Π(π)$) even when the environment is strongly action-coupled, exposing a fundamental locality-optimality tradeoff. We use this framework to derive a general spectral condition $ρ(E^{\mathrm{s}}+E^{\mathrm{a}}Π(π)) < 1$ for exponential decay, which is strictly tighter than prior norm-based conditions. Finally, we leverage this theory to analyze a provably-sound localized block-coordinate policy improvement framework with guarantees tied directly to this spectral radius.

[287] Early-Warning Signals of Grokking via Loss-Landscape Geometry

Yongzhong Xu

Main category: cs.LG

TL;DR: The commutator defect (non-commuting gradient updates) serves as an early-warning signal for grokking (abrupt generalization) in transformers across arithmetic and sequence-learning tasks, with causal interventions showing it’s necessary for generalization.

DetailsMotivation: To determine if the grokking mechanism observed in modular arithmetic extends to sequence-learning tasks, and to identify universal precursors to delayed generalization in transformers.

Method: Studied two sequence-learning benchmarks (SCAN compositional generalization and Dyck-1 depth prediction) across learning rates, measuring commutator defect (curvature from non-commuting gradients). Used weight-space PCA and causal interventions (amplifying/suppressing non-commutativity) to test mechanistic role.

Result: Commutator defect rises before generalization with superlinear power law lead times (α≈1.18 for SCAN, ≈1.13 for Dyck). Spectral concentration is not universal. Amplifying non-commutativity accelerates grokking (32% on SCAN, 50% on Dyck), while suppressing orthogonal gradient flow delays or prevents it.

Conclusion: Commutator defect is a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers, with task-dependent sensitivity but universal necessity for grokking.

Abstract: Grokking – the abrupt transition from memorization to generalization after prolonged training – has been linked to confinement on low-dimensional execution manifolds in modular arithmetic. Whether this mechanism extends beyond arithmetic remains open. We study two sequence-learning benchmarks: SCAN compositional generalization and Dyck-1 depth prediction. Across both tasks and a wide range of learning rates, the commutator defect – a curvature measure derived from non-commuting gradient updates – rises well before generalization, with lead times following a superlinear power law (alpha approximately 1.18 for SCAN, approximately 1.13 for Dyck), consistent with prior results on modular arithmetic. Weight-space PCA reveals that spectral concentration is not a universal precursor; the commutator defect is. Causal interventions demonstrate a mechanistic role: amplifying non-commutativity accelerates grokking (roughly 32% on SCAN, roughly 50% on Dyck), while suppressing orthogonal gradient flow delays or prevents it. The three task families form a spectrum of causal sensitivity – modular arithmetic is rigid, Dyck is responsive, SCAN is intermediate – yet suppression delays or prevents grokking in all cases, establishing necessity as a universal finding. These results identify the commutator defect as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers.

[288] Fail-Closed Alignment for Large Language Models

Zachary Coalson, Beth Sohler, Aiden Gabriel, Sanghyun Hong

Main category: cs.LG

TL;DR: The paper identifies a structural weakness in LLM alignment where refusal mechanisms are fail-open, proposes fail-closed alignment as a design principle, and implements a progressive alignment framework that creates redundant, independent refusal pathways for robust safety against jailbreaks.

DetailsMotivation: Current LLM alignment has a structural weakness: refusal mechanisms are fail-open, meaning they can be bypassed by suppressing a single dominant feature through prompt-based jailbreaks, causing alignment to collapse and leading to unsafe generation.

Method: Proposes fail-closed alignment as a design principle where refusal mechanisms remain effective even under partial failures via redundant, independent causal pathways. Implements a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces.

Result: Across four jailbreak attacks, achieves strongest overall robustness while mitigating over-refusal and preserving generation quality with small computational overhead. Mechanistic analyses confirm models encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously.

Conclusion: Fail-closed alignment provides a principled foundation for robust LLM safety by ensuring refusal mechanisms remain effective even when individual pathways are compromised, offering empirical support for this approach through successful defense against jailbreak attacks.

Abstract: We identify a structural weakness in current large language model (LLM) alignment: modern refusal mechanisms are fail-open. While existing approaches encode refusal behaviors across multiple latent features, suppressing a single dominant feature$-$via prompt-based jailbreaks$-$can cause alignment to collapse, leading to unsafe generation. Motivated by this, we propose fail-closed alignment as a design principle for robust LLM safety: refusal mechanisms should remain effective even under partial failures via redundant, independent causal pathways. We present a concrete instantiation of this principle: a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces. Across four jailbreak attacks, we achieve the strongest overall robustness while mitigating over-refusal and preserving generation quality, with small computational overhead. Our mechanistic analyses confirm that models trained with our method encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety.

[289] Discovering Universal Activation Directions for PII Leakage in Language Models

Leo Marchyok, Zachary Coalson, Sungho Keum, Sooel Son, Sanghyun Hong

Main category: cs.LG

TL;DR: UniLeak is a framework that identifies universal activation directions in language models that consistently increase PII leakage when added to the residual stream, enabling both risk amplification and mitigation without training data access.

DetailsMotivation: To understand how privacy-sensitive behaviors like PII leakage are represented in language models' internal structure, and to develop methods to identify and control these representations without needing training data or groundtruth PII.

Method: Uses mechanistic interpretability to identify universal activation directions in the residual stream whose linear addition at inference time consistently increases PII generation likelihood. Relies only on self-generated text, not training data or groundtruth PII.

Result: UniLeak successfully identifies model-specific directions that generalize across contexts and amplify PII generation probability with minimal impact on generation quality. Steering along these directions substantially increases PII leakage compared to existing prompt-based extraction methods.

Conclusion: PII leakage in language models can be understood as a superposition of latent signals in model representations, enabling both risk amplification and mitigation through identified universal activation directions.

Abstract: Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model’s residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model’s representations, enabling both risk amplification and mitigation.

[290] Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

Rahul Thomas, Teo Kitanovski, Micah Goldblum, Arka Pal

Main category: cs.LG

TL;DR: Systematic evaluation of verification strategies for speculative decoding reveals Traversal Verification dominates OT-based methods; delayed tree expansion and neural selector improve OT-based performance to surpass Traversal Verification.

DetailsMotivation: Prior work has proposed various verification algorithms for i.i.d rollouts in multi-path speculative decoding, but their relative performance under matched settings remains unclear. The authors aim to systematically evaluate verification strategies and understand why some methods underperform.

Method: 1) Systematic evaluation of verification strategies across model families, tasks, and sampling regimes; 2) Analysis of why OT-based methods lag behind; 3) Proposal of delayed tree expansion that drafts a partial single path and delays i.i.d. branching; 4) Development of a dynamic neural selector that estimates expected block efficiency from draft and target features.

Result: Traversal Verification dominates consistently in initial evaluation. OT-based methods achieve high multi-token acceptance near root but miss opportunities deeper in draft tree. Delayed tree expansion preserves target distribution and improves root-node i.i.d rollouts. Neural selector enables OT-based methods like SpecInfer to outperform Traversal Verification for the first time, achieving 5% higher average throughput across models, datasets, and sampling settings.

Conclusion: The systematic analysis reveals key insights about verification strategy performance in speculative decoding. The proposed delayed tree expansion and neural selector significantly improve OT-based methods, making them competitive with and even superior to Traversal Verification in practice.

Abstract: Multi-path speculative decoding accelerates lossless sampling from a target model by using a cheaper draft model to generate a draft tree of tokens, and then applies a verification algorithm that accepts a subset of these. While prior work has proposed various verification algorithms for i.i.d rollouts, their relative performance under matched settings remains unclear. In this work, we firstly present a systematic evaluation of verification strategies across model families, tasks, and sampling regimes, and find that Traversal Verification dominates consistently, with OT-based methods lagging far behind. Our analysis uncovers that this occurs because OT-based methods achieve high multi-token acceptance near the root of the draft tree, while multi-token gains are most impactful deeper in the draft tree, where draft and target distributions diverge. Based on this insight, we propose delayed tree expansion, which drafts a partial single path, delaying the i.i.d. branching point. We show that delayed tree expansion preserves the target distribution and improves on root-node i.i.d rollouts. Further, we develop a dynamic neural selector that estimates the expected block efficiency of optimal-transport-based verification methods from draft and target features, enabling context-dependent expansion decisions. Our neural selector allows OT-based methods like SpecInfer to outperform Traversal Verification for the first time, achieving 5% higher average throughput across a wide range of models, datasets, and sampling settings.

[291] Arcee Trinity Large Technical Report

Varun Singh, Lucas Krauss, Sami Jaghouar, Matej Sirovatka, Charles Goddard, Fares Obied, Jack Min Ong, Jannik Straube, Fern, Aria Harley, Conner Stewart, Colin Kealty, Maziyar Panahi, Simon Kirsten, Anushka Deshpande, Anneketh Vij, Arthur Bresnu, Pranav Veldurthi, Raghav Ravishankar, Hardik Bishnoi, DatologyAI Team, Arcee AI Team, Prime Intellect Team, Mark McQuade, Johannes Hagemann, Lucas Atkins

Main category: cs.LG

TL;DR: Arcee Trinity models are sparse Mixture-of-Experts architectures with varying sizes (6B-400B total parameters) trained on massive token datasets using modern architectural innovations and the Muon optimizer.

DetailsMotivation: To develop efficient large-scale language models using sparse Mixture-of-Experts architectures that activate only a subset of parameters per token, enabling better computational efficiency while maintaining performance.

Method: Three sparse MoE models with interleaved local/global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing. Trinity Large uses Soft-clamped Momentum Expert Bias Updates (SMEBU) for load balancing. All trained with Muon optimizer on 10-17 trillion tokens.

Result: Three models successfully trained with zero loss spikes: Trinity Nano (6B total, 1B active), Trinity Mini (26B total, 3B active), and Trinity Large (400B total, 13B active). Checkpoints available on Hugging Face.

Conclusion: The Arcee Trinity models demonstrate successful training of large-scale sparse MoE architectures with modern innovations, providing efficient language models with varying computational footprints.

Abstract: We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models’ modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at https://huggingface.co/arcee-ai.

[292] Action-Graph Policies: Learning Action Co-dependencies in Multi-Agent Reinforcement Learning

Nikunj Gupta, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna

Main category: cs.LG

TL;DR: Action Graph Policies (AGP) model action dependencies in multi-agent RL to enable coordination through global action dependency conditioning, achieving superior performance on coordination tasks.

DetailsMotivation: In multi-agent RL, successful decentralized decision-making requires not just good individual actions but compatible actions across agents to synchronize behavior, avoid conflicts, and satisfy global constraints. Current methods often fail to properly coordinate actions.

Method: Proposes Action Graph Policies (AGP) that model dependencies among agents’ available action choices. It constructs “coordination contexts” that enable agents to condition their decisions on global action dependencies, creating a more expressive joint policy than fully independent policies.

Result: AGP achieves 80-95% success on canonical coordination tasks with partial observability and anti-coordination penalties, where other MARL methods reach only 10-25%. It consistently outperforms baselines in diverse multi-agent environments.

Conclusion: AGP provides a theoretically grounded and empirically effective approach to modeling action dependencies in multi-agent coordination, enabling better synchronization and conflict avoidance than existing methods.

Abstract: Coordinating actions is the most fundamental form of cooperation in multi-agent reinforcement learning (MARL). Successful decentralized decision-making often depends not only on good individual actions, but on selecting compatible actions across agents to synchronize behavior, avoid conflicts, and satisfy global constraints. In this paper, we propose Action Graph Policies (AGP), that model dependencies among agents’ available action choices. It constructs, what we call, \textit{coordination contexts}, that enable agents to condition their decisions on global action dependencies. Theoretically, we show that AGPs induce a strictly more expressive joint policy compared to fully independent policies and can realize coordinated joint actions that are provably more optimal than greedy execution even from centralized value-decomposition methods. Empirically, we show that AGP achieves 80-95% success on canonical coordination tasks with partial observability and anti-coordination penalties, where other MARL methods reach only 10-25%. We further demonstrate that AGP consistently outperforms these baselines in diverse multi-agent environments.

[293] Malliavin Calculus as Stochastic Backpropogation

Kevin D. Oden

Main category: cs.LG

TL;DR: A unified framework connecting pathwise and score-function gradient estimators via Malliavin calculus, with a variance-aware hybrid estimator that adaptively combines both approaches for optimal variance reduction.

DetailsMotivation: To establish a rigorous theoretical connection between two fundamental gradient estimation techniques (pathwise/reparameterization and score-function/Malliavin) and develop a principled hybrid approach that leverages their complementary strengths for variance reduction in stochastic optimization.

Method: Shows both gradient estimators arise from Malliavin integration-by-parts identity, then introduces a unified hybrid estimator that adaptively combines pathwise and Malliavin gradients using empirical covariance structure to achieve minimum variance among unbiased linear combinations.

Result: Achieved 9% variance reduction on VAEs (CIFAR-10) and up to 35% on strongly-coupled synthetic problems. Provides closed-form finite-sample convergence bounds and demonstrates when hybrid approaches provide benefits versus limitations.

Conclusion: Malliavin calculus provides a conceptually unifying framework for stochastic gradient estimation, clarifying when hybrid approaches offer tangible benefits and when they face inherent limitations, particularly in non-stationary optimization landscapes.

Abstract: We establish a rigorous connection between pathwise (reparameterization) and score-function (Malliavin) gradient estimators by showing that both arise from the Malliavin integration-by-parts identity. Building on this equivalence, we introduce a unified and variance-aware hybrid estimator that adaptively combines pathwise and Malliavin gradients using their empirical covariance structure. The resulting formulation provides a principled understanding of stochastic backpropagation and achieves minimum variance among all unbiased linear combinations, with closed-form finite-sample convergence bounds. We demonstrate 9% variance reduction on VAEs (CIFAR-10) and up to 35% on strongly-coupled synthetic problems. Exploratory policy gradient experiments reveal that non-stationary optimization landscapes present challenges for the hybrid approach, highlighting important directions for future work. Overall, this work positions Malliavin calculus as a conceptually unifying and practically interpretable framework for stochastic gradient estimation, clarifying when hybrid approaches provide tangible benefits and when they face inherent limitations.

[294] WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning

Gagan Mundada, Zihan Huang, Rohan Surana, Sheldon Yu, Jennifer Yuntong Zhang, Xintong Li, Tong Yu, Lina Yao, Jingbo Shang, Julian McAuley, Junda Wu

Main category: cs.LG

TL;DR: WS-GRPO improves reasoning efficiency in language models by using outcome-derived continue/stop guidance to reduce redundant deliberation while maintaining accuracy.

DetailsMotivation: GRPO can lead to inefficient reasoning and overthinking due to extended deliberation, and controlling this behavior is difficult because length penalties are hard to calibrate and direct supervision for when to continue/stop is typically unavailable.

Method: WS-GRPO trains a preference model from outcome-only correctness to produce prefix-level signals indicating when additional continuation is beneficial, converting terminal rewards into correctness-aware guidance over partial trajectories.

Result: WS-GRPO substantially reduces rollout length while remaining competitive with GRPO baselines on reasoning benchmarks.

Conclusion: WS-GRPO effectively improves rollout efficiency in reasoning tasks by providing outcome-derived continue/stop guidance, reducing redundant deliberation without sacrificing accuracy.

Abstract: Group Relative Policy Optimization (GRPO) is effective for training language models on complex reasoning. However, since the objective is defined relative to a group of sampled trajectories, extended deliberation can create more chances to realize relative gains, leading to inefficient reasoning and overthinking, and complicating the trade-off between correctness and rollout efficiency. Controlling this behavior is difficult in practice, considering (i) Length penalties are hard to calibrate because longer rollouts may reflect harder problems that require longer reasoning, penalizing tokens risks truncating useful reasoning along with redundant continuation; and (ii) supervision that directly indicates when to continue or stop is typically unavailable beyond final answer correctness. We propose Weakly Supervised GRPO (WS-GRPO), which improves rollout efficiency by converting terminal rewards into correctness-aware guidance over partial trajectories. Unlike global length penalties that are hard to calibrate, WS-GRPO trains a preference model from outcome-only correctness to produce prefix-level signals that indicate when additional continuation is beneficial. Thus, WS-GRPO supplies outcome-derived continue/stop guidance, reducing redundant deliberation while maintaining accuracy. We provide theoretical results and empirically show on reasoning benchmarks that WS-GRPO substantially reduces rollout length while remaining competitive with GRPO baselines.

[295] Multi-Objective Alignment of Language Models for Personalized Psychotherapy

Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli, Majid Sarrafzadeh, Saadia Gabriel

Main category: cs.LG

TL;DR: Multi-objective alignment framework (MODPO) for therapeutic AI that balances patient preferences with clinical safety using direct preference optimization across six therapeutic criteria.

DetailsMotivation: Mental health care faces workforce shortages and cost constraints, while current AI alignment approaches fail to balance patient preferences with clinical safety, optimizing objectives independently.

Method: Surveyed 335 individuals with mental health experience to collect preference rankings, then developed multi-objective alignment framework using direct preference optimization with reward models for six criteria: empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy.

Result: MODPO achieves superior balance (77.6% empathy, 62.6% safety) compared to single-objective optimization (93.6% empathy, 47.8% safety), with therapeutic criteria outperforming general communication principles by 17.2%. Blinded clinician evaluation confirms MODPO is consistently preferred.

Conclusion: Multi-objective alignment using DPO effectively balances therapeutic dimensions in mental health AI systems, addressing the limitations of single-objective optimization approaches.

Abstract: Mental health disorders affect over 1 billion people worldwide, yet access to care remains limited by workforce shortages and cost constraints. While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety. We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization. We train reward models for six criteria – empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy – and systematically compare multi-objective approaches against single-objective optimization, supervised fine-tuning, and parameter merging. Multi-objective DPO (MODPO) achieves superior balance (77.6% empathy, 62.6% safety) compared to single-objective optimization (93.6% empathy, 47.8% safety), and therapeutic criteria outperform general communication principles by 17.2%. Blinded clinician evaluation confirms MODPO is consistently preferred, with LLM-evaluator agreement comparable to inter-clinician reliability.

[296] Transforming Behavioral Neuroscience Discovery with In-Context Learning and AI-Enhanced Tensor Methods

Paimon Goulart, Jordan Steinhauser, Dawon Ahn, Kylene Shuler, Edward Korzus, Jia Chen, Evangelos E. Papalexakis

Main category: cs.LG

TL;DR: AI-enhanced pipeline using In-Context Learning to accelerate behavioral neuroscience research on fear generalization in mice, automating data preparation and pattern interpretation for domain experts.

DetailsMotivation: Scientific discovery pipelines are typically complex and rigid, requiring domain experts to spend excessive time on data preparation and pipeline debugging rather than interpreting findings. The paper aims to transform behavioral neuroscience research by making AI tools accessible to domain experts without requiring AI expertise.

Method: Uses In-Context Learning (ICL) as an interface for domain experts to automate pipeline components without model training/fine-tuning. Introduces AI-enhanced tensor decomposition models for pattern discovery from heterogeneous behavioral neuroscience data. Focuses on fear generalization studies in mice.

Result: Superior performance compared to standard domain practices and reasonable ML baselines. Effective discovery validated by domain experts. Demonstrates remarkable efficacy in data preparation and pattern interpretation.

Conclusion: ICL provides a suitable interface for domain experts to leverage AI capabilities without AI expertise, accelerating scientific discovery pipelines in behavioral neuroscience while maintaining performance.

Abstract: Scientific discovery pipelines typically involve complex, rigid, and time-consuming processes, from data preparation to analyzing and interpreting findings. Recent advances in AI have the potential to transform such pipelines in a way that domain experts can focus on interpreting and understanding findings, rather than debugging rigid pipelines or manually annotating data. As part of an active collaboration between data science/AI researchers and behavioral neuroscientists, we showcase an example AI-enhanced pipeline, specifically designed to transform and accelerate the way that the domain experts in the team are able to gain insights out of experimental data. The application at hand is in the domain of behavioral neuroscience, studying fear generalization in mice, an important problem whose progress can advance our understanding of clinically significant and often debilitating conditions such as PTSD (Post-Traumatic Stress Disorder). We identify the emerging paradigm of “In-Context Learning” (ICL) as a suitable interface for domain experts to automate parts of their pipeline without the need for or familiarity with AI model training and fine-tuning, and showcase its remarkable efficacy in data preparation and pattern interpretation. Also, we introduce novel AI-enhancements to tensor decomposition model, which allows for more seamless pattern discovery from the heterogeneous data in our application. We thoroughly evaluate our proposed pipeline experimentally, showcasing its superior performance compared to what is standard practice in the domain, as well as against reasonable ML baselines that do not fall under the ICL paradigm, to ensure that we are not compromising performance in our quest for a seamless and easy-to-use interface for domain experts. Finally, we demonstrate effective discovery, with results validated by the domain experts in the team.

[297] Forecasting Anomaly Precursors via Uncertainty-Aware Time-Series Ensembles

Hyeongwon Kang, Jinwoo Park, Seunghun Han, Pilsung Kang

Main category: cs.LG

TL;DR: FATE is an unsupervised framework for detecting precursors-of-anomaly in time-series data using ensemble forecasting and predictive uncertainty, with a new evaluation metric PTaPR for early warning assessment.

DetailsMotivation: Most existing anomaly detection methods are reactive, detecting anomalies only after they occur, lacking proactive early warning capabilities. There's a need for systems that can anticipate anomalies before they happen to enable preventive maintenance and improve system reliability.

Method: FATE uses an ensemble of diverse time-series forecasting models to predict future values and quantifies predictive uncertainty through ensemble disagreement. It detects precursors-of-anomaly by identifying when ensemble models show high disagreement, signaling potential future anomalies without requiring ground-truth labels or reconstruction errors.

Result: Experiments on five real-world benchmark datasets show FATE achieves average improvements of 19.9 percentage points in PTaPR AUC and 20.02 percentage points in early detection F1 score compared to baselines, while requiring no anomaly labels.

Conclusion: FATE provides an effective and practical approach for real-time unsupervised early warning in complex time-series environments, demonstrating superior performance in detecting precursors-of-anomaly through ensemble forecasting and uncertainty quantification.

Abstract: Detecting anomalies in time-series data is critical in domains such as industrial operations, finance, and cybersecurity, where early identification of abnormal patterns is essential for ensuring system reliability and enabling preventive maintenance. However, most existing methods are reactive: they detect anomalies only after they occur and lack the capability to provide proactive early warning signals. In this paper, we propose FATE (Forecasting Anomalies with Time-series Ensembles), a novel unsupervised framework for detecting Precursors-of-Anomaly (PoA) by quantifying predictive uncertainty from a diverse ensemble of time-series forecasting models. Unlike prior approaches that rely on reconstruction errors or require ground-truth labels, FATE anticipates future values and leverages ensemble disagreement to signal early signs of potential anomalies without access to target values at inference time. To rigorously evaluate PoA detection, we introduce Precursor Time-series Aware Precision and Recall (PTaPR), a new metric that extends the traditional Time-series Aware Precision and Recall (TaPR) by jointly assessing segment-level accuracy, within-segment coverage, and temporal promptness of early predictions. This enables a more holistic assessment of early warning capabilities that existing metrics overlook. Experiments on five real-world benchmark datasets show that FATE achieves an average improvement of 19.9 percentage points in PTaPR AUC and 20.02 percentage points in early detection F1 score, outperforming baselines while requiring no anomaly labels. These results demonstrate the effectiveness and practicality of FATE for real-time unsupervised early warning in complex time-series environments.

[298] Multi-Probe Zero Collision Hash (MPZCH): Mitigating Embedding Collisions and Enhancing Model Freshness in Large-Scale Recommenders

Ziliang Zhao, Bi Xue, Emma Lin, Mengjiao Zhou, Kaustubh Vartak, Shakhzod Ali-Zade, Carson Lu, Tao Li, Bin Kuang, Rui Jian, Bin Wen, Dennis van der Staay, Yixin Bao, Eddy Li, Chao Deng, Songbin Liu, Qifan Wang, Kai Ren

Main category: cs.LG

TL;DR: MPZCH is a novel hash indexing method for recommendation system embedding tables that eliminates collisions through linear probing and active eviction policies, improving embedding quality while maintaining performance.

DetailsMotivation: Traditional hash-based indexing for embedding tables in recommendation systems suffers from collisions as unique IDs expand, degrading model performance and personalization quality. Hash collisions cause stale embedding inheritance where new features inherit outdated embeddings.

Method: Multi-Probe Zero Collision Hash (MPZCH) uses linear probing with configurable probing policies and active eviction. It employs auxiliary tensors and CUDA kernels to retire obsolete IDs and reset reassigned slots, preventing stale embedding inheritance. The system maintains efficiency through optimized implementation.

Result: MPZCH achieves zero collisions for user embeddings and significantly improves item embedding freshness and quality. It maintains training QPS and inference latency comparable to existing methods despite collision-mitigation overhead. The solution has been released in the open-source TorchRec library.

Conclusion: MPZCH effectively solves the embedding collision problem in large-scale recommendation systems through linear probing and active eviction, improving embedding quality while maintaining production-scale efficiency. The open-source release enables broader adoption.

Abstract: Embedding tables are critical components of large-scale recommendation systems, facilitating the efficient mapping of high-cardinality categorical features into dense vector representations. However, as the volume of unique IDs expands, traditional hash-based indexing methods suffer from collisions that degrade model performance and personalization quality. We present Multi-Probe Zero Collision Hash (MPZCH), a novel indexing mechanism based on linear probing that effectively mitigates embedding collisions. With reasonable table sizing, it often eliminates these collisions entirely while maintaining production-scale efficiency. MPZCH utilizes auxiliary tensors and high-performance CUDA kernels to implement configurable probing and active eviction policies. By retiring obsolete IDs and resetting reassigned slots, MPZCH prevents the stale embedding inheritance typical of hash-based methods, ensuring new features learn effectively from scratch. Despite its collision-mitigation overhead, the system maintains training QPS and inference latency comparable to existing methods. Rigorous online experiments demonstrate that MPZCH achieves zero collisions for user embeddings and significantly improves item embedding freshness and quality. The solution has been released within the open-source TorchRec library for the broader community.

[299] Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

Akira Sakai, Yuma Ichikawa

Main category: cs.LG

TL;DR: The paper analyzes sign patterns in neural networks, showing that weight signs are largely inherited from initialization and rarely flip during training, leading to a “sign lock-in” phenomenon that enables sub-bit compression by treating signs as fixed.

DetailsMotivation: As model compression pushes below 1 bit per weight, the sign bit becomes a fixed-cost bottleneck. The paper investigates whether sign patterns in neural networks are learnable or random, and whether they can be compressed more efficiently.

Method: Analyzes sign matrices across Transformers, CNNs, and MLPs using spectral analysis and low-rank approximation. Develops “sign lock-in theory” - a stopping-time analysis of sign flips under SGD noise. Proposes gap-based initialization and outward-drift regularization to reduce sign flip rates.

Result: Sign matrices are spectrally indistinguishable from random Rademacher matrices and resist low-rank approximation. Most weights retain initialization signs; flips occur via rare near-zero boundary crossings. The proposed methods reduce effective flip rate to ~10^-3 with only ~1 point perplexity increase.

Conclusion: Sign patterns in neural networks are largely inherited from initialization rather than learned, enabling sub-bit compression by treating signs as fixed. The “sign lock-in” phenomenon explains why signs rarely change during training, allowing for more aggressive compression.

Abstract: Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood around zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately $10^{-3}$ with only about a one-point increase in perplexity.

[300] Spatio-temporal dual-stage hypergraph MARL for human-centric multimodal corridor traffic signal control

Xiaocai Zhang, Neema Nassir, Milad Haghani

Main category: cs.LG

TL;DR: STDSH-MARL: A multi-agent reinforcement learning framework for multimodal traffic signal control using dual-stage hypergraph attention to model spatio-temporal dependencies and enable adaptive signal timing with public transportation priority.

DetailsMotivation: Traditional traffic signal control focuses on vehicle-centric performance, but modern urban corridors need to prioritize multimodal travelers, especially high-occupancy public transportation, requiring more sophisticated control frameworks.

Method: Proposes STDSH-MARL (Spatio-Temporal Dual-Stage Hypergraph based Multi-Agent Reinforcement Learning) with centralized training/decentralized execution. Uses dual-stage hypergraph attention to capture spatio-temporal dependencies across spatial and temporal hyperedges, and introduces hybrid discrete action space for joint phase configuration and green duration decisions.

Result: Experiments on corridor networks under five traffic scenarios show STDSH-MARL consistently improves multimodal performance with clear public transportation priority benefits. Outperforms state-of-the-art baselines, with ablation studies confirming temporal hyperedges as the most influential component.

Conclusion: STDSH-MARL provides an effective scalable framework for human-centric multimodal traffic signal control that successfully prioritizes public transportation while maintaining overall network performance.

Abstract: Human-centric traffic signal control in corridor networks must increasingly account for multimodal travelers, particularly high-occupancy public transportation, rather than focusing solely on vehicle-centric performance. This paper proposes STDSH-MARL (Spatio-Temporal Dual-Stage Hypergraph based Multi-Agent Reinforcement Learning), a scalable multi-agent deep reinforcement learning framework that follows a centralized training and decentralized execution paradigm. The proposed method captures spatio-temporal dependencies through a novel dual-stage hypergraph attention mechanism that models interactions across both spatial and temporal hyperedges. In addition, a hybrid discrete action space is introduced to jointly determine the next signal phase configuration and its corresponding green duration, enabling more adaptive signal timing decisions. Experiments conducted on a corridor network under five traffic scenarios demonstrate that STDSH-MARL consistently improves multimodal performance and provides clear benefits for public transportation priority. Compared with state-of-the-art baseline methods, the proposed approach achieves superior overall performance. Further ablation studies confirm the contribution of each component of STDSH-MARL, with temporal hyperedges identified as the most influential factor driving the observed performance gains.

[301] AdvSynGNN: Structure-Adaptive Graph Neural Nets via Adversarial Synthesis and Self-Corrective Propagation

Rong Fu, Muge Qi, Chunlei Meng, Shuo Yin, Kun Liu, Zhaolu Kang, Simon Fong

Main category: cs.LG

TL;DR: AdvSynGNN is a robust graph neural network architecture that combines multi-resolution structural synthesis, transformer-based attention modulation, adversarial propagation, and label refinement to handle structural noise and non-homophilous graph topologies.

DetailsMotivation: Graph neural networks suffer from performance degradation when dealing with structural noise or non-homophilous topologies, creating a need for more resilient node-level representation learning approaches.

Method: The framework uses multi-resolution structural synthesis with contrastive objectives for geometry-sensitive initializations, a transformer backbone with attention modulation based on topological signals, an integrated adversarial propagation engine (generator-discriminator for connectivity alterations), and label refinement via residual correction with confidence metrics.

Result: Empirical evaluations show the synergistic approach effectively optimizes predictive accuracy across diverse graph distributions while maintaining computational efficiency.

Conclusion: The study provides practical implementation protocols for robust deployment of AdvSynGNN in large-scale environments, offering a comprehensive solution for resilient graph learning.

Abstract: Graph neural networks frequently encounter significant performance degradation when confronted with structural noise or non-homophilous topologies. To address these systemic vulnerabilities, we present AdvSynGNN, a comprehensive architecture designed for resilient node-level representation learning. The proposed framework orchestrates multi-resolution structural synthesis alongside contrastive objectives to establish geometry-sensitive initializations. We develop a transformer backbone that adaptively accommodates heterophily by modulating attention mechanisms through learned topological signals. Central to our contribution is an integrated adversarial propagation engine, where a generative component identifies potential connectivity alterations while a discriminator enforces global coherence. Furthermore, label refinement is achieved through a residual correction scheme guided by per-node confidence metrics, which facilitates precise control over iterative stability. Empirical evaluations demonstrate that this synergistic approach effectively optimizes predictive accuracy across diverse graph distributions while maintaining computational efficiency. The study concludes with practical implementation protocols to ensure the robust deployment of the AdvSynGNN system in large-scale environments.

[302] Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum

Minxin Zhang, Yuxuan Liu, Hayden Scheaffer

Main category: cs.LG

TL;DR: NAMO and NAMO-D are new optimizers combining orthogonalized momentum (from Muon) with norm-based Adam-type noise adaptation, showing improved performance in GPT-2 pretraining over AdamW and Muon.

DetailsMotivation: To create an optimizer that integrates orthogonalized momentum's structural advantages with principled noise adaptation mechanisms, addressing limitations of existing optimizers like Adam and Muon in large language model training.

Method: NAMO scales orthogonalized momentum using a single adaptive stepsize while preserving orthogonality. NAMO-D extends this by right-multiplying orthogonalized momentum by a diagonal matrix with clamped entries, enabling neuron-wise noise adaptation aligned with near block-diagonal Hessian structure.

Result: Both NAMO and NAMO-D outperform AdamW and Muon baselines in GPT-2 pretraining experiments. NAMO-D achieves further gains over NAMO through its clamping hyperparameter that balances well-conditioned updates with fine-grained noise adaptation.

Conclusion: The proposed optimizers successfully integrate orthogonalized momentum with noise adaptation, providing improved training performance for large language models while maintaining theoretical convergence guarantees in both deterministic and stochastic settings.

Abstract: Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability, Muon utilizes the weight layers’ matrix structure via orthogonalized momentum, showing superior performance in large language model training. We propose a new optimizer and a diagonal extension, NAMO and NAMO-D, providing the first principled integration of orthogonalized momentum with norm-based Adam-type noise adaptation. NAMO scales orthogonalized momentum using a single adaptive stepsize, preserving orthogonality while improving upon Muon at negligible additional cost. NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries. This design enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure. Under standard assumptions, we establish optimal convergence rates for both algorithms in the deterministic setting and show that, in the stochastic setting, their convergence guarantees adapt to the noise level of stochastic gradients. Experiments on pretraining GPT-2 models demonstrate improved performance of both NAMO and NAMO-D compared to the AdamW and Muon baselines, with NAMO-D achieving further gains over NAMO via an additional clamping hyperparameter that balances the competing goals of maintaining a well-conditioned update direction and leveraging fine-grained noise adaptation.

[303] MeGU: Machine-Guided Unlearning with Target Feature Disentanglement

Haoyu Wang, Zhuo Huang, Xiaolong Wang, Bo Han, Zhiwei Lin, Tongliang Liu

Main category: cs.LG

TL;DR: MeGU is a machine unlearning framework that uses MLLMs to guide concept-aware re-alignment for selective forgetting, addressing the trade-off between erasing target data and preserving model utility.

DetailsMotivation: Existing machine unlearning approaches face a fundamental trade-off: aggressively erasing target data degrades model utility on retained data, while conservative strategies leave residual target information intact. The authors identify that semantic class concepts are entangled at the feature-pattern level, which fundamentally limits current unlearning paradigms.

Method: MeGU leverages Multi-modal Large Language Models (MLLMs) to explicitly determine re-alignment directions for target samples by assigning semantically meaningful perturbing labels. It encodes inter-class conceptual similarities into a lightweight transition matrix and introduces a positive-negative feature noise pair to disentangle target concept influence. During finetuning, negative noise suppresses target-specific patterns while positive noise reinforces remaining associated features and aligns them with perturbing concepts.

Result: MeGU enables controlled and selective forgetting, effectively mitigating both under-unlearning and over-unlearning. The framework achieves better balance between erasing target data influence and preserving model utility on retained data compared to existing approaches.

Conclusion: The proposed MeGU framework successfully addresses the fundamental trade-off in machine unlearning by leveraging MLLMs for concept-aware guidance and explicit feature disentanglement, enabling more effective and selective forgetting while maintaining model performance on retained data.

Abstract: The growing concern over training data privacy has elevated the “Right to be Forgotten” into a critical requirement, thereby raising the demand for effective Machine Unlearning. However, existing unlearning approaches commonly suffer from a fundamental trade-off: aggressively erasing the influence of target data often degrades model utility on retained data, while conservative strategies leave residual target information intact. In this work, the intrinsic representation properties learned during model pretraining are analyzed. It is demonstrated that semantic class concepts are entangled at the feature-pattern level, sharing associated features while preserving concept-specific discriminative components. This entanglement fundamentally limits the effectiveness of existing unlearning paradigms. Motivated by this insight, we propose Machine-Guided Unlearning (MeGU), a novel framework that guides unlearning through concept-aware re-alignment. Specifically, Multi-modal Large Language Models (MLLMs) are leveraged to explicitly determine re-alignment directions for target samples by assigning semantically meaningful perturbing labels. To improve efficiency, inter-class conceptual similarities estimated by the MLLM are encoded into a lightweight transition matrix. Furthermore, MeGU introduces a positive-negative feature noise pair to explicitly disentangle target concept influence. During finetuning, the negative noise suppresses target-specific feature patterns, while the positive noise reinforces remaining associated features and aligns them with perturbing concepts. This coordinated design enables selective disruption of target-specific representations while preserving shared semantic structures. As a result, MeGU enables controlled and selective forgetting, effectively mitigating both under-unlearning and over-unlearning.

[304] Synergizing Transport-Based Generative Models and Latent Geometry for Stochastic Closure Modeling

Xinghao Dong, Huchen Yang, Jin-long Wu

Main category: cs.LG

TL;DR: Flow matching in latent space enables fast single-step sampling for stochastic closure models, outperforming diffusion models in speed while maintaining physical fidelity through regularization.

DetailsMotivation: Diffusion models offer high-quality diverse samples for stochastic closure modeling but suffer from slow sampling speeds. The paper aims to develop faster generative approaches while maintaining physical fidelity for closure models in complex dynamical systems.

Method: Systematic comparison of transport-based generative models on 2D Kolmogorov flows. Uses flow matching in lower-dimensional latent space for single-step sampling. Implements explicit regularization (metric-preserving and geometry-aware constraints) and implicit regularization via joint training to control latent space distortion and ensure physical fidelity.

Result: Flow matching in latent space achieves up to two orders of magnitude faster sampling than iterative diffusion approaches. Both explicit and implicit regularization preserve topological information from the original dynamical system’s manifold, enabling effective stochastic closure modeling with less training data.

Conclusion: Latent space flow matching with appropriate regularization provides a superior alternative to diffusion models for stochastic closure modeling, offering dramatically faster sampling while maintaining physical fidelity and requiring less training data.

Abstract: Diffusion models recently developed for generative AI tasks can produce high-quality samples while still maintaining diversity among samples to promote mode coverage, providing a promising path for learning stochastic closure models. Compared to other types of generative AI models, such as GANs and VAEs, the sampling speed is known as a key disadvantage of diffusion models. By systematically comparing transport-based generative models on a numerical example of 2D Kolmogorov flows, we show that flow matching in a lower-dimensional latent space is suited for fast sampling of stochastic closure models, enabling single-step sampling that is up to two orders of magnitude faster than iterative diffusion-based approaches. To control the latent space distortion and thus ensure the physical fidelity of the sampled closure term, we compare the implicit regularization offered by a joint training scheme against two explicit regularizers: metric-preserving (MP) and geometry-aware (GA) constraints. Besides offering a faster sampling speed, both explicitly and implicitly regularized latent spaces inherit the key topological information from the lower-dimensional manifold of the original complex dynamical system, which enables the learning of stochastic closure models without demanding a huge amount of training data.

[305] A Locality Radius Framework for Understanding Relational Inductive Bias in Database Learning

Aadi Joshi, Kavya Bhand

Main category: cs.LG

TL;DR: The paper introduces locality radius as a measure of structural neighborhood needed for schema prediction tasks, hypothesizing that model performance depends on alignment between task locality radius and GNN aggregation depth.

DetailsMotivation: To understand when multi-hop structural reasoning is actually necessary for relational schema tasks like foreign key discovery, since GNNs are commonly used but it's unclear when their relational inductive bias truly improves performance.

Method: Introduces locality radius as a formal measure of minimum structural neighborhood required for predictions, then conducts controlled empirical study across various schema tasks with multi-seed experiments, capacity-matched comparisons, statistical testing, scaling analysis, and synthetic radius-controlled benchmarks.

Result: Results reveal a consistent bias-radius alignment effect, showing that model performance depends critically on alignment between task locality radius and architectural aggregation depth.

Conclusion: The study provides insights into when structural reasoning is necessary for schema-level prediction tasks and establishes the importance of aligning GNN architecture depth with task-specific locality requirements.

Abstract: Foreign key discovery and related schema-level prediction tasks are often modeled using graph neural networks (GNNs), implicitly assuming that relational inductive bias improves performance. However, it remains unclear when multi-hop structural reasoning is actually necessary. In this work, we introduce locality radius, a formal measure of the minimum structural neighborhood required to determine a prediction in relational schemas. We hypothesize that model performance depends critically on alignment between task locality radius and architectural aggregation depth. We conduct a controlled empirical study across foreign key prediction, join cost estimation, blast radius regression, cascade impact classification, and additional graph-derived schema tasks. Our evaluation includes multi-seed experiments, capacity-matched comparisons, statistical significance testing, scaling analysis, and synthetic radius-controlled benchmarks. Results reveal a consistent bias-radius alignment effect.

[306] FLoRG: Federated Fine-tuning with Low-rank Gram Matrices and Procrustes Alignment

Chuiyang Meng, Ming Tang, Vincent W. S. Wong

Main category: cs.LG

TL;DR: FLoRG: Federated fine-tuning framework using single low-rank matrix with Gram matrix aggregation and Procrustes alignment to address challenges in federated LoRA.

DetailsMotivation: Address two key challenges in federated fine-tuning with LoRA: (1) error from separately aggregating two low-rank matrices, and (2) decomposition drift when recovering factors from aggregated product matrix.

Method: Proposes FLoRG framework that uses single low-rank matrix for fine-tuning, aggregates its Gram matrix to eliminate aggregation error, and introduces Procrustes alignment to minimize decomposition drift between rounds.

Result: Outperforms five state-of-the-art baselines in downstream task accuracy, reduces communication overhead by up to 2041×, with theoretical convergence analysis showing tighter bounds with Procrustes alignment.

Conclusion: FLoRG effectively addresses federated LoRA challenges through single-matrix approach with Gram aggregation and Procrustes alignment, achieving superior performance and efficiency.

Abstract: Parameter-efficient fine-tuning techniques such as low-rank adaptation (LoRA) enable large language models (LLMs) to adapt to downstream tasks efficiently. Federated learning (FL) further facilitates this process by enabling collaborative fine-tuning across distributed clients without sharing private data. However, the use of two separate low-rank matrices in LoRA for federated fine-tuning introduces two types of challenges. The first challenge arises from the error induced by separately aggregating those two low-rank matrices. The second challenge occurs even when the product of two low-rank matrices is aggregated. The server needs to recover factors via matrix decomposition, which is non-unique and can introduce decomposition drift. To tackle the aforementioned challenges, we propose FLoRG, a federated fine-tuning framework which employs a single low-rank matrix for fine-tuning and aggregates its Gram matrix (i.e., the matrix of inner products of its column vectors), eliminating the aggregation error while also reducing the communication overhead. FLoRG minimizes the decomposition drift by introducing a Procrustes alignment approach which aligns the decomposed matrix between consecutive fine-tuning rounds for consistent updates. We theoretically analyze the convergence of FLoRG and prove that adopting the Procrustes alignment results in a tighter convergence bound. Experimental results across multiple LLM fine-tuning benchmarks demonstrate that FLoRG outperforms five state-of-the-art baseline schemes in the downstream task accuracy and can reduce the communication overhead by up to 2041$\times$.

[307] The Anxiety of Influence: Bloom Filters in Transformer Attention Heads

Peter Balogh

Main category: cs.LG

TL;DR: Transformer attention heads in language models can function as membership testers that detect whether tokens have appeared before in context, with some heads showing Bloom filter-like behavior and others being false positives.

DetailsMotivation: To understand how transformer attention heads implement specific computational functions, particularly identifying heads that serve as membership testers to detect token repetition in context.

Method: Analyzed attention heads across four language models (GPT-2 variants and Pythia-160M) to identify membership-testing behavior, using statistical analysis of false positive rates, capacity curves, and confound controls to distinguish genuine membership testers from artifacts.

Result: Identified three genuine membership-testing heads forming a multi-resolution system in early layers (0-1), with two heads showing high-precision filtering (0-4% false positives) and one following classic Bloom filter capacity curves. These heads generalize broadly to any repeated token type and contribute to both repeated and novel token processing.

Conclusion: Transformer attention heads can implement specialized computational functions like membership testing, forming a taxonomically distinct category from other head types, with the surviving heads demonstrating robust functionality that withstands rigorous confound controls.

Abstract: Some transformer attention heads appear to function as membership testers, dedicating themselves to answering the question “has this token appeared before in the context?” We identify these heads across four language models (GPT-2 small, medium, and large; Pythia-160M) and show that they form a spectrum of membership-testing strategies. Two heads (L0H1 and L0H5 in GPT-2 small) function as high-precision membership filters with false positive rates of 0-4% even at 180 unique context tokens – well above the $d_\text{head} = 64$ bit capacity of a classical Bloom filter. A third head (L1H11) shows the classic Bloom filter capacity curve: its false positive rate follows the theoretical formula $p \approx (1 - e^{-kn/m})^k$ with $R^2 = 1.0$ and fitted capacity $m \approx 5$ bits, saturating by $n \approx 20$ unique tokens. A fourth head initially identified as a Bloom filter (L3H0) was reclassified as a general prefix-attention head after confound controls revealed its apparent capacity curve was a sequence-length artifact. Together, the three genuine membership-testing heads form a multi-resolution system concentrated in early layers (0-1), taxonomically distinct from induction and previous-token heads, with false positive rates that decay monotonically with embedding distance – consistent with distance-sensitive Bloom filters. These heads generalize broadly: they respond to any repeated token type, not just repeated names, with 43% higher generalization than duplicate-token-only heads. Ablation reveals these heads contribute to both repeated and novel token processing, indicating that membership testing coexists with broader computational roles. The reclassification of L3H0 through confound controls strengthens rather than weakens the case: the surviving heads withstand the scrutiny that eliminated a false positive in our own analysis.

[308] Unified Latents (UL): How to train your latents

Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, Tim Salimans

Main category: cs.LG

TL;DR: Unified Latents (UL) framework learns latent representations jointly regularized by diffusion prior and decoded by diffusion model, achieving competitive image and video generation quality with efficient training.

DetailsMotivation: To develop a unified framework for learning latent representations that combines the benefits of diffusion models with efficient latent space learning, addressing limitations of existing approaches that use pre-trained latents or separate training objectives.

Method: Proposes Unified Latents (UL) framework where encoder output noise is linked to diffusion prior’s minimum noise level, creating a simple training objective that provides tight upper bound on latent bitrate. Combines diffusion prior regularization with diffusion model decoding.

Result: Achieves competitive FID of 1.4 on ImageNet-512 with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. Sets new state-of-the-art FVD of 1.3 on Kinetics-600 video dataset.

Conclusion: UL framework provides effective approach for learning latent representations with diffusion models, achieving strong performance in both image and video generation tasks with improved training efficiency.

Abstract: We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder’s output noise to the prior’s minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.

[309] Operationalization of Machine Learning with Serverless Architecture: An Industrial Operationalization of Machine Learning with Serverless Architecture: An Industrial Implementation for Harmonized System Code Prediction

Sai Vineeth Kandappareddigari, Santhoshkumar Jagadish, Gauri Verma, Ilhuicamina Contreras, Christopher Dignam, Anmol Srivastava, Benjamin Demers

Main category: cs.LG

TL;DR: A serverless MLOps framework for complete ML lifecycle management, demonstrated through HS code classification using text embeddings and deep learning models with 98% accuracy.

DetailsMotivation: To address the challenges in operationalizing ML systems for compliance-critical tasks like HS code classification, where frequent updates, ambiguous descriptions, and errors cause shipment delays and financial losses, while ensuring reproducibility, auditability, and cost-efficiency.

Method: Serverless MLOps framework with event-driven pipelines and managed services, using custom text embedding encoder with multiple deep learning architectures (Text-CNN achieving best results), automated A/B testing, auto-scaling, and standardized interfaces for model-agnostic support.

Result: Achieved 98% accuracy on ground truth data for HS code prediction, with the pipeline ensuring reproducibility, auditability, SLA adherence, and cost-efficiency through deterministic classification with predictable latency and explainability.

Conclusion: The framework provides a replicable blueprint for operationalizing ML using serverless architecture, enabling enterprises to scale while optimizing performance and economics, with extensibility to transformer variants and LLM-based inference.

Abstract: This paper presents a serverless MLOps framework orchestrating the complete ML lifecycle from data ingestion, training, deployment, monitoring, and retraining to using event-driven pipelines and managed services. The architecture is model-agnostic, supporting diverse inference patterns through standardized interfaces, enabling rapid adaptation without infrastructure overhead. We demonstrate practical applicability through an industrial implementation for Harmonized System (HS) code prediction, a compliance-critical task where short, unstructured product descriptions are mapped to standardized codes used by customs authorities in global trade. Frequent updates and ambiguous descriptions make classification challenging, with errors causing shipment delays and financial losses. Our solution uses a custom text embedding encoder and multiple deep learning architectures, with Text-CNN achieving 98 percent accuracy on ground truth data. Beyond accuracy, the pipeline ensures reproducibility, auditability, and SLA adherence under variable loads via auto-scaling. A key feature is automated A/B testing, enabling dynamic model selection and safe promotion in production. Cost-efficiency drives model choice; while transformers may achieve similar accuracy, their long-term operational costs are significantly higher. Deterministic classification with predictable latency and explainability is prioritized, though the architecture remains extensible to transformer variants and LLM-based inference. The paper first introduces the deep learning architectures with simulations and model comparisons, then discusses industrialization through serverless architecture, demonstrating automated retraining, prediction, and validation of HS codes. This work provides a replicable blueprint for operationalizing ML using serverless architecture, enabling enterprises to scale while optimizing performance and economics.

[310] The Sound of Death: Deep Learning Reveals Vascular Damage from Carotid Ultrasound

Christoph Balada, Aida Romano-Martinez, Payal Varshney, Vincent ten Cate, Katharina Geschke, Jonas Tesarz, Paul Claßen, Alexander K. Schuster, Dativa Tibyampansha, Karl-Patrik Kresoja, Philipp S. Wild, Sheraz Ahmed, Andreas Dengel

Main category: cs.LG

TL;DR: ML framework extracts vascular damage representations from carotid ultrasound videos using hypertension as weak label, achieving CVD risk prediction comparable to conventional models.

DetailsMotivation: Cardiovascular diseases are the leading cause of mortality worldwide, but early risk detection is limited by available diagnostics. Carotid ultrasound is non-invasive and widely accessible but contains untapped structural and hemodynamic information that could provide better risk assessment.

Method: Machine learning framework that extracts clinically meaningful representations of vascular damage from carotid ultrasound videos, using hypertension as a weak proxy label. The model learns robust, biologically plausible features and uses explainable AI to reveal what the model relies on.

Result: The model’s vascular damage representations strongly associate with established cardiovascular risk factors, comorbidities, and laboratory measures. High vascular damage stratifies individuals for myocardial infarction, cardiac death, and all-cause mortality, matching or outperforming conventional risk models like SCORE2. Explainable AI reveals the model relies on vessel morphology and perivascular tissue characteristics.

Conclusion: Routine carotid ultrasound contains far more prognostic information than previously recognized. The approach provides a scalable, non-invasive, cost-effective tool for population-wide cardiovascular risk assessment, enabling earlier and more personalized prevention strategies without reliance on laboratory tests or complex clinical inputs.

Abstract: Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, yet early risk detection is often limited by available diagnostics. Carotid ultrasound, a non-invasive and widely accessible modality, encodes rich structural and hemodynamic information that is largely untapped. Here, we present a machine learning (ML) framework that extracts clinically meaningful representations of vascular damage (VD) from carotid ultrasound videos, using hypertension as a weak proxy label. The model learns robust features that are biologically plausible, interpretable, and strongly associated with established cardiovascular risk factors, comorbidities, and laboratory measures. High VD stratifies individuals for myocardial infarction, cardiac death, and all-cause mortality, matching or outperforming conventional risk models such as SCORE2. Explainable AI analyses reveal that the model relies on vessel morphology and perivascular tissue characteristics, uncovering novel functional and anatomical signatures of vascular damage. This work demonstrates that routine carotid ultrasound contains far more prognostic information than previously recognized. Our approach provides a scalable, non-invasive, and cost-effective tool for population-wide cardiovascular risk assessment, enabling earlier and more personalized prevention strategies without reliance on laboratory tests or complex clinical inputs.

[311] Online Learning with Improving Agents: Multiclass, Budgeted Agents and Bandit Learners

Sajad Ashkezari, Shai Ben-David

Main category: cs.LG

TL;DR: Paper studies learning with improvements model where agents can modify features for better labels, extending previous work with combinatorial dimensions for online learnability, multiclass analysis, bandit feedback, and cost modeling.

DetailsMotivation: To investigate and extend the learning with improvements model where agents can make small feature modifications to receive more desirable labels, addressing limitations in previous work.

Method: Develops combinatorial dimensions to characterize online learnability, analyzes multiclass scenarios, studies learnability under bandit feedback, and models agents’ costs for making improvements.

Result: Provides comprehensive theoretical framework for learning with improvements, extending previous results with new characterizations of learnability across different settings.

Conclusion: The paper significantly advances theoretical understanding of learning with improvements by providing new combinatorial dimensions and analyzing various practical scenarios including multiclass, bandit feedback, and cost considerations.

Abstract: We investigate the recently introduced model of learning with improvements, where agents are allowed to make small changes to their feature values to be warranted a more desirable label. We extensively extend previously published results by providing combinatorial dimensions that characterize online learnability in this model, by analyzing the multiclass setup, learnability in a bandit feedback setup, modeling agents’ cost for making improvements and more.

[312] i-PhysGaussian: Implicit Physical Simulation for 3D Gaussian Splatting

Yicheng Cao, Zhuo Huang, Yu Yao, Yiming Ying, Daoyi Dong, Tongliang Liu

Main category: cs.LG

TL;DR: i-PhysGaussian combines 3D Gaussian Splatting with implicit Material Point Method for stable physical simulation with large time steps

DetailsMotivation: Current 3D reconstruction-based simulators use explicit step-wise updates that are sensitive to time steps and suffer accuracy degradation with high-stiffness materials or quasi-static movement

Method: Couples 3D Gaussian Splatting (3DGS) with implicit Material Point Method (MPM) integrator, using implicit Newton-type optimization with GMRES solver to minimize momentum-balance residual

Result: Maintains stability at up to 20x larger time steps than explicit baselines, preserving structural coherence and smooth motion in complex dynamic transitions

Conclusion: The framework reduces time-step sensitivity and ensures physical consistency for more robust physical simulation

Abstract: Physical simulation predicts future states of objects based on material properties and external loads, enabling blueprints for both Industry and Engineering to conduct risk management. Current 3D reconstruction-based simulators typically rely on explicit, step-wise updates, which are sensitive to step time and suffer from rapid accuracy degradation under complicated scenarios, such as high-stiffness materials or quasi-static movement. To address this, we introduce i-PhysGaussian, a framework that couples 3D Gaussian Splatting (3DGS) with an implicit Material Point Method (MPM) integrator. Unlike explicit methods, our solution obtains an end-of-step state by minimizing a momentum-balance residual through implicit Newton-type optimization with a GMRES solver. This formulation significantly reduces time-step sensitivity and ensures physical consistency. Our results demonstrate that i-PhysGaussian maintains stability at up to 20x larger time steps than explicit baselines, preserving structural coherence and smooth motion even in complex dynamic transitions.

[313] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen

Main category: cs.LG

TL;DR: M-Attack-V2 improves black-box adversarial attacks on Large Vision-Language Models by addressing gradient variance issues in local crop matching through multi-crop alignment, auxiliary target alignment, and patch momentum techniques.

DetailsMotivation: Existing transfer-based attacks on LVLMs using local crop-level matching (like M-Attack) suffer from high-variance, nearly orthogonal gradients across iterations due to ViT translation sensitivity and structural asymmetry between source and target crops, leading to unstable optimization and poor attack performance.

Method: Proposes M-Attack-V2 with three key modules: 1) Multi-Crop Alignment (MCA) averages gradients from multiple independent local views per iteration to reduce variance; 2) Auxiliary Target Alignment (ATA) uses a small auxiliary set from semantically correlated distribution instead of aggressive target augmentation; 3) Patch Momentum replays historical crop gradients combined with refined patch-size ensemble (PE+).

Result: Significantly improves transfer-based black-box attack success rates on frontier LVLMs: Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks.

Conclusion: M-Attack-V2 provides a simple, modular enhancement to M-Attack that addresses gradient variance issues in local matching, substantially improving black-box adversarial attacks on Large Vision-Language Models through better gradient stabilization and smoother target manifolds.

Abstract: Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.

[314] TIFO: Time-Invariant Frequency Operator for Stationarity-Aware Representation Learning in Time Series

Xihao Piao, Zheng Chen, Lingwei Zhu, Yushun Dong, Yasuko Matsubara, Yasushi Sakurai

Main category: cs.LG

TL;DR: TIFO is a plug-and-play method that addresses distribution shift in time series forecasting by learning stationarity-aware weights over frequency components, suppressing non-stationary frequencies to improve forecasting accuracy.

DetailsMotivation: Nonstationary time series forecasting suffers from distribution shift between training and test data due to different underlying distributions. Existing methods fail to capture time-evolving structures across samples and don't model complex temporal dependencies effectively.

Method: Proposes Time-Invariant Frequency Operator (TIFO) that learns stationarity-aware weights over the frequency spectrum across the entire dataset. It highlights stationary frequency components while suppressing non-stationary ones, mitigating distribution shift. The approach leverages Fourier transform’s implicit eigen-decomposition in frequency space.

Result: Achieved 18 top-1 and 6 top-2 results out of 28 forecasting settings. Notable improvements: 33.3% and 55.3% average MSE improvements on ETTm2 dataset. Reduced computational costs by 60-70% compared to baseline methods, demonstrating strong scalability.

Conclusion: TIFO effectively addresses distribution shift in time series forecasting through frequency-space analysis, offering a plug-and-play solution that improves accuracy while reducing computational costs across diverse forecasting models.

Abstract: Nonstationary time series forecasting suffers from the distribution shift issue due to the different distributions that produce the training and test data. Existing methods attempt to alleviate the dependence by, e.g., removing low-order moments from each individual sample. These solutions fail to capture the underlying time-evolving structure across samples and do not model the complex time structure. In this paper, we aim to address the distribution shift in the frequency space by considering all possible time structures. To this end, we propose a Time-Invariant Frequency Operator (TIFO), which learns stationarity-aware weights over the frequency spectrum across the entire dataset. The weight representation highlights stationary frequency components while suppressing non-stationary ones, thereby mitigating the distribution shift issue in time series. To justify our method, we show that the Fourier transform of time series data implicitly induces eigen-decomposition in the frequency space. TIFO is a plug-and-play approach that can be seamlessly integrated into various forecasting models. Experiments demonstrate our method achieves 18 top-1 and 6 top-2 results out of 28 forecasting settings. Notably, it yields 33.3% and 55.3% improvements in average MSE on the ETTm2 dataset. In addition, TIFO reduces computational costs by 60% -70% compared to baseline methods, demonstrating strong scalability across diverse forecasting models.

[315] VP-VAE: Rethinking Vector Quantization via Adaptive Vector Perturbation

Linwei Zhai, Han Ding, Mingzhi Lin, Cui Zhao, Fei Wang, Ge Wang, Wang Zhi, Wei Xi

Main category: cs.LG

TL;DR: VP-VAE decouples representation learning from discretization by replacing quantization with structured latent perturbations during training, eliminating codebook collapse issues in VQ-VAEs.

DetailsMotivation: VQ-VAEs suffer from training instability and "codebook collapse" due to the coupling of representation learning and discrete codebook optimization, which limits their effectiveness in generative modeling.

Method: Replaces non-differentiable quantizer with distribution-consistent, scale-adaptive latent perturbations generated via Metropolis-Hastings sampling. Also introduces FSP (Finite Scalar Perturbation) variant for uniform latent assumptions that improves FSQ-style fixed quantizers.

Result: Improves reconstruction fidelity, achieves more balanced token usage, avoids training instability, and demonstrates effectiveness on both image and audio benchmarks.

Conclusion: VP-VAE provides a stable alternative to VQ-VAEs by decoupling representation learning from discretization, with FSP variant offering theoretical insights and practical improvements for fixed quantization approaches.

Abstract: Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental to modern generative modeling, yet they often suffer from training instability and “codebook collapse” due to the inherent coupling of representation learning and discrete codebook optimization. In this paper, we propose VP-VAE (Vector Perturbation VAE), a novel paradigm that decouples representation learning from discretization by eliminating the need for an explicit codebook during training. Our key insight is that, from the neural network’s viewpoint, performing quantization primarily manifests as injecting a structured perturbation in latent space. Accordingly, VP-VAE replaces the non-differentiable quantizer with distribution-consistent and scale-adaptive latent perturbations generated via Metropolis–Hastings sampling. This design enables stable training without a codebook while making the model robust to inference-time quantization error. Moreover, under the assumption of approximately uniform latent variables, we derive FSP (Finite Scalar Perturbation), a lightweight variant of VP-VAE that provides a unified theoretical explanation and a practical improvement for FSQ-style fixed quantizers. Extensive experiments on image and audio benchmarks demonstrate that VP-VAE and FSP improve reconstruction fidelity and achieve substantially more balanced token usage, while avoiding the instability inherent to coupled codebook training.

[316] When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer

Shuqi Liu, Yuzhou Cao, Lei Feng, Bo An, Luke Ong

Main category: cs.LG

TL;DR: Multi-expert Learning to Defer faces inherent underfitting due to expert identifiability issues, which PiCCE solves by adaptively identifying reliable experts to reduce the problem to single-expert-like learning.

DetailsMotivation: Multi-expert Learning to Defer (L2D) is more challenging than single-expert L2D due to inherent underfitting problems caused by expert identifiability issues when learning which expert to trust from a diverse pool.

Method: Proposes PiCCE (Pick the Confident and Correct Expert), a surrogate-based method that adaptively identifies a reliable expert based on empirical evidence, effectively reducing multi-expert L2D to a single-expert-like learning problem.

Result: Theoretical analysis shows statistical consistency and ability to recover class probabilities and expert accuracies. Extensive experiments across diverse settings, including real-world expert scenarios, validate improved performance.

Conclusion: PiCCE successfully addresses the fundamental challenges of multi-expert L2D by resolving expert identifiability issues and inherent underfitting, providing a practical solution for real-world applications.

Abstract: Learning to Defer (L2D) enables a classifier to abstain from predictions and defer to an expert, and has recently been extended to multi-expert settings. In this work, we show that multi-expert L2D is fundamentally more challenging than the single-expert case. With multiple experts, the classifier’s underfitting becomes inherent, which seriously degrades prediction performance, whereas in the single-expert setting it arises only under specific conditions. We theoretically reveal that this stems from an intrinsic expert identifiability issue: learning which expert to trust from a diverse pool, a problem absent in the single-expert case and renders existing underfitting remedies failed. To tackle this issue, we propose PiCCE (Pick the Confident and Correct Expert), a surrogate-based method that adaptively identifies a reliable expert based on empirical evidence. PiCCE effectively reduces multi-expert L2D to a single-expert-like learning problem, thereby resolving multi expert underfitting. We further prove its statistical consistency and ability to recover class probabilities and expert accuracies. Extensive experiments across diverse settings, including real-world expert scenarios, validate our theoretical results and demonstrate improved performance.

[317] TimeOmni-VL: Unified Models for Time Series Understanding and Generation

Tong Guan, Sheng Pan, Johan Barthelemy, Zhao Li, Yujun Cai, Cesare Alippi, Ming Jin, Shirui Pan

Main category: cs.LG

TL;DR: TimeOmni-VL is a vision-centric framework that unifies time series understanding and generation through bidirectional mapping between time series and images, enabling understanding-guided generation.

DetailsMotivation: Current time series modeling faces a divide between numerical generation (which relies on superficial pattern matching) and semantic understanding (which struggles with high-fidelity numerical output). While unified multimodal models have bridged this gap in vision, this potential remains untapped for time series.

Method: Two key innovations: (1) Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI) for near-lossless TS2I and I2TS conversions. (2) Understanding-guided generation using a novel TSUMM-Suite dataset with six understanding tasks coupled with two generation tasks, and a calibrated Chain-of-Thought approach.

Result: The unified approach significantly improves both semantic understanding and numerical precision, establishing a new frontier for multimodal time series modeling.

Conclusion: TimeOmni-VL is the first vision-centric framework to unify time series understanding and generation, leveraging understanding as an explicit control signal for high-fidelity generation.

Abstract: Recent time series modeling faces a sharp divide between numerical generation and semantic understanding, with research showing that generation models often rely on superficial pattern matching, while understanding-oriented models struggle with high-fidelity numerical output. Although unified multimodal models (UMMs) have bridged this gap in vision, their potential for time series remains untapped. We propose TimeOmni-VL, the first vision-centric framework that unifies time series understanding and generation through two key innovations: (1) Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI), which advances Time Series-to-Image (TS2I) and Image-to-Time Series (I2TS) conversions to ensure near-lossless transformations. (2) Understanding-guided generation. We introduce TSUMM-Suite, a novel dataset consists of six understanding tasks rooted in time series analytics that are coupled with two generation tasks. With a calibrated Chain-of-Thought, TimeOmni-VL is the first to leverage time series understanding as an explicit control signal for high-fidelity generation. Experiments confirm that this unified approach significantly improves both semantic understanding and numerical precision, establishing a new frontier for multimodal time series modeling.

[318] Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, Sijia Liu

Main category: cs.LG

TL;DR: ZO-Muon: A novel zeroth-order optimization method that combines subspace projection with Muon-style spectral optimization to accelerate convergence and improve accuracy for fine-tuning large models without backpropagation.

DetailsMotivation: Zeroth-order optimization is memory-efficient for fine-tuning large models but suffers from accuracy vs. query efficiency trade-offs. The paper aims to overcome this limitation by unifying subspace methods with spectral optimization techniques.

Method: Proposes ZO-Muon framework that combines: (1) projection-based subspace view to reduce gradient estimation variance using low-rank structure of model updates, and (2) Muon-style spectral optimization that applies gradient orthogonalization to extract informative spectral structure from noisy ZO gradients.

Result: ZO-Muon significantly accelerates convergence and achieves win-win improvement in accuracy and query/runtime efficiency. Requires only 24.7% of queries compared to MeZO baseline for LLM fine-tuning on SST-2, and improves accuracy by 25.1% on ViT-B fine-tuning on CIFAR-100.

Conclusion: The unified framework of subspace gradient orthogonalization enables efficient zeroth-order optimization for large-scale model fine-tuning, offering substantial improvements over existing methods in both query efficiency and accuracy.

Abstract: Zeroth-order (ZO) optimization provides a gradient-free alternative to first-order (FO) methods by estimating gradients via finite differences of function evaluations, and has recently emerged as a memory-efficient paradigm for fine-tuning large-scale models by avoiding backpropagation. However, ZO optimization has a fundamental tension between accuracy and query efficiency. In this work, we show that ZO optimization can be substantially improved by unifying two complementary principles: (i) a projection-based subspace view that reduces gradient estimation variance by exploiting the intrinsic low-rank structure of model updates, and (ii) Muon-style spectral optimization that applies gradient orthogonalization to extract informative spectral structure from noisy ZO gradients. These findings form a unified framework of subspace gradient orthogonalization, which we instantiate in a new method, ZO-Muon, admitting a natural interpretation as a low-rank Muon optimizer in the ZO setting. Extensive experiments on large language models (LLMs) and vision transformers (ViTs) demonstrate that ZO-Muon significantly accelerates convergence and achieves a win-win improvement in accuracy and query/runtime efficiency. Notably, compared to the popular MeZO baseline, ZO-Muon requires only 24.7% of the queries to reach the same SST-2 performance for LLM fine-tuning, and improves accuracy by 25.1% on ViT-B fine-tuning on CIFAR-100.

[319] In-Context Learning in Linear vs. Quadratic Attention Models: An Empirical Study on Regression Tasks

Ayush Goel, Arjun Kohli, Sarvagya Somvanshi

Main category: cs.LG

TL;DR: Empirical comparison of transformer vs linear attention models on in-context learning for linear regression tasks, analyzing MSE, convergence, generalization, and depth effects.

DetailsMotivation: To understand how different attention mechanisms (quadratic vs linear) perform in-context learning on simple function classes like linear regression, and to identify their similarities and limitations.

Method: Empirical evaluation of transformer and linear attention models on the canonical linear regression task from Garg et al., measuring learning quality (MSE), convergence, generalization behavior, and effects of increasing model depth.

Result: Results illustrate both similarities and limitations of linear attention relative to quadratic attention in the in-context learning setting for linear regression.

Conclusion: The study provides empirical insights into how different attention mechanisms behave for in-context learning on simple function classes, highlighting trade-offs between linear and quadratic attention.

Abstract: Recent work has demonstrated that transformers and linear attention models can perform in-context learning (ICL) on simple function classes, such as linear regression. In this paper, we empirically study how these two attention mechanisms differ in their ICL behavior on the canonical linear-regression task of Garg et al. We evaluate learning quality (MSE), convergence, and generalization behavior of each architecture. We also analyze how increasing model depth affects ICL performance. Our results illustrate both the similarities and limitations of linear attention relative to quadratic attention in this setting.

[320] Continual uncertainty learning

Heisei Yonezawa, Ansei Yonezawa, Itsuro Kajiwara

Main category: cs.LG

TL;DR: A curriculum-based continual learning framework for robust control of nonlinear systems with multiple uncertainties, using DRL with domain randomization and model-based controllers to handle sim-to-real transfer in automotive vibration control.

DetailsMotivation: Robust control of mechanical systems with multiple uncertainties is challenging, especially when nonlinear dynamics and operating conditions are intertwined. Existing DRL with domain randomization often leads to sub-optimal policies and poor learning efficiency when handling all uncertainties simultaneously.

Method: Proposes a curriculum-based continual learning framework that decomposes complex control problems with multiple uncertainties into sequential learning tasks. Uses extended plant sets with gradually expanded uncertainties, incorporates model-based controllers for baseline performance, and employs residual learning for task-specific DRL optimization.

Result: The method was applied to automotive powertrain vibration control, demonstrating robustness against structural nonlinearities and dynamic variations with successful sim-to-real transfer.

Conclusion: The proposed framework effectively handles multiple uncertainties in nonlinear systems through sequential learning, prevents catastrophic forgetting, and achieves robust control with practical industrial application.

Abstract: Robust control of mechanical systems with multiple uncertainties remains a fundamental challenge, particularly when nonlinear dynamics and operating-condition variations are intricately intertwined. While deep reinforcement learning (DRL) combined with domain randomization has shown promise in mitigating the sim-to-real gap, simultaneously handling all sources of uncertainty often leads to sub-optimal policies and poor learning efficiency. This study formulates a new curriculum-based continual learning framework for robust control problems involving nonlinear dynamical systems in which multiple sources of uncertainty are simultaneously superimposed. The key idea is to decompose a complex control problem with multiple uncertainties into a sequence of continual learning tasks, in which strategies for handling each uncertainty are acquired sequentially. The original system is extended into a finite set of plants whose dynamic uncertainties are gradually expanded and diversified as learning progresses. The policy is stably updated across the entire plant sets associated with tasks defined by different uncertainty configurations without catastrophic forgetting. To ensure learning efficiency, we jointly incorporate a model-based controller (MBC), which guarantees a shared baseline performance across the plant sets, into the learning process to accelerate the convergence. This residual learning scheme facilitates task-specific optimization of the DRL agent for each uncertainty, thereby enhancing sample efficiency. As a practical industrial application, this study applies the proposed method to designing an active vibration controller for automotive powertrains. We verified that the resulting controller is robust against structural nonlinearities and dynamic variations, realizing successful sim-to-real transfer.

[321] SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

Ron Shapira Weber, Oren Freifeld

Main category: cs.LG

TL;DR: A PyTorch library for GPU-accelerated SoftDTW computation with improved numerical stability, memory efficiency, and support for arbitrary sequence lengths.

DetailsMotivation: Existing GPU implementations of SoftDTW have limitations including hard sequence-length caps of 1024, numerical instability in backward passes for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors.

Method: The library introduces three key innovations: (1) tiled anti-diagonal kernel execution to remove sequence-length constraints, (2) log-space backward pass to prevent floating-point overflow, and (3) fused distance-computation mode that eliminates the O(BNM) intermediate distance tensor.

Result: The implementation achieves up to 98% memory reduction compared to prior work, supports arbitrary sequence lengths, provides full PyTorch autograd integration, and includes Soft-DTW Barycenter computation.

Conclusion: softdtw-cuda-torch provides a robust, memory-efficient GPU implementation of SoftDTW that overcomes key limitations of existing solutions, making it suitable for large-scale sequence alignment tasks.

Abstract: We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping (SoftDTW) on GPUs. Our implementation addresses three key limitations of existing GPU implementations of SoftDTW: a hard sequence-length cap of 1024, numerical instability in the backward pass for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors. We introduce (1) tiled anti-diagonal kernel execution that removes the sequence-length constraint, (2) a log-space back-ward pass that prevents floating-point overflow, and (3) a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work. The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter computation. Code is available at https://github.com/BGU-CS-VIL/sdtw-cuda-torch.

[322] CounterFlowNet: From Minimal Changes to Meaningful Counterfactual Explanations

Oleksii Furman, Patryk Marszałek, Jan Masłowski, Piotr Gaiński, Maciej Zięba, Marek Śmieja

Main category: cs.LG

TL;DR: CounterFlowNet uses Generative Flow Networks to generate sparse, high-quality counterfactual explanations for tabular data with heterogeneous features while satisfying user constraints.

DetailsMotivation: Existing counterfactual explanation methods struggle with generating multiple high-quality explanations that are sparse, work with heterogeneous tabular data, and respect user-defined constraints.

Method: Formulates counterfactual generation as sequential feature modification using conditional Generative Flow Networks (GFlowNet), trained to sample CFs proportionally to a reward function encoding validity, sparsity, proximity, and plausibility.

Result: Experiments on eight datasets show CounterFlowNet achieves superior trade-offs between validity, sparsity, plausibility, and diversity while fully satisfying given constraints.

Conclusion: CounterFlowNet provides an effective generative approach for generating high-quality, sparse counterfactual explanations for tabular data with heterogeneous features and user constraints.

Abstract: Counterfactual explanations (CFs) provide human-interpretable insights into model’s predictions by identifying minimal changes to input features that would alter the model’s output. However, existing methods struggle to generate multiple high-quality explanations that (1) affect only a small portion of the features, (2) can be applied to tabular data with heterogeneous features, and (3) are consistent with the user-defined constraints. We propose CounterFlowNet, a generative approach that formulates CF generation as sequential feature modification using conditional Generative Flow Networks (GFlowNet). CounterFlowNet is trained to sample CFs proportionally to a user-specified reward function that can encode key CF desiderata: validity, sparsity, proximity and plausibility, encouraging high-quality explanations. The sequential formulation yields highly sparse edits, while a unified action space seamlessly supports continuous and categorical features. Moreover, actionability constraints, such as immutability and monotonicity of features, can be enforced at inference time via action masking, without retraining. Experiments on eight datasets under two evaluation protocols demonstrate that CounterFlowNet achieves superior trade-offs between validity, sparsity, plausibility, and diversity with full satisfaction of the given constraints.

[323] Structured Prototype-Guided Adaptation for EEG Foundation Models

Jingying Ma, Feng Wu, Yucheng Xing, Qika Lin, Tianyu Liu, Chenyu Liu, Ziyu Jia, Mengling Feng

Main category: cs.LG

TL;DR: SCOPE framework improves EEG foundation model fine-tuning under limited supervision by using structured prototypes and confidence-aware pseudo-labels with lightweight adapters.

DetailsMotivation: EEG foundation models perform poorly with limited subject-level supervision in clinical settings due to structural mismatch between noisy supervision and highly plastic parameter spaces.

Method: Two-stage framework: 1) Learn geometry-regularized task priors, construct balanced class-level prototypes, produce confidence-aware pseudo-labels; 2) ProAdapter adapts frozen EEG foundation models via lightweight adapter conditioned on structured prototypes.

Result: SCOPE consistently achieves strong performance and efficiency across three EEG tasks and five foundation model backbones under label-limited cross-subject settings.

Conclusion: SCOPE addresses the structural mismatch in EEG foundation model fine-tuning, enabling effective adaptation with limited supervision through structured prototype guidance.

Abstract: Electroencephalography (EEG) foundation models (EFMs) have achieved strong performance under full fine-tuning but exhibit poor generalization when subject-level supervision is limited, a common constraint in real-world clinical settings. We show that this failure stems not merely from limited supervision, but from a structural mismatch between noisy, limited supervision and the highly plastic parameter space of EFMs. To address this challenge, we propose SCOPE, a Structured COnfidence-aware Prototype-guided adaptation framework for EFM fine-tuning. SCOPE follows a two-stage pipeline. In the first stage, we construct reliable external supervision by learning geometry-regularized task priors, constructing balanced class-level prototypes over the resulting embeddings, and producing confidence-aware pseudo-labels from their agreement to filter unreliable signals on unlabeled data. In the second stage, we introduce ProAdapter, which adapts frozen EEG foundation models via a lightweight adapter conditioned on the structured prototypes. Experiments across three EEG tasks and five foundation model backbones demonstrate that SCOPE consistently achieves strong performance and efficiency under label-limited cross-subject settings.

[324] Learning a Latent Pulse Shape Interface for Photoinjector Laser Systems

Alexander Klemps, Denis Ilia, Pradeep Kr. Banerjee, Ye Chen, Henrik Tünnermann, Nihat Ay

Main category: cs.LG

TL;DR: A generative modeling framework using Wasserstein Autoencoders learns a differentiable latent interface for laser pulse shaping in photoinjectors, enabling efficient exploration of pulse design space and reducing reliance on expensive simulations.

DetailsMotivation: Systematic exploration of laser pulse shaping design space in photoinjectors is limited by the computational cost of brute-force pulse propagation simulations, creating a need for more efficient methods to optimize electron beam quality.

Method: Uses Wasserstein Autoencoders to learn a differentiable latent interface between pulse shaping and downstream beam dynamics, creating a continuous and interpretable latent space that captures pulse characteristics and enables smooth interpolation between pulse types.

Result: The learned latent space is continuous, interpretable, and maintains high-fidelity reconstructions. Pulse families trace coherent trajectories, and the model generalizes from simulated to real experimental pulse measurements, accurately reconstructing pulses and embedding them consistently.

Conclusion: The generative modeling approach reduces reliance on expensive pulse-propagation simulations and facilitates downstream beam dynamics simulation and analysis by providing an efficient latent representation of laser pulse shapes.

Abstract: Controlling the longitudinal laser pulse shape in photoinjectors of Free-Electron Lasers is a powerful lever for optimizing electron beam quality, but systematic exploration of the vast design space is limited by the cost of brute-force pulse propagation simulations. We present a generative modeling framework based on Wasserstein Autoencoders to learn a differentiable latent interface between pulse shaping and downstream beam dynamics. Our empirical findings show that the learned latent space is continuous and interpretable while maintaining high-fidelity reconstructions. Pulse families such as higher-order Gaussians trace coherent trajectories, while standardizing the temporal pulse lengths shows a latent organization correlated with pulse energy. Analysis via principal components and Gaussian Mixture Models reveals a well behaved latent geometry, enabling smooth transitions between distinct pulse types via linear interpolation. The model generalizes from simulated data to real experimental pulse measurements, accurately reconstructing pulses and embedding them consistently into the learned manifold. Overall, the approach reduces reliance on expensive pulse-propagation simulations and facilitates downstream beam dynamics simulation and analysis.

[325] RLGT: A reinforcement learning framework for extremal graph theory

Ivan Damnjanović, Uroš Milivojević, Irena Đorđević, Dragan Stevanović

Main category: cs.LG

TL;DR: RLGT is a reinforcement learning framework for graph theory that systematizes previous work, supports various graph types, and aims to facilitate RL-based research in extremal graph theory.

DetailsMotivation: To create a unified RL framework for graph theory that systematizes previous scattered work, supports diverse graph types (undirected/directed, with/without loops, multiple edge colors), and provides optimized computational performance for extremal graph theory research.

Method: Develops RLGT framework with efficient graph representation, modular design, and computational optimizations to support reinforcement learning applications in graph theory, building on previous Deep Cross-Entropy RL approaches for combinatorial optimization problems.

Result: A novel RL framework that supports various graph types and provides systematic infrastructure for RL-based research in extremal graph theory, enabling future work in this interdisciplinary area.

Conclusion: RLGT successfully systematizes previous RL work in graph theory and provides a flexible, optimized framework that should facilitate future research at the intersection of reinforcement learning and extremal graph theory.

Abstract: Reinforcement learning (RL) is a subfield of machine learning that focuses on developing models that can autonomously learn optimal decision-making strategies over time. In a recent pioneering paper, Wagner demonstrated how the Deep Cross-Entropy RL method can be applied to tackle various problems from extremal graph theory by reformulating them as combinatorial optimization problems. Subsequently, many researchers became interested in refining and extending the framework introduced by Wagner, thereby creating various RL environments specialized for graph theory. Moreover, a number of problems from extremal graph theory were solved through the use of RL. In particular, several inequalities concerning the Laplacian spectral radius of graphs were refuted, new lower bounds were obtained for certain Ramsey numbers, and contributions were made to the Turán-type extremal problem in which the forbidden structures are cycles of length three and four. Here, we present Reinforcement Learning for Graph Theory (RLGT), a novel RL framework that systematizes the previous work and provides support for both undirected and directed graphs, with or without loops, and with an arbitrary number of edge colors. The framework efficiently represents graphs and aims to facilitate future RL-based research in extremal graph theory through optimized computational performance and a clean and modular design.

[326] Efficient privacy loss accounting for subsampling and random allocation

Vitaly Feldman, Moshe Shenfeld

Main category: cs.LG

TL;DR: The paper presents an efficient method to compute privacy loss distributions for random allocation sampling in differential privacy, showing it matches or beats Poisson sampling for DP-SGD training.

DetailsMotivation: Random allocation sampling has shown utility advantages in differentially private optimization and aggregation, but existing theoretical analyses have shortcomings: privacy parameters aren't tight due to approximations, and computed parameters (hockey stick or Renyi divergence) introduce overheads in privacy loss accounting.

Method: Develops new tools for general privacy loss accounting based on PLD (privacy loss distribution) realization, enabling efficient computation of PLD for random allocation applied to any differentially private algorithm. Extends accurate privacy loss accounting to subsampling which previously required manual noise-mechanism-specific analysis.

Result: Demonstrates that random allocation’s privacy-utility trade-off for Gaussian mechanism is at least as good as Poisson subsampling, and random allocation is better suited for training via DP-SGD.

Conclusion: Random allocation sampling provides practical advantages for differential privacy applications, particularly for DP-SGD training, with efficient PLD computation enabling better privacy-utility trade-offs compared to traditional Poisson sampling.

Abstract: We consider the privacy amplification properties of a sampling scheme in which a user’s data is used in $k$ steps chosen randomly and uniformly from a sequence (or set) of $t$ steps. This sampling scheme has been recently applied in the context of differentially private optimization (Chua et al., 2024a; Choquette-Choo et al., 2025) and communication-efficient high-dimensional private aggregation (Asi et al., 2025), where it was shown to have utility advantages over the standard Poisson sampling. Theoretical analyses of this sampling scheme (Feldman & Shenfeld, 2025; Dong et al., 2025) lead to bounds that are close to those of Poisson sampling, yet still have two significant shortcomings. First, in many practical settings, the resulting privacy parameters are not tight due to the approximation steps in the analysis. Second, the computed parameters are either the hockey stick or Renyi divergence, both of which introduce overheads when used in privacy loss accounting. In this work, we demonstrate that the privacy loss distribution (PLD) of random allocation applied to any differentially private algorithm can be computed efficiently. When applied to the Gaussian mechanism, our results demonstrate that the privacy-utility trade-off for random allocation is at least as good as that of Poisson subsampling. In particular, random allocation is better suited for training via DP-SGD. To support these computations, our work develops new tools for general privacy loss accounting based on a notion of PLD realization. This notion allows us to extend accurate privacy loss accounting to subsampling which previously required manual noise-mechanism-specific analysis.

[327] LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy

Hsin-Jung Yang, Zhanhong Jiang, Prajwal Koirala, Qisai Liu, Cody Fleming, Soumik Sarkar

Main category: cs.LG

TL;DR: LexiSafe: A lexicographic offline RL framework for safety-critical cyber-physical systems that prioritizes safety over reward optimization with theoretical guarantees.

DetailsMotivation: Existing offline safe RL methods lack structural mechanisms to prevent safety drift in cyber-physical systems where safety violations during training are unacceptable and only pre-collected data is available.

Method: Proposes LexiSafe with two formulations: LexiSafe-SC for single safety cost with safety-violation and performance-suboptimality bounds, and LexiSafe-MC for hierarchical safety requirements with multiple safety costs, both using lexicographic prioritization with structural bias.

Result: Empirically demonstrates reduced safety violations and improved task performance compared to constrained offline baselines, with theoretical sample-complexity guarantees.

Conclusion: LexiSafe offers a practical and theoretically grounded approach for safety-critical CPS decision-making by unifying lexicographic prioritization with structural bias.

Abstract: Offline safe reinforcement learning (RL) is increasingly important for cyber-physical systems (CPS), where safety violations during training are unacceptable and only pre-collected data are available. Existing offline safe RL methods typically balance reward-safety tradeoffs through constraint relaxation or joint optimization, but they often lack structural mechanisms to prevent safety drift. We propose LexiSafe, a lexicographic offline RL framework designed to preserve safety-aligned behavior. We first develop LexiSafe-SC, a single-cost formulation for standard offline safe RL, and derive safety-violation and performance-suboptimality bounds that together yield sample-complexity guarantees. We then extend the framework to hierarchical safety requirements with LexiSafe-MC, which supports multiple safety costs and admits its own sample-complexity analysis. Empirically, LexiSafe demonstrates reduced safety violations and improved task performance compared to constrained offline baselines. By unifying lexicographic prioritization with structural bias, LexiSafe offers a practical and theoretically grounded approach for safety-critical CPS decision-making.

[328] Flickering Multi-Armed Bandits

Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

Main category: cs.LG

TL;DR: Flickering Multi-Armed Bandits (FMAB) introduces a new bandit framework where available arms change dynamically based on previous selections, modeled as random graph processes with local movement constraints.

DetailsMotivation: Traditional multi-armed bandits assume a fixed set of available arms, but many real-world applications involve dynamically changing action spaces where available options depend on previous choices, such as in robotics navigation or resource-constrained environments.

Method: Proposes a two-phase algorithm: (1) exploration using lazy random walks to identify optimal arm, and (2) navigation and commitment for exploitation. Analyzes under i.i.d. Erdős-Rényi and Edge-Markovian random graph processes.

Result: Establishes high-probability and expected sublinear regret bounds for both graph settings. Shows exploration cost is near-optimal with matching information-theoretic lower bound. Validates with numerical simulations including robotic ground vehicle scenario.

Conclusion: FMAB provides a principled framework for bandit problems with dynamically changing action spaces, with theoretical guarantees and practical applications in robotics and constrained environments.

Abstract: We introduce Flickering Multi-Armed Bandits (FMAB), a new MAB framework where the set of available arms (or actions) can change at each round, and the available set at any time may depend on the agent’s previously selected arm. We model this constrained, evolving availability using random graph processes, where arms are nodes and the agent’s movement is restricted to its local neighborhood. We analyze this problem under two random graph models: an i.i.d. Erdős–Rényi (ER) process and an Edge-Markovian process. We propose and analyze a two-phase algorithm that employs a lazy random walk for exploration to efficiently identify the optimal arm, followed by a navigation and commitment phase for exploitation. We establish high-probability and expected sublinear regret bounds for both graph settings. We show that the exploration cost of our algorithm is near-optimal by establishing a matching information-theoretic lower bound for this problem class, highlighting the fundamental cost of exploration under local-move constraints. We complement our theoretical guarantees with numerical simulations, including a scenario of a robotic ground vehicle scouting a disaster-affected region.

[329] SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework

Rong Fu, Zijian Zhang, Wenxin Zhang, Kun Liu, Jiekai Wu, Xianda Li, Simon Fong

Main category: cs.LG

TL;DR: SubQuad is an end-to-end pipeline for comparative analysis of adaptive immune repertoires that addresses computational bottlenecks through antigen-aware near-subquadratic retrieval, GPU acceleration, multimodal fusion, and fairness-constrained clustering.

DetailsMotivation: The motivation is to overcome two major bottlenecks in population-scale immune repertoire analysis: 1) near-quadratic computational cost of pairwise affinity evaluations, and 2) dataset imbalances that obscure clinically important minority clonotypes, which hampers comparative analysis at population scale.

Method: The method combines several techniques: 1) antigen-aware near-subquadratic retrieval with GPU-accelerated affinity kernels, 2) learned multimodal fusion with a differentiable gating module that adaptively weights complementary alignment and embedding channels per-pair, 3) compact MinHash prefiltering to reduce candidate comparisons, and 4) automated calibration routine that enforces proportional representation of rare antigen-specific subgroups through fairness-constrained clustering.

Result: On large viral and tumor repertoires, SubQuad achieves measured gains in throughput and peak memory usage while preserving or improving recall@k, cluster purity, and subgroup equity metrics.

Conclusion: By co-designing indexing, similarity fusion, and equity-aware objectives, SubQuad offers a scalable, bias-aware platform for repertoire mining and downstream translational tasks such as vaccine target prioritization and biomarker discovery.

Abstract: Comparative analysis of adaptive immune repertoires at population scale is hampered by two practical bottlenecks: the near-quadratic cost of pairwise affinity evaluations and dataset imbalances that obscure clinically important minority clonotypes. We introduce SubQuad, an end-to-end pipeline that addresses these challenges by combining antigen-aware, near-subquadratic retrieval with GPU-accelerated affinity kernels, learned multimodal fusion, and fairness-constrained clustering. The system employs compact MinHash prefiltering to sharply reduce candidate comparisons, a differentiable gating module that adaptively weights complementary alignment and embedding channels on a per-pair basis, and an automated calibration routine that enforces proportional representation of rare antigen-specific subgroups. On large viral and tumor repertoires SubQuad achieves measured gains in throughput and peak memory usage while preserving or improving recall@k, cluster purity, and subgroup equity. By co-designing indexing, similarity fusion, and equity-aware objectives, SubQuad offers a scalable, bias-aware platform for repertoire mining and downstream translational tasks such as vaccine target prioritization and biomarker discovery.

[330] From Subtle to Significant: Prompt-Driven Self-Improving Optimization in Test-Time Graph OOD Detection

Luzhi Wang, Xuanshuo Fu, He Zhang, Chuang Liu, Xiaobao Wang, Hongbo Liu

Main category: cs.LG

TL;DR: SIGOOD is a self-improving graph OOD detection framework that uses prompt-enhanced graphs and energy preference optimization to amplify OOD signals through iterative refinement.

DetailsMotivation: Current graph OOD detection methods use one-pass inference that can't progressively correct erroneous predictions, limiting their ability to amplify OOD signals for better detection.

Method: Proposes SIGOOD with prompt-enhanced graphs and Energy Preference Optimization (EPO) loss that iteratively optimizes prompts in a self-improving loop to amplify OOD signals.

Result: Comprehensive evaluations on 21 real-world datasets confirm SIGOOD’s effectiveness and outperformance over existing methods.

Conclusion: SIGOOD provides an effective unsupervised framework for graph OOD detection through continuous self-learning and test-time training with prompt optimization.

Abstract: Graph Out-of-Distribution (OOD) detection aims to identify whether a test graph deviates from the distribution of graphs observed during training, which is critical for ensuring the reliability of Graph Neural Networks (GNNs) when deployed in open-world scenarios. Recent advances in graph OOD detection have focused on test-time training techniques that facilitate OOD detection without accessing potential supervisory information (e.g., training data). However, most of these methods employ a one-pass inference paradigm, which prevents them from progressively correcting erroneous predictions to amplify OOD signals. To this end, we propose a \textbf{S}elf-\textbf{I}mproving \textbf{G}raph \textbf{O}ut-\textbf{o}f-\textbf{D}istribution detector (SIGOOD), which is an unsupervised framework that integrates continuous self-learning with test-time training for effective graph OOD detection. Specifically, SIGOOD generates a prompt to construct a prompt-enhanced graph that amplifies potential OOD signals. To optimize prompts, SIGOOD introduces an Energy Preference Optimization (EPO) loss, which leverages energy variations between the original test graph and the prompt-enhanced graph. By iteratively optimizing the prompt by involving it into the detection model in a self-improving loop, the resulting optimal prompt-enhanced graph is ultimately used for OOD detection. Comprehensive evaluations on 21 real-world datasets confirm the effectiveness and outperformance of our SIGOOD method. The code is at https://github.com/Ee1s/SIGOOD.

[331] Shortcut learning in geometric knot classification

Djordje Mihajlovic, Davide Michieletto

Main category: cs.LG

TL;DR: Investigates machine learning approaches to knot classification, discovers hidden non-topological features in training data, and provides a clean dataset and code to enable proper topological classification.

DetailsMotivation: Knot classification is important in various scientific fields, and while neural networks can solve complex classification tasks, it's unclear if they can truly learn topological features or rely on non-topological shortcuts.

Method: Analyzed ML shortcut methods for knot classification, discovered hidden non-topological features in Molecular Dynamics simulation data, and developed a clean dataset and code to generate knot embeddings that explore geometric state space while preserving topology.

Result: Found that ML models were using non-topological features from training data rather than learning true topological properties, necessitating a cleaner dataset for proper topological classification.

Conclusion: Provides foundational tools (dataset and code) to accelerate development of ML models that can genuinely solve geometric knot classification challenges by removing non-topological feature shortcuts.

Abstract: Classifying the topology of closed curves is a central problem in low dimensional topology with applications beyond mathematics spanning protein folding, polymer physics and even magnetohydrodynamics. The central problem is how to determine whether two embeddings of a closed arc are equivalent under ambient isotopy. Given the striking ability of neural networks to solve complex classification tasks, it is therefore natural to ask if the knot classification problem can be tackled using Machine Learning (ML). In this paper, we investigate generic shortcut methods employed by ML to solve the knot classification challenge and specifically discover hidden non-topological features in training data generated through Molecular Dynamics simulations of polygonal knots that are used by ML to arrive to positive classifications results. We then provide a rigorous foundation for future attempts to tackle the knot classification challenge using ML by developing a publicly-available (i) dataset, that aims to remove the potential of non-topological feature classification and (ii) code, that can generate knot embeddings that faithfully explore chosen geometric state space with fixed knot topology. We expect that our work will accelerate the development of ML models that can solve complex geometric knot classification challenges.

[332] 2Mamba2Furious: Linear in Complexity, Competitive in Accuracy

Gabriel Mongaras, Eric C. Larson

Main category: cs.LG

TL;DR: 2Mamba improves linear attention transformers by simplifying Mamba-2, enhancing A-mask and hidden state order to achieve near-softmax accuracy with better memory efficiency for long contexts.

DetailsMotivation: Linear attention transformers are more efficient than softmax attention but suffer from reduced accuracy and expressiveness. The goal is to bridge this accuracy gap while maintaining efficiency advantages.

Method: First simplify Mamba-2 to its core components (Mamba-2S), then improve the A-mask and increase hidden state order to create 2Mamba. Also investigate elements that help surpass softmax attention accuracy.

Result: 2Mamba achieves accuracy nearly comparable to softmax attention while being much more memory efficient for long context lengths, with some elements potentially surpassing softmax attention accuracy.

Conclusion: The proposed 2Mamba method successfully bridges the accuracy gap between linear and softmax attention, offering efficient long-context processing while maintaining competitive accuracy.

Abstract: Linear attention transformers have become a strong alternative to softmax attention due to their efficiency. However, linear attention tends to be less expressive and results in reduced accuracy compared to softmax attention. To bridge the accuracy gap between softmax attention and linear attention, we manipulate Mamba-2, a very strong linear attention variant. We first simplify Mamba-2 down to its most fundamental and important components, evaluating which specific choices make it most accurate. From this simplified Mamba variant (Mamba-2S), we improve the A-mask and increase the order of the hidden state, resulting in a method, which we call 2Mamba, that is nearly as accurate as softmax attention, yet much more memory efficient for long context lengths. We also investigate elements to Mamba-2 that help surpass softmax attention accuracy. Code is provided for all our experiments

[333] A feature-stable and explainable machine learning framework for trustworthy decision-making under incomplete clinical data

Justyna Andrys-Olek, Paulina Tworek, Luca Gherardini, Mark W. Ruddock, Mary Jo Kurt, Peter Fitzgerald, Jose Sousa

Main category: cs.LG

TL;DR: CACTUS is an explainable ML framework for biomedical data that maintains feature stability under missing data conditions while achieving competitive predictive performance.

DetailsMotivation: Machine learning models in biomedical applications face limited adoption due to poor robustness, limited interpretability, and instability of learned features under realistic data perturbations like missingness, undermining trust and reproducibility.

Method: CACTUS integrates feature abstraction, interpretable classification, and systematic feature stability analysis to quantify how consistently informative features are preserved as data quality degrades. It’s benchmarked against random forests and gradient boosting methods using a real-world haematuria cohort of 568 patients under controlled missing data conditions.

Result: CACTUS achieves competitive or superior predictive performance while maintaining markedly higher stability of top-ranked features as missingness increases, including in sex-stratified analyses. Feature stability provides information complementary to conventional performance metrics.

Conclusion: Feature stability is essential for assessing trustworthiness of ML models in biomedical applications. CACTUS offers a generalizable framework for trustworthy data-driven decision support by explicitly quantifying robustness to missing data and prioritizing interpretable, stable features.

Abstract: Machine learning models are increasingly applied to biomedical data, yet their adoption in high stakes domains remains limited by poor robustness, limited interpretability, and instability of learned features under realistic data perturbations, such as missingness. In particular, models that achieve high predictive performance may still fail to inspire trust if their key features fluctuate when data completeness changes, undermining reproducibility and downstream decision-making. Here, we present CACTUS (Comprehensive Abstraction and Classification Tool for Uncovering Structures), an explainable machine learning framework explicitly designed to address these challenges in small, heterogeneous, and incomplete clinical datasets. CACTUS integrates feature abstraction, interpretable classification, and systematic feature stability analysis to quantify how consistently informative features are preserved as data quality degrades. Using a real-world haematuria cohort comprising 568 patients evaluated for bladder cancer, we benchmark CACTUS against widely used machine learning approaches, including random forests and gradient boosting methods, under controlled levels of randomly introduced missing data. We demonstrate that CACTUS achieves competitive or superior predictive performance while maintaining markedly higher stability of top-ranked features as missingness increases, including in sex-stratified analyses. Our results show that feature stability provides information complementary to conventional performance metrics and is essential for assessing the trustworthiness of machine learning models applied to biomedical data. By explicitly quantifying robustness to missing data and prioritising interpretable, stable features, CACTUS offers a generalizable framework for trustworthy data-driven decision support.

[334] MDP Planning as Policy Inference

David Tolpin

Main category: cs.LG

TL;DR: The paper proposes treating episodic MDP planning as Bayesian inference over policies, using variational sequential Monte Carlo to approximate posterior policy distributions and acting via posterior predictive sampling.

DetailsMotivation: The authors aim to develop a Bayesian approach to MDP planning that treats policies as latent variables, allowing for uncertainty quantification over optimal behavior rather than just point estimates of optimal policies.

Method: They cast episodic MDP planning as Bayesian inference where policies have unnormalized probabilities of optimality monotone in expected return. They adapt variational sequential Monte Carlo (VSMC) for discrete domains, introducing a sweep to enforce policy consistency across revisited states and coupling transition randomness across particles to reduce simulator noise confounding.

Result: The method is evaluated across grid worlds, Blackjack, Triangle Tireworld, and Academic Advising domains. The paper analyzes the structure of inferred policy distributions and compares behavior to discrete Soft Actor-Critic, highlighting qualitative and statistical differences arising from policy-level uncertainty.

Conclusion: The Bayesian inference approach to MDP planning provides a principled way to quantify uncertainty over optimal policies, with posterior predictive sampling inducing stochastic control policies through a Thompson-sampling interpretation rather than entropy regularization.

Abstract: We cast episodic Markov decision process (MDP) planning as Bayesian inference over policies. A policy is treated as the latent variable and is assigned an unnormalized probability of optimality that is monotone in its expected return, yielding a posterior distribution whose modes coincide with return-maximizing solutions while posterior dispersion represents uncertainty over optimal behavior. To approximate this posterior in discrete domains, we adapt variational sequential Monte Carlo (VSMC) to inference over deterministic policies under stochastic dynamics, introducing a sweep that enforces policy consistency across revisited states and couples transition randomness across particles to avoid confounding from simulator noise. Acting is performed by posterior predictive sampling, which induces a stochastic control policy through a Thompson-sampling interpretation rather than entropy regularization. Across grid worlds, Blackjack, Triangle Tireworld, and Academic Advising, we analyze the structure of inferred policy distributions and compare the resulting behavior to discrete Soft Actor-Critic, highlighting qualitative and statistical differences that arise from policy-level uncertainty.

[335] Convergence Analysis of Two-Layer Neural Networks under Gaussian Input Masking

Afroditi Kolomvaki, Fangshuo Liao, Evan Dramko, Ziyun Guang, Anastasios Kyrillidis

Main category: cs.LG

TL;DR: Two-layer neural network training with Gaussian randomly masked inputs achieves linear convergence up to error proportional to mask variance, analyzed via Neural Tangent Kernel theory.

DetailsMotivation: The paper investigates training neural networks with partially corrupted or masked inputs, which is relevant to practical scenarios like sensor networks with missing data, privacy-preserving training where features are intentionally obscured, and federated learning where users have access to incomplete feature sets. Understanding convergence guarantees in such noisy input conditions is important for real-world applications.

Method: The authors use Neural Tangent Kernel (NTK) analysis to study the convergence of two-layer ReLU networks trained with Gaussian randomly masked inputs. A key technical contribution involves resolving the randomness within the non-linear activation function, which presents analytical challenges due to the interaction between random masking and ReLU non-linearity.

Result: The analysis demonstrates that training a two-layer ReLU network with Gaussian randomly masked inputs achieves linear convergence up to an error region that is proportional to the variance of the masking process. This provides theoretical guarantees for training under noisy input conditions.

Conclusion: The paper establishes theoretical convergence guarantees for neural network training with randomly masked inputs, showing that despite input corruption, networks can still converge linearly to a region whose size depends on the noise level. This has implications for robust training in practical scenarios with incomplete or corrupted data.

Abstract: We investigate the convergence guarantee of two-layer neural network training with Gaussian randomly masked inputs. This scenario corresponds to Gaussian dropout at the input level, or noisy input training common in sensor networks, privacy-preserving training, and federated learning, where each user may have access to partial or corrupted features. Using a Neural Tangent Kernel (NTK) analysis, we demonstrate that training a two-layer ReLU network with Gaussian randomly masked inputs achieves linear convergence up to an error region proportional to the mask’s variance. A key technical contribution is resolving the randomness within the non-linear activation, a problem of independent interest.

[336] Variational Grey-Box Dynamics Matching

Gurjeet Sangra Singh, Frantzeska Lavda, Giangiacomo Mercatali, Alexandros Kalousis

Main category: cs.LG

TL;DR: A grey-box method that integrates incomplete physics models into flow matching generative models, learning dynamics from observational trajectories without ground-truth physics parameters.

DetailsMotivation: Bridge the gap between black-box deep generative models (flow matching/diffusion) that neglect underlying physics, and interpretable physics-based simulation models that may have missing/unknown terms and cannot fully describe real-world observations.

Method: Integrates incomplete physics models directly into generative models using structured variational distribution within flow matching framework. Uses two latent encodings: one for missing stochasticity/multi-modal velocity, and another to encode physics parameters as latent variable with physics-informed prior. Also adapts framework to handle second-order dynamics.

Result: Experiments on representative ODE/PDE problems show performance on par with or superior to fully data-driven approaches and previous grey-box baselines, while preserving physics model interpretability.

Conclusion: The method successfully bridges physics-based and data-driven approaches, enabling learning from observational trajectories alone without ground-truth physics parameters in a simulation-free manner.

Abstract: Deep generative models such as flow matching and diffusion models have shown great potential in learning complex distributions and dynamical systems, but often act as black-boxes, neglecting underlying physics. In contrast, physics-based simulation models described by ODEs/PDEs remain interpretable, but may have missing or unknown terms, unable to fully describe real-world observations. We bridge this gap with a novel grey-box method that integrates incomplete physics models directly into generative models. Our approach learns dynamics from observational trajectories alone, without ground-truth physics parameters, in a simulation-free manner that avoids scalability and stability issues of Neural ODEs. The core of our method lies in modelling a structured variational distribution within the flow matching framework, by using two latent encodings: one to model the missing stochasticity and multi-modal velocity, and a second to encode physics parameters as a latent variable with a physics-informed prior. Furthermore, we present an adaptation of the framework to handle second-order dynamics. Our experiments on representative ODE/PDE problems show that our method performs on par with or superior to fully data-driven approaches and previous grey-box baselines, while preserving the interpretability of the physics model. Our code is available at https://github.com/DMML-Geneva/VGB-DM.

[337] Learning with Boolean threshold functions

Veit Elser, Manish Krishan Lal

Main category: cs.LG

TL;DR: A method for training neural networks with strictly ±1 activations and weights using nonconvex constraint satisfaction instead of gradient descent, achieving exact solutions on Boolean tasks.

DetailsMotivation: To develop a training method for neural networks operating on Boolean data that avoids gradient-based optimization challenges, produces interpretable models with ±1 weights, and can solve discrete learning problems where standard methods struggle.

Method: Uses constraint satisfaction with divide-and-concur decomposition: one constraint enforces Boolean threshold function (BTF) consistency between inputs, weights, and outputs with margin bounds; another imposes architectural concurrence. The reflect-reflect-relax (RRR) projection algorithm reconciles these constraints.

Result: Achieves exact solutions or strong generalization on tasks including multiplier-circuit discovery, binary autoencoding, logic-network inference, and cellular automata learning, outperforming gradient-based methods in discrete domains.

Conclusion: Projection-based constraint satisfaction provides a viable alternative foundation for learning in discrete neural systems, offering benefits for interpretability and efficient inference through sparse, logically-equivalent representations.

Abstract: We develop a method for training neural networks on Boolean data in which the values at all nodes are strictly $\pm 1$, and the resulting models are typically equivalent to networks whose nonzero weights are also $\pm 1$. The method replaces loss minimization with a nonconvex constraint formulation. Each node implements a Boolean threshold function (BTF), and training is expressed through a divide-and-concur decomposition into two complementary constraints: one enforces local BTF consistency between inputs, weights, and output; the other imposes architectural concurrence, equating neuron outputs with downstream inputs and enforcing weight equality across training-data instantiations of the network. The reflect-reflect-relax (RRR) projection algorithm is used to reconcile these constraints. Each BTF constraint includes a lower bound on the margin. When this bound is sufficiently large, the learned representations are provably sparse and equivalent to networks composed of simple logical gates with $\pm 1$ weights. Across a range of tasks – including multiplier-circuit discovery, binary autoencoding, logic-network inference, and cellular automata learning – the method achieves exact solutions or strong generalization in regimes where standard gradient-based methods struggle. These results demonstrate that projection-based constraint satisfaction provides a viable and conceptually distinct foundation for learning in discrete neural systems, with implications for interpretability and efficient inference.

[338] Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models

Wen-Tse Chen, Jiayu Chen, Fahim Tajwar, Hao Zhu, Xintong Duan, Ruslan Salakhutdinov, Jeff Schneider

Main category: cs.LG

TL;DR: RICOL uses LLMs for temporal credit assignment via retrospective in-context learning to transform sparse rewards into dense training signals, improving sample efficiency in RL.

DetailsMotivation: Learning from self-sampled data with sparse feedback is challenging in RL. Traditional temporal credit assignment methods rely on task-specific value functions that suffer from poor sample efficiency and limited generalization.

Method: Proposes RICL (Retrospective In-Context Learning) that leverages pretrained LLMs to transform sparse rewards into dense advantage functions through in-context learning. Then introduces RICOL, an online learning framework that iteratively refines policies based on RICL’s credit assignment results.

Result: RICL accurately estimates advantage functions with limited samples and identifies critical states for temporal credit assignment. On four BabyAI scenarios, RICOL achieves comparable performance to traditional RL algorithms with significantly higher sample efficiency.

Conclusion: LLMs show potential for temporal credit assignment, enabling more sample-efficient and generalizable RL paradigms by leveraging pretrained knowledge for dense supervision signals.

Abstract: Learning from self-sampled data and sparse environmental feedback remains a fundamental challenge in training self-evolving agents. Temporal credit assignment mitigates this issue by transforming sparse feedback into dense supervision signals. However, previous approaches typically depend on learning task-specific value functions for credit assignment, which suffer from poor sample efficiency and limited generalization. In this work, we propose to leverage pretrained knowledge from large language models (LLMs) to transform sparse rewards into dense training signals (i.e., the advantage function) through retrospective in-context learning (RICL). We further propose an online learning framework, RICOL, which iteratively refines the policy based on the credit assignment results from RICL. We empirically demonstrate that RICL can accurately estimate the advantage function with limited samples and effectively identify critical states in the environment for temporal credit assignment. Extended evaluation on four BabyAI scenarios show that RICOL achieves comparable convergent performance with traditional online RL algorithms with significantly higher sample efficiency. Our findings highlight the potential of leveraging LLMs for temporal credit assignment, paving the way for more sample-efficient and generalizable RL paradigms.

[339] LORA-CRAFT: Cross-layer Rank Adaptation via Frozen Tucker Decomposition of Pre-trained Attention Weights

Kasun Dewage, Marianna Pensky, Suranadi De Silva, Shankadeep Mondal

Main category: cs.LG

TL;DR: CRAFT is a parameter-efficient fine-tuning method that applies Tucker tensor decomposition across transformer layers, freezing the decomposed factors and training only small adaptation matrices for efficient adaptation.

DetailsMotivation: To bridge two lines of work in parameter-efficient fine-tuning: tensor-based methods that decompose gradient updates across layers, and SVD-based methods that operate independently per layer. The goal is to achieve efficient adaptation with minimal parameters while maintaining competitive performance.

Method: CRAFT applies Tucker tensor decomposition via Higher-Order SVD (HOSVD) directly on pre-trained attention weight matrices organized as cross-layer 3D tensors. It freezes all resulting Tucker factors and trains only lightweight trainable transformation matrices applied to each factor matrix, requiring only 41K adaptation parameters.

Result: Experiments on GLUE benchmark using RoBERTa-base and RoBERTa-large show CRAFT achieves competitive performance with existing methods while requiring only 41K Tucker adaptation parameters, which is independent of model dimension and depth at fixed Tucker ranks.

Conclusion: CRAFT provides an effective parameter-efficient fine-tuning approach that bridges tensor decomposition methods, achieving strong performance with minimal trainable parameters through cross-layer Tucker decomposition and frozen factor adaptation.

Abstract: We introduce CRAFT (Cross-layer Rank Adaptation via Frozen Tucker), a parameter-efficient fine-tuning (PEFT) method that applies Tucker tensor decomposition to pre-trained attention weight matrices stacked across transformer layers and trains only small square adaptation matrices on the resulting frozen Tucker factors. Existing tensor-based PEFT methods decompose gradient updates: LoTR applies Tucker decomposition with shared factor matrices, while SuperLoRA groups and reshapes $ΔW$ across layers before applying Tucker decomposition. Separately, methods like PiSSA apply SVD to pre-trained weights but operate independently per layer. CRAFT bridges these two lines of work: it performs full Tucker decomposition via Higher-Order SVD (HOSVD) directly on pre-trained weights organized as cross-layer 3D tensors, freezes all resulting factors, and adapts the model through lightweight trainable transformations applied to each factor matrix. Experiments on the GLUE benchmark using RoBERTa-base and RoBERTa-large demonstrate that CRAFT achieves competitive performance with existing methods while requiring only 41K Tucker adaptation parameters–a count independent of model dimension and depth at fixed Tucker ranks.

[340] Variational inference via radial transport

Luca Ghafourpour, Sinho Chewi, Alessio Figalli, Aram-Alexandre Pooladian

Main category: cs.LG

TL;DR: radVI: A radial profile optimization method for variational inference that improves Gaussian approximations by optimizing over radial distributions, providing better coverage of target distributions.

DetailsMotivation: Standard Gaussian approximations in variational inference often fail to capture the correct radial profile of target distributions, leading to poor coverage and inaccurate approximations.

Method: Develops radVI algorithm that optimizes over radial profiles as an add-on to existing VI schemes (Gaussian VI, Laplace approximation). Uses Wasserstein space optimization theory and radial transport map regularity properties.

Result: Provides theoretical convergence guarantees for radVI based on recent developments in Wasserstein space optimization and Caffarelli-style radial transport map regularity.

Conclusion: radVI is an effective, computationally cheap enhancement to existing VI methods that improves radial profile approximation and provides better coverage of target distributions.

Abstract: In variational inference (VI), the practitioner approximates a high-dimensional distribution $π$ with a simple surrogate one, often a (product) Gaussian distribution. However, in many cases of practical interest, Gaussian distributions might not capture the correct radial profile of $π$, resulting in poor coverage. In this work, we approach the VI problem from the perspective of optimizing over these radial profiles. Our algorithm radVI is a cheap, effective add-on to many existing VI schemes, such as Gaussian (mean-field) VI and Laplace approximation. We provide theoretical convergence guarantees for our algorithm, owing to recent developments in optimization over the Wasserstein space–the space of probability distributions endowed with the Wasserstein distance–and new regularity properties of radial transport maps in the style of Caffarelli (2000).

[341] Provably Explaining Neural Additive Models

Shahaf Bassan, Yizhak Yisrael Elboher, Tobias Ladner, Volkan Şahin, Jan Kretinsky, Matthias Althoff, Guy Katz

Main category: cs.LG

TL;DR: Efficient algorithm for generating provably cardinally-minimal explanations for Neural Additive Models (NAMs) using logarithmic verification queries

DetailsMotivation: Existing post-hoc explanation methods for neural networks lack provable guarantees and are computationally infeasible for standard networks. There's a need for efficient methods that can generate explanations with provable guarantees, particularly for more interpretable model families like NAMs.

Method: Developed a model-specific algorithm for NAMs that generates provably cardinally-minimal explanations using only a logarithmic number of verification queries in input features, after parallelized preprocessing with logarithmic runtime in precision for each univariate NAM component.

Result: The algorithm outperforms existing methods for finding relaxed subset-minimal explanations, provides provably smaller explanations, substantially reduces computation time, and offers benefits unattainable by standard sampling-based techniques.

Conclusion: For Neural Additive Models, it’s possible to efficiently generate explanations with provable guarantees, making cardinally-minimal explanations feasible and providing advantages over existing methods despite solving a more difficult task.

Abstract: Despite significant progress in post-hoc explanation methods for neural networks, many remain heuristic and lack provable guarantees. A key approach for obtaining explanations with provable guarantees is by identifying a cardinally-minimal subset of input features which by itself is provably sufficient to determine the prediction. However, for standard neural networks, this task is often computationally infeasible, as it demands a worst-case exponential number of verification queries in the number of input features, each of which is NP-hard. In this work, we show that for Neural Additive Models (NAMs), a recent and more interpretable neural network family, we can efficiently generate explanations with such guarantees. We present a new model-specific algorithm for NAMs that generates provably cardinally-minimal explanations using only a logarithmic number of verification queries in the number of input features, after a parallelized preprocessing step with logarithmic runtime in the required precision is applied to each small univariate NAM component. Our algorithm not only makes the task of obtaining cardinally-minimal explanations feasible, but even outperforms existing algorithms designed to find the relaxed variant of subset-minimal explanations - which may be larger and less informative but easier to compute - despite our algorithm solving a much more difficult task. Our experiments demonstrate that, compared to previous algorithms, our approach provides provably smaller explanations than existing works and substantially reduces the computation time. Moreover, we show that our generated provable explanations offer benefits that are unattainable by standard sampling-based techniques typically used to interpret NAMs.

[342] Position: Evaluation of ECG Representations Must Be Fixed

Zachary Berger, Daniel Prakah-Asante, John Guttag, Collin M. Stultz

Main category: cs.LG

TL;DR: A position paper critiquing current ECG representation learning benchmarks and proposing expanded clinical evaluation targets with improved evaluation practices.

DetailsMotivation: Current ECG representation learning benchmarks are too narrow, focusing only on arrhythmia and waveform morphology while ignoring broader clinical information encoded in ECGs. The field needs more comprehensive evaluation aligned with clinically meaningful objectives.

Method: Proposes expanding downstream evaluation to include structural heart disease and patient-level forecasting. Outlines evaluation best practices for multi-label, imbalanced settings. Empirically evaluates three representative ECG pre-training approaches across six settings including standard benchmarks and new clinical targets.

Result: When proper evaluation practices are applied, current literature conclusions about best-performing representations change. Surprisingly, a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks, making it a reasonable baseline.

Conclusion: ECG representation learning needs broader clinical benchmarks and improved evaluation practices. Random encoders should be used as baselines, and the field should expand beyond current narrow benchmarks to include structural disease and forecasting tasks.

Abstract: This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature’s current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of three representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.

[343] MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning

Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai

Main category: cs.LG

TL;DR: MASPO introduces a unified RL framework with soft Gaussian gating, mass-adaptive constraints, and asymmetric risk control to address inefficiencies in existing RLVR methods for LLMs.

DetailsMotivation: Existing RLVR methods like GRPO use rigid trust region mechanisms that don't align with LLM optimization dynamics, causing inefficient gradient utilization, insensitive probability mass constraints, and asymmetric signal reliability issues.

Method: Proposes Mass-Adaptive Soft Policy Optimization (MASPO) with three components: 1) differentiable soft Gaussian gating for better gradient utility, 2) mass-adaptive limiter to balance exploration across probability spectrum, 3) asymmetric risk controller to align updates with signal confidence.

Result: Extensive evaluations show MASPO significantly outperforms strong baselines as a robust, all-in-one RLVR solution for LLMs.

Conclusion: MASPO provides a unified framework that addresses key limitations in existing RLVR methods for LLMs through adaptive, differentiable mechanisms that better align with LLM optimization dynamics.

Abstract: Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence. Extensive evaluations demonstrate that MASPO serves as a robust, all-in-one RLVR solution, significantly outperforming strong baselines. Our code is available at: https://anonymous.4open.science/r/ma1/README.md.

[344] A Theoretical Framework for Modular Learning of Robust Generative Models

Corinna Cortes, Mehryar Mohri, Yutao Zhong

Main category: cs.LG

TL;DR: Modular LLM training framework combining domain experts via robust gating to match monolithic performance without heuristic data weighting.

DetailsMotivation: Large-scale generative models are resource-intensive and rely on heuristic dataset weighting. The paper aims to enable modular training by combining small domain experts to match monolithic performance robustly for any data mixture.

Method: Theoretical framework for modular generative modeling with pre-trained experts combined via gating mechanism. Defines normalized gating functions space G₁, formulates minimax game to find robust gate minimizing divergence to worst-case data mixture. Proves existence using Kakutani’s fixed-point theorem. Introduces Stochastic Primal-Dual algorithm and Structural Distillation for efficient inference.

Result: Empirical results on synthetic and real-world datasets confirm modular architecture effectively mitigates gradient conflict and can robustly outperform monolithic baselines. Shows modularity acts as strong regularizer with generalization bounds scaling with gate complexity.

Conclusion: Modular approach combining domain experts via robust gating can match or outperform monolithic training, eliminating need for heuristic data weighting while providing theoretical guarantees and practical efficiency.

Abstract: Training large-scale generative models is resource-intensive and relies heavily on heuristic dataset weighting. We address two fundamental questions: Can we train Large Language Models (LLMs) modularly-combining small, domain-specific experts to match monolithic performance-and can we do so robustly for any data mixture, eliminating heuristic tuning? We present a theoretical framework for modular generative modeling where a set of pre-trained experts are combined via a gating mechanism. We define the space of normalized gating functions, $G_{1}$, and formulate the problem as a minimax game to find a single robust gate that minimizes divergence to the worst-case data mixture. We prove the existence of such a robust gate using Kakutani’s fixed-point theorem and show that modularity acts as a strong regularizer, with generalization bounds scaling with the lightweight gate’s complexity. Furthermore, we prove that this modular approach can theoretically outperform models retrained on aggregate data, with the gap characterized by the Jensen-Shannon Divergence. Finally, we introduce a scalable Stochastic Primal-Dual algorithm and a Structural Distillation method for efficient inference. Empirical results on synthetic and real-world datasets confirm that our modular architecture effectively mitigates gradient conflict and can robustly outperform monolithic baselines.

[345] Revisiting Weight Regularization for Low-Rank Continual Learning

Yaoyue Zheng, Yin Zhang, Joost van de Weijer, Gido M van de Ven, Shaoyi Du, Xuetao Zhang, Zhiqiang Tian

Main category: cs.LG

TL;DR: EWC-LoRA: A parameter-efficient continual learning method that applies Elastic Weight Consolidation regularization to low-rank adapters to mitigate task interference while maintaining constant storage and inference costs.

DetailsMotivation: Weight regularization techniques like EWC are underexplored in parameter-efficient continual learning (PECL) with pre-trained models. Existing low-rank CL methods use task-specific modules, but there's a need for more efficient approaches that maintain constant storage/inference costs regardless of task count.

Method: Proposes EWC-LoRA which regularizes shared low-rank updates through Elastic Weight Consolidation. Uses low-rank representation to estimate parameter importance over full-dimensional space, keeping storage and inference costs constant across tasks.

Result: Extensive experiments show EWC-LoRA achieves superior stability-plasticity trade-off compared to existing low-rank CL approaches, demonstrating weight regularization remains effective even under low-rank parameterizations.

Conclusion: Weight regularization is an effective mechanism for mitigating task interference in parameter-efficient continual learning, providing a practical, computational- and memory-efficient solution for CL with pre-trained models.

Abstract: Continual Learning (CL) with large-scale pre-trained models (PTMs) has recently gained wide attention, shifting the focus from training from scratch to continually adapting PTMs. This has given rise to a promising paradigm: parameter-efficient continual learning (PECL), where task interference is typically mitigated by assigning a task-specific module during training, such as low-rank adapters. However, weight regularization techniques, such as Elastic Weight Consolidation (EWC)-a key strategy in CL-remain underexplored in this new paradigm. In this paper, we revisit weight regularization in low-rank CL as a new perspective for mitigating task interference in PECL. Unlike existing low-rank CL methods, we mitigate task interference by regularizing a shared low-rank update through EWC, thereby keeping the storage requirement and inference costs constant regardless of the number of tasks. Our proposed method EWC-LoRA leverages a low-rank representation to estimate parameter importance over the full-dimensional space. This design offers a practical, computational- and memory-efficient solution for CL with PTMs, and provides insights that may inform the broader application of regularization techniques within PECL. Extensive experiments on various benchmarks demonstrate the effectiveness of EWC-LoRA, achieving a stability-plasticity trade-off superior to existing low-rank CL approaches. These results indicate that, even under low-rank parameterizations, weight regularization remains an effective mechanism for mitigating task interference. Code is available at: https://github.com/yaoyz96/low-rank-cl.

[346] Be Wary of Your Time Series Preprocessing

Sofiane Ennadir, Tianze Wang, Oleg Smirnov, Sahar Asadi, Lele Cao

Main category: cs.LG

TL;DR: Theoretical analysis of normalization strategies’ impact on Transformer expressivity for time series, showing no single method consistently outperforms others and sometimes no normalization works best.

DetailsMotivation: Normalization and scaling are fundamental preprocessing steps in time series modeling, but their role in Transformer-based models remains underexplored from a theoretical perspective, despite their critical importance.

Method: Proposes a novel expressivity framework for time series that quantifies model’s ability to distinguish between similar/dissimilar inputs, derives theoretical bounds for Standard and Min-Max scaling, and validates empirically on classification and forecasting benchmarks using multiple Transformer models.

Result: No single normalization method consistently outperforms others; in some cases, omitting normalization entirely leads to superior performance. Choice of normalization strategy significantly influences model’s representational capacity depending on task and data characteristics.

Conclusion: Highlights critical role of preprocessing in time series learning and motivates need for more principled normalization strategies tailored to specific tasks and datasets.

Abstract: Normalization and scaling are fundamental preprocessing steps in time series modeling, yet their role in Transformer-based models remains underexplored from a theoretical perspective. In this work, we present the first formal analysis of how different normalization strategies, specifically instance-based and global scaling, impact the expressivity of Transformer-based architectures for time series representation learning. We propose a novel expressivity framework tailored to time series, which quantifies a model’s ability to distinguish between similar and dissimilar inputs in the representation space. Using this framework, we derive theoretical bounds for two widely used normalization methods: Standard and Min-Max scaling. Our analysis reveals that the choice of normalization strategy can significantly influence the model’s representational capacity, depending on the task and data characteristics. We complement our theory with empirical validation on classification and forecasting benchmarks using multiple Transformer-based models. Our results show that no single normalization method consistently outperforms others, and in some cases, omitting normalization entirely leads to superior performance. These findings highlight the critical role of preprocessing in time series learning and motivate the need for more principled normalization strategies tailored to specific tasks and datasets.

[347] Canonicalizing Multimodal Contrastive Representation Learning

Sharut Gupta, Sanyam Kansal, Stefanie Jegelka, Phillip Isola, Vikas Garg

Main category: cs.LG

TL;DR: Multimodal contrastive models trained independently with different architectures and data distributions exhibit a systematic geometric relationship: their embedding spaces are related by a single orthogonal transformation that simultaneously aligns both image and text encoders.

DetailsMotivation: While independently trained models often develop similar similarity notions, establishing explicit correspondences between representation spaces is crucial for multimodal models where consistency must hold within and across modalities. The paper investigates whether systematic geometric relationships exist between embedding spaces of independently trained multimodal contrastive models.

Method: Theoretical analysis and empirical validation across model families (CLIP, SigLIP, FLAVA) showing that if multimodal kernels agree on a small anchor set, then models must be related by a single orthogonal map. The method proves existence of orthogonal transformation Q such that representations align across independently trained models.

Result: Across different model families, the geometric relationship is well approximated by an orthogonal map (up to global mean shift). The same orthogonal transformation simultaneously aligns both image and text encoders across models, demonstrating consistent cross-modal alignment.

Conclusion: Independently trained multimodal contrastive models exhibit systematic geometric relationships via orthogonal transformations, enabling backward-compatible model upgrades without costly re-embedding and having implications for representation privacy.

Abstract: As models and data scale, independently trained networks often induce analogous notions of similarity. But, matching similarities is weaker than establishing an explicit correspondence between the representation spaces, especially for multimodal models, where consistency must hold not only within each modality, but also for the learned image-text coupling. We therefore ask: given two independently trained multimodal contrastive models (with encoders $(f, g)$ and $(\widetilde{f},\widetilde{g})$) – trained on different distributions and with different architectures – does a systematic geometric relationship exist between their embedding spaces? If so, what form does it take, and does it hold uniformly across modalities? In this work, we show that across model families such as CLIP, SigLIP, and FLAVA, this geometric relationship is well approximated by an orthogonal map (up to a global mean shift), i.e., there exists an orthogonal map $Q$ where $Q^\top Q = I$ such that $\widetilde{f}(x)\approx Q f(x)$ for paired images $x$. Strikingly, the same $Q$ simultaneously aligns the text encoders i.e., $\widetilde{g}(y)\approx Q g(y)$ for texts $y$. Theoretically, we prove that if the multimodal kernel agrees across models on a small anchor set i.e. $\langle f(x), g(y)\rangle \approx \langle \widetilde{f}(x), \widetilde{g}(y)\rangle$, then the two models must be related by a single orthogonal map $Q$ and the same $Q$ maps images and text across models. More broadly, this finding enables backward-compatible model upgrades, avoiding costly re-embedding, and has implications for the privacy of learned representations. Our project page: https://canonical-multimodal.github.io/

[348] Asymptotic Smoothing of the Lipschitz Loss Landscape in Overparameterized One-Hidden-Layer ReLU Networks

Saveliy Baturin

Main category: cs.LG

TL;DR: Theoretical and empirical study showing that overparameterized ReLU networks have flatter loss landscapes with vanishing energy gaps between local and global minima as width increases, making sublevel sets connected.

DetailsMotivation: Understanding the topological properties of loss landscapes in neural networks, particularly how overparameterization affects connectivity between solutions and the existence of barriers between local minima.

Method: Theoretical analysis proving connectivity of solutions for convex Lipschitz losses with ℓ₁-regularization, and empirical evaluation using Dynamic String Sampling (DSS) to measure energy gaps on synthetic Moons dataset and Wisconsin Breast Cancer dataset.

Result: Theoretical proof that any two models at same loss level can be connected with arbitrarily small loss increase; asymptotic bound showing energy gap vanishes as width grows. Empirical results show wider networks have smaller energy gaps, with permutation test yielding p=0 for maximum gap reduction.

Conclusion: Overparameterization flattens loss landscapes and connects sublevel sets, explaining why gradient methods succeed in finding good solutions despite non-convexity.

Abstract: We study the topology of the loss landscape of one-hidden-layer ReLU networks under overparameterization. On the theory side, we (i) prove that for convex $L$-Lipschitz losses with an $\ell_1$-regularized second layer, every pair of models at the same loss level can be connected by a continuous path within an arbitrarily small loss increase $ε$ (extending a known result for the quadratic loss); (ii) obtain an asymptotic upper bound on the energy gap $ε$ between local and global minima that vanishes as the width $m$ grows, implying that the landscape flattens and sublevel sets become connected in the limit. Empirically, on a synthetic Moons dataset and on the Wisconsin Breast Cancer dataset, we measure pairwise energy gaps via Dynamic String Sampling (DSS) and find that wider networks exhibit smaller gaps; in particular, a permutation test on the maximum gap yields $p_{perm}=0$, indicating a clear reduction in the barrier height.

[349] Unlocking [CLS] Features for Continual Post-Training

Murat Onur Yildirim, Elif Ceren Gok Yildirim, Joaquin Vanschoren

Main category: cs.LG

TL;DR: TOSCA introduces a parameter-efficient continual learning method using sparse adapter-calibrator modules on the [CLS] token to balance stability-plasticity trade-off in foundation models.

DetailsMotivation: Address the stability-plasticity trade-off in continual learning for foundation models, where excessive plasticity causes forgetting and excessive stability limits adaptation, requiring minimal functional modifications.

Method: Introduces LuCA (Learn and Calibrate) adapter-calibrator modules, then deploys sparse LuCA modules on the last [CLS] token only (TOSCA), leaving foundation model intact while adapting via token-level sparse calibration.

Result: Achieves state-of-the-art performance with ~8 times fewer parameters compared to prior methods, reducing both training and inference complexity.

Conclusion: TOSCA effectively balances stability and plasticity in continual learning for foundation models through token-level sparse adaptation with minimal parameter overhead.

Abstract: Continual learning requires models to integrate new classes or domains over time while preserving previously acquired knowledge. Within this paradigm, foundation models often achieve strong performance, but they still remain subject to the stability-plasticity trade-off, where excessive plasticity leads to forgetting of prior knowledge, and excessive stability constrains the adaptation. This necessitates an effective post-training strategy that introduces minimal yet functional modifications. To address this challenge, we first introduce a new parameter-efficient fine-tuning module ‘Learn and Calibrate’, or LuCA, designed to acquire task-specific knowledge through an adapter-calibrator couple, enabling well-refined feature representations. Then, for each task, we deploy a sparse LuCA module on top of the last classification token [CLS] just before the classifier, which we refer to as ‘Token-level Sparse Calibration and Adaptation’, or TOSCA. By leaving the generalization capabilities of the foundation models intact and adapting exclusively via the last token, our approach achieves a harmonious balance between stability and plasticity while reducing both training and inference complexity. We demonstrate that TOSCA yields state-of-the-art performance while introducing ~8 times fewer parameters compared to prior methods.

[350] Towards Anytime-Valid Statistical Watermarking

Baihe Huang, Eric Xu, Kannan Ramchandran, Jiantao Jiao, Michael I. Jordan

Main category: cs.LG

TL;DR: Anchored E-Watermarking: A new watermarking framework for LLMs that combines optimal sampling with anytime-valid inference using e-values and test supermartingales.

DetailsMotivation: Existing watermarking methods for distinguishing machine-generated content from human text have two critical limitations: lack of principled approach for selecting sampling distributions and reliance on fixed-horizon hypothesis testing that precludes valid early stopping.

Method: Develops an e-value-based watermarking framework called Anchored E-Watermarking that unifies optimal sampling with anytime-valid inference. Uses an anchor distribution to approximate the target model, constructs test supermartingales for detection, and characterizes optimal e-value with respect to worst-case log-growth rate.

Result: Theoretical claims substantiated by simulations and evaluations on established benchmarks show the framework can significantly enhance sample efficiency, reducing average token budget required for detection by 13-15% relative to state-of-the-art baselines.

Conclusion: The proposed Anchored E-Watermarking framework addresses key limitations of existing methods by providing principled sampling distribution selection and valid anytime-inference capabilities for detecting machine-generated text.

Abstract: The proliferation of Large Language Models (LLMs) necessitates efficient mechanisms to distinguish machine-generated content from human text. While statistical watermarking has emerged as a promising solution, existing methods suffer from two critical limitations: the lack of a principled approach for selecting sampling distributions and the reliance on fixed-horizon hypothesis testing, which precludes valid early stopping. In this paper, we bridge this gap by developing the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference. Unlike traditional approaches where optional stopping invalidates Type-I error guarantees, our framework enables valid, anytime-inference by constructing a test supermartingale for the detection process. By leveraging an anchor distribution to approximate the target model, we characterize the optimal e-value with respect to the worst-case log-growth rate and derive the optimal expected stopping time. Our theoretical claims are substantiated by simulations and evaluations on established benchmarks, showing that our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.

[351] Guarding the Middle: Protecting Intermediate Representations in Federated Split Learning

Obaidullah Zaland, Sajib Mistry, Monowar Bhuyan

Main category: cs.LG

TL;DR: KD-UFSL: A privacy-preserving federated split learning framework combining k-anonymity and differential privacy to protect client data while maintaining model utility.

DetailsMotivation: Federated split learning enables decentralized training but exposes client data through intermediate representations (smashed data). Need privacy-preserving methods that balance privacy and utility for large-scale big data applications.

Method: Proposes k-anonymous differentially private UFSL (KD-UFSL) that combines microaggregation (k-anonymity) and differential privacy to protect intermediate representations. Demonstrates data-reconstruction attacks on smashed data and then applies privacy-enhancing techniques to mitigate risks.

Result: KD-UFSL increases mean squared error between actual and reconstructed images by up to 50% and decreases structural similarity by up to 40% on four benchmarking datasets. Maintains global model utility while improving privacy.

Conclusion: KD-UFSL effectively balances privacy and utility in federated split learning, making it suitable for large-scale big data applications where data privacy is critical.

Abstract: Big data scenarios, where massive, heterogeneous datasets are distributed across clients, demand scalable, privacy-preserving learning methods. Federated learning (FL) enables decentralized training of machine learning (ML) models across clients without data centralization. Decentralized training, however, introduces a computational burden on client devices. U-shaped federated split learning (UFSL) offloads a fraction of the client computation to the server while keeping both data and labels on the clients’ side. However, the intermediate representations (i.e., smashed data) shared by clients with the server are prone to exposing clients’ private data. To reduce exposure of client data through intermediate data representations, this work proposes k-anonymous differentially private UFSL (KD-UFSL), which leverages privacy-enhancing techniques such as microaggregation and differential privacy to minimize data leakage from the smashed data transferred to the server. We first demonstrate that an adversary can access private client data from intermediate representations via a data-reconstruction attack, and then present a privacy-enhancing solution, KD-UFSL, to mitigate this risk. Our experiments indicate that, alongside increasing the mean squared error between the actual and reconstructed images by up to 50% in some cases, KD-UFSL also decreases the structural similarity between them by up to 40% on four benchmarking datasets. More importantly, KD-UFSL improves privacy while preserving the utility of the global model. This highlights its suitability for large-scale big data applications where privacy and utility must be balanced.

[352] On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

Main category: cs.LG

TL;DR: RPG (Regularized Policy Gradient) provides a unified framework for KL-regularized policy gradient methods in LLM reasoning, addressing KL direction, normalization, and off-policy estimation issues with theoretical clarity and practical improvements.

DetailsMotivation: KL regularization is widely used in policy gradient algorithms for LLMs, but there's confusion about KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimators, especially in off-policy settings. The paper aims to provide theoretical clarity and practical improvements.

Method: Develops Regularized Policy Gradient (RPG) view with unified derivation that: (1) unifies normalized/unnormalized KL variants, (2) specifies conditions for gradient-equivalent REINFORCE-style losses, (3) corrects off-policy importance-weighting mismatch in GRPO, and (4) introduces RPG-Style Clip for stable off-policy training.

Result: RPG-REINFORCE with RPG-Style Clip improves accuracy by up to +6 percentage points over DAPO on mathematical reasoning benchmarks (AIME24, AIME25). At 8K context length, achieves 52% accuracy on AIME25, surpassing Qwen3-4B-Instruct (47%).

Conclusion: RPG provides a stable and scalable RL algorithm for LLM reasoning through KL-correct objective, clipped importance sampling, and iterative reference-policy updates, offering theoretical unification and practical performance improvements.

Abstract: Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO’s KL term; and (iv) introduces RPG-Style Clip, a clipped-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. We extend our experiments to 8K context length, and RPG-REINFORCE with RPG-Style Clip achieves 52% accuracy on AIME25, surpassing the official Qwen3-4B-Instruct model (47%). Notably, RPG is a stable and scalable RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) clipped importance sampling, and (c) an iterative reference-policy update scheme. Project Page: https://github.com/complex-reasoning/RPG.

[353] Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han

Main category: cs.LG

TL;DR: VCPO stabilizes asynchronous RL training for language models by controlling policy-gradient variance through effective sample size scaling and minimum-variance baselines.

DetailsMotivation: Asynchronous RL training increases throughput but causes high variance in critic-free policy-gradient methods like REINFORCE/GRPO due to stale rollouts and heavy-tailed importance ratios, leading to noisy gradients and unstable learning.

Method: VCPO uses two key techniques: (1) scales learning rate based on effective sample size to dampen unreliable updates, and (2) applies a closed-form minimum-variance baseline for off-policy settings without needing auxiliary value models.

Result: VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming various baselines, reduces training time by 2.5× while matching synchronous performance.

Conclusion: Explicit control of policy-gradient variance is crucial for reliable asynchronous RL at scale, and VCPO provides an effective stabilization method for REINFORCE/GRPO-style algorithms.

Abstract: Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly $\textbf{higher variance}$: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants. This reduces long-context, multi-turn training time by 2.5$\times$ while matching synchronous performance, demonstrating that explicit control of policy-gradient variance is key for reliable asynchronous RL at scale.

[354] Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning

Obaidullah Zaland, Zulfiqar Ahmad Khan, Monowar Bhuyan

Main category: cs.LG

TL;DR: OSI-FL is a one-shot incremental federated learning framework that uses vision-language models and diffusion models to address communication overhead and catastrophic forgetting in federated learning with incremental data.

DetailsMotivation: Federated learning faces challenges with communication overhead and catastrophic forgetting when dealing with incremental data streams in privacy-sensitive, large-scale systems. Traditional FL requires multiple communication rounds and struggles with incremental learning scenarios.

Method: OSI-FL uses frozen vision-language models to extract category-specific embeddings from clients in a single communication round. A pre-trained diffusion model at the server synthesizes data similar to client distributions. Selective Sample Retention (SSR) identifies and retains the most informative samples per category-task pair to mitigate catastrophic forgetting.

Result: OSI-FL outperforms traditional and one-shot FL baselines in both class-incremental and domain-incremental scenarios across three benchmark datasets, effectively addressing communication overhead and catastrophic forgetting.

Conclusion: OSI-FL provides an effective solution for incremental federated learning by combining vision-language models, diffusion models, and selective sample retention to handle communication constraints and catastrophic forgetting simultaneously.

Abstract: Modern big-data systems generate massive, heterogeneous, and geographically dispersed streams that are large-scale and privacy-sensitive, making centralization challenging. While federated learning (FL) provides a privacy-enhancing training mechanism, it assumes a static data flow and learns a collaborative model over multiple rounds, making learning with \textit{incremental} data challenging in limited-communication scenarios. This paper presents One-Shot Incremental Federated Learning (OSI-FL), the first FL framework that addresses the dual challenges of communication overhead and catastrophic forgetting. OSI-FL communicates category-specific embeddings, devised by a frozen vision-language model (VLM) from each client in a single communication round, which a pre-trained diffusion model at the server uses to synthesize new data similar to the client’s data distribution. The synthesized samples are used on the server for training. However, two challenges still persist: i) tasks arriving incrementally need to retrain the global model, and ii) as future tasks arrive, retraining the model introduces catastrophic forgetting. To this end, we augment training with Selective Sample Retention (SSR), which identifies and retains the top-p most informative samples per category and task pair based on sample loss. SSR bounds forgetting by ensuring that representative retained samples are incorporated into training in further iterations. The experimental results indicate that OSI-FL outperforms baselines, including traditional and one-shot FL approaches, in both class-incremental and domain-incremental scenarios across three benchmark datasets.

[355] SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Nathan S. de Lara, Florian Shkurti

Main category: cs.LG

TL;DR: SMAC is an offline RL method that learns actor-critics that can smoothly transition to online fine-tuning without performance drops by regularizing Q-functions to respect derivative relationships between policy scores and Q-function gradients.

DetailsMotivation: Current offline RL methods produce performant actor-critics, but when fine-tuned online with value-based algorithms, they suffer immediate performance drops due to loss landscape valleys between offline and online maxima.

Method: SMAC regularizes the Q-function during offline training to respect a first-order derivative equality between the policy’s score function and the action-gradient of the Q-function, ensuring offline maxima are connected to better online maxima via monotonic reward paths.

Result: SMAC achieves smooth transfer to online algorithms (Soft Actor-Critic and TD3) in 6/6 D4RL tasks, reducing regret by 34-58% over best baselines in 4/6 environments.

Conclusion: SMAC successfully addresses the offline-to-online fine-tuning problem by learning actor-critics that can transition to online RL without performance drops through careful regularization of Q-functions.

Abstract: Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.

[356] When to Trust the Cheap Check: Weak and Strong Verification for Reasoning

Shayan Kiyani, Sima Noorani, George Pappas, Hamed Hassani

Main category: cs.LG

TL;DR: Paper formalizes weak-strong verification policies for LLMs, showing optimal policies have two-threshold structure and developing online algorithm to control acceptance/rejection errors.

DetailsMotivation: LLM reasoning increasingly occurs within verification loops with cheap internal checks (weak verification) and costly user feedback (strong verification). Need to formalize when to accept outputs based on weak verification vs. defer to strong verification given their different cost-reliability trade-offs.

Method: Formalizes weak-strong verification policies, introduces metrics for incorrect acceptance/rejection and strong-verification frequency. Shows optimal policies have two-threshold structure, analyzes how calibration and sharpness affect weak verifier value. Develops online algorithm that provably controls acceptance/rejection errors without assumptions on query stream, LLM, or weak verifier.

Result: Theoretical framework shows optimal verification policies admit two-threshold structure. Practical online algorithm developed that can control error rates without requiring distributional assumptions.

Conclusion: Provides formal framework for weak-strong verification in LLM systems, enabling principled decision-making about when to trust weak verification vs. defer to costly human feedback, with provable error control guarantees.

Abstract: Reasoning with LLMs increasingly unfolds inside a broader verification loop. Internally, systems use cheap checks, such as self-consistency or proxy rewards, which we call weak verification. Externally, users inspect outputs and steer the model through feedback until results are trustworthy, which we call strong verification. These signals differ sharply in cost and reliability: strong verification can establish trust but is resource-intensive, while weak verification is fast and scalable but noisy and imperfect. We formalize this tension through weak–strong verification policies, which decide when to accept or reject based on weak verification and when to defer to strong verification. We introduce metrics capturing incorrect acceptance, incorrect rejection, and strong-verification frequency. Over population, we show that optimal policies admit a two-threshold structure and that calibration and sharpness govern the value of weak verifiers. Building on this, we develop an online algorithm that provably controls acceptance and rejection errors without assumptions on the query stream, the language model, or the weak verifier.

[357] Reverso: Efficient Time Series Foundation Models for Zero-shot Forecasting

Xinghong Fu, Yanhong Li, Georgios Papaioannou, Yoon Kim

Main category: cs.LG

TL;DR: Reverso introduces efficient time series foundation models using hybrid long convolution + linear RNN layers instead of large transformers, achieving comparable zero-shot forecasting performance with 100x smaller models.

DetailsMotivation: Current time series foundation models have scaled to hundreds of millions of parameters like in language/vision domains, but these large transformer models are inefficient and expensive for practical use. There's a need for more efficient models that maintain performance.

Method: Proposes hybrid models combining long convolution and linear RNN layers (DeltaNet layers) instead of large transformers. Uses data augmentation and inference strategies to improve performance. Creates Reverso family of models with this architecture.

Result: Small hybrid models match performance of larger transformer-based models while being more than 100 times smaller. Reverso models significantly push the performance-efficiency Pareto frontier for zero-shot time series forecasting.

Conclusion: Large-scale transformers are not necessary for time series foundation models. Efficient hybrid architectures with long convolution and linear RNN layers can achieve comparable zero-shot forecasting performance with dramatically reduced model size and computational cost.

Abstract: Learning time series foundation models has been shown to be a promising approach for zero-shot time series forecasting across diverse time series domains. Insofar as scaling has been a critical driver of performance of foundation models in other modalities such as language and vision, much recent work on time series foundation modeling has focused on scaling. This has resulted in time series foundation models with hundreds of millions of parameters that are, while performant, inefficient and expensive to use in practice. This paper describes a simple recipe for learning efficient foundation models for zero-shot time series forecasting that are orders of magnitude smaller. We show that large-scale transformers are not necessary: small hybrid models that interleave long convolution and linear RNN layers (in particular DeltaNet layers) can match the performance of larger transformer-based models while being more than a hundred times smaller. We also describe several data augmentation and inference strategies that further improve performance. This recipe results in Reverso, a family of efficient time series foundation models for zero-shot forecasting that significantly push the performance-efficiency Pareto frontier.

[358] Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang

Main category: cs.LG

TL;DR: MPO is a multimodal prompt optimization framework that jointly optimizes text and non-text prompts for MLLMs, outperforming text-only methods across images, videos, and molecules.

DetailsMotivation: Current prompt optimization methods are limited to text-only approaches, which restricts the full potential of multimodal LLMs that can process images, videos, and other modalities. There's a need to extend prompt optimization to the multimodal space.

Method: Proposes Multimodal Prompt Optimizer (MPO) - a unified framework that performs joint optimization of multimodal prompts through alignment-preserving updates and uses Bayesian-based selection strategy to guide candidate prompt selection based on earlier evaluations.

Result: MPO outperforms leading text-only optimization methods across diverse modalities including images, videos, and molecules, demonstrating the effectiveness of multimodal prompt optimization.

Conclusion: Multimodal prompt optimization is crucial for realizing the full potential of MLLMs, and MPO provides an effective framework for optimizing both textual and non-textual prompts across various modalities.

Abstract: Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.

[359] pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Hansheng Chen, Kai Zhang, Hao Tan, Leonidas Guibas, Gordon Wetzstein, Sai Bi

Main category: cs.LG

TL;DR: π-Flow introduces policy-based flow models that use network-free policies to predict dynamic flow velocities for fast ODE integration, achieving state-of-the-art results in few-step diffusion/flow models with improved quality-diversity trade-off.

DetailsMotivation: Existing few-step diffusion/flow models suffer from a quality-diversity trade-off due to format mismatch when distilling teacher velocity predictions into student shortcut predictions, leading to complex distillation procedures.

Method: π-Flow modifies student flow model output to predict network-free policies that generate dynamic flow velocities at future substeps, enabling fast ODE integration without extra network evaluations. Uses imitation distillation with ℓ₂ flow matching loss to match policy’s velocity to teacher’s along policy’s trajectory.

Result: Achieves 1-NFE FID of 2.85 on ImageNet 256², outperforming previous 1-NFE models of same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, achieves substantially better diversity than state-of-the-art DMD models while maintaining teacher-level quality.

Conclusion: π-Flow provides a stable and scalable training approach for few-step generative models that avoids the quality-diversity trade-off through policy-based flow modeling and imitation distillation.

Abstract: Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($π$-Flow). $π$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy’s ODE trajectory to the teacher’s, we introduce a novel imitation distillation approach, which matches the policy’s velocity to the teacher’s along the policy’s trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher’s behavior, $π$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming previous 1-NFE models of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $π$-Flow achieves substantially better diversity than state-of-the-art DMD models, while maintaining teacher-level quality.

[360] FAMOSE: A ReAct Approach to Automated Feature Discovery

Keith Burghardt, Jienan Liu, Sadman Sakib, Yuning Hao, Bo Li

Main category: cs.LG

TL;DR: FAMOSE is an AI agent framework using ReAct paradigm for automated feature engineering in tabular data, achieving state-of-the-art performance on regression and near state-of-the-art on classification tasks.

DetailsMotivation: Feature engineering is a critical bottleneck in machine learning for tabular data, requiring substantial domain expertise to identify optimal features from exponentially large feature spaces. Current automated approaches lack the ability to intelligently explore and refine features.

Method: FAMOSE uses the ReAct (Reasoning + Acting) paradigm to create an autonomous agent that explores, generates, and refines features. It integrates feature selection and evaluation tools within an agent architecture, allowing iterative feature discovery and evaluation steps recorded in the LLM context window.

Result: FAMOSE achieves state-of-the-art for regression tasks (reducing RMSE by 2.0% on average) and near state-of-the-art for classification tasks (especially with >10K instances, increasing ROC-AUC by 0.23% on average). It demonstrates robustness to errors compared to other algorithms.

Conclusion: AI agents using ReAct paradigm are effective for inventive problem-solving like feature engineering. The iterative discovery-evaluation process, similar to few-shot prompting, guides LLMs to create better features by learning from what works and doesn’t work.

Abstract: Feature engineering remains a critical yet challenging bottleneck in machine learning, particularly for tabular data, as identifying optimal features from an exponentially large feature space traditionally demands substantial domain expertise. To address this challenge, we introduce FAMOSE (Feature AugMentation and Optimal Selection agEnt), a novel framework that leverages the ReAct paradigm to autonomously explore, generate, and refine features while integrating feature selection and evaluation tools within an agent architecture. To our knowledge, FAMOSE represents the first application of an agentic ReAct framework to automated feature engineering, especially for both regression and classification tasks. Extensive experiments demonstrate that FAMOSE is at or near the state-of-the-art on classification tasks (especially tasks with more than 10K instances, where ROC-AUC increases 0.23% on average), and achieves the state-of-the-art for regression tasks by reducing RMSE by 2.0% on average, while remaining more robust to errors than other algorithms. We hypothesize that FAMOSE’s strong performance is because ReAct allows the LLM context window to record (via iterative feature discovery and evaluation steps) what features did or did not work. This is similar to a few-shot prompt and guides the LLM to invent better, more innovative features. Our work offers evidence that AI agents are remarkably effective in solving problems that require highly inventive solutions, such as feature engineering.

[361] MARS: Margin-Aware Reward-Modeling with Self-Refinement

Payel Bhattacharjee, Osvaldo Simeone, Ravi Tandon

Main category: cs.LG

TL;DR: MARS: Margin-Aware Augmentation and Sampling Strategy for reward modeling that focuses augmentation on ambiguous preference pairs where reward models are uncertain, improving training efficiency and robustness.

DetailsMotivation: Training reliable reward models for alignment pipelines (RLHF/RLAIF) requires costly human-labeled preference data. Existing augmentation approaches are agnostic to reward model's estimation difficulty, motivating the need for adaptive strategies that target failure modes.

Method: Proposes MARS framework that concentrates augmentation on low-margin (ambiguous) preference pairs where reward model is most uncertain, and iteratively refines training distribution via hard-sample augmentation.

Result: Theoretical guarantees show the strategy increases average curvature of loss function, enhances information, and improves conditioning. Empirical results demonstrate consistent gains over uniform augmentation for robust reward modeling.

Conclusion: MARS provides an effective adaptive augmentation strategy for reward modeling that targets ambiguous cases, improving training efficiency and robustness in alignment pipelines.

Abstract: Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF, underpinning policy optimization methods including PPO and TRPO. However, training reliable reward models relies heavily on human-labeled preference data, which is costly and limited, motivating the use of data augmentation. Existing augmentation approaches typically operate at the representation or semantic level and remain agnostic to the reward model’s estimation difficulty. In this paper, we propose MARS, an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model. Our proposed framework, MARS, concentrates augmentation on low-margin (ambiguous) preference pairs where the reward model is most uncertain, and iteratively refines the training distribution via hard-sample augmentation. We provide theoretical guarantees showing that this strategy increases the average curvature of the loss function hence enhance information and improves conditioning, along with empirical results demonstrating consistent gains over uniform augmentation for robust reward modeling.

[362] A.R.I.S.: Automated Recycling Identification System for E-Waste Classification Using Deep Learning

Dhruv Talwar, Harsh Desai, Wendong Yin, Goutam Mohanty, Rafael Reveles

Main category: cs.LG

TL;DR: A.R.I.S. is a low-cost portable sorter for shredded e-waste that uses YOLOx deep learning model to classify metals, plastics, and circuit boards in real-time, achieving high accuracy and improving material recovery efficiency.

DetailsMotivation: Traditional electronic recycling processes suffer from significant resource loss due to inadequate material separation and identification capabilities, limiting material recovery efficiency.

Method: The system employs a YOLOx model to classify metals, plastics, and circuit boards in real time, achieving low inference latency with high detection accuracy, integrated with established sorting methods.

Result: Experimental evaluation yielded 90% overall precision, 82.2% mean average precision (mAP), and 84% sortation purity, demonstrating effective material classification and separation.

Conclusion: A.R.I.S. enhances material recovery efficiency and lowers barriers to advanced recycling adoption, supporting broader initiatives in extending product life cycles and reducing environmental impact.

Abstract: Traditional electronic recycling processes suffer from significant resource loss due to inadequate material separation and identification capabilities, limiting material recovery. We present A.R.I.S. (Automated Recycling Identification System), a low-cost, portable sorter for shredded e-waste that addresses this efficiency gap. The system employs a YOLOx model to classify metals, plastics, and circuit boards in real time, achieving low inference latency with high detection accuracy. Experimental evaluation yielded 90% overall precision, 82.2% mean average precision (mAP), and 84% sortation purity. By integrating deep learning with established sorting methods, A.R.I.S. enhances material recovery efficiency and lowers barriers to advanced recycling adoption. This work complements broader initiatives in extending product life cycles, supporting trade-in and recycling programs, and reducing environmental impact across the supply chain.

[363] Multi-Round Human-AI Collaboration with User-Specified Requirements

Sima Noorani, Shayan Kiyani, Hamed Hassani, George Pappas

Main category: cs.LG

TL;DR: A principled framework for ensuring conversational AI improves human decision quality through counterfactual harm prevention and complementarity enforcement with user-defined rules and online guarantees.

DetailsMotivation: As humans increasingly rely on conversational AI for high-stakes decisions, there's a need for principled frameworks to ensure these interactions reliably improve decision quality without undermining human strengths.

Method: Formalizes counterfactual harm prevention and complementarity principles via user-defined rules, then introduces an online, distribution-free algorithm with finite sample guarantees that enforces these constraints over collaboration dynamics.

Result: The framework maintains prescribed violation rates even under nonstationary interaction dynamics, and tightening/loosening constraints produces predictable shifts in downstream human accuracy, confirming the principles serve as practical levers for steering collaboration.

Conclusion: The proposed framework provides a principled way to ensure conversational AI interactions improve human decision quality without needing to model or constrain human behavior, using counterfactual harm and complementarity as practical steering mechanisms.

Abstract: As humans increasingly rely on multiround conversational AI for high stakes decisions, principled frameworks are needed to ensure such interactions reliably improve decision quality. We adopt a human centric view governed by two principles: counterfactual harm, ensuring the AI does not undermine human strengths, and complementarity, ensuring it adds value where the human is prone to err. We formalize these concepts via user defined rules, allowing users to specify exactly what harm and complementarity mean for their specific task. We then introduce an online, distribution free algorithm with finite sample guarantees that enforces the user-specified constraints over the collaboration dynamics. We evaluate our framework across two interactive settings: LLM simulated collaboration on a medical diagnostic task and a human crowdsourcing study on a pictorial reasoning task. We show that our online procedure maintains prescribed counterfactual harm and complementarity violation rates even under nonstationary interaction dynamics. Moreover, tightening or loosening these constraints produces predictable shifts in downstream human accuracy, confirming that the two principles serve as practical levers for steering multi-round collaboration toward better decision quality without the need to model or constrain human behavior.

[364] On the Existence and Behavior of Secondary Attention Sinks

Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, Yiren Zhao

Main category: cs.LG

TL;DR: The paper identifies and analyzes “secondary attention sinks” - tokens that receive disproportionate attention in middle layers of transformer models, differing from primary sinks (like BOS tokens) in their emergence patterns and properties.

DetailsMotivation: Prior work identified attention sinks (like BOS tokens) that receive disproportionate attention, but this work discovers a new class of "secondary sinks" with fundamentally different properties that emerge in middle layers and have variable persistence.

Method: Extensive experiments across 11 model families analyzing where secondary sinks appear, their properties, formation mechanisms, and impact on attention. Investigates MLP modules in middle layers that create these sinks and their relationship to primary sinks.

Result: Secondary sinks are formed by specific middle-layer MLP modules that map token representations to align with primary sink directions. Their ℓ₂-norm determines sink score and persistence. Primary sinks weaken in middle layers as secondary sinks emerge. Larger models show more deterministic sink patterns with identifiable “sink levels” (3 in QwQ-32B, 6 in Qwen3-14B).

Conclusion: The paper reveals a new class of attention sinks with distinct properties from previously studied primary sinks, providing deeper understanding of attention mechanisms in transformers, especially in middle layers where complex interactions occur.

Abstract: Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these secondary sinks appear, their properties, how they are formed, and their impact on the attention mechanism. Specifically, we show that: (1) these sinks are formed by specific middle-layer MLP modules; these MLPs map token representations to vectors that align with the direction of the primary sink of that layer. (2) The $\ell_2$-norm of these vectors determines the sink score of the secondary sink, and also the number of layers it lasts for, thereby leading to different impacts on the attention mechanisms accordingly. (3) The primary sink weakens in middle layers, coinciding with the emergence of secondary sinks. We observe that in larger-scale models, the location and lifetime of the sinks, together referred to as sink levels, appear in a more deterministic and frequent manner. Specifically, we identify three sink levels in QwQ-32B and six levels in Qwen3-14B.

[365] Defining and Evaluating Physical Safety for Large Language Models

Yung-Chen Tang, Pin-Yu Chen, Tsung-Yi Ho

Main category: cs.LG

TL;DR: LLM physical safety benchmark for drone control reveals utility-safety tradeoff, with larger models showing better safety but advanced prompting still struggling with unintentional attacks.

DetailsMotivation: Address the unexplored risks of LLMs causing physical threats and harm in real-world robotic applications, specifically focusing on drone control safety evaluation.

Method: Developed comprehensive benchmark classifying drone safety risks into four categories: human-targeted threats, object-targeted threats, infrastructure attacks, and regulatory violations. Evaluated mainstream LLMs and tested prompt engineering techniques like In-Context Learning and Chain-of-Thought.

Result: Found undesirable trade-off between utility and safety, with code-generation-focused models performing poorly on safety. Larger models demonstrate better safety capabilities, especially in refusing dangerous commands. Advanced prompting improves safety but still struggles with unintentional attacks.

Conclusion: The benchmark facilitates design and evaluation of LLM physical safety, highlighting the need for better safety mechanisms in LLM-controlled robotic systems.

Abstract: Large Language Models (LLMs) are increasingly used to control robotic systems such as drones, but their risks of causing physical threats and harm in real-world applications remain unexplored. Our study addresses the critical gap in evaluating LLM physical safety by developing a comprehensive benchmark for drone control. We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations. Our evaluation of mainstream LLMs reveals an undesirable trade-off between utility and safety, with models that excel in code generation often performing poorly in crucial safety aspects. Furthermore, while incorporating advanced prompt engineering techniques such as In-Context Learning and Chain-of-Thought can improve safety, these methods still struggle to identify unintentional attacks. In addition, larger models demonstrate better safety capabilities, particularly in refusing dangerous commands. Our findings and benchmark can facilitate the design and evaluation of physical safety for LLMs. The project page is available at huggingface.co/spaces/TrustSafeAI/LLM-physical-safety.

[366] Point-DeepONet: Predicting Nonlinear Fields on Non-Parametric Geometries under Variable Load Conditions

Jangseop Park, Namwoo Kang

Main category: cs.LG

TL;DR: Point-DeepONet integrates PointNet with DeepONet to create a fast surrogate model for 3D structural analysis, predicting displacement and stress fields from point cloud geometries and variable loads.

DetailsMotivation: Traditional finite element simulations are computationally expensive for design optimization and real-time control. Existing deep learning surrogates struggle with complex 3D geometries and varying load conditions.

Method: Combines PointNet for learning geometric representations from raw point clouds with DeepONet architecture to fuse geometric embeddings with load conditions for predicting 3D displacement and von Mises stress fields.

Result: Achieves R² of 0.987 for displacement and 0.923 for von Mises stress, with 400x speedup over finite element analysis (seconds vs 19.32 minutes). Maintains accuracy on unseen load directions.

Conclusion: Point-DeepONet enables rapid, high-fidelity structural analysis for complex engineering workflows, offering excellent scalability and generalization capabilities.

Abstract: Nonlinear structural analyses in engineering often require extensive finite element simulations, limiting their applicability in design optimization and real-time control. Conventional deep learning surrogates often struggle with complex, non-parametric three-dimensional (3D) geometries and directionally varying loads. This work presents Point-DeepONet, an operator-learning-based surrogate that integrates PointNet into the DeepONet framework to learn a mapping from non-parametric geometries and variable load conditions to physical response fields. By leveraging PointNet to learn a geometric representation from raw point clouds, our model circumvents the need for manual parameterization. This geometric embedding is then synergistically fused with load conditions within the DeepONet architecture to accurately predict three-dimensional displacement and von Mises stress fields. Trained on a large-scale dataset, Point-DeepONet demonstrates high fidelity, achieving a coefficient of determination (R^2) reaching 0.987 for displacement and 0.923 for von Mises stress. Furthermore, to rigorously validate its generalization capabilities, we conducted additional experiments on unseen, randomly oriented load directions, where the model maintained exceptional accuracy. Compared to nonlinear finite element analyses that require about 19.32 minutes per case, Point-DeepONet provides predictions in mere seconds–approximately 400 times faster–while maintaining excellent scalability. These findings, validated through extensive experiments and ablation studies, highlight the potential of Point-DeepONet to enable rapid, high-fidelity structural analyses for complex engineering workflows.

[367] Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning

Sanghyeon Lee, Sangjun Bae, Yisak Park, Seungyul Han

Main category: cs.LG

TL;DR: SISL is a self-improving skill learning method for meta-reinforcement learning that addresses noisy offline demonstrations through decoupled policy improvement and skill prioritization, enabling robust adaptation in long-horizon tasks.

DetailsMotivation: Skill-based meta-RL methods struggle with noisy offline demonstrations, which cause unstable skill learning and degraded performance in long-horizon environments. Existing approaches are highly susceptible to data quality issues.

Method: Proposes Self-Improving Skill Learning (SISL) with: 1) self-guided skill refinement using decoupled high-level and skill improvement policies, and 2) skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories.

Result: SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks by effectively mitigating the effects of noisy and suboptimal demonstration data.

Conclusion: SISL provides a robust solution for skill-based meta-RL that can handle noisy offline demonstrations, enabling stable skill learning and improved adaptation performance in challenging long-horizon environments.

Abstract: Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, leading to unstable skill learning and degraded performance. To address this, we propose Self-Improving Skill Learning (SISL), which performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data. By mitigating the effect of noise, SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks. Our code is available at https://github.com/epsilog/SISL.

[368] Rex: A Family of Reversible Exponential (Stochastic) Runge-Kutta Solvers

Zander W. Blasingame, Chen Liu

Main category: cs.LG

TL;DR: Rex: A new family of reversible exponential (stochastic) Runge-Kutta solvers for exact inversion of neural differential equations in generative models

DetailsMotivation: Current ODE/SDE solvers for deep generative models accumulate discretization errors that prevent exact inversion, which is crucial for applications requiring precision. Existing inversion methods have poor stability, low-order convergence, and are limited to ODE domain.

Method: Proposes Rex solvers using Lawson methods to convert any explicit (stochastic) Runge-Kutta scheme into a reversible one, enabling exact inversion of neural differential equations in generative models.

Result: Rex solvers enable exact inversion with improved stability and convergence. Demonstrated utility in sampling Boltzmann distributions with flow models and improving image generation/editing with diffusion models.

Conclusion: Rex provides a theoretically sound and practical solution for reversible integration in neural differential equations, addressing key limitations in current generative model inversion methods.

Abstract: Deep generative models based on neural differential equations have quickly become the state-of-the-art for numerous generation tasks across many different applications. These models rely on ODE/SDE solvers which integrate from a prior distribution to the data distribution. In many applications it is highly desirable to then integrate in the other direction. The standard solvers, however, accumulate discretization errors which don’t align with the forward trajectory, thereby prohibiting an exact inversion. In applications where the precision of the generative model is paramount this inaccuracy in inversion is often unacceptable. Current approaches to solving the inversion of these models results in significant downstream issues with poor stability and low-order of convergence; moreover, they are strictly limited to the ODE domain. In this work, we propose a new family of reversible exponential (stochastic) Runge-Kutta solvers which we refer to as Rex developed by an application of Lawson methods to convert any explicit (stochastic) Runge-Kutta scheme into a reversible one. In addition to a rigorous theoretical analysis of the proposed solvers, we also empirically demonstrate the utility of Rex on improving the sampling of Boltzmann distributions with flow models, and improving image generation and editing capabilities with diffusion models.

[369] Oversmoothing, Oversquashing, Heterophily, Long-Range, and more: Demystifying Common Beliefs in Graph Machine Learning

Adrian Arnaiz-Rodriguez, Federico Errica

Main category: cs.LG

TL;DR: Critical analysis of common beliefs in graph machine learning regarding oversmoothing, oversquashing, homophily-heterophily dichotomy, and long-range tasks, refuting universal statements with counterexamples to clarify conceptual differences.

DetailsMotivation: The graph ML community has developed commonly accepted beliefs and assumptions (as universal statements) around key topics like oversmoothing, oversquashing, homophily-heterophily dichotomy, and long-range tasks. These beliefs are not always true nor easy to distinguish, leading to ambiguities and misunderstandings that prevent researchers from addressing precise research questions.

Method: The authors make common beliefs explicit and encourage critical thinking by refuting universal statements via simple yet formally sufficient counterexamples. They analyze key topics in graph ML to clarify conceptual differences.

Result: The paper provides a critical examination of prevailing assumptions in graph ML, offering counterexamples that challenge universal statements about oversmoothing, oversquashing, homophily-heterophily relationships, and long-range task performance.

Conclusion: By exposing and refuting common universal statements in graph ML, the paper aims to help researchers focus on more clearly defined and targeted problems, moving beyond ambiguous assumptions that hinder progress in the field.

Abstract: After a renaissance phase in which researchers revisited the message-passing paradigm through the lens of deep learning, the graph machine learning community shifted its attention towards a deeper and practical understanding of message-passing’s benefits and limitations. In this paper, we notice how the fast pace of progress around the topics of oversmoothing and oversquashing, the homophily-heterophily dichotomy, and long-range tasks, came with the consolidation of commonly accepted beliefs and assumptions – under the form of universal statements – that are not always true nor easy to distinguish from each other. We argue that this has led to ambiguities around the investigated problems, preventing researchers from focusing on and addressing precise research questions while causing a good amount of misunderstandings. Our contribution is to make such common beliefs explicit and encourage critical thinking around these topics, refuting universal statements via simple yet formally sufficient counterexamples. The end goal is to clarify conceptual differences, helping researchers address more clearly defined and targeted problems.

[370] Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han

Main category: cs.LG

TL;DR: SSE is a hierarchical RL framework that improves long-horizon goal-conditioned tasks by separating reachable from unreachable subgoals using Frontier Experience Replay, reducing inefficient high-level planning.

DetailsMotivation: Long-horizon goal-conditioned RL faces challenges with distant goals and sparse rewards. Existing hierarchical/graph-based methods often fail due to subgoal infeasibility from conventional hindsight relabeling, leading to inefficient high-level planning.

Method: Proposes Strict Subgoal Execution (SSE) with Frontier Experience Replay (FER) to separate unreachable from admissible subgoals. FER uses failure and partial-success transitions to identify unreliable subgoals. Also includes decoupled exploration policy and path refinement adjusting edge costs based on low-level failures.

Result: SSE consistently outperforms existing goal-conditioned and hierarchical RL methods across diverse long-horizon benchmarks in both efficiency and success rate.

Conclusion: SSE effectively addresses subgoal infeasibility in hierarchical RL for long-horizon tasks by improving subgoal reliability and reducing unnecessary high-level decisions through frontier-based filtering.

Abstract: Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, their reliance on conventional hindsight relabeling often fails to correct subgoal infeasibility, leading to inefficient high-level planning. To address this, we propose Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that integrates Frontier Experience Replay (FER) to separate unreachable from admissible subgoals and streamline high-level decision making. FER delineates the reachability frontier using failure and partial-success transitions, which identifies unreliable subgoals, increases subgoal reliability, and reduces unnecessary high-level decisions. Additionally, SSE employs a decoupled exploration policy to cover underexplored regions of the goal space and a path refinement that adjusts edge costs using observed low-level failures. Experimental results across diverse long-horizon benchmarks show that SSE consistently outperforms existing goal-conditioned and hierarchical RL methods in both efficiency and success rate. Our code is available at https://github.com/Jaebak1996/SSE

[371] Advancing Universal Deep Learning for Electronic-Structure Hamiltonian Prediction of Materials

Shi Yin, Zujian Dai, Xinyang Pan, Lixin He

Main category: cs.LG

TL;DR: NextHAM: A neural E(3)-symmetry transformer model for predicting electronic-structure Hamiltonians with improved generalization across diverse materials.

DetailsMotivation: Current deep learning methods for Hamiltonian prediction struggle with generalization across diverse atomic types and structural patterns due to high-dimensional complexity. There's a need for more universal and efficient approaches that can handle the diversity of materials while maintaining accuracy.

Method: 1) Uses zeroth-step Hamiltonians from DFT initial charge density as informative descriptors and initial estimates, allowing the model to predict correction terms rather than full Hamiltonians. 2) Neural Transformer architecture with strict E(3)-symmetry for equivariance. 3) Novel training objective ensuring accuracy in both real and reciprocal space to prevent error amplification and “ghost states.” 4) Introduces Materials-HAM-SOC dataset with 17,000 material structures spanning 68 elements including SOC effects.

Result: NextHAM achieves excellent accuracy and efficiency in predicting Hamiltonians and band structures on the Materials-HAM-SOC benchmark, demonstrating strong generalization across diverse materials.

Conclusion: The proposed NextHAM framework advances universal deep learning for Hamiltonian prediction through methodological innovations (zeroth-step Hamiltonians, E(3)-symmetric transformer, dual-space training) and a comprehensive dataset, offering significant improvements in generalization and efficiency.

Abstract: Deep learning methods for electronic-structure Hamiltonian prediction has offered significant computational efficiency advantages over traditional DFT methods, yet the diversity of atomic types, structural patterns, and the high-dimensional complexity of Hamiltonians pose substantial challenges to the generalization performance. In this work, we contribute on both the methodology and dataset sides to advance universal deep learning paradigm for Hamiltonian prediction. On the method side, we propose NextHAM, a neural E(3)-symmetry and expressive correction method for efficient and generalizable materials electronic-structure Hamiltonian prediction. First, we introduce the zeroth-step Hamiltonians, which can be efficiently constructed by the initial charge density of DFT, as informative descriptors of neural regression model in the input level and initial estimates of the target Hamiltonian in the output level, so that the regression model directly predicts the correction terms to the target ground truths, thereby significantly simplifying the input-output mapping for learning. Second, we present a neural Transformer architecture with strict E(3)-Symmetry and high non-linear expressiveness for Hamiltonian prediction. Third, we propose a novel training objective to ensure the accuracy performance of Hamiltonians in both real space and reciprocal space, preventing error amplification and the occurrence of “ghost states” caused by the large condition number of the overlap matrix. On the dataset side, we curate a high-quality broad-coverage large benchmark, namely Materials-HAM-SOC, comprising 17,000 material structures spanning 68 elements from six rows of the periodic table and explicitly incorporating SOC effects. Experimental results on Materials-HAM-SOC demonstrate that NextHAM achieves excellent accuracy and efficiency in predicting Hamiltonians and band structures.

[372] Watermarking Diffusion Language Models

Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev

Main category: cs.LG

TL;DR: First watermark for diffusion language models (DLMs) that addresses the challenge of non-sequential token generation by applying watermarks in expectation over context and promoting tokens that increase watermark strength.

DetailsMotivation: Diffusion language models (DLMs) generate tokens in arbitrary order unlike autoregressive language models (ARLMs), making existing ARLM watermarking schemes incompatible since they rely on previously generated tokens. There's a need for watermarking tailored to the DLM paradigm.

Method: 1) Apply watermark in expectation over context even when some context tokens are undetermined, 2) Promote tokens that increase watermark strength when used as context for other tokens, while keeping the watermark detector unchanged from ARLM approaches.

Result: The DLM watermark achieves >99% true positive rate with minimal quality impact and similar robustness to existing ARLM watermarks, enabling reliable DLM watermarking for the first time.

Conclusion: Successfully developed the first effective watermarking scheme for diffusion language models that addresses the unique challenges of non-sequential token generation while maintaining high detection rates and minimal quality degradation.

Abstract: We introduce the first watermark tailored for diffusion language models (DLMs), an emergent LLM paradigm able to generate tokens in arbitrary order, in contrast to standard autoregressive language models (ARLMs) which generate tokens sequentially. While there has been much work in ARLM watermarking, a key challenge when attempting to apply these schemes directly to the DLM setting is that they rely on previously generated tokens, which are not always available with DLM generation. In this work we address this challenge by: (i) applying the watermark in expectation over the context even when some context tokens are yet to be determined, and (ii) promoting tokens which increase the watermark strength when used as context for other tokens. This is accomplished while keeping the watermark detector unchanged. Our experimental evaluation demonstrates that the DLM watermark leads to a >99% true positive rate with minimal quality impact and achieves similar robustness to existing ARLM watermarks, enabling for the first time reliable DLM watermarking.

[373] LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

Ximan Sun, Xiang Cheng

Main category: cs.LG

TL;DR: LRT-Diffusion introduces a risk-aware sampling method for diffusion policies in offline RL that uses sequential hypothesis testing with calibrated risk control instead of heuristic guidance.

DetailsMotivation: Current diffusion policies for offline RL use heuristic guidance at sampling time without proper statistical risk control, lacking interpretable risk budgets and principled uncertainty handling.

Method: Treats each denoising step as sequential hypothesis test between unconditional prior and state-conditional policy head, accumulating log-likelihood ratio and gating conditional mean with logistic controller calibrated to meet user-specified Type-I error level alpha.

Result: On D4RL MuJoCo tasks, LRT-Diffusion improves return-OOD trade-off over Q-guided baselines while honoring desired alpha, with theoretical guarantees of level-alpha calibration and stability bounds.

Conclusion: LRT-Diffusion is a drop-in inference-time method that adds principled, calibrated risk control to diffusion policies for offline RL, especially beneficial when off-support errors dominate.

Abstract: Diffusion policies are competitive for offline reinforcement learning (RL) but are typically guided at sampling time by heuristics that lack a statistical notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that treats each denoising step as a sequential hypothesis test between the unconditional prior and the state-conditional policy head. Concretely, we accumulate a log-likelihood ratio and gate the conditional mean with a logistic controller whose threshold tau is calibrated once under H0 to meet a user-specified Type-I level alpha. This turns guidance from a fixed push into an evidence-driven adjustment with a user-interpretable risk budget. Importantly, we deliberately leave training vanilla (two heads with standard epsilon-prediction) under the structure of DDPM. LRT guidance composes naturally with Q-gradients: critic-gradient updates can be taken at the unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum from exploitation to conservatism. We standardize states and actions consistently at train and test time and report a state-conditional out-of-distribution (OOD) metric alongside return. On D4RL MuJoCo tasks, LRT-Diffusion improves the return-OOD trade-off over strong Q-guided baselines in our implementation while honoring the desired alpha. Theoretically, we establish level-alpha calibration, concise stability bounds, and a return comparison showing when LRT surpasses Q-guidance-especially when off-support errors dominate. Overall, LRT-Diffusion is a drop-in, inference-time method that adds principled, calibrated risk control to diffusion policies for offline RL.

[374] Semi-Supervised Preference Optimization with Limited Feedback

Seonggyun Lee, Sungjun Lim, Seojin Park, Soeun Cheon, Kyungwoo Song

Main category: cs.LG

TL;DR: SSPO enables semi-supervised preference optimization using minimal labeled data and large unlabeled datasets via optimal reward thresholding for pseudo-labeling.

DetailsMotivation: Current preference optimization methods require substantial paired feedback data, leading to high resource costs. The paper aims to reduce dependency on labeled data while maintaining alignment quality.

Method: Proposes Semi-Supervised Preference Optimization (SSPO) that learns from both limited pairwise preference labels and large unpaired samples. Key innovation is theoretical proof of optimal reward threshold that separates winning/losing responses, enabling principled pseudo-labeling of unlabeled data.

Result: SSPO achieves remarkable data efficiency - training Mistral-7B-Instruct with just 1% of UltraFeedback data consistently surpasses strong baselines trained on 10% of UltraFeedback. Validated across multiple datasets.

Conclusion: SSPO effectively distills latent preferences from unlabeled data, maintaining human alignment while drastically reducing data acquisition costs, making preference optimization more accessible and efficient.

Abstract: The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Mistral-7B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.

[375] Data Augmentation Scheme for Raman Spectra with Highly Correlated Annotations

Christoph Lange, Isabel Thiele, Lara Santolin, Sebastian L. Riedel, Maxim Borisyak, Peter Neubauer, M. Nicolas Cruz Bournazou

Main category: cs.LG

TL;DR: Data augmentation technique for Raman spectroscopy using additive nature of spectra to generate training data with statistically independent labels, improving CNN performance when correlations differ between training and target datasets.

DetailsMotivation: Raman spectroscopy is popular in biotechnology for non-invasive monitoring, but CNNs require large datasets and can pick up non-linear dependencies. Historical data often exists but isn't used due to correlation differences between contexts.

Method: Exploits additive nature of spectra to generate additional data points with statistically independent labels from existing datasets. This reduces correlations between model predictions when training CNNs.

Result: Training CNNs on generated data improves performance on datasets where annotations don’t bear the same correlations as training data. Enables reuse of historical spectra for new contexts with different correlations.

Conclusion: The data augmentation technique allows building more robust models using historical data, demonstrated with synthetic spectra of Ralstonia eutropha batch cultivations for monitoring substrate, biomass and PHA concentrations.

Abstract: In biotechnology Raman Spectroscopy is rapidly gaining popularity as a process analytical technology (PAT) that measures cell densities, substrate- and product concentrations. As it records vibrational modes of molecules it provides that information non-invasively in a single spectrum. Typically, partial least squares (PLS) is the model of choice to infer information about variables of interest from the spectra. However, biological processes are known for their complexity where convolutional neural networks (CNN) present a powerful alternative. They can handle non-Gaussian noise and account for beam misalignment, pixel malfunctions or the presence of additional substances. However, they require a lot of data during model training, and they pick up non-linear dependencies in the process variables. In this work, we exploit the additive nature of spectra in order to generate additional data points from a given dataset that have statistically independent labels so that a network trained on such data exhibits low correlations between the model predictions. We show that training a CNN on these generated data points improves the performance on datasets where the annotations do not bear the same correlation as the dataset that was used for model training. This data augmentation technique enables us to reuse spectra as training data for new contexts that exhibit different correlations. The additional data allows for building a better and more robust model. This is of interest in scenarios where large amounts of historical data are available but are currently not used for model training. We demonstrate the capabilities of the proposed method using synthetic spectra of Ralstonia eutropha batch cultivations to monitor substrate, biomass and polyhydroxyalkanoate (PHA) biopolymer concentrations during of the experiments.

[376] Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration

Yan Sun, Jia Guo, Stanley Kok, Zihao Wang, Zujie Wen, Zhiqiang Zhang

Main category: cs.LG

TL;DR: PREPO improves data efficiency in reinforcement learning with verifiable rewards (RLVR) by using prompt perplexity to guide learning progression and amplifying rollout discrepancy through relative entropy differentiation, achieving competitive results with up to 3x fewer rollouts.

DetailsMotivation: RLVR improves reasoning in LLMs but is computationally expensive due to many unproductive rollouts. The paper aims to improve data efficiency by leveraging intrinsic data properties that come at almost no additional cost during training.

Method: PREPO has two components: 1) Uses prompt perplexity as an indicator of model adaptability, enabling progression from easier to harder contexts. 2) Amplifies rollout discrepancy by differentiating their relative entropy, prioritizing sequences with higher exploration.

Result: On Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3 times fewer rollouts than baselines while preserving competitive performance.

Conclusion: PREPO successfully improves data efficiency of RLVR through intrinsic data property utilization, with both empirical gains and theoretical analysis explaining the method’s rationale.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models, yet training remains costly because many rollouts contribute little to optimization, considering the amount of computation required. This study investigates how simply leveraging intrinsic data properties, almost free benefit during training, can improve data efficiency for RLVR. We propose PREPO with two complementary components. First, we adopt prompt perplexity as an indicator of model adaptability in learning, enabling the model to progress from well-understood contexts to more challenging ones. Second, we amplify the discrepancy among the rollouts by differentiating their relative entropy, and prioritize sequences that exhibit a higher degree of exploration. Together, these mechanisms reduce rollout demand while preserving competitive performance. On the Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3 times fewer rollouts than the baselines. Beyond empirical gains, we provide theoretical and in-depth analyses explaining the underlying rationale of our method to improve the data efficiency of RLVR.

[377] Graph Machine Learning based Doubly Robust Estimator for Network Causal Effects

Seyedeh Baharan Khatami, Harsh Parikh, Haowei Chen, Sudeepa Roy, Babak Salimi

Main category: cs.LG

TL;DR: Proposes a novel method combining graph machine learning with double machine learning to estimate causal effects in social networks, addressing interference and network-induced confounding without strong prior assumptions.

DetailsMotivation: Existing methods for causal inference in social networks make strong assumptions about network-induced confounding mechanisms that rarely hold in high-dimensional networks. There's a need for methods that can handle interference (where neighbors' treatments affect outcomes) and network confounding without restrictive assumptions.

Method: Combines graph machine learning approaches with the double machine learning framework to enable accurate estimation of direct and peer effects using single observational social network data. The method is semiparametrically efficient under mild regularity conditions.

Result: Demonstrates semiparametric efficiency of the proposed estimator, allowing for consistent uncertainty quantification. Extensive simulations show the method is accurate, robust, and scalable. Applied to investigate impact of Self-Help Group participation on financial risk tolerance.

Conclusion: The proposed methodology effectively addresses causal inference challenges in social networks by combining graph ML with double ML, providing accurate estimation of direct and peer effects without strong prior assumptions about network confounding.

Abstract: We address the challenge of inferring causal effects in social network data. This results in challenges due to interference – where a unit’s outcome is affected by neighbors’ treatments – and network-induced confounding factors. While there is extensive literature focusing on estimating causal effects in social network setups, a majority of them make prior assumptions about the form of network-induced confounding mechanisms. Such strong assumptions are rarely likely to hold especially in high-dimensional networks. We propose a novel methodology that combines graph machine learning approaches with the double machine learning framework to enable accurate and efficient estimation of direct and peer effects using a single observational social network. We demonstrate the semiparametric efficiency of our proposed estimator under mild regularity conditions, allowing for consistent uncertainty quantification. We demonstrate that our method is accurate, robust, and scalable via an extensive simulation study. We use our method to investigate the impact of Self-Help Group participation on financial risk tolerance.

[378] Gradient Testing and Estimation by Comparisons

Xiwen Tao, Chenyi Zhang, Helin Wang, Yexin Zhang, Tongyang Li

Main category: cs.LG

TL;DR: Paper studies gradient testing and estimation using only comparison oracles, with classical and quantum algorithms for determining gradient direction.

DetailsMotivation: Many optimization problems involve functions where only comparison queries (which point has larger value) are available, not direct gradient access. Understanding gradient properties with limited information is fundamental.

Method: Designs algorithms using comparison oracles: 1) gradient testing algorithm to check if normalized gradient is close/far from given direction (O(1) queries), 2) gradient estimation algorithm to estimate normalized gradient direction (O(n log(1/ε)) queries), 3) quantum algorithm using superposition queries (O(log(n/ε)) queries).

Result: Proves optimality of classical gradient estimation algorithm (O(n log(1/ε)) queries). Shows quantum algorithm achieves exponential speedup in dimension dependence (O(log(n/ε)) vs O(n log(1/ε))).

Conclusion: Comparison oracles provide sufficient information for gradient testing and estimation, with quantum algorithms offering significant speedups. Results have implications for optimization with limited function access.

Abstract: We study gradient testing and gradient estimation of smooth functions using only a comparison oracle that, given two points, indicates which one has the larger function value. For any smooth $f\colon\mathbb R^n\to\mathbb R$, $\mathbf{x}\in\mathbb R^n$, and $\varepsilon>0$, we design a gradient testing algorithm that determines whether the normalized gradient $\nabla f(\mathbf{x})/|\nabla f(\mathbf{x})|$ is $\varepsilon$-close or $2\varepsilon$-far from a given unit vector $\mathbf{v}$ using $O(1)$ queries, as well as a gradient estimation algorithm that outputs an $\varepsilon$-estimate of $\nabla f(\mathbf{x})/|\nabla f(\mathbf{x})|$ using $O(n\log(1/\varepsilon))$ queries which we prove to be optimal. Furthermore, we study gradient estimation in the quantum comparison oracle model where queries can be made in superpositions, and develop a quantum algorithm using $O(\log (n/\varepsilon))$ queries.

[379] SeqRisk: Transformer-augmented latent variable model for robust survival prediction with longitudinal data

Mine Öğretir, Miika Koskinen, Juha Sinisalo, Risto Renkonen, Harri Lähdesmäki

Main category: cs.LG

TL;DR: SeqRisk combines VAE/LVAE with transformers and Cox models for longitudinal healthcare risk prediction, handling irregular clinical data to identify high-risk patients.

DetailsMotivation: Traditional survival analysis uses single time-point data, failing to leverage longitudinal patient history and capture temporal patterns in clinical real-world data, which is often irregular, noisy, and sparse.

Method: SeqRisk integrates variational autoencoder (VAE) or longitudinal VAE (LVAE) with transformer-based sequence aggregation and Cox proportional hazards module for risk prediction, handling irregular longitudinal data and capturing long-range interactions.

Result: SeqRisk demonstrated robust performance under increasing data sparsity conditions, consistently surpassing existing approaches in predictive accuracy and generalizability while providing partial explainability.

Conclusion: The proposed method effectively handles challenging clinical longitudinal data, improves risk prediction accuracy, and offers insights into population characteristics for identifying high-risk patients.

Abstract: In healthcare, risk assessment of patient outcomes has been based on survival analysis for a long time, i.e. modeling time-to-event associations. However, conventional approaches rely on data from a single time-point, making them suboptimal for fully leveraging longitudinal patient history and capturing temporal regularities. Focusing on clinical real-world data and acknowledging its challenges, we utilize latent variable models to effectively handle irregular, noisy, and sparsely observed longitudinal data. We propose SeqRisk, a method that combines variational autoencoder (VAE) or longitudinal VAE (LVAE) with a transformer-based sequence aggregation and Cox proportional hazards module for risk prediction. SeqRisk captures long-range interactions, enhances predictive accuracy and generalizability, as well as provides partial explainability for sample population characteristics in attempts to identify high-risk patients. SeqRisk demonstrated robust performance under conditions of increasing sparsity, consistently surpassing existing approaches.

[380] Beyond Linear Surrogates: High-Fidelity Local Explanations for Black-Box Models

Sanjeev Shrestha, Rahul Dubey, Hui Liu

Main category: cs.LG

TL;DR: A novel local model-agnostic explanation method using MARS and N-ball sampling to generate high-fidelity explanations for black-box ML models.

DetailsMotivation: As black-box ML models become more complex and are adopted in high-stakes areas, there's a critical need for explanations of their predictions. Existing local explanation methods lack in generating high-fidelity explanations.

Method: Proposes a local model-agnostic explanation method using multivariate adaptive regression splines (MARS) to model non-linear local boundaries and N-ball sampling strategies that sample perturbed samples directly from desired distributions instead of reweighting.

Result: The method achieves higher local surrogate fidelity compared to baseline methods, with an average 32% reduction in RMSE across five benchmark datasets. Statistical analysis shows significantly better results across all datasets.

Conclusion: The approach advances explainable AI by providing more accurate local approximations of black-box models, benefiting both research and practitioner communities.

Abstract: With the increasing complexity of black-box machine learning models and their adoption in high-stakes areas, it is critical to provide explanations for their predictions. Existing local explanation methods lack in generating high-fidelity explanations. This paper proposes a novel local model agnostic explanation method to generate high-fidelity explanations using multivariate adaptive regression splines (MARS) and N-ball sampling strategies. MARS is used to model non-linear local boundaries that effectively captures the underlying behavior of the reference model, thereby enhancing the local fidelity. The N-ball sampling technique samples perturbed samples directly from a desired distribution instead of reweighting, leading to further improvement in the faithfulness. The performance of the proposed method was computed in terms of root mean squared error (RMSE) and evaluated on five different benchmark datasets with different kernel width. Experimental results show that the proposed method achieves higher local surrogate fidelity compared to baseline local explanation methods, with an average reduction of 32% in root mean square error, indicating more accurate local approximations of the black-box model. Additionally, statistical analysis shows that across all benchmark datasets, the proposed approach results were statistically significantly better. This paper advances the field of explainable AI by providing insights that can benefit the broader research and practitioner community.

[381] Risk-Aware Decision Making in Restless Bandits: Theory and Algorithms for Planning and Learning

Nima Akbarzadeh, Yossiri Adulyasak, Erick Delage

Main category: cs.LG

TL;DR: Risk-aware restless bandits with Whittle index solutions for planning and Thompson sampling for learning, applied to machine replacement and patient scheduling.

DetailsMotivation: Traditional restless bandits use risk-neutral objectives, but real-world applications often require risk-awareness to mitigate downside risks. The paper aims to incorporate risk considerations into restless bandit problems.

Method: Generalizes restless bandits with risk-aware objectives, establishes indexability conditions, provides Whittle index solutions for planning (finite-horizon non-stationary and infinite-horizon stationary MDPs), and proposes Thompson sampling for learning unknown transition probabilities.

Result: Shows bounded regret scaling sublinearly with episodes and quadratically with arms for Thompson sampling. Numerical experiments demonstrate efficacy in reducing risk exposure in machine replacement and patient scheduling applications.

Conclusion: Successfully incorporates risk-awareness into restless bandits with theoretical guarantees and practical applications, providing both planning and learning solutions for risk-sensitive decision-making.

Abstract: In restless bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless bandits problem with a risk-neutral objective by incorporating risk-awareness, which is particularly important in various real-world applications especially when the decision maker seeks to mitigate downside risks. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index for the first time for the planning problem with finite-horizon non-stationary and for infinite-horizon stationary Markov decision processes. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless bandits is illustrated through a set of numerical experiments in the contexts of machine replacement and patient scheduling applications under both planning and learning setups.

[382] Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information

Maria-Florina Balcan, Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Keegan Harris, Zhiwei Steven Wu

Main category: cs.LG

TL;DR: Online learning algorithms for Stackelberg games with side information achieve improved O(√T) regret under bandit feedback via reduction to linear contextual bandits

DetailsMotivation: The paper addresses the problem of online learning in Stackelberg games where a leader interacts with a sequence of followers, with the leader observing contextual information before committing to a strategy. Previous algorithms achieved O(T^{2/3}) regret, and the authors aim to improve this to O(√T) under bandit feedback.

Method: The authors propose learning algorithms that rely on a reduction to linear contextual bandits in the utility space. In each round, a linear contextual bandit algorithm recommends a utility vector, which is then inverted to determine the leader’s mixed strategy. The approach is extended to settings where the leader’s utility function is unknown.

Result: The algorithms achieve O(√T) regret under bandit feedback, improving from the previously best-known rates of O(T^{2/3}). The methods are applied to bidding in second-price auctions with side information and online Bayesian persuasion with public and private states. Empirical results show the algorithms outperform previous approaches in numerical simulations.

Conclusion: The paper presents efficient online learning algorithms for Stackelberg games with side information that achieve optimal regret rates through a novel reduction to linear contextual bandits. The approach is generalizable to settings with unknown utility functions and has practical applications in auction theory and Bayesian persuasion.

Abstract: We study the problem of online learning in Stackelberg games with side information between a leader and a sequence of followers. In every round the leader observes contextual information and commits to a mixed strategy, after which the follower best-responds. We provide learning algorithms for the leader which achieve $O(T^{1/2})$ regret under bandit feedback, an improvement from the previously best-known rates of $O(T^{2/3})$. Our algorithms rely on a reduction to linear contextual bandits in the utility space: In each round, a linear contextual bandit algorithm recommends a utility vector, which our algorithm inverts to determine the leader’s mixed strategy. We extend our algorithms to the setting in which the leader’s utility function is unknown, and also apply it to the problems of bidding in second-price auctions with side information and online Bayesian persuasion with public and private states. Finally, we observe that our algorithms empirically outperform previous results on numerical simulations.

[383] Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation

Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang

Main category: cs.LG

TL;DR: PSOFT is a parameter-efficient fine-tuning method that confines orthogonal transformations to the principal subspace of pre-trained weights to achieve semantic preservation, expressiveness, and multi-dimensional efficiency.

DetailsMotivation: Existing orthogonal fine-tuning methods struggle to balance expressiveness and efficiency (parameter counts, memory, computation) while preserving semantic representations of pre-trained models.

Method: PSOFT constructs principal subspace via matrix decomposition for compatible transformations with higher effective rank, establishes theoretical conditions to maintain subspace geometry for semantic preservation, and introduces efficient tunable vectors that gradually relax orthogonality during training.

Result: Extensive experiments on 35 NLP and CV tasks across four representative models demonstrate PSOFT achieves semantic preservation, expressiveness, and multi-dimensional efficiency in PEFT.

Conclusion: PSOFT offers a practical and scalable solution for parameter-efficient fine-tuning that simultaneously achieves semantic preservation, expressiveness, and efficiency across NLP and CV tasks.

Abstract: Driven by the rapid growth of model parameters, parameter-efficient fine-tuning (PEFT) has become essential for adapting large models to diverse downstream tasks under constrained computational resources. Within this paradigm, orthogonal fine-tuning and its variants preserve semantic representations of pre-trained models, but struggle to achieve both expressiveness and efficiency in terms of parameter counts, memory, and computation. To overcome this limitation, we propose efficient Orthogonal Fine-Tuning with Principal Subspace adaptation (PSOFT), which confines orthogonal transformations to the principal subspace of pre-trained weights. Specifically, PSOFT constructs this subspace via matrix decomposition to enable compatible transformations with higher effective rank, establishes a theoretical condition that strictly maintains the geometry of this subspace for essential semantic preservation, and introduces efficient tunable vectors that gradually relax orthogonality during training to enhance adaptability. Extensive experiments on 35 NLP and CV tasks across four representative models demonstrate that PSOFT offers a practical and scalable solution to simultaneously achieve semantic preservation, expressiveness, and multi-dimensional efficiency in PEFT. The code is publicly available at https://github.com/fei407/PSOFT.

[384] Temporal Graph Pattern Machine

Yijun Ma, Zehong Wang, Weixiang Sun, Yanfang Ye

Main category: cs.LG

TL;DR: TGPM is a foundation framework for temporal graph learning that learns generalized evolving patterns through interaction patches and self-supervised pre-training, achieving SOTA in link prediction with strong cross-domain transferability.

DetailsMotivation: Current temporal graph learning methods are task-centric with restrictive assumptions (short-term dependencies, static neighborhoods, retrospective time usage), which hinders discovery of transferable temporal evolution mechanisms. Need a framework that directly learns generalized evolving patterns.

Method: TGPM conceptualizes interactions as interaction patches via temporally-biased random walks to capture multi-scale structural semantics and long-range dependencies. Uses Transformer-based backbone to capture global temporal regularities. Introduces self-supervised pre-training tasks (masked token modeling and next-time prediction) to encode fundamental laws of network evolution.

Result: Extensive experiments show TGPM consistently achieves state-of-the-art performance in both transductive and inductive link prediction, demonstrating exceptional cross-domain transferability.

Conclusion: TGPM provides a foundation framework for temporal graph learning that shifts focus to learning generalized evolving patterns, enabling better transferability and performance across domains.

Abstract: Temporal graph learning is pivotal for deciphering dynamic systems, where the core challenge lies in explicitly modeling the underlying evolving patterns that govern network transformation. However, prevailing methods are predominantly task-centric and rely on restrictive assumptions – such as short-term dependency modeling, static neighborhood semantics, and retrospective time usage. These constraints hinder the discovery of transferable temporal evolution mechanisms. To address this, we propose the Temporal Graph Pattern Machine (TGPM), a foundation framework that shifts the focus toward directly learning generalized evolving patterns. TGPM conceptualizes each interaction as an interaction patch synthesized via temporally-biased random walks, thereby capturing multi-scale structural semantics and long-range dependencies that extend beyond immediate neighborhoods. These patches are processed by a Transformer-based backbone designed to capture global temporal regularities while adapting to context-specific interaction dynamics. To further empower the model, we introduce a suite of self-supervised pre-training tasks – specifically masked token modeling and next-time prediction – to explicitly encode the fundamental laws of network evolution. Extensive experiments show that TGPM consistently achieves state-of-the-art performance in both transductive and inductive link prediction, demonstrating exceptional cross-domain transferability. Our code has been released in https://github.com/antman9914/TGPM.

[385] Supervised Graph Contrastive Learning for Gene Regulatory Networks

Sho Oshima, Yuji Okamoto, Taisei Tosaki, Ryosuke Kojima

Main category: cs.LG

TL;DR: SupGCL is a supervised graph contrastive learning method for gene regulatory networks that incorporates real biological perturbations from knockdown experiments as supervision, improving representation learning over conventional augmentation-based approaches.

DetailsMotivation: Current graph contrastive learning methods use artificial perturbations that may not reflect biological reality, leading to augmentation-free trends. However, real biological perturbations from experiments contain valuable information that should be leveraged rather than avoided.

Method: SupGCL uses a probabilistic formulation that generalizes conventional GCL by linking artificial augmentations with real perturbations measured in knockdown experiments, using the latter as explicit supervision for training GRN representations.

Result: On patient-derived GRNs from three cancer types, SupGCL yields clearer disease-subtype structure, improves clustering in embedding space analysis, and consistently outperforms strong baselines on 13 downstream tasks spanning gene-level annotation and patient-level prediction.

Conclusion: Incorporating real biological perturbations as supervision in graph contrastive learning provides more biologically meaningful representations and improves performance on various downstream tasks in gene regulatory network analysis.

Abstract: Graph Contrastive Learning (GCL) is a powerful self-supervised learning framework that performs data augmentation through graph perturbations, with growing applications in the analysis of biological networks such as Gene Regulatory Networks (GRNs). The artificial perturbations commonly used in GCL, such as node dropping, induce structural changes that can diverge from biological reality. This concern has contributed to a broader trend in graph representation learning toward augmentation-free methods, which view such structural changes as problematic and should be avoided. However, this trend overlooks the fundamental insight that structural changes from biologically meaningful perturbations are not a problem to be avoided, but rather a rich source of information, thereby ignoring the valuable opportunity to leverage data from real biological experiments. Motivated by this insight, we propose SupGCL (Supervised Graph Contrastive Learning), a new GCL method for GRNs that directly incorporates biological perturbations from gene knockdown experiments as supervision. SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments, and using the latter as explicit supervision. On patient-derived GRNs from three cancer types, we train GRN representations with SupGCL and evaluate it in two regimes: (i) embedding space analysis, where it yields clearer disease-subtype structure and improves clustering, and (ii) task-specific fine-tuning, where it consistently outperforms strong graph representation learning baselines on 13 downstream tasks spanning gene-level functional annotation and patient-level prediction.

[386] GGBall: Graph Generative Model on Poincaré Ball

Tianci Bu, Chuanrui Wang, Hao Ma, Haoren Zheng, Xin Lu, Tailin Wu

Main category: cs.LG

TL;DR: GGBall: A hyperbolic framework for graph generation using hyperbolic VQ-VAE with Riemannian flow matching, achieving better hierarchical structure preservation than Euclidean methods.

DetailsMotivation: Euclidean geometry struggles to capture the exponential complexity and hierarchical structures in graphs. Hyperbolic geometry offers better representation for hierarchical data but hasn't been effectively integrated with modern generative models for graph generation.

Method: Combines Hyperbolic Vector-Quantized Autoencoder (HVQVAE) with Riemannian flow matching using closed-form geodesics. Develops hyperbolic GNN and Transformer layers that operate entirely within the hyperbolic manifold for stability and scalability.

Result: Reduces degree MMD by over 75% on Community-Small and over 40% on Ego-Small compared to state-of-the-art baselines, demonstrating improved ability to preserve topological hierarchies in generated graphs.

Conclusion: Hyperbolic geometry provides a powerful foundation for generative modeling of complex, structured, and hierarchical data domains like graphs, outperforming Euclidean approaches in preserving hierarchical structures.

Abstract: Generating graphs with hierarchical structures remains a fundamental challenge due to the limitations of Euclidean geometry in capturing exponential complexity. Here we introduce \textbf{GGBall}, a novel hyperbolic framework for graph generation that integrates geometric inductive biases with modern generative paradigms. GGBall combines a Hyperbolic Vector-Quantized Autoencoder (HVQVAE) with a Riemannian flow matching prior defined via closed-form geodesics. This design enables flow-based priors to model complex latent distributions, while vector quantization helps preserve the curvature-aware structure of the hyperbolic space. We further develop a suite of hyperbolic GNN and Transformer layers that operate entirely within the manifold, ensuring stability and scalability. Empirically, our model reduces degree MMD by over 75% on Community-Small and over 40% on Ego-Small compared to state-of-the-art baselines, demonstrating an improved ability to preserve topological hierarchies. These results highlight the potential of hyperbolic geometry as a powerful foundation for the generative modeling of complex, structured, and hierarchical data domains. Our code is available at \href{https://github.com/AI4Science-WestlakeU/GGBall}{here}.

[387] Two-Player Zero-Sum Games with Bandit Feedback

Elif Yılmaz, Christos Dimitrakakis

Main category: cs.LG

TL;DR: The paper proposes three algorithms for learning pure strategy Nash Equilibria in two-player zero-sum games with bandit feedback, achieving instance-dependent regret bounds comparable to existing methods.

DetailsMotivation: To address the problem of learning in two-player zero-sum games where the payoff matrix is unknown and must be estimated through bandit feedback, with a focus on deriving instance-dependent regret bounds which have received limited attention in the literature.

Method: Three algorithms based on Explore-Then-Commit (ETC) and action pair elimination frameworks: 1) ETC adapted to zero-sum games, 2) adaptive elimination leveraging ε-Nash Equilibrium property, and 3) extension with non-uniform exploration.

Result: Achieved instance-dependent regret upper bounds: O(Δ + √T) for ETC, and O(log(TΔ²)/Δ) for both adaptive elimination algorithms, where Δ is the suboptimality gap.

Conclusion: ETC and action pair elimination algorithms perform effectively in zero-sum game settings, achieving regret bounds comparable to existing methods while providing valuable insights through instance-dependent analysis.

Abstract: We study a two-player zero-sum game in which the row player aims to maximize their payoff against a competing column player, under an unknown payoff matrix estimated through bandit feedback. We propose three algorithms based on the Explore-Then-Commit (ETC) and action pair elimination frameworks. The first adapts it to zero-sum games, the second incorporates adaptive elimination that leverages the $\varepsilon$-Nash Equilibrium property to efficiently select the optimal action pair, and the third extends the elimination algorithm by employing non-uniform exploration. Our objective is to demonstrate the applicability of ETC and action pair elimination algorithms in a zero-sum game setting by focusing on learning pure strategy Nash Equilibria. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret of our proposed algorithms, which has received limited attention in the literature on zero-sum games. Particularly, after $T$ rounds, we achieve an instance-dependent regret upper bounds of $O(Δ+ \sqrt{T})$ for ETC in zero-sum game setting and $O\left(\frac{\log (T Δ^2)}Δ\right)$ for the adaptive elimination algorithm and its variant with non-uniform exploration, where $Δ$ denotes the suboptimality gap. Therefore, our results indicate that the ETC and action pair elimination algorithms perform effectively in zero-sum game settings, achieving regret bounds comparable to existing methods while providing insight through instance-dependent analysis.

[388] The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs

Zhiliang Chen, Alfred Wei Lun Leong, Shao Yong Ong, Apivich Hemachandra, Gregory Kang Ruey Lau, Chuan-Sheng Foo, Zhengyuan Liu, Nancy F. Chen, Bryan Kian Hsiang Low

Main category: cs.LG

TL;DR: JoBS is a method that jointly optimizes both data and model configurations for LLM training using a scaling-law-inspired performance predictor with Bayesian optimization, solving the chicken-and-egg dilemma of interdependent optimizations.

DetailsMotivation: The paper addresses the fundamental chicken-and-egg problem in LLM training where optimal data configurations depend on model configurations and vice versa. Existing methods optimize either data or model separately without considering their interaction, leading to suboptimal solutions.

Method: JoBS uses a scaling-law-inspired performance predictor to guide Bayesian optimization. It allocates part of the optimization budget to learn a predictor that estimates full-training performance from early training steps, then uses BO with this predictor to efficiently explore the joint space of data and model configurations.

Result: JoBS outperforms existing multi-fidelity Bayesian optimization baselines and separate data/model optimization approaches across diverse LLM tasks under the same optimization budget constraints.

Conclusion: The proposed joint optimization approach effectively solves the interdependent data-model configuration problem, enabling more efficient LLM training through intelligent budget allocation and performance prediction.

Abstract: Co-optimizing data and model configurations for training LLMs presents a classic chicken-and-egg dilemma: The best training data configuration (e.g., data mixture) for a downstream task depends on the chosen model configuration (e.g., model architecture), and vice versa. However, jointly optimizing both data and model configurations is often deemed intractable, and existing methods focus on either data or model optimization without considering their interaction. We introduce JoBS, an approach that uses a scaling-law-inspired performance predictor to aid Bayesian optimization (BO) in jointly optimizing LLM training data and model configurations efficiently. JoBS allocates a portion of the optimization budget to learn an LLM performance predictor that predicts how promising a training configuration is from a small number of training steps. The remaining budget is used to perform BO entirely with the predictor, effectively amortizing the cost of running full-training runs. We study JoBS’s average regret and devise the optimal budget allocation to minimize regret. JoBS outperforms existing multi-fidelity BO baselines, as well as data and model optimization approaches across diverse LLM tasks under the same optimization budget.

[389] Generating Directed Graphs with Dual Attention and Asymmetric Encoding

Alba Carballo-Castro, Manuel Madeira, Yiming Qin, Dorina Thanou, Pascal Frossard

Main category: cs.LG

TL;DR: Directo is the first generative model for directed graphs using discrete flow matching with positional encodings for asymmetric relations and dual-attention mechanisms.

DetailsMotivation: Directed graphs are essential for modeling asymmetric relationships in various domains, but directed graph generation remains underexplored due to challenges in modeling edge directionality and lack of standardized benchmarks.

Method: Proposes Directo, a generative model for directed graphs built on discrete flow matching framework with: (i) principled positional encodings for asymmetric relations, (ii) dual-attention mechanism capturing both incoming and outgoing dependencies, and (iii) robust discrete generative framework.

Result: The method performs strongly across diverse settings and competes with specialized models for particular classes like directed acyclic graphs. A benchmark suite covering synthetic and real-world datasets is introduced.

Conclusion: Directo establishes a solid foundation for future research in directed graph generation, highlighting the effectiveness and generality of the approach.

Abstract: Directed graphs naturally model systems with asymmetric, ordered relationships, essential to applications in biology, transportation, social networks, and visual understanding. Generating such graphs enables tasks such as simulation, data augmentation and novel instance discovery; however, directed graph generation remains underexplored. We identify two key factors limiting progress in this direction: first, modeling edge directionality introduces a substantially larger dependency space, making the underlying distribution harder to learn; second, the absence of standardized benchmarks hinders rigorous evaluation. Addressing the former requires more expressive models that are sensitive to directional topologies. We propose Directo, the first generative model for directed graphs built upon the discrete flow matching framework. Our approach combines: (i) principled positional encodings tailored to asymmetric pairwise relations, (ii) a dual-attention mechanism capturing both incoming and outgoing dependencies, and (iii) a robust, discrete generative framework. To support evaluation, we introduce a benchmark suite covering synthetic and real-world datasets. It shows that our method performs strongly across diverse settings and even competes with specialized models for particular classes, such as directed acyclic graphs. Our results highlight the effectiveness and generality of our approach, establishing a solid foundation for future research in directed graph generation.

[390] Diffusion-Guided Pretraining for Brain Graph Foundation Models

Xinxu Wei, Rong Zhou, Lifang He, Yu Zhang

Main category: cs.LG

TL;DR: A diffusion-based pretraining framework for brain graph foundation models that uses structure-aware dropping/masking and topology-aware reconstruction to preserve brain connectivity semantics.

DetailsMotivation: Existing graph-based pretraining methods for brain signals use naive random dropping/masking that disrupts meaningful connectivity patterns, and graph-level readout/reconstruction schemes fail to capture global structural information, limiting representation robustness.

Method: Proposes a unified diffusion-based pretraining framework that: 1) uses diffusion to guide structure-aware dropping and masking strategies to preserve brain graph semantics while maintaining pretraining diversity, and 2) enables topology-aware graph-level readout and node-level global reconstruction by allowing embeddings to aggregate information from globally related regions.

Result: Extensive experiments across multiple neuroimaging datasets with over 25,000 subjects and 60,000 scans involving various mental disorders and brain atlases demonstrate consistent performance improvements.

Conclusion: The diffusion-based framework effectively addresses limitations of existing methods by preserving brain graph semantics and enabling global structural information capture, leading to improved representation learning for brain signals.

Abstract: With the growing interest in foundation models for brain signals, graph-based pretraining has emerged as a promising paradigm for learning transferable representations from connectome data. However, existing contrastive and masked autoencoder methods typically rely on naive random dropping or masking for augmentation, which is ill-suited for brain graphs and hypergraphs as it disrupts semantically meaningful connectivity patterns. Moreover, commonly used graph-level readout and reconstruction schemes fail to capture global structural information, limiting the robustness of learned representations. In this work, we propose a unified diffusion-based pretraining framework that addresses both limitations. First, diffusion is designed to guide structure-aware dropping and masking strategies, preserving brain graph semantics while maintaining effective pretraining diversity. Second, diffusion enables topology-aware graph-level readout and node-level global reconstruction by allowing graph embeddings and masked nodes to aggregate information from globally related regions. Extensive experiments across multiple neuroimaging datasets with over 25,000 subjects and 60,000 scans involving various mental disorders and brain atlases demonstrate consistent performance improvements.

[391] AXLearn: Modular, Hardware-Agnostic Large Model Training

Mark Lee, Chang Lan, Tom Gunter, John Peebles, Hanzhi Zhou, Kelvin Zou, Sneha Bangalore, Chung-Cheng Chiu, Nan Du, Xianzhi Du, Philipp Dufter, Ruixuan Hou, Haoshuo Huang, Dongseong Hwang, Xiang Kong, Jinhao Lei, Tao Lei, Meng Li, Li Li, Jiarui Lu, Zhiyun Lu, Yiping Ma, David Qiu, Vivek Rathod, Senyu Tong, Zhucheng Tu, Jianyu Wang, Yongqiang Wang, Zirui Wang, Floris Weers, Sam Wiseman, Guoli Yin, Bowen Zhang, Xiyou Zhou, Danyang Zhuo, Cheng Leong, Ruoming Pang

Main category: cs.LG

TL;DR: AXLearn is Apple’s production system for scalable training of large deep learning models with unique focus on modularity and hardware-agnostic training, maintaining constant complexity as system scales.

DetailsMotivation: Need for scalable, high-performance training systems for large deep learning models that can handle increasing model complexity while maintaining development efficiency and hardware flexibility.

Method: Developed AXLearn with strict encapsulation between software components, enabling modular assembly and hardware-agnostic training. System maintains constant complexity scaling through well-designed interfaces.

Result: AXLearn achieves equivalent performance to state-of-the-art training systems while enabling rapid development (e.g., integrating RoPE across hundreds of modules with just 10 lines of code vs. hundreds in other systems).

Conclusion: AXLearn provides an effective production system for large-scale deep learning training at Apple, balancing performance, scalability, and development efficiency through modular design.

Abstract: AXLearn is a production system which facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-art deep learning systems, AXLearn has a unique focus on modularity and support for hardware-agnostic training. AXLearn’s internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on different hardware infrastructure. AXLearn maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in state-of-the-art training systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn at Apple.

[392] Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu

Main category: cs.LG

TL;DR: Automated pipeline for detecting unverbalized biases in LLMs by generating candidate bias concepts, testing them through statistical methods, and identifying biases not mentioned in chain-of-thought reasoning.

DetailsMotivation: LLMs often provide plausible chain-of-thought reasoning that may hide internal biases (unverbalized biases), making monitoring via stated reasoning unreliable. Existing bias evaluations require predefined categories and hand-crafted datasets, limiting scalability and discovery of unknown biases.

Method: Fully automated black-box pipeline that: 1) uses LLM autoraters to generate candidate bias concepts from task datasets, 2) tests each concept by generating positive/negative variations on progressively larger input samples, 3) applies statistical techniques for multiple testing and early stopping, and 4) flags concepts as unverbalized biases if they yield statistically significant performance differences while not being cited in CoT reasoning.

Result: Pipeline automatically discovered previously unknown biases in seven LLMs across three decision tasks (hiring, loan approval, university admissions), including Spanish fluency, English proficiency, and writing formality. Also validated manually identified biases from prior work (gender, race, religion, ethnicity).

Conclusion: The proposed approach provides a practical, scalable path to automatic task-specific bias discovery in LLMs, moving beyond predefined bias categories to uncover hidden biases not verbalized in reasoning traces.

Abstract: Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these unverbalized biases. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model’s CoTs. We evaluate our pipeline across seven LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.

[393] Instance-Wise Adaptive Sampling for Dataset Construction in Approximating Inverse Problem Solutions

Jiequn Han, Kui Ren, Nathan Soedjak

Main category: cs.LG

TL;DR: Adaptive sampling framework for constructing compact training datasets for inverse problems, dynamically allocating sampling effort based on specific test instances to improve sample efficiency.

DetailsMotivation: Traditional learning-based approaches for inverse problems require large datasets drawn from prior distributions, which can be costly when priors have high intrinsic dimensions or high accuracy is needed. The authors aim to develop a more efficient method that adapts to specific test instances.

Method: Proposes an instance-wise adaptive sampling framework that dynamically allocates sampling effort based on the specific test instance. The method iteratively refines the training dataset conditioned on the latest prediction, tailoring the dataset to the geometry of the inverse map around each test instance.

Result: Demonstrated effectiveness in inverse scattering problems under two types of structured priors. Results show that the adaptive method’s advantage becomes more pronounced with more complex priors or higher accuracy requirements.

Conclusion: The adaptive sampling strategy offers a scalable and practical alternative to conventional fixed-dataset training regimes for inverse problems, with broad applicability beyond the specific inverse scattering problem studied.

Abstract: We propose an instance-wise adaptive sampling framework for constructing compact and informative training datasets for supervised learning of inverse problem solutions. Typical learning-based approaches aim to learn a general-purpose inverse map from datasets drawn from a prior distribution, with the training process independent of the specific test instance. When the prior has a high intrinsic dimension or when high accuracy of the learned solution is required, a large number of training samples may be needed, resulting in substantial data collection costs. In contrast, our method dynamically allocates sampling effort based on the specific test instance, enabling significant gains in sample efficiency. By iteratively refining the training dataset conditioned on the latest prediction, the proposed strategy tailors the dataset to the geometry of the inverse map around each test instance. We demonstrate the effectiveness of our approach in the inverse scattering problem under two types of structured priors. Our results show that the advantage of the adaptive method becomes more pronounced in settings with more complex priors or higher accuracy requirements. While our experiments focus on a particular inverse problem, the adaptive sampling strategy is broadly applicable and readily extends to other inverse problems, offering a scalable and practical alternative to conventional fixed-dataset training regimes.

[394] Slicing Wasserstein Over Wasserstein Via Functional Optimal Transport

Moritz Piening, Robert Beinert

Main category: cs.LG

TL;DR: Proposes Double-Sliced Wasserstein (DSW) metric as a computationally efficient alternative to Wasserstein over Wasserstein (WoW) distances for comparing meta-measures, leveraging 1D Wasserstein isometry and Gaussian process projections.

DetailsMotivation: Wasserstein over Wasserstein (WoW) distances are powerful for comparing datasets or distributions over images and shapes but computationally costly. Existing sliced WoW accelerations suffer from numerical instability due to reliance on parametric meta-measures or high-order moments.

Method: Leverages isometry between 1D Wasserstein space and quantile functions in L₂([0,1]) space. Introduces general sliced Wasserstein framework for arbitrary Banach spaces, defining sliced distance between 1D meta-measures via infinite-dimensional L₂-projections parametrized by Gaussian processes. Combines this with classical integration over Euclidean unit sphere to create DSW metric.

Result: DSW minimization is equivalent to WoW minimization for discretized meta-measures while avoiding unstable higher-order moments and achieving computational savings. Numerical experiments on datasets, shapes, and images validate DSW as a scalable substitute for WoW distance.

Conclusion: DSW provides a stable, computationally efficient alternative to WoW distances for comparing meta-measures, with applications in dataset comparison, shape analysis, and image processing.

Abstract: Wasserstein distances define a metric between probability measures on arbitrary metric spaces, including meta-measures (measures over measures). The resulting Wasserstein over Wasserstein (WoW) distance is a powerful, but computationally costly tool for comparing datasets or distributions over images and shapes. Existing sliced WoW accelerations rely on parametric meta-measures or the existence of high-order moments, leading to numerical instability. As an alternative, we propose to leverage the isometry between the 1d Wasserstein space and the quantile functions in the function space $L_2([0,1])$. For this purpose, we introduce a general sliced Wasserstein framework for arbitrary Banach spaces. Due to the 1d Wasserstein isometry, this framework defines a sliced distance between 1d meta-measures via infinite-dimensional $L_2$-projections, parametrized by Gaussian processes. Combining this 1d construction with classical integration over the Euclidean unit sphere yields the double-sliced Wasserstein (DSW) metric for general meta-measures. We show that DSW minimization is equivalent to WoW minimization for discretized meta-measures, while avoiding unstable higher-order moments and computational savings. Numerical experiments on datasets, shapes, and images validate DSW as a scalable substitute for the WoW distance.

[395] Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exiting

Xi Wang, James McInerney, Lequn Wang, Nathan Kallus

Main category: cs.LG

TL;DR: EAT (Entropy After ) is a method to detect and prevent overthinking in reasoning LLMs by monitoring token entropy after a stop thinking token, enabling early exit and compute-efficient reasoning.

DetailsMotivation: Reasoning LLMs tend to overthink, continuing to revise answers even after reaching correct solutions, leading to inefficient token usage and wasted compute resources.

Method: Append a stop thinking token () and monitor the entropy of following tokens during reasoning; threshold the variance under an exponential moving average to create a practical stopping rule when entropy stabilizes.

Result: On MATH500 and AIME2025, EAT reduces token usage by 12-22% without harming accuracy; effective even in black-box settings using proxy models for logit access.

Conclusion: EAT provides a simple, inexpensive method to detect overthinking and enable adaptive compute allocation, improving reasoning efficiency in LLMs.

Abstract: Reasoning LLMs show improved performance with longer chains of thought. However, recent work has highlighted their tendency to overthink, continuing to revise answers even after reaching the correct solution. We quantitatively confirm this inefficiency from the distribution dynamics perspective by tracking Pass@1 for answers averaged over a large number of rollouts and find the model often begins to always produce the correct answer early in the reasoning, making extra reasoning tokens wasteful. To detect and prevent overthinking, we propose a simple and inexpensive novel signal, Entropy After (EAT), for monitoring and deciding whether to exit reasoning early. By appending a stop thinking token () and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule. Importantly, our approach enables adaptively allocating compute based on the EAT trajectory, allowing us to spend compute in a more efficient way compared with fixing the token budget for all questions. Empirically, on MATH500 and AIME2025, EAT reduces token usage by 12 - 22% without harming accuracy. EAT also remains effective in black box settings where logits from the reasoning model are not accessible, and EAT is computed with proxy models: We verified the feasibility via early stopping Llama 70B with a 1.5B model and Claude 3.7 with a local 4B model.

[396] Logit Distance Bounds Representational Similarity

Beatrix M. G. Nielsen, Emanuele Marconato, Luigi Gresele, Andrea Dittadi, Simon Buchholz

Main category: cs.LG

TL;DR: The paper shows that while KL divergence can match predictions in distillation, it doesn’t guarantee linear representational similarity; logit-distance distillation better preserves teacher’s linear representational properties.

DetailsMotivation: The paper investigates whether approximate distributional closeness (like in distillation) implies approximate linear representational similarity, given that exact equality does imply such similarity due to identifiability results.

Method: Defines a representational dissimilarity measure based on models’ identifiability class and proves it’s bounded by logit distance. Shows KL divergence upper-bounds logit distance but provides weak control in practice. Conducts distillation experiments on synthetic and image datasets comparing KL-based vs logit-distance distillation.

Result: Logit-distance distillation yields students with higher linear representational similarity and better preservation of teacher’s linearly recoverable concepts compared to KL-based distillation.

Conclusion: KL-based distillation can match teacher’s predictions while failing to preserve linear representational properties; logit-distance distillation is more effective for preserving representational similarity and linearly recoverable concepts.

Abstract: For a broad family of discriminative models that includes autoregressive language models, identifiability results imply that if two models induce the same conditional distributions, then their internal representations agree up to an invertible linear transformation. We ask whether an analogous conclusion holds approximately when the distributions are close instead of equal. Building on the observation of Nielsen et al. (2025) that closeness in KL divergence need not imply high linear representational similarity, we study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees. Specifically, we define a representational dissimilarity measure based on the models’ identifiability class and prove that it is bounded by the logit distance. We further show that, when model probabilities are bounded away from zero, KL divergence upper-bounds logit distance; yet the resulting bound fails to provide nontrivial control in practice. As a consequence, KL-based distillation can match a teacher’s predictions while failing to preserve linear representational properties, such as linear-probe recoverability of human-interpretable concepts. In distillation experiments on synthetic and image datasets, logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher’s linearly recoverable concepts.

[397] ArtNet: Hierarchical Clustering-Based Artificial Netlist Generator for ML and DTCO Application

Andrew B. Kahng. Seokhyeong Kang, Seonghyeon Park, Dooseok Yoon

Main category: cs.LG

TL;DR: ArtNet is an artificial netlist generator that creates realistic training data for ML-based chip design optimization, improving model generalization and enabling better design-technology co-optimization.

DetailsMotivation: ML and DTCO approaches for chip PPA optimization face limitations due to scarce diverse training data and long design flow turnaround times, requiring better data generation methods.

Method: ArtNet generates artificial netlists that replicate key topological characteristics of real designs, producing realistic datasets that match target parameters for ML training and DTCO exploration.

Result: ArtNet improves CNN-based DRV prediction F1 score by 0.16 through data augmentation, and ArtNet-generated mini-brains achieve up to 97.94% PPA match with full-scale block designs.

Conclusion: ArtNet effectively addresses data scarcity in chip design ML by generating realistic artificial netlists, enabling better model generalization and more efficient design space exploration.

Abstract: In advanced nodes, optimization of power, performance and area (PPA) has become highly complex and challenging. Machine learning (ML) and design-technology co-optimization (DTCO) provide promising mitigations, but face limitations due to a lack of diverse training data as well as long design flow turnaround times (TAT). We propose ArtNet, a novel artificial netlist generator designed to tackle these issues. Unlike previous methods, ArtNet replicates key topological characteristics, enhancing ML model generalization and supporting broader design space exploration for DTCO. By producing realistic artificial datasets that moreclosely match given target parameters, ArtNet enables more efficient PPAoptimization and exploration of flows and design enablements. In the context of CNN-based DRV prediction, ArtNet’s data augmentationimproves F1 score by 0.16 compared to using only the original (real) dataset. In the DTCO context, ArtNet-generated mini-brains achieve a PPA match up to 97.94%, demonstrating close alignment with design metrics of targeted full-scale block designs.

[398] Omni-iEEG: A Large-Scale, Comprehensive iEEG Dataset and Benchmark for Epilepsy Research

Chenda Duan, Yipeng Zhang, Sotaro Kanai, Yuanyi Ding, Atsuro Daida, Pengyue Yu, Tiancheng Zheng, Naoto Kuroda, Shaun A. Hussain, Eishi Asano, Hiroki Nariai, Vwani Roychowdhury

Main category: cs.LG

TL;DR: Omni-iEEG: A large-scale, harmonized intracranial EEG dataset for epilepsy research with 302 patients, 178 hours of recordings, and 36K expert-validated pathological event annotations.

DetailsMotivation: Current epilepsy research faces challenges with inconsistent iEEG formats, lack of standardized benchmarks, and limited annotated data across centers, hindering reproducibility and clinical translation of machine learning approaches.

Method: Harmonized heterogeneous iEEG formats, metadata, and recordings from multiple public sources to create a unified dataset with clinical metadata (seizure onset zones, resections, outcomes) and expert-validated pathological event annotations.

Result: Created Omni-iEEG with 302 patients, 178 hours of high-resolution iEEG recordings, 36K expert-validated annotations, and established clinically meaningful tasks with unified evaluation metrics.

Conclusion: Omni-iEEG serves as a foundation for reproducible, generalizable, and clinically translatable epilepsy research, bridging machine learning and clinical practice with standardized benchmarks.

Abstract: Epilepsy affects over 50 million people worldwide, and one-third of patients suffer drug-resistant seizures where surgery offers the best chance of seizure freedom. Accurate localization of the epileptogenic zone (EZ) relies on intracranial EEG (iEEG). Clinical workflows, however, remain constrained by labor-intensive manual review. At the same time, existing data-driven approaches are typically developed on single-center datasets that are inconsistent in format and metadata, lack standardized benchmarks, and rarely release pathological event annotations, creating barriers to reproducibility, cross-center validation, and clinical relevance. With extensive efforts to reconcile heterogeneous iEEG formats, metadata, and recordings across publicly available sources, we present $\textbf{Omni-iEEG}$, a large-scale, pre-surgical iEEG resource comprising $\textbf{302 patients}$ and $\textbf{178 hours}$ of high-resolution recordings. The dataset includes harmonized clinical metadata such as seizure onset zones, resections, and surgical outcomes, all validated by board-certified epileptologists. In addition, Omni-iEEG provides over 36K expert-validated annotations of pathological events, enabling robust biomarker studies. Omni-iEEG serves as a bridge between machine learning and epilepsy research. It defines clinically meaningful tasks with unified evaluation metrics grounded in clinical priors, enabling systematic evaluation of models in clinically relevant settings. Beyond benchmarking, we demonstrate the potential of end-to-end modeling on long iEEG segments and highlight the transferability of representations pretrained on non-neurophysiological domains. Together, these contributions establish Omni-iEEG as a foundation for reproducible, generalizable, and clinically translatable epilepsy research. The project page with dataset and code links is available at omni-ieeg.github.io/omni-ieeg.

[399] Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation

Ruchi Sandilya, Sumaira Perez, Charles Lynch, Lindsay Victoria, Benjamin Zebley, Derrick Matthew Buchanan, Mahendra T. Bhati, Nolan Williams, Timothy J. Spellman, Faith M. Gunning, Conor Liston, Logan Grosenick

Main category: cs.LG

TL;DR: ConDA applies contrastive learning to diffusion model latents to create interpretable low-dimensional embeddings aligned with underlying dynamical factors, enabling smooth interpolation, extrapolation, and counterfactual editing while maintaining original diffusion rendering quality.

DetailsMotivation: Diffusion models generate high-quality outputs but their latent spaces are high-dimensional and not organized for interpretation or control. There's a need to make these latent spaces more interpretable and controllable for applications like editing, interpolation, and understanding underlying dynamics.

Method: ConDA uses contrastive learning on pretrained diffusion latents with auxiliary variables (time, stimulation parameters, facial action units) to learn a low-dimensional embedding. It separates editing and rendering by lifting embedding trajectories back to diffusion latents using a neighborhood-preserving kNN decoder.

Result: ConDA produces more interpretable and controllable latent structure than linear traversals and conditioning-based baselines across diverse domains including fluid dynamics, neural calcium imaging, neurostimulation, facial expression dynamics, and motor cortex activity.

Conclusion: Diffusion latents encode dynamics-relevant structure that can be exploited by an explicit contrastive geometry layer, enabling interpretable control while maintaining the quality of original diffusion rendering.

Abstract: Diffusion models excel at generation, but their latent spaces are high dimensional and not explicitly organized for interpretation or control. We introduce ConDA (Contrastive Diffusion Alignment), a plug-and-play geometry layer that applies contrastive learning to pretrained diffusion latents using auxiliary variables (e.g., time, stimulation parameters, facial action units). ConDA learns a low-dimensional embedding whose directions align with underlying dynamical factors, consistent with recent contrastive learning results on structured and disentangled representations. In this embedding, simple nonlinear trajectories support smooth interpolation, extrapolation, and counterfactual editing while rendering remains in the original diffusion space. ConDA separates editing and rendering by lifting embedding trajectories back to diffusion latents with a neighborhood-preserving kNN decoder and is robust across inversion solvers. Across fluid dynamics, neural calcium imaging, therapeutic neurostimulation, facial expression dynamics, and monkey motor cortex activity, ConDA yields more interpretable and controllable latent structure than linear traversals and conditioning-based baselines, indicating that diffusion latents encode dynamics-relevant structure that can be exploited by an explicit contrastive geometry layer.

[400] Alternatives to the Laplacian for Scalable Spectral Clustering with Group Fairness Constraints

Iván Ojeda-Ruiz, Young Ju Lee, Malcolm Dickens, Leonardo Cambisaca

Main category: cs.LG

TL;DR: Fair-SMW algorithm improves efficiency of fair spectral clustering by reformulating constrained optimization using Lagrangian method and Sherman-Morrison-Woodbury identity, achieving 2x faster computation while maintaining comparable fairness balance.

DetailsMotivation: Existing fair spectral clustering algorithms that incorporate group fairness (balance) constraints suffer from computational inefficiency. The study aims to enhance runtime performance while maintaining comparable fairness outcomes.

Method: Reformulates the constrained optimization problem using Lagrangian method and Sherman-Morrison-Woodbury identity to create Fair-SMW algorithm. Uses three alternatives to Laplacian matrix with different spectral gaps to generate multiple variations.

Result: Achieves 2x faster computation time than state-of-the-art methods while maintaining comparable balance. Also demonstrates flexibility to achieve twice as much balance when needed. Evaluated on real-world datasets including LastFM, FacebookNet, Deezer, and German.

Conclusion: Fair-SMW provides an efficient solution for fair spectral clustering with improved computational performance while maintaining fairness constraints, offering practical benefits for real-world applications requiring algorithmic fairness.

Abstract: Recent research has focused on mitigating algorithmic bias in clustering by incorporating fairness constraints into algorithmic design. Notions such as disparate impact, community cohesion, and cost per population have been implemented to enforce equitable outcomes. Among these, group fairness (balance) ensures that each protected group is proportionally represented within every cluster. However, incorporating balance as a metric of fairness into spectral clustering algorithms has led to computational times that can be improved. This study aims to enhance the efficiency of spectral clustering algorithms by reformulating the constrained optimization problem using a new formulation derived from the Lagrangian method and the Sherman-Morrison-Woodbury (SMW) identity, resulting in the Fair-SMW algorithm. Fair-SMW employs three alternatives to the Laplacian matrix with different spectral gaps to generate multiple variations of Fair-SMW, achieving clustering solutions with comparable balance to existing algorithms while offering improved runtime performance. We present the results of Fair-SMW, evaluated using the Stochastic Block Model (SBM) to measure both runtime efficiency and balance across real-world network datasets, including LastFM, FacebookNet, Deezer, and German. We achieve an improvement in computation time that is twice as fast as the state-of-the-art, and also flexible enough to achieve twice as much balance.

[401] Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution

Zhuojin Li, Marco Paolieri, Leana Golubchik

Main category: cs.LG

TL;DR: A system for collaborative CPU-GPU execution on mobile devices using OpenCL SVM for synchronization and ML models for execution time prediction to optimize inference latency.

DetailsMotivation: Mobile devices have limited computing resources but unified memory architecture and narrower CPU-GPU performance gap create opportunities for collaborative execution to reduce inference latency of deep neural networks.

Method: Proposes lightweight synchronization using OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times of tasks on both CPU and GPU, accounting for GPU kernel performance characteristics and dispatch times.

Result: Achieves up to 1.89x speedup for linear layers and 1.75x speedup for convolutional layers on mobile platforms, close to theoretical maximums found by exhaustive search.

Conclusion: The approach effectively enables CPU-GPU co-execution on mobile devices by overcoming synchronization overhead and execution time prediction challenges, significantly improving DNN inference performance.

Abstract: Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance provide an opportunity to reduce inference latency by assigning tasks to both CPU and GPU. The main obstacles for such collaborative execution are the significant synchronization overhead required to combine partial results, and the difficulty of predicting execution times of tasks assigned to CPU and GPU (due to the dynamic selection of implementations and parallelism level). To overcome these obstacles, we propose both a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times. Notably, these models capture the performance characteristics of GPU kernels and account for their dispatch times. A comprehensive evaluation on four mobile platforms shows that our approach can quickly select CPU-GPU co-execution strategies achieving up to 1.89x speedup for linear layers and 1.75x speedup for convolutional layers (close to the achievable maximum values of 2.01x and 1.87x, respectively, found by exhaustive grid search on a Pixel~5 smartphone).

[402] Online Robust Reinforcement Learning with General Function Approximation

Debamita Ghosh, George K. Atia, Yue Wang

Main category: cs.LG

TL;DR: Online distributionally robust RL with function approximation that learns robust policies through interaction without prior data, using dual-driven fitted robust Bellman procedure with regret guarantees based on robust Bellman-Eluder dimension.

DetailsMotivation: Real-world RL systems suffer performance degradation when deployment environments differ from training environments. Existing distributionally robust RL approaches require strong data assumptions (generative models or large offline datasets) and are limited to tabular settings.

Method: Proposes a fully online DR-RL algorithm with general function approximation that learns solely through interaction. Uses a dual-driven fitted robust Bellman procedure that simultaneously estimates value function and worst-case backup operator.

Result: Establishes regret guarantees characterized by robust Bellman-Eluder dimension, covering broad class of phi-divergence uncertainty sets. Regret bounds are sublinear, don’t scale with state/action space sizes, and specialize to tight rates in structured problem classes.

Conclusion: The framework demonstrates practicality and scalability for online distributionally robust RL with function approximation, addressing limitations of existing approaches that require strong data assumptions and are restricted to tabular settings.

Abstract: In many real-world settings, reinforcement learning systems suffer performance degradation when the environment encountered at deployment differs from that observed during training. Distributionally robust reinforcement learning (DR-RL) mitigates this issue by seeking policies that maximize performance under the most adverse transition dynamics within a prescribed uncertainty set. Most existing DR-RL approaches, however, rely on strong data availability assumptions, such as access to a generative model or large offline datasets, and are largely restricted to tabular settings. In this work, we propose a fully online DR-RL algorithm with general function approximation that learns robust policies solely through interaction, without requiring prior knowledge or pre-collected data. Our approach is based on a dual-driven fitted robust Bellman procedure that simultaneously estimates the value function and the corresponding worst-case backup operator. We establish regret guarantees for online DR-RL characterized by an intrinsic complexity notion, the robust Bellman-Eluder dimension, covering a broad class of phi-divergence uncertainty sets. The resulting regret bounds are sublinear, do not scale with the size of the state or action spaces, and specialize to tight rates in structured problem classes, demonstrating the practicality and scalability of our framework.

[403] On the Sample Complexity of Learning for Blind Inverse Problems

Nathan Buskulic, Luca Calatroni, Lorenzo Rosasco, Silvia Villa

Main category: cs.LG

TL;DR: Theoretical analysis of learning in blind inverse problems using Linear Minimum Mean Square Estimators (LMMSEs), deriving optimal estimators and establishing connections to Tikhonov regularization with convergence guarantees.

DetailsMotivation: Blind inverse problems where the forward operator is unknown present challenges for standard methods. While data-driven approaches show promise, they lack interpretability and theoretical guarantees, limiting reliability in applied domains like imaging.

Method: Analyzes blind inverse problems within the Linear Minimum Mean Square Estimators (LMMSEs) framework. Derives closed-form expressions for optimal estimators, establishes equivalences with Tikhonov-regularized formulations, and proves convergence results under source conditions.

Result: Provides theoretical analysis with closed-form optimal estimators, finite-sample error bounds characterizing performance as function of noise level, problem conditioning, and sample size. Validates findings through numerical experiments showing predicted convergence behavior.

Conclusion: The work provides rigorous theoretical foundations for learning in blind inverse problems, bridging the gap between empirical data-driven methods and theoretical guarantees, with implications for reliable application in imaging and other domains.

Abstract: Blind inverse problems arise in many experimental settings where the forward operator is partially or entirely unknown. In this context, methods developed for the non-blind case cannot be adapted in a straightforward manner. Recently, data-driven approaches have been proposed to address blind inverse problems, demonstrating strong empirical performance and adaptability. However, these methods often lack interpretability and are not supported by rigorous theoretical guarantees, limiting their reliability in applied domains such as imaging inverse problems. In this work, we shed light on learning in blind inverse problems within the simplified yet insightful framework of Linear Minimum Mean Square Estimators (LMMSEs). We provide a theoretical analysis, deriving closed-form expressions for optimal estimators and extending classical results. In particular, we establish equivalences with suitably chosen Tikhonov-regularized formulations, where the regularization depends explicitly on the distributions of the unknown signal, the noise, and the random forward operators. We also prove convergence results of the reconstruction error under appropriate source condition assumptions. Furthermore, we derive finite-sample error bounds that characterize the performance of learned estimators as a function of the noise level, problem conditioning, and number of available samples. These bounds explicitly quantify the impact of operator randomness and reveal the associated convergence rates as this randomness vanishes. Finally, we validate our theoretical findings through illustrative numerical experiments that confirm the predicted convergence behavior.

[404] Reinforcement Learning to Discover a North-East Monsoon Index for Rainfall Prediction in Thailand

Kiattikun Chobtham

Main category: cs.LG

TL;DR: Novel North-East monsoon climate index optimized via Deep Q-Network improves long-term monthly rainfall prediction in Thailand using LSTM models.

DetailsMotivation: Existing global climate indices like ENSO have limitations for regional rainfall prediction in Thailand; there's a need for local-scale indices to improve predictive accuracy in specific Thai regions.

Method: Developed a novel North-East monsoon climate index from sea surface temperature; used Deep Q-Network reinforcement learning to optimize calculation areas; classified rainfall stations into 12 clusters; integrated optimized index into LSTM models for prediction.

Result: Incorporating the optimized index significantly improves long-term monthly rainfall prediction skill in most cluster areas and effectively reduces RMSE for 12-month-ahead forecasts.

Conclusion: The reinforcement learning-optimized local climate index approach enhances regional rainfall prediction accuracy in Thailand, addressing limitations of global indices.

Abstract: Accurately predicting long-term rainfall is challenging. Global climate indices, such as the El Niño-Southern Oscillation, are standard input features for machine learning. However, a significant gap persists regarding local-scale indices capable of improving predictive accuracy in specific regions of Thailand. This paper introduces a novel North-East monsoon climate index calculated from sea surface temperature to reflect the climatology of the boreal winter monsoon. To optimise the calculated areas used for this index, a Deep Q-Network reinforcement learning agent explores and selects the most effective rectangles based on their correlation with seasonal rainfall. Rainfall stations were classified into 12 distinct clusters to distinguish rainfall patterns between southern and upper Thailand. Experimental results show that incorporating the optimised index into Long Short-Term Memory models significantly improves long-term monthly rainfall prediction skill in most cluster areas. This approach effectively reduces the Root Mean Square Error for 12-month-ahead forecasts.

[405] Learning PDE Solvers with Physics and Data: A Unifying View of Physics-Informed Neural Networks and Neural Operators

Yilong Dai, Shengyu Chen, Ziyi Wang, Xiaowei Jia, Yiqun Xie, Vipin Kumar, Runlong Yu

Main category: cs.LG

TL;DR: A survey paper proposing a unifying framework to analyze Physics-Informed Neural Networks (PINNs) and Neural Operators (NOs) for solving PDEs, organizing methods along three dimensions: what is learned, how physics is integrated, and computational amortization.

DetailsMotivation: The field lacks a unified perspective to understand relationships, limitations, and appropriate roles of various physics-aware data-driven approaches for PDEs in scientific workflows, despite their increasing importance in modern computational processes.

Method: Proposes a unifying framework organizing methods along three dimensions: 1) What is learned (solution, operator, or model), 2) How physical structures are integrated into learning, and 3) How computational load is amortized across problem instances.

Result: Provides a systematic survey that reveals how challenges in learning-based PDE solvers can be understood as consequences of structural properties, enabling better understanding of relationships between PINNs and NOs.

Conclusion: The unifying perspective facilitates development of reliable learning-based PDE solvers and catalyzes synthesis of physics and data by providing a structured framework to analyze existing methods and their trade-offs.

Abstract: Partial differential equations (PDEs) are central to scientific modeling. Modern workflows increasingly rely on learning-based components to support model reuse, inference, and integration across large computational processes. Despite the emergence of various physics-aware data-driven approaches, the field still lacks a unified perspective to uncover their relationships, limitations, and appropriate roles in scientific workflows. To this end, we propose a unifying perspective to place two dominant paradigms: Physics-Informed Neural Networks (PINNs) and Neural Operators (NOs), within a shared design space. We organize existing methods from three fundamental dimensions: what is learned, how physical structures are integrated into the learning process, and how the computational load is amortized across problem instances. In this way, many challenges can be best understood as consequences of these structural properties of learning PDEs. By analyzing advances through this unifying view, our survey aims to facilitate the development of reliable learning-based PDE solvers and catalyze a synthesis of physics and data.

[406] Active Learning for Decision Trees with Provable Guarantees

Arshia Soltani Moakhar, Tanapoom Laoaron, Faraz Ghahremani, Kiarash Banihashem, MohammadTaghi Hajiaghayi

Main category: cs.LG

TL;DR: Theoretical analysis of active learning label complexity for decision trees as binary classifiers, providing first analysis of disagreement coefficient and presenting algorithm achieving polylogarithmic label queries under specific assumptions.

DetailsMotivation: To advance theoretical understanding of active learning label complexity for decision trees, particularly analyzing the disagreement coefficient (key parameter governing active learning) and developing algorithms with provable guarantees for binary classification.

Method: 1) Theoretical analysis of disagreement coefficient for decision trees under two assumptions: distinct feature dimensions in root-to-leaf paths and regular grid-like data structure. 2) Development of general active learning algorithm for binary classification with multiplicative error guarantee. 3) Combination of these results to design active learning algorithm for decision trees with polylogarithmic label queries.

Result: 1) First analysis of disagreement coefficient for decision trees showing polylogarithmic label complexity under stated assumptions, with polynomial complexity if assumptions relaxed. 2) First general active learning algorithm achieving (1+ε)-approximate classifier. 3) Active learning algorithm for decision trees using polylogarithmic label queries. 4) Label complexity lower bound showing near-optimal dependence on error tolerance ε.

Conclusion: The paper provides foundational theoretical results for active learning with decision trees, establishing conditions for efficient label complexity and presenting near-optimal algorithms, advancing theoretical understanding of active learning for structured classifiers.

Abstract: This paper advances the theoretical understanding of active learning label complexity for decision trees as binary classifiers. We make two main contributions. First, we provide the first analysis of the disagreement coefficient for decision trees-a key parameter governing active learning label complexity. Our analysis holds under two natural assumptions required for achieving polylogarithmic label complexity, (i) each root-to-leaf path queries distinct feature dimensions, and (ii) the input data has a regular, grid-like structure. We show these assumptions are essential, as relaxing them leads to polynomial label complexity. Second, we present the first general active learning algorithm for binary classification that achieves a multiplicative error guarantee, producing a $(1+ε)$-approximate classifier. By combining these results, we design an active learning algorithm for decision trees that uses only a polylogarithmic number of label queries in the dataset size, under the stated assumptions. Finally, we establish a label complexity lower bound, showing our algorithm’s dependence on the error tolerance $ε$ is close to optimal.

[407] Universal Diffusion-Based Probabilistic Downscaling

Roberto Molinaro, Niall Siegenheim, Henry Martin, Mark Frey, Niels Poulsen, Philipp Seitz, Marvin Vincent Gabler

Main category: cs.LG

TL;DR: A universal diffusion-based framework that probabilistically downscales low-resolution weather forecasts to high-resolution predictions without model-specific fine-tuning, improving both deterministic and probabilistic skill across diverse weather models.

DetailsMotivation: To create a scalable, model-agnostic solution for enhancing spatial resolution and uncertainty representation in weather forecasting without requiring specialized fine-tuning for each upstream weather model.

Method: Train a single conditional diffusion model on paired coarse-resolution inputs (~25 km) and high-resolution regional reanalysis targets (~5 km), then apply it in zero-shot manner to deterministic forecasts from various weather models to generate probabilistic high-resolution predictions.

Result: The downscaled forecasts consistently improve upon each model’s raw deterministic forecast, with substantial gains in probabilistic skill (CRPS) across diverse AI-based and numerical weather prediction systems for near-surface variables up to 90-hour lead times.

Conclusion: Diffusion-based downscaling provides an effective, scalable probabilistic interface for operational weather forecasting that enhances both spatial resolution and uncertainty representation without model-specific adaptation.

Abstract: We introduce a universal diffusion-based downscaling framework that lifts deterministic low-resolution weather forecasts into probabilistic high-resolution predictions without any model-specific fine-tuning. A single conditional diffusion model is trained on paired coarse-resolution inputs (~25 km resolution) and high-resolution regional reanalysis targets (~5 km resolution), and is applied in a fully zero-shot manner to deterministic forecasts from heterogeneous upstream weather models. Focusing on near-surface variables, we evaluate probabilistic forecasts against independent in situ station observations over lead times up to 90 h. Across a diverse set of AI-based and numerical weather prediction (NWP) systems, the ensemble mean of the downscaled forecasts consistently improves upon each model’s own raw deterministic forecast, and substantially larger gains are observed in probabilistic skill as measured by CRPS. These results demonstrate that diffusion-based downscaling provides a scalable, model-agnostic probabilistic interface for enhancing spatial resolution and uncertainty representation in operational weather forecasting pipelines.

[408] HPMixer: Hierarchical Patching for Multivariate Time Series Forecasting

Jung Min Choi, Vijaya Krishna Yalavarthi, Lars Schmidt-Thieme

Main category: cs.LG

TL;DR: HPMixer is a novel architecture for long-term multivariate time series forecasting that decouples periodic patterns and residual dynamics using hierarchical patching, learnable wavelet transforms, and channel mixing.

DetailsMotivation: Effectively capturing both periodic patterns and residual dynamics is essential for accurate long-term multivariate time series forecasting, but existing methods often struggle to model these components in a complementary manner within standard deep learning benchmarks.

Method: Proposes Hierarchical Patching Mixer (HPMixer) with: 1) Periodic component using learnable cycle module with nonlinear channel-wise MLP, 2) Residual component processed through Learnable Stationary Wavelet Transform (LSWT) for frequency-domain representations, 3) Channel-mixing encoder for inter-channel dependencies, and 4) Two-level non-overlapping hierarchical patching mechanism for multi-scale residual variations.

Result: Extensive experiments on standard multivariate benchmarks demonstrate that HPMixer achieves competitive or state-of-the-art performance compared to recent baselines.

Conclusion: HPMixer provides an effective framework for long-term multivariate time series forecasting by integrating decoupled periodicity modeling with structured, multi-scale residual learning.

Abstract: In long-term multivariate time series forecasting, effectively capturing both periodic patterns and residual dynamics is essential. To address this within standard deep learning benchmark settings, we propose the Hierarchical Patching Mixer (HPMixer), which models periodicity and residuals in a decoupled yet complementary manner. The periodic component utilizes a learnable cycle module [7] enhanced with a nonlinear channel-wise MLP for greater expressiveness. The residual component is processed through a Learnable Stationary Wavelet Transform (LSWT) to extract stable, shift-invariant frequency-domain representations. Subsequently, a channel-mixing encoder models explicit inter-channel dependencies, while a two-level non-overlapping hierarchical patching mechanism captures coarse- and fine-scale residual variations. By integrating decoupled periodicity modeling with structured, multi-scale residual learning, HPMixer provides an effective framework. Extensive experiments on standard multivariate benchmarks demonstrate that HPMixer achieves competitive or state-of-the-art performance compared to recent baselines.

cs.MA

[409] Guiding LLM-Based Human Mobility Simulation with Mobility Measures from Shared Data

Hua Yan, Heng Tan, Yu Yang

Main category: cs.MA

TL;DR: M2LSimu: A mobility measures-guided multi-prompt adjustment framework that uses population-level mobility measures to refine individual-level prompts for realistic human mobility simulation with LLMs.

DetailsMotivation: Current LLM-based human mobility simulation approaches generate individual trajectories independently without population-level coordination, failing to capture collective behaviors. There's a need for frameworks that can coordinate individual-level cognitive processes with population-level mobility patterns.

Method: M2LSimu uses mobility measures derived from shared data as guidance to refine individual-level prompts. It applies coarse-grained adjustment strategies guided by mobility measures, then progressively enables fine-grained individual-level adaptation while satisfying multiple population-level mobility objectives under budget constraints.

Result: M2LSimu significantly outperforms state-of-the-art LLM-based methods on two public datasets for human mobility simulation.

Conclusion: The framework successfully addresses the limitation of independent trajectory generation by incorporating population-level coordination through mobility measures guidance, enabling more realistic simulation of collective human mobility behaviors.

Abstract: Large-scale human mobility simulation is critical for many science domains such as urban science, epidemiology, and transportation analysis. Recent works treat large language models (LLMs) as human agents to simulate realistic mobility trajectories by modeling individual-level cognitive processes. However, these approaches generate individual mobility trajectories independently, without any population-level coordination mechanism, and thus fail to capture the emergence of collective behaviors. To address this issue, we design M2LSimu, a mobility measures-guided multi-prompt adjustment framework that leverages mobility measures derived from shared data as guidance to refine individual-level prompts for realistic mobility generation. Our framework applies coarse-grained adjustment strategies guided by mobility measures, progressively enabling fine-grained individual-level adaptation while satisfying multiple population-level mobility objectives under a limited budget. Experiments show that M2LSimu significantly outperforms state-of-the-art LLM-based methods on two public datasets.

[410] Self-Evolving Multi-Agent Network for Industrial IoT Predictive Maintenance

Rebin Saleh, Khanh Pham Dinh, Balázs Villányi, Truong-Son Hy

Main category: cs.MA

TL;DR: SEMAS: A self-evolving hierarchical multi-agent system for industrial IoT predictive maintenance that distributes specialized agents across Edge, Fog, and Cloud tiers for real-time anomaly detection with interpretability and low latency.

DetailsMotivation: Industrial IoT predictive maintenance needs real-time anomaly detection that is interpretable and computationally efficient. Traditional static models can't adapt to evolving conditions, while LLM-based monolithic systems are too resource-intensive for edge deployment.

Method: Hierarchical multi-agent system with Edge agents (lightweight feature extraction), Fog agents (ensemble detection with dynamic consensus voting), and Cloud agents (continuous policy optimization via PPO). Includes LLM-based response generation for explainability and federated knowledge aggregation.

Result: Superior anomaly detection performance with exceptional stability under adaptation, sustains accuracy across evolving contexts, and delivers substantial latency improvements enabling genuine real-time deployment on industrial benchmarks (Boiler Emulator and Wind Turbine).

Conclusion: Resource-aware, self-evolving multi-agent coordination is essential for production-ready industrial IoT predictive maintenance under strict latency and explainability constraints.

Abstract: Industrial IoT predictive maintenance requires systems capable of real-time anomaly detection without sacrificing interpretability or demanding excessive computational resources. Traditional approaches rely on static, offline-trained models that cannot adapt to evolving operational conditions, while LLM-based monolithic systems demand prohibitive memory and latency, rendering them impractical for on-site edge deployment. We introduce SEMAS, a self-evolving hierarchical multi-agent system that distributes specialized agents across Edge, Fog, and Cloud computational tiers. Edge agents perform lightweight feature extraction and pre-filtering; Fog agents execute diversified ensemble detection with dynamic consensus voting; and Cloud agents continuously optimize system policies via Proximal Policy Optimization (PPO) while maintaining asynchronous, non-blocking inference. The framework incorporates LLM-based response generation for explainability and federated knowledge aggregation for adaptive policy distribution. This architecture enables resource-aware specialization without sacrificing real-time performance or model interpretability. Empirical evaluation on two industrial benchmarks (Boiler Emulator and Wind Turbine) demonstrates that SEMAS achieves superior anomaly detection performance with exceptional stability under adaptation, sustains prediction accuracy across evolving operational contexts, and delivers substantial latency improvements enabling genuine real-time deployment. Ablation studies confirm that PPO-driven policy evolution, consensus voting, and federated aggregation each contribute materially to system effectiveness. These findings indicate that resource-aware, self-evolving 1multi-agent coordination is essential for production-ready industrial IoT predictive maintenance under strict latency and explainability constraints.

[411] AdaptOrch: Task-Adaptive Multi-Agent Orchestration in the Era of LLM Performance Convergence

Geunbin Yu

Main category: cs.MA

TL;DR: AdaptOrch: A framework for dynamic multi-agent orchestration that selects optimal coordination topologies (parallel, sequential, hierarchical, hybrid) based on task dependencies, outperforming static approaches even with identical models.

DetailsMotivation: As LLMs from different providers achieve similar benchmark performance, selecting single best models yields diminishing returns. The structural composition of how multiple agents are coordinated (orchestration topology) now dominates system-level performance over individual model capability.

Method: Formal framework with: (1) Performance Convergence Scaling Law formalizing when orchestration outweighs model selection; (2) Topology Routing Algorithm mapping task decomposition DAGs to optimal orchestration patterns in O(|V| + |E|) time; (3) Adaptive Synthesis Protocol with termination guarantees and heuristic consistency scoring for parallel outputs.

Result: Validated across coding (SWE-bench), reasoning (GPQA), and retrieval-augmented generation tasks. Topology-aware orchestration achieves 12-23% improvement over static single-topology baselines using identical underlying models.

Conclusion: Orchestration design is established as a first-class optimization target independent of model scaling, with structural composition of multi-agent coordination now dominating system performance over individual model capability.

Abstract: As large language models from diverse providers converge toward comparable benchmark performance, the traditional paradigm of selecting a single best model per task yields diminishing returns. We argue that orchestration topology – the structural composition of how multiple agents are coordinated, parallelized, and synthesized – now dominates system-level performance over individual model capability. We present AdaptOrch, a formal framework for task-adaptive multi-agent orchestration that dynamically selects among four canonical topologies (parallel, sequential, hierarchical, and hybrid) based on task dependency graphs and empirically derived domain characteristics. Our framework introduces three key contributions: (1) a Performance Convergence Scaling Law, formalizing conditions under which orchestration selection outweighs model selection; (2) a Topology Routing Algorithm that maps task decomposition DAGs to optimal orchestration patterns in O(|V| + |E|) time; and (3) an Adaptive Synthesis Protocol with provable termination guarantees and heuristic consistency scoring for parallel agent outputs. We validate AdaptOrch across coding (SWE-bench), reasoning (GPQA), and retrieval-augmented generation tasks, demonstrating that topology-aware orchestration achieves 12-23% improvement over static single-topology baselines, even when using identical underlying models. Our results establish orchestration design as a first-class optimization target independent of model scaling.

[412] Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form

Xuefeng Wang, Lei Zhang, Henglin Pu, Husheng Li, Ahmed H. Qureshi

Main category: cs.MA

TL;DR: Proposes a continuous-time constrained MARL framework using physics-informed neural networks to handle safety constraints in multi-agent systems with irregular time intervals.

DetailsMotivation: Traditional discrete-time MARL struggles with complex multi-agent dynamics in high-frequency or irregular time-interval settings, and existing continuous-time MARL methods rarely handle safety constraints due to discontinuities that make HJB-based learning difficult.

Method: Introduces continuous-time constrained MDP formulation, transforms discrete MDPs into CT-CMDPs via epigraph-based reformulation, and solves using a novel physics-informed neural network (PINN)-based actor-critic method for stable optimization in continuous time.

Result: Demonstrates smoother value approximations, more stable training, and improved performance over safe MARL baselines on continuous-time safe multi-particle environments and safe multi-agent MuJoCo benchmarks.

Conclusion: The proposed CT-CMDP framework with PINN-based actor-critic method effectively handles safety constraints in continuous-time multi-agent systems, providing more robust and stable learning compared to existing approaches.

Abstract: Multi-agent reinforcement learning (MARL) has made significant progress in recent years, but most algorithms still rely on a discrete-time Markov Decision Process (MDP) with fixed decision intervals. This formulation is often ill-suited for complex multi-agent dynamics, particularly in high-frequency or irregular time-interval settings, leading to degraded performance and motivating the development of continuous-time MARL (CT-MARL). Existing CT-MARL methods are mainly built on Hamilton-Jacobi-Bellman (HJB) equations. However, they rarely account for safety constraints such as collision penalties, since these introduce discontinuities that make HJB-based learning difficult. To address this challenge, we propose a continuous-time constrained MDP (CT-CMDP) formulation and a novel MARL framework that transforms discrete MDPs into CT-CMDPs via an epigraph-based reformulation. We then solve this by proposing a novel physics-informed neural network (PINN)-based actor-critic method that enables stable and efficient optimization in continuous time. We evaluate our approach on continuous-time safe multi-particle environments (MPE) and safe multi-agent MuJoCo benchmarks. Results demonstrate smoother value approximations, more stable training, and improved performance over safe MARL baselines, validating the effectiveness and robustness of our method.

[413] AgentConductor: Topology Evolution for Multi-Agent Competition-Level Code Generation

Siyu Wang, Ruotian Lu, Zhihao Yang, Yuchao Wang, Yanzhou Zhang, Lei Xu, Qimin Xu, Guojun Yin, Cailian Chen, Xinping Guan

Main category: cs.MA

TL;DR: AgentConductor: RL-optimized multi-agent system with LLM-based orchestrator that dynamically generates interaction topologies for code generation tasks based on task difficulty and execution feedback.

DetailsMotivation: Existing multi-agent systems for code generation use fixed interaction topologies that don't adapt to task difficulty or use execution feedback, leading to redundant communication and performance bottlenecks.

Method: Uses reinforcement learning with an LLM-based orchestrator to create dynamic DAG topologies. Features topological density function for communication-aware characterization and difficulty interval partitioning for precise density control.

Result: Achieves state-of-the-art accuracy across competition-level and foundational code datasets, outperforming strongest baseline by up to 14.6% pass@1 accuracy, 13% density reduction, and 68% token cost reduction.

Conclusion: AgentConductor demonstrates that adaptive, feedback-driven topology generation significantly improves multi-agent code generation performance while reducing communication overhead.

Abstract: Large language model(LLM)-driven multi-agent systems(MAS) coordinate specialized agents through predefined interaction topologies and have shown promise for complex tasks such as competition-level code generation. Recent studies demonstrate that carefully designed multi-agent workflows and communication graphs can significantly improve code generation performance by leveraging collaborative reasoning. However, existing methods neither adapt topology density to task difficulty nor iteratively refine the topology within an instance using execution feedback, which leads to redundant communication and performance bottlenecks. To address these issues, we propose AgentConductor: a reinforcement learning-optimized MAS with an LLM-based orchestrator agent as its core, which enables end-to-end feedback-driven dynamic generation of interaction topologies. For each query, AgentConductor infers agent roles and task difficulty, then constructs a task-adapted, density-aware layered directed acyclic graph (DAG) topology, underpinned by two key innovations. First, we design a novel topological density function that captures communication-aware mathematical characterizations of multi-agent interactions. Second, we adopt difficulty interval partitioning to avoid excessive pruning for precise topological density upper bound measurement per difficulty level and finer-grained control. Empirically, across three competition-level and two foundational code datasets, AgentConductor achieves state-of-the-art accuracy, outperforming the strongest baseline by up to 14.6% in pass@1 accuracy, 13% in density reduction, and 68% in token cost reduction.

[414] Algorithmic Collusion at Test Time: A Meta-game Design and Evaluation

Yuhong Luo, Daniel Schoepflin, Xintong Wang

Main category: cs.MA

TL;DR: This paper studies algorithmic collusion risk using a meta-game framework with pretrained policies and in-game adaptation, evaluating RL and LLM-based strategies in repeated pricing games under symmetric/asymmetric cost settings.

DetailsMotivation: The paper addresses the debate around algorithmic collusion threat and regulatory intervention, noting limitations in existing evaluations that rely on long learning horizons, assumptions about counterparty rationality, and symmetry in hyperparameters/economic settings.

Method: Introduces a meta-game design where agents have pretrained policies with different strategic characteristics (competitive, cooperative, collusive). Formulates the problem as selecting a meta-strategy combining initial policy with in-game adaptation rule. Samples normal-form empirical games over meta-strategy profiles, computes game statistics (payoffs, regret), and constructs empirical best-response graphs to analyze strategic relationships.

Result: Evaluates both reinforcement-learning and LLM-based strategies in repeated pricing games under symmetric and asymmetric cost settings, presenting findings on algorithmic collusion feasibility and pricing strategy effectiveness in practical “test-time” environments.

Conclusion: The study provides insights into whether collusion can emerge under rational choices and how agents co-adapt toward cooperation or competition, with implications for understanding algorithmic collusion risks in real-world settings.

Abstract: The threat of algorithmic collusion, and whether it merits regulatory intervention, remains debated, as existing evaluations of its emergence often rely on long learning horizons, assumptions about counterparty rationality in adopting collusive strategies, and symmetry in hyperparameters and economic settings among players. To study collusion risk, we introduce a meta-game design for analyzing algorithmic behavior under test-time constraints. We model agents as possessing pretrained policies with distinct strategic characteristics (e.g., competitive, naively cooperative, robustly collusive), and formulate the problem as selecting a meta-strategy that combines a pretrained, initial policy with an in-game adaptation rule. We seek to examine whether collusion can emerge under rational choices and how agents co-adapt toward cooperation or competition. To this end, we sample normal-form empirical games over meta-strategy profiles, % across random initial game states, compute relevant game statistics (e.g., payoffs against individuals and regret against an equilibrium mixture of opponents), and construct empirical best-response graphs to uncover strategic relationships. We evaluate both reinforcement-learning and LLM-based strategies in repeated pricing games under symmetric and asymmetric cost settings, and present findings on the feasibility of algorithmic collusion and the effectiveness of pricing strategies in practical ``test-time’’ environments. The source code and the full paper with appendix are available at: https://github.com/chailab-rutgers/CollusionMetagame.

[415] Fault Tolerant Multi-Agent Learning with Adversarial Budget Constraints

David Mguni, Yaqi Sun, Haojun Chen, Wanrong Yang, Amir Darabi, Larry Olanrewaju Orimoloye, Yaodong Yang

Main category: cs.MA

TL;DR: MARTA is a plug-and-play robustness layer for cooperative multi-agent RL that introduces a Switcher-Adversary mechanism to induce selective agent malfunctions, improving fault tolerance across various MARL domains.

DetailsMotivation: Agent malfunctions are critical failure modes in practical cooperative multi-agent systems but are underexplored in existing MARL theory, creating a need for robust fault-tolerant approaches.

Method: Introduces MARTA with a Switcher-Adversary mechanism that creates a fault-switching (N+2)-player Markov game where the Switcher chooses when/which agent fails and the Adversary controls faulty behavior via random or worst-case policies. Develops a Q-learning-type scheme with provable contraction properties.

Result: MARTA consistently improves robustness across Traffic Junction, Level-Based Foraging, MPE SimpleTag, and SMAC (v2), achieving performance gains up to 116.7% in SMAC, 21.4% in MPE SimpleTag, and 44.6% in LBF while reducing failure rates under mismatched fault regimes.

Conclusion: MARTA provides a theoretically grounded and practically deployable mechanism for fault-tolerant MARL that integrates seamlessly with existing algorithms without architectural modifications.

Abstract: We study robustness to agent malfunctions in cooperative multi-agent reinforcement learning (MARL), a failure mode that is critical in practice yet underexplored in existing theory. We introduce MARTA, a plug-and-play robustness layer that augments standard MARL algorithms with a Switcher-Adversary mechanism which selectively induces malfunctions in performance-critical states. This formulation defines a fault-switching $(N+2)$-player Markov game in which the Switcher chooses when and which agent fails, and the Adversary controls the resulting faulty behaviour via random or worst-case policies. We develop a Q-learning-type scheme and show that the associated Bellman operator is a contraction, yielding existence and uniqueness of the minimax value, convergence to a Markov perfect equilibrium. MARTA integrates seamlessly with MARL algorithms without architectural modification and consistently improves robustness across Traffic Junction (TJ), Level-Based Foraging (LBF), MPE SimpleTag, and SMAC (v2). In these domains, MARTA achieves large gains in final performance of up to 116.7% in SMAC, 21.4% in MPE SimpleTag, and 44.6% in LBF, while significantly reducing failure rates under train-test mismatched fault regimes. These results establish MARTA as a theoretically grounded and practically deployable mechanism for fault-tolerant MARL.

[416] Stigmergic Swarming Agents for Fast Subgraph Isomorphism

H. Van Dyke Parunak

Main category: cs.MA

TL;DR: ASSIST is an ant colony-inspired heuristic algorithm for maximum partial subgraph isomorphism that achieves linear time complexity in query size and constant time in data size for the combinatorial search phase.

DetailsMotivation: The maximum partial subgraph isomorphism problem is NP-complete with exponential complexity in naive approaches, and current heuristics have O(d²) complexity. There's a need for more efficient algorithms that can handle large graphs and support various matching problems like temporally ordered edges and inexact matches.

Method: ASSIST uses ant colony optimization inspired by traveling salesperson solutions. It first performs peering (matching individual nodes) in O(q·log(d)) time, then uses an iterative subgraph search approach where the combinatorial complexity is linear in query size and constant in data size through stigmergy-based optimization.

Result: The algorithm achieves significantly better time complexity than existing heuristics, with the iterative search phase being linear in query size and constant in data size. It also supports extensions for various matching problems that other heuristics struggle with.

Conclusion: ASSIST provides an efficient heuristic for subgraph isomorphism with improved scalability and flexibility for complex matching scenarios, making it suitable for large-scale graph analysis problems.

Abstract: Maximum partial subgraph isomorphism compares two graphs (nodes joined by edges) to find a largest common subgraph. A common use case, for graphs with labeled nodes, seeks to find instances of a \textit{query} graph with $q$ nodes in a (typically larger) \textit{data} graph with $d$ nodes. The problem is NP-complete, and naïve solutions are exponential in $q + d$. The fastest current heuristic has complexity $O(d^2)$. This paper outlines ASSIST (Approximate Swarming Subgraph Isomorphism through Stigmergy), inspired by the ant colony optimization approach to the traveling salesperson. After peering (identifying matching individual nodes in query and data) in time $O(q\cdot log(d))$, the time required for ASSIST’s iterative subgraph search, the combinatorially complex part of the problem, is linear in query size and constant in data size. ASSIST can be extended to support matching problems (such as temporally ordered edges, inexact matches, and missing nodes or edges in the data graph) that frustrate other heuristics.

cs.MM

[417] CAFE: Channel-Autoregressive Factorized Encoding for Robust Biosignal Spatial Super-Resolution

Hongjun Liu, Leyu Zhou, Zijianghao Yang, Rujun Han, Shitong Duan, Kuanjian Tang, Chao Yao

Main category: cs.MM

TL;DR: CAFE: A plug-and-play rollout generation scheme for spatial super-resolution of biosignals from low-density to high-density montages using progressive geometry-aligned reconstruction.

DetailsMotivation: Real-world biosignal deployments often use low-density montages due to hardware constraints, creating need for spatial super-resolution methods that avoid artifact propagation and false non-local correlations common in existing approaches.

Method: Progressive rollout generation starting from low-density channels, recovering nearby channels first then expanding to distal regions. Uses step-wise supervision, teacher forcing with scheduled sampling, and autoregressive rollout across channel groups while reusing any temporal backbone as shared predictor.

Result: Evaluated on 4 modalities and 6 datasets, demonstrates plug-and-play generality across 3 backbones (MLP, Conv, Transformer) and achieves consistently better reconstruction than 5 representative baselines.

Conclusion: CAFE provides effective spatial super-resolution for biosignals with plug-and-play compatibility, addressing challenges of artifact propagation and false correlations in sparse measurements.

Abstract: High-density biosignal recordings are critical for neural decoding and clinical monitoring, yet real-world deployments often rely on low-density (LD) montages due to hardware and operational constraints. This motivates spatial super-resolution from LD observations, but heterogeneous dependencies under sparse and noisy measurements often lead to artifact propagation and false non-local correlations. To address this, we propose CAFE, a plug-and-play rollout generation scheme that reconstructs the full montage in geometry-aligned stages. Starting from the LD channels, CAFE first recovers nearby channels and then progressively expands to more distal regions, exploiting reliable local structure before introducing non-local interactions. During training, step-wise supervision is applied over channel groups and teacher forcing with epoch-level scheduled sampling along the group dimension is utilized to reduce exposure bias, enabling parallel computation across steps. At test time, CAFE performs an autoregressive rollout across groups, while remaining plug-and-play by reusing any temporal backbone as the shared predictor. Evaluated on $4$ modalities and $6$ datasets, CAFE demonstrates plug-and-play generality across $3$ backbones (MLP, Conv, Transformer) and achieves consistently better reconstruction than $5$ representative baselines.

eess.AS

[418] CC-G2PnP: Streaming Grapheme-to-Phoneme and prosody with Conformer-CTC for unsegmented languages

Yuma Shirahata, Ryuichi Yamamoto

Main category: eess.AS

TL;DR: CC-G2PnP is a streaming grapheme-to-phoneme and prosody model using Conformer-CTC architecture for real-time text-to-speech with minimal look-ahead, outperforming baselines on Japanese datasets.

DetailsMotivation: To enable streaming connection between large language models and text-to-speech systems by developing a model that can process grapheme tokens chunk-by-chunk for real-time phonemic and prosodic label prediction, particularly for languages without explicit word boundaries like Japanese.

Method: Based on Conformer-CTC architecture, processes input grapheme tokens chunk by chunk with minimal look-ahead. Uses CTC decoder to learn alignment between graphemes and phonemes during training, eliminating dependency on explicit word boundaries.

Result: Significantly outperforms baseline streaming G2PnP model in accuracy of phonemic and prosodic label prediction on Japanese dataset, demonstrating effectiveness for unsegmented languages.

Conclusion: CC-G2PnP provides an effective streaming solution for grapheme-to-phoneme and prosody conversion that works well for languages without explicit word boundaries, enabling real-time text-to-speech applications.

Abstract: We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.

[419] Array-Aware Ambisonics and HRTF Encoding for Binaural Reproduction With Wearable Arrays

Yhonatan Gayer, Vladimir Tourbabin, Zamir Ben Hur, David Lou Alon, Boaz Rafaely

Main category: eess.AS

TL;DR: Novel method for binaural reproduction from arbitrary microphone arrays using array-aware optimization of Ambisonics encoding with HRTF pre-processing, improving spatial accuracy and perceptual quality.

DetailsMotivation: To address limitations in conventional Ambisonics encoding for binaural reproduction from arbitrary microphone arrays, particularly for wearable arrays and head rotations, by integrating array-specific information into the HRTF processing pipeline.

Method: Array-aware optimization of Ambisonics encoding through Head-Related Transfer Function (HRTF) pre-processing that integrates array-specific information into the HRTF processing pipeline.

Result: Objective evaluations show superior performance under simulated wearable-array and head rotations compared to conventional Ambisonics. Listening experiments confirm significantly higher perceptual ratings in both timbre and spatial quality.

Conclusion: The method offers a practical solution for spatial audio rendering in VR, AR, and wearable audio capture applications while maintaining full compatibility with standard Ambisonics.

Abstract: This work introduces a novel method for binaural reproduction from arbitrary microphone arrays, based on array-aware optimization of Ambisonics encoding through Head-Related Transfer Function (HRTF) pre-processing. The proposed approach integrates array-specific information into the HRTF processing pipeline, leading to improved spatial accuracy in binaural rendering. Objective evaluations demonstrate superior performance under simulated wearable-array and head rotations compared to conventional Ambisonics encoding method. A listening experiment further confirms that the method achieves significantly higher perceptual ratings in both timbre and spatial quality. Fully compatible with standard Ambisonics, the proposed method offers a practical solution for spatial audio rendering in applications such as virtual reality, augmented reality, and wearable audio capture.

[420] Discrete optimal transport is a strong audio adversarial attack

Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan

Main category: eess.AS

TL;DR: kDOT-VC is a voice conversion method using discrete optimal transport for domain adaptation that also serves as an effective adversarial attack against audio anti-spoofing systems.

DetailsMotivation: The paper aims to develop a voice conversion method with strong domain adaptation capabilities while also exploring its potential as an adversarial attack against modern audio anti-spoofing countermeasures.

Method: Uses probabilistic optimal transport to align frame-level WavLM embeddings of generated speech to a bona fide pool via entropic OT and top-k barycentric projection, then decodes with a neural vocoder.

Result: Demonstrates stronger domain adaptation than kNN-VC, SinkVC, and Gaussian OT methods, and shows effectiveness as a black-box adversarial attack against audio anti-spoofing systems.

Conclusion: Distribution-level alignment via optimal transport is a powerful and stable attack method for deployed countermeasures, highlighting vulnerabilities in current audio security systems.

Abstract: In this paper, we introduce the discrete optimal transport voice conversion ($k$DOT-VC) method. Comparison with $k$NN-VC, SinkVC, and Gaussian optimal transport (MKL) demonstrates stronger domain adaptation abilities of our method. We use the probabilistic nature of optimal transport (OT) and show that $k$DOT-VC is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level {WavLM} embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top-$k$ barycentric projection, then decoded with a neural vocoder. Ablation analysis indicates that distribution-level alignment is a powerful and stable attack for deployed CMs.

[421] Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

Main category: eess.AS

TL;DR: Resp-Agent: A multimodal system for respiratory auscultation using an active adversarial curriculum agent to address information loss from spectrogram conversion and limited/imbalanced data through modality-weaving diagnoser and flow matching generator.

DetailsMotivation: Two main challenges in deep learning-based respiratory auscultation: (1) inherent information loss when converting audio signals to spectrograms (discards transient acoustic events and clinical context), and (2) limited data availability exacerbated by severe class imbalance.

Method: 1) Active Adversarial Curriculum Agent (Thinker-A²CA) as central controller to identify diagnostic weaknesses and schedule targeted synthesis; 2) Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors; 3) Flow Matching Generator that adapts text-only LLM via modality injection to synthesize hard-to-diagnose samples; 4) Resp-229k benchmark corpus with 229k recordings paired with LLM-distilled clinical narratives.

Result: Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance.

Conclusion: The proposed Resp-Agent system effectively addresses fundamental challenges in respiratory auscultation through multimodal integration, active learning, and data synthesis, providing a robust solution for medical audio analysis with limited/imbalanced data.

Abstract: Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.

eess.IV

[422] Structured Analytic Mappings for Point Set Registration

Wei Feng, Tengda Wei, Haiyong Zheng

Main category: eess.IV

TL;DR: Analytic-ICP: A non-rigid point set registration method using multivariate Taylor expansions for smooth deformations with quasi-linear time complexity.

DetailsMotivation: Existing non-rigid registration methods often rely on kernel functions or high-dimensional parameterizations, which can be computationally expensive and lack explicit closed-form representations. There's a need for efficient, structured approaches that can handle smooth deformations with low complexity.

Method: Uses multivariate Taylor expansion of vector-valued functions to construct a structured function space with truncated basis terms. Develops a quasi-Newton optimization algorithm that progressively lifts the identity map into higher-order analytic forms. Embeds this model into an ICP loop using nearest-neighbor correspondences.

Result: Analytic-ICP achieves higher accuracy and faster convergence than classical methods like CPD and TPS-RPM, particularly for small and smooth deformations. The method has quasi-linear time complexity and demonstrates effectiveness on 2D and 3D datasets.

Conclusion: The analytic approximation model provides a unified framework for rigid, affine, and nonlinear deformations with explicit closed-form representation, offering an efficient alternative to kernel-based methods for non-rigid point set registration.

Abstract: We present an analytic approximation model for non-rigid point set registration, grounded in the multivariate Taylor expansion of vector-valued functions. By exploiting the algebraic structure of Taylor expansions, we construct a structured function space spanned by truncated basis terms, allowing smooth deformations to be represented with low complexity and explicit form. To estimate mappings within this space, we develop a quasi-Newton optimization algorithm that progressively lifts the identity map into higher-order analytic forms. This structured framework unifies rigid, affine, and nonlinear deformations under a single closed-form formulation, without relying on kernel functions or high-dimensional parameterizations. The proposed model is embedded into a standard ICP loop – using (by default) nearest-neighbor correspondences – resulting in Analytic-ICP, an efficient registration algorithm with quasi-linear time complexity. Experiments on 2D and 3D datasets demonstrate that Analytic-ICP achieves higher accuracy and faster convergence than classical methods such as CPD and TPS-RPM, particularly for small and smooth deformations.

[423] Is there a relationship between Mean Opinion Score (MOS) and Just Noticeable Difference (JND)?

Jingwen Zhu, Hadi Amirpour, Wei Zhou, Patrick Le Callet

Main category: eess.IV

TL;DR: This paper investigates the relationship between Just Noticeable Difference (JND) and Mean Opinion Score (MOS) for video quality assessment, finding that while MOS values at JND points align with theory, reverse mapping from MOS to JND is ambiguous due to overlapping confidence intervals.

DetailsMotivation: Existing video quality metrics cover broad quality ranges, but premium streaming applications require finer granularity in high-quality scenarios. JND modeling is crucial for perceptual bitrate ladder construction, but the relationship between JND and the widely used MOS remains unclear.

Method: Conducted a Degradation Category Rating (DCR) subjective study based on an existing JND dataset to examine how MOS corresponds to the 75% Satisfied User Ratio (SUR) points of the 1st and 2nd JNDs.

Result: MOS values at JND points generally align with theoretical expectations (e.g., 4.75 for 75% SUR of 1st JND), but reverse mapping from MOS to JND is ambiguous due to overlapping confidence intervals across PVS indices. DCR studies with limited participants may not detect meaningful differences between reference and JND videos.

Conclusion: The relationship between JND and MOS is complex - while forward mapping works, reverse mapping is unreliable due to statistical limitations. This has implications for using MOS-based metrics in high-quality video streaming applications requiring JND-level precision.

Abstract: Evaluating perceived video quality is essential for ensuring high Quality of Experience (QoE) in modern streaming applications. While existing subjective datasets and Video Quality Metrics (VQMs) cover a broad quality range, many practical use cases especially for premium users focus on high quality scenarios requiring finer granularity. Just Noticeable Difference (JND) has emerged as a key concept for modeling perceptual thresholds in these high end regions and plays an important role in perceptual bitrate ladder construction. However, the relationship between JND and the more widely used Mean Opinion Score (MOS) remains unclear. In this paper, we conduct a Degradation Category Rating (DCR) subjective study based on an existing JND dataset to examine how MOS corresponds to the 75% Satisfied User Ratio (SUR) points of the 1st and 2nd JNDs. We find that while MOS values at JND points generally align with theoretical expectations (e.g., 4.75 for the 75% SUR of the 1st JND), the reverse mapping from MOS to JND is ambiguous due to overlapping confidence intervals across PVS indices. Statistical significance analysis further shows that DCR studies with limited participants may not detect meaningful differences between reference and JND videos.

[424] HybridPrompt: Bridging Generative Priors and Traditional Codecs for Mobile Streaming

Liming Liu, Jiangkai Wu, Haoyang Wang, Peiheng Wang, Zongming Guo, Xinggong Zhang

Main category: eess.IV

TL;DR: HybridPrompt combines generative neural codecs for keyframes with traditional codecs for other frames to achieve real-time 1080p video decoding at 150+ FPS on smartphones, improving perceptual quality while maintaining speed.

DetailsMotivation: Traditional codecs are fast but degrade under low bandwidth, while neural codecs offer better quality but are too slow for real-time mobile playback. The paper aims to combine the speed of traditional codecs with the perceptual quality of neural approaches.

Method: Uses hybrid architecture: generative model for keyframes, traditional codec for other frames. Makes traditional decoding differentiable for end-to-end optimization, using subsequent frames as supervision to align generative keyframes with traditional codec requirements. Includes two-stage generation strategy.

Result: Achieves real-time 1080p decoding at over 150 FPS on commercial smartphones. Outperforms pure neural baselines in speed by orders of magnitude while achieving 8% average LPIPS gain over traditional codecs at 200kbps.

Conclusion: HybridPrompt successfully combines the speed of traditional codecs with the perceptual quality of neural approaches, enabling real-time high-quality video playback on mobile devices through differentiable optimization and hybrid architecture.

Abstract: In Video on Demand (VoD) scenarios, traditional codecs are the industry standard due to their high decoding efficiency. However, they suffer from severe quality degradation under low bandwidth conditions. While emerging generative neural codecs offer significantly higher perceptual quality, their reliance on heavy frame-by-frame generation makes real-time playback on mobile devices impractical. We ask: is it possible to combine the blazing-fast speed of traditional standards with the superior visual fidelity of neural approaches? We present HybridPrompt, the first generative-based video system capable of achieving real-time 1080p decoding at over 150 FPS on a commercial smartphone. Specifically, we employ a hybrid architecture that encodes Keyframes using a generative model while relying on traditional codecs for the remaining frames. A major challenge is that the two paradigms have conflicting objectives: the “hallucinated” details from generative models often misalign with the rigid prediction mechanisms of traditional codecs, causing bitrate inefficiency. To address this, we demonstrate that the traditional decoding process is differentiable, enabling an end-to-end optimization loop. This allows us to use subsequent frames as additional supervision, forcing the generative model to synthesize keyframes that are not only perceptually high-fidelity but also mathematically optimal references for the traditional codec. By integrating a two-stage generation strategy, our system outperforms pure neural baselines by orders of magnitude in speed while achieving an average LPIPS gain of 8% over traditional codecs at 200kbps.

[425] Gaussian surrogates do well on Poisson inverse problems

Alexandra Spitzer, Lorenzo Baldassari, Valentin Derbanot, Ivan Dokmanić

Main category: eess.IV

TL;DR: Analysis of MSE performance of Poisson vs Gaussian surrogate objectives for inverse problems with Poisson-distributed measurements, showing Gaussian surrogates can achieve comparable MSE to Poisson MAP at low dose despite departing from Poisson likelihood.

DetailsMotivation: In imaging inverse problems with Poisson-distributed measurements, objectives are derived from Poisson likelihood but performance is evaluated by MSE. The paper investigates how much Poisson objectives matter for MSE, especially at low dose, and whether Gaussian surrogates can perform comparably.

Method: Theoretical analysis using a stylized diagonal model to study MSE of Poisson and Gaussian surrogate reconstruction objectives under Poisson noise. Examines: 1) unregularized Poisson maximum-likelihood estimator, 2) Poisson MAP with regularization, 3) heteroscedastic quadratic objective (normal approximation of Poisson data), 4) homoscedastic quadratic objective yielding linear estimator. Validates with numerical computed tomography experiments.

Result: Unregularized Poisson maximum-likelihood estimator can incur large MSE at low dose, while Poisson MAP mitigates instability through regularization. Both Gaussian surrogate objectives (heteroscedastic and homoscedastic) can achieve MSE comparable to Poisson MAP in low-dose regime, despite departing from Poisson likelihood. CT experiments confirm conclusions extend beyond theoretical analysis.

Conclusion: Gaussian surrogate objectives can provide comparable MSE performance to Poisson MAP for Poisson-distributed measurements in low-dose regimes, suggesting simpler Gaussian approaches may be sufficient for MSE-based evaluation despite theoretical mismatch with Poisson likelihood.

Abstract: In imaging inverse problems with Poisson-distributed measurements, it is common to use objectives derived from the Poisson likelihood. But performance is often evaluated by mean squared error (MSE), which raises a practical question: how much does a Poisson objective matter for MSE, even at low dose? We analyze the MSE of Poisson and Gaussian surrogate reconstruction objectives under Poisson noise. In a stylized diagonal model, we show that the unregularized Poisson maximum-likelihood estimator can incur large MSE at low dose, while Poisson MAP mitigates this instability through regularization. We then study two Gaussian surrogate objectives: a heteroscedastic quadratic objective motivated by the normal approximation of Poisson data, and a homoscedastic quadratic objective that yields a simple linear estimator. We show that both surrogates can achieve MSE comparable to Poisson MAP in the low-dose regime, despite departing from the Poisson likelihood. Numerical computed tomography experiments indicate that these conclusions extend beyond the stylized setting of our theoretical analysis.

[426] Learning Perceptual Representations for Gaming NR-VQA with Multi-Task FR Signals

Yu-Chih Chen, Michael Wang, Chieh-Dun Wen, Kai-Siang Ma, Avinab Saha, Li-Heng Chen, Alan Bovik

Main category: eess.IV

TL;DR: MTL-VQA is a multi-task learning framework for no-reference video quality assessment of gaming videos that uses full-reference metrics as supervisory signals for pretraining without human labels.

DetailsMotivation: NR-VQA for gaming videos is challenging due to limited human-rated datasets and unique content characteristics like fast motion, stylized graphics, and compression artifacts. There's a need for effective methods that can work with limited labeled data.

Method: Multi-task learning framework that uses full-reference metrics as supervisory signals to learn perceptually meaningful features without human labels for pretraining. Jointly optimizes multiple FR objectives with adaptive task weighting to learn shared representations that transfer effectively to NR-VQA.

Result: Experiments on gaming video datasets show MTL-VQA achieves performance competitive with state-of-the-art NR-VQA methods across both MOS-supervised and label-efficient/self-supervised settings.

Conclusion: The approach demonstrates that multi-task learning with FR metrics as supervisory signals can effectively address NR-VQA challenges for gaming videos, especially in data-scarce scenarios.

Abstract: No-reference video quality assessment (NR-VQA) for gaming videos is challenging due to limited human-rated datasets and unique content characteristics including fast motion, stylized graphics, and compression artifacts. We present MTL-VQA, a multi-task learning framework that uses full-reference metrics as supervisory signals to learn perceptually meaningful features without human labels for pretraining. By jointly optimizing multiple full-reference (FR) objectives with adaptive task weighting, our approach learns shared representations that transfer effectively to NR-VQA. Experiments on gaming video datasets show MTL-VQA achieves performance competitive with state-of-the-art NR-VQA methods across both MOS-supervised and label-efficient/self-supervised settings.

[427] Attention-Enhanced U-Net for Accurate Segmentation of COVID-19 Infected Lung Regions in CT Scans

Amal Lahchim, Lazar Davic

Main category: eess.IV

TL;DR: Modified U-Net with attention mechanisms for COVID-19 lung infection segmentation in CT scans, achieving state-of-the-art performance metrics.

DetailsMotivation: To develop an accurate and robust automated segmentation method for COVID-19 infected lung regions in CT scans, which is crucial for diagnosis, treatment planning, and disease monitoring.

Method: Modified U-Net architecture enhanced with attention mechanisms, combined with data augmentation and postprocessing techniques for improved segmentation of infected lung regions in COVID-19 CT scans.

Result: Achieved Dice coefficient of 0.8658 and mean IoU of 0.8316, outperforming other segmentation methods on the COVID-19 CT dataset.

Conclusion: The proposed method demonstrates superior segmentation performance for COVID-19 lung infections and shows promise for clinical applications, with future work planned for dataset expansion, 3D segmentation, and clinical deployment.

Abstract: In this study, we propose a robust methodology for automatic segmentation of infected lung regions in COVID-19 CT scans using convolutional neural networks. The approach is based on a modified U-Net architecture enhanced with attention mechanisms, data augmentation, and postprocessing techniques. It achieved a Dice coefficient of 0.8658 and mean IoU of 0.8316, outperforming other methods. The dataset was sourced from public repositories and augmented for diversity. Results demonstrate superior segmentation performance. Future work includes expanding the dataset, exploring 3D segmentation, and preparing the model for clinical deployment.

[428] Adversarial Deep Learning for Simultaneous Segmentation of Ventricular and White Matter Hyperintensities in Clinical MRI

Mahdi Bashiri Bawil, Mousa Shamsi, Abolhassan Shakeri Bavil

Main category: eess.IV

TL;DR: A deep learning framework for simultaneous segmentation of ventricles and white matter hyperintensities in MS patients, distinguishing normal from pathological lesions with improved accuracy and speed.

DetailsMotivation: Current MS diagnosis methods treat brain structures independently, struggle to differentiate normal from pathological hyperintensities, and perform poorly on anisotropic clinical MRI data, creating a need for more accurate and efficient segmentation approaches.

Method: Developed a 2D pix2pix architecture trained on FLAIR scans from 300 MS patients plus MSSEG2016 benchmark data. Compared five architectural variants through systematic ablation with 5-fold cross-validation, integrating adversarial training, attention-weighted discrimination, and adaptive hybrid loss.

Result: Final architecture (V5) achieved mean Dice 0.852±0.004 and HD95 4.87±0.13mm across all classes. Outperformed six baseline methods, with adversarial training providing the largest single gain (+0.109 Dice). Processing required ~4 seconds per case, up to 36x faster than baselines.

Conclusion: The framework combines adversarial training, attention-weighted discrimination, and adaptive loss scheduling to achieve improved accuracy, clinically relevant lesion differentiation, and computational efficiency suitable for routine clinical workflows.

Abstract: Purpose: Multiple sclerosis (MS) diagnosis requires accurate assessment of white matter hyperintensities (WMH) and ventricular changes on brain MRI. Current methods treat these structures independently, struggle to differentiate normal from pathological hyperintensities, and perform poorly on anisotropic clinical data. We present a deep learning framework that simultaneously segments ventricles and WMH while distinguishing normal periventricular hyperintensities from pathological MS lesions. Methods: We developed a 2D pix2pix architecture trained on FLAIR scans from 300 MS patients combined with the MSSEG2016 benchmark (15 patients). Five architectural variants were compared through systematic ablation using 5-fold cross-validation with patient-level stratification, progressively integrating adversarial training, attention-weighted discrimination, and adaptive hybrid loss. Performance was assessed against six established methods using Dice coefficient, Hausdorff distance, precision, and recall. Results: The final architecture (V5) achieved mean Dice 0.852+/-0.004 and HD95 4.87+/-0.13mm across all classes. Per-class performance: ventricles (Dice 0.907+/-0.002, HD95 3.00+/-0.51mm), abnormal WMH (Dice 0.825+/-0.009, HD95 4.51+/-0.32mm), normal WMH (Dice 0.677+/-0.007). V5 outperformed all baselines on local data for both ventricle and WMH segmentation. Ablation analysis confirmed adversarial training provided the largest single gain (+0.109 Dice). End-to-end processing required ~4 seconds per case-up to 36x faster than baseline methods. Conclusions: This systematically validated framework combines adversarial training, attention-weighted discrimination, and adaptive loss scheduling to achieve improved accuracy, clinically relevant lesion differentiation, and computational efficiency suitable for routine clinical workflows.

[429] AtlasPatch: Efficient Tissue Detection and High-throughput Patch Extraction for Computational Pathology at Scale

Ahmed Alagha, Christopher Leclerc, Yousef Kotp, Omar Metwally, Calvin Moras, Peter Rentopoulos, Ghodsiyeh Rostami, Bich Ngoc Nguyen, Jumanah Baig, Abdelhakim Khellaf, Vincent Quoc-Huy Trinh, Rabeb Mizouni, Hadi Otrok, Jamal Bentahar, Mahdi S. Hosseini

Main category: eess.IV

TL;DR: AtlasPatch is a scalable framework for whole-slide image preprocessing that combines foundation-model tissue detection with high-throughput patch extraction, achieving high precision and 16× speedup over existing methods.

DetailsMotivation: Whole-slide image preprocessing (tissue detection + patch extraction) is a major bottleneck for scaling computational pathology to large, heterogeneous cohorts. Current methods are slow and lack robustness across varying tissue conditions and artifacts.

Method: Couples foundation-model tissue detection (efficiently adapted Segment-Anything model) with high-throughput patch extraction. Uses annotated multi-cohort training set of ~30,000 WSI thumbnails. Framework is open-source, efficiently parallelized, and supports patch saving or streaming for on-the-fly embedding.

Result: Achieves high precision (0.986) tissue detection robust to varying conditions (brightness, fragmentation, artifacts). Reduces end-to-end WSI preprocessing time by up to 16× versus deep-learning pipelines without degrading downstream task performance.

Conclusion: AtlasPatch provides a scalable, efficient solution for WSI preprocessing that benefits both pathology departments (tissue detection/QC) and AI researchers (dataset creation/model training), with open-source availability for practical deployment.

Abstract: Whole-slide image (WSI) preprocessing, comprising tissue detection followed by patch extraction, is foundational to AI-driven computational pathology but remains a major bottleneck for scaling to large and heterogeneous cohorts. We present AtlasPatch, a scalable framework that couples foundation-model tissue detection with high-throughput patch extraction at minimal computational overhead. Our tissue detector achieves high precision (0.986) and remains robust across varying tissue conditions (e.g., brightness, fragmentation, boundary definition, tissue heterogeneity) and common artifacts (e.g., pen/ink markings, scanner streaks). This robustness is enabled by our annotated, heterogeneous multi-cohort training set of ~30,000 WSI thumbnails combined with efficient adaptation of the Segment-Anything (SAM) model. AtlasPatch also reduces end-to-end WSI preprocessing time by up to 16$\times$ versus widely used deep-learning pipelines, without degrading downstream task performance. The AtlasPatch tool is open-source, efficiently parallelized for practical deployment, and supports options to save extracted patches or stream them into common feature-extraction models for on-the-fly embedding, making it adaptable to both pathology departments (tissue detection and quality control) and AI researchers (dataset creation and model training). AtlasPatch software package is available at https://github.com/AtlasAnalyticsLab/AtlasPatch.

Last updated: 2026-03-06
Built with Hugo, theme modified on Stack