Daily arXiv Papers - 2026-02-18

AI-enhanced summaries of 0 research papers from arXiv

Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper

Main category: cs.CL

TL;DR: ZeroSyl is a training-free method to extract syllable boundaries and embeddings from frozen WavLM models using L2 norms of intermediate features, enabling effective speech language modeling without complex pipelines.

DetailsMotivation: Pure speech language models face challenges with excessively long sequences from discrete tokens of self-supervised speech encoders. Existing syllable-based methods like Sylber and SyllableLM require complex multi-stage training pipelines, motivating a simpler approach.

Method: ZeroSyl extracts syllable boundaries using L2 norms of features in WavLM’s intermediate layers without any training. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model.

Result: ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show syllabic units have better scaling behavior for syntactic modeling while finer-grained units benefit lexical tasks.

Conclusion: ZeroSyl provides a simple, training-free method for syllable segmentation that enables effective speech language modeling, demonstrating competitive performance and better scaling properties than existing approaches.

Abstract: Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM’s intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

Relevance: 9/10

[2] TAC: Timestamped Audio Captioning

Sonal Kumar, Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Dinesh Manocha, Nicholas J. Bryan, Zeyu Jin, Justin Salamon

Main category: cs.SD

TL;DR: TAC is a timestamped audio captioner that produces temporally grounded audio descriptions with varying detail, trained on synthetic mixtures to handle polyphonic scenes, and TAC-V extends it to audio-visual descriptions, both serving as semantic bridges for LLMs to achieve SOTA on multimodal understanding benchmarks.

DetailsMotivation: Current Large Audio Language Models struggle with overlapping events in complex acoustic scenes, leading to temporally inconsistent captions and hallucinations, necessitating better temporal grounding and detail in audio understanding.

Method: TAC is trained with a synthetic data pipeline constructing challenging dynamic mixtures from real-world audio sources for robust polyphonic learning. TAC-V extends this to audio-visual descriptions. Both serve as semantic bridges for text-only LLMs in TAC→LLM and TAC-V→LLM cascades.

Result: TAC outperforms all competing methods in event detection and dense captioning with low hallucination rate and accurate temporal grounding. TAC→LLM and TAC-V→LLM cascades achieve state-of-the-art scores on audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding benchmarks.

Conclusion: TAC provides temporally grounded audio descriptions that effectively bridge semantic gaps for LLMs, enabling superior multimodal understanding and reasoning across audio and audio-visual domains.

Abstract: Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a “semantic bridge” for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.

Relevance: 9/10

[3] MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

Zihan Zhang, Xize Cheng, Zhennan Jiang, Dongjie Fu, Jingyuan Chen, Zhou Zhao, Tao Jin

Main category: cs.SD

TL;DR: MARS-Sep is a reinforcement learning framework for universal sound separation that aligns separation outputs with semantic preferences using a reward model derived from audio-text-vision encoders, improving semantic quality over traditional signal-level optimization.

DetailsMotivation: Current sound separation models optimized for low-level signal metrics often produce outputs contaminated with perceptually salient interference from acoustically similar sources, failing to achieve semantic purity. This misalignment between signal metrics and human perception parallels the alignment problem in LLMs.

Method: MARS-Sep reformulates separation as decision making using reinforcement learning. It learns a factorized Beta mask policy steered by a preference reward model derived from progressively-aligned audio-text-vision encoders. The framework uses a stable, clipped trust-region surrogate for optimization and directly incentivizes semantic consistency with query prompts.

Result: Extensive experiments on multiple benchmarks show consistent gains in Text-, Audio-, and Image-Queried separation tasks, with notable improvements in both signal metrics and semantic quality compared to traditional approaches.

Conclusion: The preference alignment perspective effectively addresses the semantic contamination problem in sound separation, demonstrating that RL-based frameworks with multimodal reward models can produce more semantically consistent and perceptually satisfying separation outputs.

Abstract: Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. We introduce a preference alignment perspective, analogous to aligning LLMs with human intent. To address this, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is steered by a preference reward model and optimized by a stable, clipped trust-region surrogate. The reward, derived from a progressively-aligned audio-text-vision encoder, directly incentivizes semantic consistency with query prompts. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://github.com/mars-sep/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.

Relevance: 9/10


Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] EduResearchBench: A Hierarchical Atomic Task Decomposition Benchmark for Full-Lifecycle Educational Research

Houping Yue, Zixiang Di, Mei Jiang, Bingdong Li, Hao Hao, Yu Song, Bo Jiang, Aimin Zhou

Main category: cs.CL

TL;DR: EduResearchBench: A comprehensive evaluation platform for educational academic writing that decomposes research workflows into specialized modules and atomic tasks, with a curriculum learning approach and specialized model training.

DetailsMotivation: Existing LLM benchmarks lack fine-grained assessments for complex academic research workflows, particularly in educational scholarly writing, where holistic scoring obscures specific capability bottlenecks.

Method: Hierarchical Atomic Task Decomposition (HATD) framework breaks research workflows into 6 specialized modules and 24 atomic tasks; curriculum learning strategy; trained EduWrite model (30B) using 11K high-quality instruction pairs from 55K academic samples.

Result: EduWrite (30B) substantially outperforms larger general-purpose models (72B) on multiple core metrics, demonstrating that data quality density and staged training curricula are more decisive than parameter scale in vertical domains.

Conclusion: Specialized evaluation platforms with fine-grained task decomposition and curriculum learning enable better assessment and training of LLMs for academic writing, with domain-specific training outperforming larger general models.

Abstract: While Large Language Models (LLMs) are reshaping the paradigm of AI for Social Science (AI4SS), rigorously evaluating their capabilities in scholarly writing remains a major challenge. Existing benchmarks largely emphasize single-shot, monolithic generation and thus lack the fine-grained assessments required to reflect complex academic research workflows. To fill this gap, we introduce EduResearchBench, the first comprehensive evaluation platform dedicated to educational academic writing. EduResearchBench is built upon our Hierarchical Atomic Task Decomposition (HATD) framework, which decomposes an end-to-end research workflow into six specialized research modules (e.g., Quantitative Analysis, Qualitative Research, and Policy Research) spanning 24 fine-grained atomic tasks. This taxonomy enables an automated evaluation pipeline that mitigates a key limitation of holistic scoring, where aggregate scores often obscure specific capability bottlenecks, and instead provides fine-grained, diagnostic feedback on concrete deficiencies. Moreover, recognizing the high cognitive load inherent in scholarly writing, we propose a curriculum learning strategy that progressively builds competence from foundational skills to complex methodological reasoning and argumentation. Leveraging 55K raw academic samples, we curate 11K high-quality instruction pairs to train EduWrite, a specialized educational scholarly writing model. Experiments show that EduWrite (30B) substantially outperforms larger general-purpose models (72B) on multiple core metrics, demonstrating that in vertical domains, data quality density and hierarchically staged training curricula are more decisive than parameter scale.

[2] Indic-TunedLens: Interpreting Multilingual Models in Indian Languages

Mihir Panchal, Deeksha Varshney, Mamta, Asif Ekbal

Main category: cs.CL

TL;DR: Indic-TunedLens is an interpretability framework for multilingual LLMs that learns language-specific affine transformations to better decode intermediate activations for Indian languages, improving over standard methods especially for low-resource, morphologically rich languages.

DetailsMotivation: Multilingual LLMs are deployed in linguistically diverse regions like India, but interpretability tools remain English-centric. LLMs often operate in English-centric representation spaces, creating a need for cross-lingual interpretability tools for Indian languages.

Method: Introduces Indic-TunedLens that learns shared affine transformations for each target Indian language. Unlike standard Logit Lens which directly decodes intermediate activations, this method adjusts hidden states for each target language to align them with target output distributions for more faithful decoding.

Result: Evaluated on 10 Indian languages using MMLU benchmark, significantly improves over state-of-the-art interpretability methods, especially for morphologically rich, low-resource languages. Provides insights into layer-wise semantic encoding of multilingual transformers.

Conclusion: Indic-TunedLens offers a novel interpretability framework for Indian languages that addresses the English-centric bias in existing tools, enabling better understanding of how multilingual LLMs process diverse linguistic inputs.

Abstract: Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/AnonymousAccountACL/IndicTunedLens. Our code is available at https://github.com/AnonymousAccountACL/IndicTunedLens.

[3] CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding

Tahir Hussain, Saddam Hussain Khan

Main category: cs.CL

TL;DR: CGRA DeBERTa: A concept-guided residual domain augmentation transformer for theological QA over Islamic Hadith texts, achieving 97.85 EM score with parameter-efficient gating.

DetailsMotivation: Accurate QA over classical Islamic texts is challenging due to domain-specific semantics, long context dependencies, and concept-sensitive reasoning. Existing models struggle with theological nuance in Hadith corpora.

Method: Customized DeBERTa transformer with LoRA-based adaptations and residual concept-aware gating. Incorporates theological priors from Islamic Concept Dictionary (12 core terms) via Concept Guided Residual Blocks and importance-weighted attention gating mechanism.

Result: Achieved 97.85 EM score on 42,591 QA pairs from Sahih al-Bukhari and Sahih Muslim, surpassing BERT (75.87) and DeBERTa (89.77) by significant margins with only ~8% inference overhead.

Conclusion: Presents an efficient, interpretable, and accurate Hadith QA system that preserves theological nuance while maintaining computational efficiency, enabling scalable educational applications.

Abstract: Accurate QA over classical Islamic texts remains challenging due to domain specific semantics, long context dependencies, and concept sensitive reasoning. Therefore, a new CGRA DeBERTa, a concept guided residual domain augmentation transformer framework, is proposed that enhances theological QA over Hadith corpora. The CGRA DeBERTa builds on a customized DeBERTa transformer backbone with lightweight LoRA based adaptations and a residual concept aware gating mechanism. The customized DeBERTa embedding block learns global and positional context, while Concept Guided Residual Blocks incorporate theological priors from a curated Islamic Concept Dictionary of 12 core terms. Moreover, the Concept Gating Mechanism selectively amplifies semantically critical tokens via importance weighted attention, applying differential scaling from 1.04 to 3.00. This design preserves contextual integrity, strengthens domain-specific semantic representations, and enables accurate, efficient span extraction while maintaining computational efficiency. This paper reports the results of training CGRA using a specially constructed dataset of 42591 QA pairs from the text of Sahih alBukhari and Sahih Muslim. While BERT achieved an EM score of 75.87 and DeBERTa one of 89.77, our model scored 97.85 and thus surpassed them by 8.08 on an absolute scale, all while adding approximately 8 inference overhead due to parameter efficient gating. The qualitative evaluation noted better extraction and discrimination and theological precision. This study presents Hadith QA systems that are efficient, interpretable, and accurate and that scale provide educational materials with necessary theological nuance.

[4] AIC CTU@AVerImaTeC: dual-retriever RAG for image-text fact checking

Herbert Ullrich, Jan Drchal

Main category: cs.CL

TL;DR: 3rd place AVerImaTeC system combining retrieval-augmented generation with reverse image search for multimodal fact-checking at low cost

DetailsMotivation: To create an accessible, cost-effective multimodal fact-checking system that combines textual and visual evidence retrieval with large language model reasoning

Method: Three decoupled modules: 1) textual retrieval via similarity search, 2) image retrieval via API-accessed reverse image search, 3) generation using GPT5.1 with single multimodal LLM call per fact-check

Result: Competitive performance achieving 3rd place in AVerImaTeC shared task with average cost of $0.013 per fact-check using GPT5.1 via OpenAI Batch API

Conclusion: The system provides an accessible starting point for multimodal fact-checking research with reproducible code, published prompts, vector stores, and cost insights

Abstract: In this paper, we present our 3rd place system in the AVerImaTeC shared task, which combines our last year’s retrieval-augmented generation (RAG) pipeline with a reverse image search (RIS) module. Despite its simplicity, our system delivers competitive performance with a single multimodal LLM call per fact-check at just $0.013 on average using GPT5.1 via OpenAI Batch API. Our system is also easy to reproduce and tweak, consisting of only three decoupled modules - a textual retrieval module based on similarity search, an image retrieval module based on API-accessed RIS, and a generation module using GPT5.1 - which is why we suggest it as an accesible starting point for further experimentation. We publish its code and prompts, as well as our vector stores and insights into the scheme’s running costs and directions for further improvement.

[5] ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper

Main category: cs.CL

TL;DR: ZeroSyl is a training-free method to extract syllable boundaries and embeddings from frozen WavLM models using L2 norms of intermediate features, enabling effective speech language modeling without complex pipelines.

DetailsMotivation: Pure speech language models face challenges with excessively long sequences from discrete tokens of self-supervised speech encoders. Existing syllable-based methods like Sylber and SyllableLM require complex multi-stage training pipelines, motivating a simpler approach.

Method: ZeroSyl extracts syllable boundaries using L2 norms of features in WavLM’s intermediate layers without any training. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model.

Result: ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show syllabic units have better scaling behavior for syntactic modeling while finer-grained units benefit lexical tasks.

Conclusion: ZeroSyl provides a simple, training-free method for syllable segmentation that enables effective speech language modeling, demonstrating competitive performance and better scaling properties than existing approaches.

Abstract: Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM’s intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

[6] OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape, Yuhao Zhang, Jack Hessel

Main category: cs.CL

TL;DR: LLM agents struggle with opaque tools lacking clear documentation; ToolObserver framework iteratively refines tool documentation using execution feedback to improve performance in environments with underspecified tools.

DetailsMotivation: Real-world tools are often opaque with unclear best practices and failure modes, unlike the simple, perfectly documented tools assumed in most benchmarks. The paper aims to study whether LLM agents can improve performance with opaque tools through interaction and documentation refinement.

Method: Created OpaqueToolsBench with three task-oriented environments (general function calling, interactive chess playing, long-trajectory agentic search) providing underspecified tools. Proposed ToolObserver framework that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories.

Result: Existing automatic documentation methods are expensive and unreliable with opaque tools. ToolObserver outperforms existing methods across OpaqueToolsBench datasets, even in hard settings. It’s also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline for test-time tool exploration.

Conclusion: ToolObserver effectively addresses the challenge of opaque tools by enabling LLM agents to learn and refine tool documentation through interaction, significantly improving performance and efficiency in real-world tool-calling scenarios.

Abstract: Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general “search” APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.

[7] Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement

Stephan Ludwig, Peter J. Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin, Dhruv Grewal, Lan Du

Main category: cs.CL

TL;DR: LX is a fine-tuned LLM for extracting 16 consumption emotions and 4 evaluation constructs from consumer text, outperforming GPT-4 and achieving high accuracy on survey responses and reviews.

DetailsMotivation: Accurately measuring consumer emotions and evaluations from unstructured text is a core challenge for marketing research, requiring better methods than existing models.

Method: Developed Linguistic eXtractor (LX), a fine-tuned large language model trained on consumer-authored text labeled with self-reported ratings of 16 emotions and 4 evaluation constructs.

Result: LX outperforms GPT-4 Turbo, RoBERTa, and DeepSeek with 81% macro-F1 accuracy on survey responses and >95% accuracy on Amazon/Yelp reviews. Application shows emotions predict product ratings and purchase behavior.

Conclusion: LX establishes new methodological foundation for consumer perception measurement, demonstrating how fine-tuned LLMs can advance marketing research with validated detection of marketing constructs from consumer data.

Abstract: Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice. This study introduces the Linguistic eXtractor (LX), a fine-tuned, large language model trained on consumer-authored text that also has been labeled with consumers’ self-reported ratings of 16 consumption-related emotions and four evaluation constructs: trust, commitment, recommendation, and sentiment. LX consistently outperforms leading models, including GPT-4 Turbo, RoBERTa, and DeepSeek, achieving 81% macro-F1 accuracy on open-ended survey responses and greater than 95% accuracy on third-party-annotated Amazon and Yelp reviews. An application of LX to online retail data, using seemingly unrelated regression, affirms that review-expressed emotions predict product ratings, which in turn predict purchase behavior. Most emotional effects are mediated by product ratings, though some emotions, such as discontent and peacefulness, influence purchase directly, indicating that emotional tone provides meaningful signals beyond star ratings. To support its use, a no-code, cost-free, LX web application is available, enabling scalable analyses of consumer-authored text. In establishing a new methodological foundation for consumer perception measurement, this research demonstrates new methods for leveraging large language models to advance marketing research and practice, thereby achieving validated detection of marketing constructs from consumer data.

[8] Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory

Zihao Tang, Xin Yu, Ziyu Xiao, Zengxuan Wen, Zelin Li, Jiaxi Zhou, Hualei Wang, Haohua Wang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang

Main category: cs.CL

TL;DR: Mnemis is a novel memory framework for LLMs that combines similarity-based retrieval (System-1) with hierarchical graph-based global reasoning (System-2) to improve memory organization and retrieval.

DetailsMotivation: Existing memory retrieval methods (RAG and Graph-RAG) rely primarily on similarity-based mechanisms, which struggle with scenarios requiring global reasoning or comprehensive coverage of all relevant information. There's a need for memory systems that can handle both local similarity and global structural relevance.

Method: Mnemis organizes memory into two structures: a base graph for similarity retrieval (System-1) and a hierarchical graph that enables top-down, deliberate traversal over semantic hierarchies (System-2 Global Selection). This dual approach combines complementary retrieval routes to find memory items that are both semantically and structurally relevant.

Result: Mnemis achieves state-of-the-art performance on long-term memory benchmarks, scoring 93.9 on LoCoMo and 91.6 on LongMemEval-S using GPT-4.1-mini, outperforming all compared methods.

Conclusion: The integration of System-1 similarity search with System-2 global reasoning provides a more effective memory framework for LLMs, addressing limitations of purely similarity-based approaches and enabling better handling of complex memory retrieval scenarios.

Abstract: AI Memory, specifically how models organizes and retrieves historical messages, becomes increasingly valuable to Large Language Models (LLMs), yet existing methods (RAG and Graph-RAG) primarily retrieve memory through similarity-based mechanisms. While efficient, such System-1-style retrieval struggles with scenarios that require global reasoning or comprehensive coverage of all relevant information. In this work, We propose Mnemis, a novel memory framework that integrates System-1 similarity search with a complementary System-2 mechanism, termed Global Selection. Mnemis organizes memory into a base graph for similarity retrieval and a hierarchical graph that enables top-down, deliberate traversal over semantic hierarchies. By combining the complementary strength from both retrieval routes, Mnemis retrieves memory items that are both semantically and structurally relevant. Mnemis achieves state-of-the-art performance across all compared methods on long-term memory benchmarks, scoring 93.9 on LoCoMo and 91.6 on LongMemEval-S using GPT-4.1-mini.

[9] NeuroSymActive: Differentiable Neural-Symbolic Reasoning with Active Exploration for Knowledge Graph Question Answering

Rong Fu, Yang Li, Zeyu Zhang, Jiekai Wu, Yaohua Liu, Shuaishuai Cao, Yangchen Zeng, Yuhang Zhang, Xiaojing Du, Chuang Zhao, Kangning Cui, Simon Fong

Main category: cs.CL

TL;DR: NeuroSymActive: A neural-symbolic framework for Knowledge Graph Question Answering that combines differentiable reasoning with active exploration to reduce expensive graph lookups.

DetailsMotivation: Current large language models struggle with knowledge-intensive queries requiring structured multi-hop inference. Knowledge graphs provide symbolic grounding but integrating them with neural models is challenging due to inefficiency of naive embedding approaches and costliness of purely symbolic methods.

Method: Combines differentiable neural-symbolic reasoning layer with active, value-guided exploration controller. Uses soft-unification style symbolic modules with neural path evaluator and Monte-Carlo style exploration policy that prioritizes high-value path expansions.

Result: Attains strong answer accuracy on standard KGQA benchmarks while reducing number of expensive graph lookups and model calls compared to common retrieval-augmented baselines.

Conclusion: NeuroSymActive provides an effective modular framework for integrating neural and symbolic reasoning for knowledge-intensive question answering, balancing accuracy with computational efficiency.

Abstract: Large pretrained language models and neural reasoning systems have advanced many natural language tasks, yet they remain challenged by knowledge-intensive queries that require precise, structured multi-hop inference. Knowledge graphs provide a compact symbolic substrate for factual grounding, but integrating graph structure with neural models is nontrivial: naively embedding graph facts into prompts leads to inefficiency and fragility, while purely symbolic or search-heavy approaches can be costly in retrievals and lack gradient-based refinement. We introduce NeuroSymActive, a modular framework that combines a differentiable neural-symbolic reasoning layer with an active, value-guided exploration controller for Knowledge Graph Question Answering. The method couples soft-unification style symbolic modules with a neural path evaluator and a Monte-Carlo style exploration policy that prioritizes high-value path expansions. Empirical results on standard KGQA benchmarks show that NeuroSymActive attains strong answer accuracy while reducing the number of expensive graph lookups and model calls compared to common retrieval-augmented baselines.

[10] Far Out: Evaluating Language Models on Slang in Australian and Indian English

Deniz Kaya Dilsiz, Dipankar Srirag, Aditya Joshi

Main category: cs.CL

TL;DR: Evaluation of slang awareness in Indian and Australian English across 7 language models using web-sourced and synthetic datasets reveals performance gaps between discriminative and generative tasks, with Indian English generally outperforming Australian English.

DetailsMotivation: Language models show systematic performance gaps with non-standard language varieties, but their ability to comprehend variety-specific slang remains underexplored for many languages, particularly Indian and Australian English.

Method: Constructed two datasets: WEB (377 web-sourced examples from Urban Dictionary) and GEN (1,492 synthetically generated usages). Evaluated 7 state-of-the-art language models on three tasks: target word prediction (TWP), guided target word prediction (TWP*), and target word selection (TWS).

Result: 1) TWS outperforms TWP and TWP* (accuracy: 0.03 → 0.49); 2) WEB dataset performs better than GEN dataset; 3) Indian English tasks outperform Australian English across all models and datasets, with TWS showing largest disparity (0.44 → 0.54).

Conclusion: Reveals fundamental asymmetries between generative and discriminative competencies for variety-specific language, highlighting limitations in slang comprehension despite English being a technologically rich language.

Abstract: Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: \textsc{web}, containing 377 web-sourced usage examples from Urban Dictionary, and \textsc{gen}, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on \textsc{web} versus \textsc{gen} datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.

[11] Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework

Mengze Hong, Chen Jason Zhang, Zichang Guo, Hanlin Gu, Di Jiang, Li Qing

Main category: cs.CL

TL;DR: A framework using Task-Oriented Flowcharts (TOFs) for end-to-end customer service automation without manual orchestration, featuring decentralized distillation with small language models for privacy and data efficiency.

DetailsMotivation: Current customer service automation approaches either require complex modular systems with extensive agent orchestration or use oversimplified instruction schemas that lack guidance and generalizability. There's a need for orchestration-free, end-to-end automation that can handle procedural knowledge efficiently while addressing data scarcity and privacy concerns.

Method: 1) Define Task-Oriented Flowcharts (TOFs) components and evaluation metrics; 2) Develop cost-efficient flowchart construction algorithm to extract procedural knowledge from service dialogues; 3) Emphasize local deployment of small language models; 4) Propose decentralized distillation with flowcharts to address data scarcity and privacy issues in model training.

Result: Extensive experiments show superior quantitative and application performance compared to strong baselines and market products across various service tasks. The framework demonstrates effectiveness in end-to-end automation without manual intervention.

Conclusion: The TOF framework enables streamlined creation of future service automation through orchestration-free, end-to-end automation. By releasing a web-based system demonstration with case studies, the approach promotes practical adoption while addressing privacy and data efficiency concerns through decentralized distillation with small language models.

Abstract: Customer service automation has seen growing demand within digital transformation. Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability. This paper introduces an orchestration-free framework using Task-Oriented Flowcharts (TOFs) to enable end-to-end automation without manual intervention. We first define the components and evaluation metrics for TOFs, then formalize a cost-efficient flowchart construction algorithm to abstract procedural knowledge from service dialogues. We emphasize local deployment of small language models and propose decentralized distillation with flowcharts to mitigate data scarcity and privacy issues in model training. Extensive experiments validate the effectiveness in various service tasks, with superior quantitative and application performance compared to strong baselines and market products. By releasing a web-based system demonstration with case studies, we aim to promote streamlined creation of future service automation.

[12] Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language

Prathamesh Devadiga, Paras Chopra

Main category: cs.CL

TL;DR: The paper investigates whether LLMs can converse in languages absent from their training data using structured prompts rather than fine-tuning, focusing on Tulu language as a case study.

DetailsMotivation: To explore if LLMs can handle languages with minimal digital presence in training data through prompting techniques alone, addressing the challenge of low-resource languages.

Method: Combines explicit grammar documentation, negative constraints to suppress tokens from related languages, romanization standardization, and quality-controlled synthetic data generation via self-play, evaluated across three LLMs.

Result: Reduces vocabulary contamination from 80% to 5% while achieving 85% grammatical accuracy; negative constraints provide consistent improvements (12-18 percentage points), grammar documentation effects vary by model architecture.

Conclusion: Structured prompts can enable LLMs to converse in languages virtually absent from training data, with negative constraints being particularly effective across different model architectures.

Abstract: Can large language models converse in languages virtually absent from their training data? We investigate this question through a case study on Tulu, a Dravidian language with over 2 million speakers but minimal digital presence. Rather than fine-tuning an LLM, we examine whether structured prompts alone can elicit basic conversational ability under controlled prompting. We systematically tackle various challenges posed by absence of training data for Tulu by combining explicit grammar documentation, negative constraints to suppress high-probability tokens from related languages, romanization standardization, and quality-controlled synthetic data generation via self-play. Evaluated on a manually curated held-out set across three LLMs (Gemini 2.0 Flash, GPT-4o, Llama 3.1 70B) and validated by native speakers, our approach reduces vocabulary contamination from 80% to 5% while achieving 85% grammatical accuracy. Cross-model analysis reveals that negative constraints provide consistent improvements (12–18 percentage points), while grammar documentation effects vary by model architecture (8–22 points).

[13] The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao

Main category: cs.CL

TL;DR: Vision Wormhole enables text-free communication between heterogeneous LLMs by repurposing VLMs’ visual interface as a universal port for inter-agent telepathy, reducing runtime overhead while maintaining reasoning fidelity.

DetailsMotivation: Current multi-agent systems using LLMs suffer from inefficient discrete text communication with runtime overhead and information loss. Existing latent state transfer methods either require homogeneous architectures or pair-specific translators, limiting scalability across diverse model families.

Method: Proposes Vision Wormhole framework with Universal Visual Codec that maps heterogeneous reasoning traces into shared continuous latent space, injecting them directly into receiver’s visual pathway. Uses hub-and-spoke topology to reduce alignment complexity from O(N²) to O(N) and label-free teacher-student distillation to align visual channel with text reasoning patterns.

Result: Extensive experiments across heterogeneous model families (Qwen-VL, Gemma) show Vision Wormhole reduces end-to-end wall-clock time while maintaining reasoning fidelity comparable to standard text-based multi-agent systems.

Conclusion: Vision Wormhole provides an efficient, model-agnostic communication framework for heterogeneous multi-agent systems by leveraging VLMs’ visual interface as a universal port, enabling scalable text-free communication with reduced runtime overhead.

Abstract: Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver’s visual pathway, effectively treating the vision encoder as a universal port for inter-agent telepathy. Our framework adopts a hub-and-spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label-free, teacher-student distillation objective to align the high-speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen-VL, Gemma) demonstrate that the Vision Wormhole reduces end-to-end wall-clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text-based MAS. Code is available at https://github.com/xz-liu/heterogeneous-latent-mas

[14] Measuring Social Integration Through Participation: Categorizing Organizations and Leisure Activities in the Displaced Karelians Interview Archive using LLMs

Joonatan Laato, Veera Schroderus, Jenna Kanerva, Jenni Kauppi, Virpi Lummaa, Filip Ginter

Main category: cs.CL

TL;DR: Using LLMs to categorize historical leisure activities from Finnish WWII evacuee interviews for social integration analysis

DetailsMotivation: Historical archives contain rich social data but extracted information (350K activity mentions, 71K unique names) is too granular for quantitative sociological analysis; need categorization framework to study social integration patterns

Method: Developed categorization framework with dimensions (activity type, socialness, regularity, physical demand); created gold-standard annotations; tested LLMs with voting approach across multiple runs; applied best model to label all 350K entities

Result: Open-weight LLM with voting approach closely matched expert judgments; successfully labeled all 350K entities, creating structured resource for social integration studies

Conclusion: LLMs can effectively apply sociological categorization schemas to historical text at scale, enabling quantitative analysis of social participation patterns from large archives

Abstract: Digitized historical archives make it possible to study everyday social life on a large scale, but the information extracted directly from text often does not directly allow one to answer the research questions posed by historians or sociologists in a quantitative manner. We address this problem in a large collection of Finnish World War II Karelian evacuee family interviews. Prior work extracted more than 350K mentions of leisure time activities and organizational memberships from these interviews, yielding 71K unique activity and organization names – far too many to analyze directly. We develop a categorization framework that captures key aspects of participation (the kind of activity/organization, how social it typically is, how regularly it happens, and how physically demanding it is). We annotate a gold-standard set to allow for a reliable evaluation, and then test whether large language models can apply the same schema at scale. Using a simple voting approach across multiple model runs, we find that an open-weight LLM can closely match expert judgments. Finally, we apply the method to label the 350K entities, producing a structured resource for downstream studies of social integration and related outcomes.

[15] TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

Chansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jiasi Shen, Jing Tang, Jianguo Li

Main category: cs.CL

TL;DR: TAROT proposes a test-driven, capability-adaptive curriculum reinforcement fine-tuning method for LLM code generation that adapts curriculum difficulty to model capability levels using four-tier test suites.

DetailsMotivation: Current reinforcement fine-tuning approaches for LLM code generation overlook heterogeneous test case difficulty and granularity, leading to imbalanced reward signals and biased gradient updates during training.

Method: TAROT constructs four-tier test suites (basic, intermediate, complex, edge) for each problem, decouples curriculum progression from raw reward scores, and uses capability-conditioned evaluation to select from portfolio of curriculum policies.

Result: Optimal curriculum for RFT in code generation depends on model capability: less capable models benefit from easy-to-hard progression, while more competent models excel with hard-first curriculum. TAROT consistently improves functional correctness and robustness.

Conclusion: TAROT provides reproducible method for adaptive curriculum design tailored to model capability, advancing LLM code generation by addressing test case heterogeneity and enabling stable optimization.

Abstract: Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model’s inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model’s capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.

[16] In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations

Mohammad Aflah Khan, Mahsa Amani, Soumi Das, Bishwamittra Ghosh, Qinyuan Wu, Krishna P. Gummadi, Manish Gupta, Abhilasha Ravichander

Main category: cs.CL

TL;DR: LLM agents exhibit systematic source preferences when selecting information, prioritizing some sources over others regardless of content quality, which affects information presentation to users.

DetailsMotivation: While prior work focused on biases in LLM-generated content, this paper investigates how LLMs select and present information from different sources when acting as information agents, hypothesizing that they exhibit systematic latent source preferences.

Method: Conducted controlled experiments on twelve LLMs from six providers using both synthetic and real-world tasks to test for systematic source preferences, examining sensitivity to contextual framing and persistence despite explicit prompting.

Result: Several models consistently exhibited strong and predictable source preferences that were sensitive to contextual framing, could outweigh content influence, and persisted despite explicit prompting to avoid them, explaining phenomena like left-leaning news recommendation skews.

Conclusion: The findings advocate for deeper investigation into the origins of these preferences and mechanisms to provide users with transparency and control over biases in LLM-powered agents.

Abstract: Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms. These agents filter, prioritize, and synthesize information retrieved from the platforms’ back-end databases or via web search. In these scenarios, LLM agents govern the information users receive, by drawing users’ attention to particular instances of retrieved information at the expense of others. While much prior work has focused on biases in the information LLMs themselves generate, less attention has been paid to the factors that influence what information LLMs select and present to users. We hypothesize that when information is attributed to specific sources (e.g., particular publishers, journals, or platforms), current LLMs exhibit systematic latent source preferences- that is, they prioritize information from some sources over others. Through controlled experiments on twelve LLMs from six model providers, spanning both synthetic and real-world tasks, we find that several models consistently exhibit strong and predictable source preferences. These preferences are sensitive to contextual framing, can outweigh the influence of content itself, and persist despite explicit prompting to avoid them. They also help explain phenomena such as the observed left-leaning skew in news recommendations in prior work. Our findings advocate for deeper investigation into the origins of these preferences, as well as for mechanisms that provide users with transparency and control over the biases guiding LLM-powered agents.

[17] Towards Expectation Detection in Language: A Case Study on Treatment Expectations in Reddit

Aswathy Velutharambath, Amelie Wührl

Main category: cs.CL

TL;DR: Introduces Expectation Detection as a novel NLP task using Reddit medical posts, with RedHOTExpect corpus (4.5K posts) silver-labeled by LLM for analyzing patient treatment expectations online.

DetailsMotivation: Patient treatment expectations significantly impact treatment success, and online platforms like medical subreddits may reveal expectations patients don't share elsewhere. No prior NLP studies have examined expectations, creating a research gap.

Method: Introduces Expectation Detection task, creates RedHOTExpect corpus of 4.5K Reddit medical posts, uses LLM for silver-labeling with manual validation (~78% accuracy), analyzes linguistic patterns and content of patient expectations.

Result: Optimism and proactive framing are more pronounced in physical/treatment-related illness posts vs mental health contexts; patients mostly discuss benefits rather than negative outcomes; LLM achieves ~78% labeling accuracy.

Conclusion: Expectation Detection is a valuable NLP task with applications in opinion mining and product design; RedHOTExpect corpus enables study of patient expectations from online platforms; findings reveal domain-specific differences in expectation expression.

Abstract: Patients’ expectations towards their treatment have a substantial effect on the treatments’ success. While primarily studied in clinical settings, online patient platforms like medical subreddits may hold complementary insights: treatment expectations that patients feel unnecessary or uncomfortable to share elsewhere. Despite this, no studies examine what type of expectations users discuss online and how they express them. Presumably this is because expectations have not been studied in natural language processing (NLP) before. Therefore, we introduce the task of Expectation Detection, arguing that expectations are relevant for many applications, including opinion mining and product design. Subsequently, we present a case study for the medical domain, where expectations are particularly crucial to extract. We contribute RedHOTExpect, a corpus of Reddit posts (4.5K posts) to study expectations in this context. We use a large language model (LLM) to silver-label the data and validate its quality manually (label accuracy ~78%). Based on this, we analyze which linguistic patterns characterize expectations and explore what patients expect and why. We find that optimism and proactive framing are more pronounced in posts about physical or treatment-related illnesses compared to mental-health contexts, and that in our dataset, patients mostly discuss benefits rather than negative outcomes. The RedHOTExpect corpus can be obtained from https://www.ims.uni-stuttgart.de/data/RedHOTExpect

[18] LuxMT Technical Report

Nils Rehlinger

Main category: cs.CL

TL;DR: LuxMT is a Luxembourgish machine translation system based on Gemma 3 27B, fine-tuned for translation from Luxembourgish to French and English, showing strong improvements over baseline and even zero-shot German translation capabilities.

DetailsMotivation: To develop a specialized machine translation system for Luxembourgish, a low-resource language, by leveraging existing multilingual models and creating novel benchmarks and data filtering techniques.

Method: Fine-tuned Gemma 3 27B model using parallel corpus data from LuxAlign (multilingual news articles) and parliamentary transcripts, augmented with Google Translate. Used LuxEmbedder (Luxembourgish sentence embeddings) for data filtering to remove low-equivalence segment pairs.

Result: LuxMT shows strong improvements over Gemma 3 baseline for Luxembourgish to French/English translation, and surprisingly good zero-shot performance for Luxembourgish to German despite no German training data. LuxEmbedder also shows potential as a quality estimation metric with strong correlation to reference-based metrics.

Conclusion: The system demonstrates effective adaptation of large language models for low-resource language translation, with promising zero-shot capabilities and potential for quality estimation, though further research is needed to fully assess the metric’s utility.

Abstract: We introduce LuxMT, a machine translation system based on Gemma 3 27B and fine-tuned for translation from Luxembourgish (LB) into French (FR) and English (EN). To assess translation performance, we construct a novel benchmark covering LB-FR, LB-EN, and LB-FR using human-translated data from Luci, a tourist magazine about Luxembourg. Training data stems from LuxAlign, a parallel corpus of multilingual Luxembourgish news articles, and LB parliamentary transcripts augmented with Google Translate. We filter the data using LuxEmbedder, LB sentence embeddings, to remove low-equivalence segment-pairs. Overall, LuxMT’s results suggest strong improvements over the Gemma 3 baseline, even for translating LB to German (DE), despite the training data not containing any DE. We also explore LuxEmbedder’s potential to be used as a quality estimation metric and find strong correlations with other reference-based metrics. However, we call for further research to fully assess the metric’s utility and advise using it with caution.

[19] Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination

Xiangyan Chen, Yujian Gan, Matthew Purver

Main category: cs.CL

TL;DR: Fine-Refine: A fine-grained refinement framework for reducing hallucinations in dialogue systems by decomposing responses into atomic units, verifying each unit with external knowledge, and iteratively correcting granular errors.

DetailsMotivation: Current LLMs suffer from hallucinations in dialogue systems, producing factually incorrect responses that mislead users and undermine trust. Existing refinement methods operate at the response level, overlooking that single responses may contain multiple verifiable/unverifiable facts.

Method: Proposes Fine-Refine framework that: 1) decomposes responses into atomic units, 2) verifies each unit using external knowledge, 3) assesses fluency via perplexity, and 4) iteratively corrects granular errors.

Result: Evaluated on HybriDialogue and OpendialKG datasets, Fine-Refine substantially improves factuality with up to 7.63-point gain in dialogue fact score, with small trade-off in dialogue quality.

Conclusion: Fine-grained refinement at the atomic unit level is effective for reducing hallucinations in dialogue systems, significantly improving factual accuracy while maintaining reasonable dialogue quality.

Abstract: The tendency for hallucination in current large language models (LLMs) negatively impacts dialogue systems. Such hallucinations produce factually incorrect responses that may mislead users and undermine system trust. Existing refinement methods for dialogue systems typically operate at the response level, overlooking the fact that a single response may contain multiple verifiable or unverifiable facts. To address this gap, we propose Fine-Refine, a fine-grained refinement framework that decomposes responses into atomic units, verifies each unit using external knowledge, assesses fluency via perplexity, and iteratively corrects granular errors. We evaluate factuality across the HybriDialogue and OpendialKG datasets in terms of factual accuracy (fact score) and coverage (Not Enough Information Proportion), and experiments show that Fine-Refine substantially improves factuality, achieving up to a 7.63-point gain in dialogue fact score, with a small trade-off in dialogue quality.

[20] DependencyAI: Detecting AI Generated Text through Dependency Parsing

Sara Ahmed, Tracy Hammond

Main category: cs.CL

TL;DR: DependencyAI: A simple, interpretable method for detecting AI-generated text using only linguistic dependency relation labels, achieving competitive performance across various settings.

DetailsMotivation: As LLMs become more prevalent, reliable detection of AI-generated text is critical for mitigating risks. Current methods often lack interpretability or require complex neural networks.

Method: Uses only linguistic dependency relation labels (syntactic relationships between words) as features for detection. Analyzes feature importance for interpretability and examines cross-domain generalization.

Result: Achieves competitive performance across monolingual, multi-generator, and multilingual settings. Reveals syntactic structures distinguishing AI from human text. Shows systematic overprediction on unseen domains, indicating generator-specific writing styles affect cross-domain generalization.

Conclusion: Dependency relations alone provide robust signal for AI-generated text detection. DependencyAI establishes a strong linguistically grounded, interpretable, non-neural network baseline for this task.

Abstract: As large language models (LLMs) become increasingly prevalent, reliable methods for detecting AI-generated text are critical for mitigating potential risks. We introduce DependencyAI, a simple and interpretable approach for detecting AI-generated text using only the labels of linguistic dependency relations. Our method achieves competitive performance across monolingual, multi-generator, and multilingual settings. To increase interpretability, we analyze feature importance to reveal syntactic structures that distinguish AI-generated from human-written text. We also observe a systematic overprediction of certain models on unseen domains, suggesting that generator-specific writing styles may affect cross-domain generalization. Overall, our results demonstrate that dependency relations alone provide a robust signal for AI-generated text detection, establishing DependencyAI as a strong linguistically grounded, interpretable, and non-neural network baseline.

[21] ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns

Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang, Kun Kuang, Zhongyu Wei, Fei Wu, Yu Cheng

Main category: cs.CL

TL;DR: ExpertWeaver is a training-free framework that converts pretrained dense models into sparse Mixture-of-Experts (MoE) architectures by leveraging activation patterns in Gated Linear Units (GLU), outperforming existing dense-to-MoE conversion methods.

DetailsMotivation: Training MoEs from scratch is expensive, and existing dense-to-MoE conversion methods break intrinsic activation patterns, leading to suboptimal expert construction. The authors aim to find a more natural way to convert dense models to MoEs by leveraging inherent structures in GLU mechanisms.

Method: ExpertWeaver analyzes fine-grained neural-wise activation patterns in GLU to discover coarse-grained MoE structures. It partitions neurons based on activation patterns into consistently activated universal neurons and dynamically activated specialized neurons, then constructs shared experts and specialized routed experts with layer-adaptive configurations without additional training.

Result: ExpertWeaver significantly outperforms existing dense-to-MoE conversion methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.

Conclusion: GLU mechanisms provide a natural blueprint for dense-to-MoE conversion, and ExpertWeaver effectively leverages inherent activation patterns to create high-quality sparse MoE architectures from pretrained dense models without additional training.

Abstract: Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.

[22] Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite

Tim Fischer, Chris Biemann

Main category: cs.CL

TL;DR: Perspectives is an interactive tool for DH scholars to explore and organize large document collections through aspect-focused clustering with human-in-the-loop refinement, using document rewriting prompts and instruction-based embeddings.

DetailsMotivation: Digital Humanities scholars need tools to explore and organize large unstructured document collections to uncover patterns, topics, and sentiments for subsequent analysis, requiring flexible, interactive approaches that incorporate human expertise.

Method: Implements a flexible aspect-focused document clustering pipeline with human-in-the-loop refinement, using document rewriting prompts and instruction-based embeddings to define analytical lenses, plus tools for cluster refinement and embedding model fine-tuning.

Result: Provides an interactive document map that enables DH researchers to uncover topics, sentiments, and other relevant categories in large document collections, facilitating data preparation for in-depth analysis.

Conclusion: Perspectives successfully empowers DH scholars to interactively explore and organize large document collections through human-in-the-loop clustering refinement, bridging computational methods with human analytical expertise.

Abstract: This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections. Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities. We showcase how this process can be initially steered by defining analytical lenses through document rewriting prompts and instruction-based embeddings, and further aligned with user intent through tools for refining clusters and mechanisms for fine-tuning the embedding model. The demonstration highlights a typical workflow, illustrating how DH researchers can leverage Perspectives’s interactive document map to uncover topics, sentiments, or other relevant categories, thereby gaining insights and preparing their data for subsequent in-depth analysis.

[23] jina-embeddings-v5-text: Task-Targeted Embedding Distillation

Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, Han Xiao

Main category: cs.CL

TL;DR: Novel training regimen combining model distillation with task-specific contrastive loss produces compact, high-performance text embedding models (jina-embeddings-v5-text-small and nano) that exceed/match SOTA for similar sizes, support long texts (32k tokens), multilingual, and robust to truncation/quantization.

DetailsMotivation: Current text embedding models for semantic similarity tasks (information retrieval, clustering, classification) are typically trained with contrastive loss functions. The authors aim to develop more effective training methods for small embedding models that outperform purely contrastive or distillation-based approaches alone.

Method: Introduces a novel training regimen that combines model distillation techniques with task-specific contrastive loss. This hybrid approach leverages the strengths of both methods to produce compact, high-performance embedding models.

Result: Resulting models (jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano) exceed or match state-of-the-art performance for models of similar size. They support long texts (up to 32k tokens), are multilingual, and generate embeddings robust to truncation and binary quantization.

Conclusion: The combined distillation + contrastive loss approach is more effective for training small embedding models than using either method alone. The publicly released models (jina-embeddings-v5-text) advance embedding model development and demonstrate practical benefits for real-world applications.

Abstract: Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.

[24] Beyond Static Pipelines: Learning Dynamic Workflows for Text-to-SQL

Yihan Wang, Peiyu Liu, Runyu Chen, Wei Xu

Main category: cs.CL

TL;DR: SquRL uses reinforcement learning to enable LLMs to dynamically construct adaptive workflows for Text-to-SQL tasks, outperforming static workflow methods especially on complex and out-of-distribution queries.

DetailsMotivation: Current Text-to-SQL systems rely on single static workflows, limiting scalability to out-of-distribution and long-tail scenarios. Instead of requiring users to manually select methods, the paper aims to enable systems to adaptively construct workflows at inference time.

Method: Proposes SquRL, a reinforcement learning framework that enhances LLMs’ reasoning capability in adaptive workflow construction. Uses rule-based reward function with two training mechanisms: dynamic actor masking to encourage broader exploration, and pseudo rewards to improve training efficiency.

Result: Dynamic workflow construction consistently outperforms the best static workflow methods on widely-used Text-to-SQL benchmarks, with especially pronounced gains on complex and out-of-distribution queries.

Conclusion: Optimal dynamic policies consistently outperform static workflows, with performance gains driven by heterogeneity across candidate workflows. Reinforcement learning enables effective adaptive workflow construction for Text-to-SQL tasks.

Abstract: Text-to-SQL has recently achieved impressive progress, yet remains difficult to apply effectively in real-world scenarios. This gap stems from the reliance on single static workflows, fundamentally limiting scalability to out-of-distribution and long-tail scenarios. Instead of requiring users to select suitable methods through extensive experimentation, we attempt to enable systems to adaptively construct workflows at inference time. Through theoretical and empirical analysis, we demonstrate that optimal dynamic policies consistently outperform the best static workflow, with performance gains fundamentally driven by heterogeneity across candidate workflows. Motivated by this, we propose SquRL, a reinforcement learning framework that enhances LLMs’ reasoning capability in adaptive workflow construction. We design a rule-based reward function and introduce two effective training mechanisms: dynamic actor masking to encourage broader exploration, and pseudo rewards to improve training efficiency. Experiments on widely-used Text-to-SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out-of-distribution queries. The codes are available at https://github.com/Satissss/SquRL

[25] Clinically Inspired Symptom-Guided Depression Detection from Emotion-Aware Speech Representations

Chaithra Nerella, Chiranjeevi Yarra

Main category: cs.CL

TL;DR: A symptom-specific depression severity estimation framework using speech with symptom-guided cross-attention and emotion-aware representations for clinical screening.

DetailsMotivation: Existing depression prediction works treat it as binary or overall severity without symptom-level analysis, limiting clinical relevance. Need for symptom-specific modeling that aligns with clinical questionnaires like PHQ-8.

Method: Uses symptom-guided cross-attention to align PHQ-8 questionnaire items with emotion-aware speech representations. Introduces learnable symptom-specific parameters to control attention distribution sharpness based on how symptoms are expressed over time.

Result: Outperforms prior works on EDAIC dataset. Attention analysis shows higher attention assigned to utterances containing cues related to multiple depressive symptoms, demonstrating interpretability.

Conclusion: Symptom-guided and emotion-aware modeling is important for speech-based depression screening, providing clinically interpretable symptom-level analysis.

Abstract: Depression manifests through a diverse set of symptoms such as sleep disturbance, loss of interest, and concentration difficulties. However, most existing works treat depression prediction either as a binary label or an overall severity score without explicitly modeling symptom-specific information. This limits their ability to provide symptom-level analysis relevant to clinical screening. To address this, we propose a symptom-specific and clinically inspired framework for depression severity estimation from speech. Our approach uses a symptom-guided cross-attention mechanism that aligns PHQ-8 questionnaire items with emotion-aware speech representations to identify which segments of a participant’s speech are more important to each symptom. To account for differences in how symptoms are expressed over time, we introduce a learnable symptom-specific parameter that adaptively controls the sharpness of attention distributions. Our results on EDAIC, a standard clinical-style dataset, demonstrate improved performance outperforming prior works. Further, analyzing the attention distributions showed that higher attention is assigned to utterances containing cues related to multiple depressive symptoms, highlighting the interpretability of our approach. These findings outline the importance of symptom-guided and emotion-aware modeling for speech-based depression screening.

[26] STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li

Main category: cs.CL

TL;DR: STAPO addresses RL fine-tuning instability in LLMs by identifying and masking spurious tokens that cause abnormal gradient updates, improving training stability and reasoning performance.

DetailsMotivation: Existing RL fine-tuning methods for LLMs suffer from late-stage performance collapse and unstable training due to heuristic techniques like entropy regularization. The paper identifies that instability is driven by a tiny fraction of tokens (0.01%) called spurious tokens that cause abnormally amplified gradient updates.

Method: The authors derive that token-wise policy gradient magnitude is negatively correlated with token probability and local policy entropy. They propose Spurious-Token-Aware Policy Optimization (STAPO) which selectively masks updates from spurious tokens and renormalizes the loss over valid tokens during RL fine-tuning.

Result: STAPO demonstrates superior entropy stability across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, achieving an average performance improvement of 7.13% over GRPO, 20-Entropy and JustRL.

Conclusion: The paper presents a principled approach to RL fine-tuning stability by addressing the spurious token problem, offering improved training stability and reasoning performance for large language models.

Abstract: Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale model refining, which selectively masks such updates and renormalizes the loss over valid tokens. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13% over GRPO, 20-Entropy and JustRL.

[27] LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models

Ahmed Khaled Khamis, Hesham Ali

Main category: cs.CL

TL;DR: NileTTS: First publicly available Egyptian Arabic TTS dataset (38 hours) created via LLM-generated synthetic pipeline, with fine-tuned XTTS v2 model for improved dialectal speech synthesis.

DetailsMotivation: Address the severe under-resourcing of Egyptian Arabic in TTS research, as most Arabic TTS resources focus on Modern Spanned Arabic and Gulf dialects despite Egyptian Arabic being the most widely understood Arabic dialect.

Method: Created 38-hour dataset using synthetic pipeline: LLMs generate Egyptian Arabic content → audio synthesis tools convert to speech → automatic transcription & speaker diarization → manual quality verification. Fine-tuned XTTS v2 multilingual TTS model on this dataset.

Result: Produced first publicly available Egyptian Arabic TTS dataset, reproducible synthetic data generation pipeline, and open-source fine-tuned model that outperforms baseline models trained on other Arabic dialects.

Conclusion: Successfully addresses Egyptian Arabic TTS resource gap with novel synthetic data approach, providing valuable resources for advancing dialectal speech synthesis research.

Abstract: Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic – the most widely understood Arabic dialect – severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first publicly available Egyptian Arabic TTS dataset, (2) a reproducible synthetic data generation pipeline for dialectal TTS, and (3) an open-source fine-tuned model. All resources are released to advance Egyptian Arabic speech synthesis research.

[28] Revisiting Northrop Frye’s Four Myths Theory with Large Language Models

Edirlei Soares de Lima, Marco A. Casanova, Antonio L. Furtado

Main category: cs.CL

TL;DR: A computational framework mapping Jungian archetypes to character functions across Frye’s four narrative genres, validated using LLMs to analyze character-role correspondences in literature.

DetailsMotivation: To address the gap in computational approaches to Northrop Frye's narrative theory, which have focused on narrative patterns rather than character functions, by developing a character-based framework that complements existing pattern-based analysis.

Method: Derived four universal character functions (protagonist, mentor, antagonist, companion) from Jungian archetype theory, specialized into sixteen genre-specific roles across Frye’s four genres (comedy, romance, tragedy, satire). Validated using six state-of-the-art LLMs on 40 narrative works with 160 positive and 30 negative samples.

Result: LLMs achieved mean balanced accuracy of 82.5% with strong inter-model agreement (Fleiss’ κ = 0.600). Performance varied by genre (72.7% to 89.9%) and role (52.5% to 99.2%), with qualitative analysis showing variations reflect genuine narrative properties like functional distribution in romance and archetypal subversion in satire.

Conclusion: The character-based approach demonstrates LLM-supported methods’ potential for computational narratology and provides foundation for narrative generation and interactive storytelling applications.

Abstract: Northrop Frye’s theory of four fundamental narrative genres (comedy, romance, tragedy, satire) has profoundly influenced literary criticism, yet computational approaches to his framework have focused primarily on narrative patterns rather than character functions. In this paper, we present a new character function framework that complements pattern-based analysis by examining how archetypal roles manifest differently across Frye’s genres. Drawing on Jungian archetype theory, we derive four universal character functions (protagonist, mentor, antagonist, companion) by mapping them to Jung’s psychic structure components. These functions are then specialized into sixteen genre-specific roles based on prototypical works. To validate this framework, we conducted a multi-model study using six state-of-the-art Large Language Models (LLMs) to evaluate character-role correspondences across 40 narrative works. The validation employed both positive samples (160 valid correspondences) and negative samples (30 invalid correspondences) to evaluate whether models both recognize valid correspondences and reject invalid ones. LLMs achieved substantial performance (mean balanced accuracy of 82.5%) with strong inter-model agreement (Fleiss’ $κ$ = 0.600), demonstrating that the proposed correspondences capture systematic structural patterns. Performance varied by genre (ranging from 72.7% to 89.9%) and role (52.5% to 99.2%), with qualitative analysis revealing that variations reflect genuine narrative properties, including functional distribution in romance and deliberate archetypal subversion in satire. This character-based approach demonstrates the potential of LLM-supported methods for computational narratology and provides a foundation for future development of narrative generation methods and interactive storytelling applications.

[29] A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models

Meirav Segal, Noa Linder, Omer Antverg, Gil Gekker, Tomer Fichman, Omri Bodenheimer, Edan Maor, Omer Nevo

Main category: cs.CL

TL;DR: A content-based framework for designing cyber refusal policies in LLMs that explicitly models offense-defense tradeoffs rather than relying on intent or offensive classification.

DetailsMotivation: Current LLM refusal systems for cybersecurity tasks use broad topic bans or offensive-focused taxonomies, leading to inconsistent decisions, over-restriction of legitimate defenders, and brittleness under obfuscation or request segmentation.

Method: Introduces a content-based framework with five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in technical substance rather than stated intent.

Result: The framework resolves inconsistencies in current frontier model behavior and enables organizations to construct tunable, risk-aware refusal policies.

Conclusion: Effective cyber refusal requires explicitly modeling offense-defense tradeoffs through content-based analysis rather than relying on intent or offensive classification alone.

Abstract: Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in the technical substance of the request rather than stated intent. We demonstrate that this content-grounded approach resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.

[30] Rethinking Metrics for Lexical Semantic Change Detection

Roksana Goworek, Haim Dubossarsky

Main category: cs.CL

TL;DR: New semantic change detection metrics AMD and SAMD outperform traditional APD and PRT for lexical semantic change detection using contextualized embeddings.

DetailsMotivation: Current lexical semantic change detection (LSCD) relies heavily on limited metrics like Average Pairwise Distance (APD) and cosine distance over word prototypes (PRT), despite advances in contextualized language model embeddings. There's a need for more robust semantic change metrics.

Method: Introduces two new measures: Average Minimum Distance (AMD) and Symmetric Average Minimum Distance (SAMD) that quantify semantic change via local correspondence between word usages across time periods. Evaluates these across multiple languages, encoder models, and representation spaces.

Result: AMD provides more robust performance, particularly under dimensionality reduction and with non-specialized encoders, while SAMD excels with specialized encoders. Both outperform traditional APD and PRT metrics.

Conclusion: LSCD can benefit from alternative semantic change metrics beyond APD and PRT, with AMD offering a robust option for contextualized embedding-based analysis of semantic change over time.

Abstract: Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and cosine distance over word prototypes (PRT). We introduce Average Minimum Distance (AMD) and Symmetric Average Minimum Distance (SAMD), new measures that quantify semantic change via local correspondence between word usages across time periods. Across multiple languages, encoder models, and representation spaces, we show that AMD often provides more robust performance, particularly under dimensionality reduction and with non-specialised encoders, while SAMD excels with specialised encoders. We suggest that LSCD may benefit from considering alternative semantic change metrics beyond APD and PRT, with AMD offering a robust option for contextualised embedding-based analysis.

[31] Causal Effect Estimation with Latent Textual Treatments

Omri Feldman, Amar Venugopal, Jann Spiess, Amir Feder

Main category: cs.CL

TL;DR: An end-to-end pipeline for generating and estimating causal effects of latent textual interventions using sparse autoencoders for hypothesis generation/steering and covariate residualization for robust causal estimation.

DetailsMotivation: Understanding causal effects of text on downstream outcomes is important for many applications, but estimating these effects requires controlled experiments with systematic textual variation. While LLMs can generate text, producing and evaluating controlled variation needs careful attention due to text inherently conflating treatment and covariate information.

Method: 1) Hypothesis generation and steering via sparse autoencoders (SAEs) to induce variation in target features; 2) Robust causal estimation addressing bias from text conflating treatment and covariate information; 3) Covariate residualization solution to mitigate estimation error.

Result: The pipeline effectively induces variation in target features and mitigates estimation error. Naive estimation suffers significant bias, but the proposed covariate residualization solution provides robust causal effect estimation in text-as-treatment settings.

Conclusion: The end-to-end pipeline provides a robust foundation for causal effect estimation in text-as-treatment settings by addressing both computational and statistical challenges through SAE-based hypothesis generation and covariate residualization for bias mitigation.

Abstract: Understanding the causal effects of text on downstream outcomes is a central task in many applications. Estimating such effects requires researchers to run controlled experiments that systematically vary textual features. While large language models (LLMs) hold promise for generating text, producing and evaluating controlled variation requires more careful attention. In this paper, we present an end-to-end pipeline for the generation and causal estimation of latent textual interventions. Our work first performs hypothesis generation and steering via sparse autoencoders (SAEs), followed by robust causal estimation. Our pipeline addresses both computational and statistical challenges in text-as-treatment experiments. We demonstrate that naive estimation of causal effects suffers from significant bias as text inherently conflates treatment and covariate information. We describe the estimation bias induced in this setting and propose a solution based on covariate residualization. Our empirical results show that our pipeline effectively induces variation in target features and mitigates estimation error, providing a robust foundation for causal effect estimation in text-as-treatment settings.

[32] Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

Chahan Vidal-Gorène, Bastien Kindt, Florian Cafiero

Main category: cs.CL

TL;DR: LLMs (GPT-4 variants and Mistral) achieve competitive performance on lemmatization and POS-tagging for four low-resource historical languages in few-shot/zero-shot settings, serving as effective annotation aids despite challenges with complex morphology.

DetailsMotivation: Low-resource languages present persistent challenges for NLP tasks like lemmatization and POS-tagging, and this paper investigates whether recent LLMs can address these tasks for historically and linguistically diverse under-resourced languages.

Method: Evaluated GPT-4 variants and open-weight Mistral models on lemmatization and POS-tagging for four languages (Ancient Greek, Classical Armenian, Old Georgian, Syriac) using few-shot and zero-shot settings with a novel benchmark of aligned training and out-of-domain test corpora, comparing against PIE (task-specific RNN baseline).

Result: LLMs without fine-tuning achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings, though challenges persist for languages with complex morphology and non-Latin scripts.

Conclusion: LLMs are a credible and relevant option for initiating linguistic annotation tasks in data-scarce scenarios, serving as effective aids for annotation despite remaining challenges with morphologically complex languages.

Abstract: Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.

[33] Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos

Laura De Grazia, Danae Sánchez Villegas, Desmond Elliott, Mireia Farrús, Mariona Taulé

Main category: cs.CL

TL;DR: FineMuSe: A multimodal Spanish dataset for fine-grained sexism detection with hierarchical taxonomy, showing multimodal LLMs perform competitively with humans but struggle with co-occurring visual sexist types.

DetailsMotivation: Online sexism detection is challenging due to its various forms, and existing automated tools are often limited to binary classification, missing subtle manifestations due to lack of fine-grained, context-sensitive labels.

Method: Created FineMuSe dataset with multimodal content (text+images) in Spanish, developed hierarchical taxonomy covering sexism forms, non-sexism, irony/humor, and evaluated various LLMs for binary and fine-grained sexism detection.

Result: Multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism, but struggle to capture co-occurring sexist types when conveyed through visual cues.

Conclusion: The study demonstrates the value of multimodal fine-grained datasets for sexism detection and highlights both the potential and limitations of multimodal LLMs in understanding complex visual-textual interactions in sexist content.

Abstract: Online sexism appears in various forms, which makes its detection challenging. Although automated tools can enhance the identification of sexist content, they are often restricted to binary classification. Consequently, more subtle manifestations of sexism may remain undetected due to the lack of fine-grained, context-sensitive labels. To address this issue, we make the following contributions: (1) we present FineMuSe, a new multimodal sexism detection dataset in Spanish that includes both binary and fine-grained annotations; (2) we introduce a comprehensive hierarchical taxonomy that encompasses forms of sexism, non-sexism, and rhetorical devices of irony and humor; and (3) we evaluate a wide range of LLMs for both binary and fine-grained sexism detection. Our findings indicate that multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism; however, they struggle to capture co-occurring sexist types when these are conveyed through visual cues.

[34] ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé

Main category: cs.CL

TL;DR: ChartEditBench: A benchmark for evaluating MLLMs’ ability to handle multi-turn, incremental chart editing through code, focusing on sustained context-aware interactions rather than one-shot generation.

DetailsMotivation: Current MLLMs perform well on single-turn chart generation but lack evaluation for real-world exploratory data analysis where users iteratively refine visualizations through multi-turn interactions requiring context maintenance, prior edit tracking, and adaptation to evolving preferences.

Method: Introduces ChartEditBench with 5,000 difficulty-controlled modification chains and a human-verified subset. Proposes evaluation framework combining execution-based fidelity checks, pixel-level visual similarity, and logical code verification to overcome LLM-as-a-Judge limitations.

Result: Experiments show state-of-the-art MLLMs degrade substantially in multi-turn settings due to error accumulation and breakdowns in shared context. They perform well on stylistic edits but frequently fail on data-centric transformations.

Conclusion: ChartEditBench establishes a challenging testbed for grounded, intent-aware multimodal programming, highlighting the need for MLLMs to better support sustained, context-aware interactions in real-world data analysis scenarios.

Abstract: While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-turn settings due to error accumulation and breakdowns in shared context, with strong performance on stylistic edits but frequent execution failures on data-centric transformations. ChartEditBench, establishes a challenging testbed for grounded, intent-aware multimodal programming.

[35] ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution

Yahia Alqurnawi, Preetom Biswas, Anmol Rao, Tejas Anvekar, Chitta Baral, Vivek Gupta

Main category: cs.CL

TL;DR: Current multimodal LLMs struggle with fine-grained attribution/citation in structured data, showing poor evidence localization despite moderate QA accuracy.

DetailsMotivation: Multimodal LLMs can answer questions about structured data but lack transparency about answer sources; users need to know which specific rows/columns support answers for trust and traceability.

Method: Evaluated several mLLMs across different table formats (Markdown, JSON, images) and prompting strategies, measuring both question answering accuracy and evidence attribution accuracy.

Result: Clear gap between QA and attribution: moderate QA accuracy but near-random attribution for JSON inputs; models better at citing rows than columns; struggle more with textual formats than images; notable differences across model families.

Conclusion: Current mLLMs are unreliable for fine-grained attribution in structured data, limiting their use in applications requiring transparency and traceability.

Abstract: Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images. While these models can often give correct answers, users also need to know where those answers come from. In this work, we study structured data attribution/citation, which is the ability of the models to point to the specific rows and columns that support an answer. We evaluate several mLLMs across different table formats and prompting strategies. Our results show a clear gap between question answering and evidence attribution. Although question answering accuracy remains moderate, attribution accuracy is much lower, near random for JSON inputs, across all models. We also find that models are more reliable at citing rows than columns, and struggle more with textual formats than images. Finally, we observe notable differences across model families. Overall, our findings show that current mLLMs are unreliable at providing fine-grained, trustworthy attribution for structured data, which limits their usage in applications requiring transparency and traceability.

[36] *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive

Main category: cs.CL

TL;DR: The paper introduces *-PLUIE, a task-specific prompting variant of ParaPLUIE that uses perplexity-based evaluation for text quality assessment without text generation, achieving better correlation with human judgments at lower computational cost.

DetailsMotivation: Current LLM-as-a-judge methods for evaluating generated text quality are computationally expensive and require post-processing. The authors aim to create a more efficient alternative that maintains accuracy while reducing computational overhead.

Method: Builds upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over Yes/No answers without generating text. Introduces *-PLUIE, which uses task-specific prompting variants to improve alignment with human judgment while maintaining low computational cost.

Result: Experiments show that personalized *-PLUIE achieves stronger correlations with human ratings compared to baseline methods while maintaining significantly lower computational costs.

Conclusion: *-PLUIE provides an efficient alternative to traditional LLM-as-a-judge methods for text quality evaluation, offering better human alignment with reduced computational requirements through task-specific prompting and perplexity-based scoring.

Abstract: Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No’’ answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

[37] Avey-B

Devang Acharya, Mohammad Hammoud

Main category: cs.CL

TL;DR: Reformulated Avey architecture for encoder-only paradigm with innovations in parameterization, normalization, and compression, outperforming Transformer encoders on token classification and information retrieval with better long-context scaling.

DetailsMotivation: Compact pretrained bidirectional encoders are crucial for industrial NLP under tight compute/memory constraints. While self-attention in BERT-style architectures provides good bidirectional contextualization, the authors aim to explore attention-free alternatives like Avey for the encoder-only paradigm to potentially improve efficiency and scalability.

Method: Reformulate Avey for encoder-only paradigm with several architectural innovations: 1) decoupled static and dynamic parameterizations, 2) stability-oriented normalization, and 3) neural compression techniques.

Result: The reformulated architecture consistently outperforms four widely used Transformer-based encoders on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.

Conclusion: Attention-free architectures like the reformulated Avey can provide competitive or superior performance to Transformer-based encoders while offering better computational efficiency and scalability, particularly for long-context scenarios in industrial NLP applications.

Abstract: Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention’s ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.

[38] Differentiating Between Human-Written and AI-Generated Texts Using Automatically Extracted Linguistic Features

Georgios P. Georgiou

Main category: cs.CL

TL;DR: Study compares linguistic features between human-written and ChatGPT-generated essays, finding significant differences across phonological, morphological, syntactic, and lexical components despite AI’s apparent mimicry of human speech.

DetailsMotivation: To systematically quantify and compare linguistic features between human-written and AI-generated language, assessing AI's ability to emulate human writing, as prior research has focused on ChatGPT but lacked systematic linguistic analysis.

Method: Used human-authored essays as benchmark, prompted ChatGPT to generate equivalent-length essays, then analyzed both using Open Brain AI computational tool to extract measures of phonological, morphological, syntactic, and lexical constituents.

Result: Found significant differences across multiple linguistic features including specific types of consonants, nouns, adjectives, pronouns, adjectival/prepositional modifiers, and use of difficult words, despite AI-generated texts appearing to mimic human speech.

Conclusion: Highlights importance of automated tools for efficient language assessment and emphasizes need for enhanced AI training methodologies to improve production of more human-like text.

Abstract: While extensive research has focused on ChatGPT in recent years, very few studies have systematically quantified and compared linguistic features between human-written and artificial intelligence (AI)-generated language. This exploratory study aims to investigate how various linguistic components are represented in both types of texts, assessing the ability of AI to emulate human writing. Using human-authored essays as a benchmark, we prompted ChatGPT to generate essays of equivalent length. These texts were analyzed using Open Brain AI, an online computational tool, to extract measures of phonological, morphological, syntactic, and lexical constituents. Despite AI-generated texts appearing to mimic human speech, the results revealed significant differences across multiple linguistic features such as specific types of consonants, nouns, adjectives, pronouns, adjectival/prepositional modifiers, and use of difficult words, among others. These findings underscore the importance of integrating automated tools for efficient language assessment, reducing time and effort in data analysis. Moreover, they emphasize the necessity for enhanced training methodologies to improve the engineering capacity of AI for producing more human-like text.

[39] Intermittent Semi-Working Mask: A New Masking Paradigm for LLMs

HaoYuan Hu, Mingcong Lu, Di Luo, XinYa Wu, Jiangcai Zhu, Taoye Yin, Zheng Li, Hao Wang, Shusheng Zhang, KeZun Zhang, KaiLai Shao, Chao Chen, Feng Wang

Main category: cs.CL

TL;DR: ISM introduces a novel masking scheme that combines bidirectional attention for context understanding with unidirectional attention for generation, enabling efficient multi-turn dialogue processing without sacrificing inference speed.

DetailsMotivation: Current LLMs struggle with multi-turn dialogues and context-intensive tasks. Prefix LLMs with bidirectional attention understand context better but are inefficient at inference due to KV-cache issues, while causal LLMs are fast but lack contextual understanding.

Method: Proposes Intermittent Semi-working Mask (ISM) - a parameter-free masking scheme that alternates bidirectional attention over query segments with unidirectional attention over answer segments, preserving global causality while enabling contextual understanding.

Result: ISM outperforms causal baselines on multi-turn dialogue and context-intensive tasks like mathematical reasoning, while maintaining inference latency comparable to standard causal LLMs and eliminating training triplet expansion.

Conclusion: ISM provides an efficient solution for context-intensive tasks by combining the strengths of bidirectional and unidirectional attention, making it practical for real-world deployment without sacrificing performance or speed.

Abstract: Multi-turn dialogues and context-intensive tasks challenge Large Language Models (LLMs) to integrate long histories without sacrificing generation quality. Although prefix LLMs can better exploit historical context via bidirectional attention on prefix tokens, they are rarely used in practice because multi-turn training requires many duplicated triplets, and its bidirectional prefix prevents KV-cache reuse at inference time, driving up high cost and latency. To retain the contextual understanding of prefix mask while preserving the inference-time efficiency of causal mask, we introduce Intermittent Semi-working Mask (ISM), a masking scheme that injects sparse bidirectional attention into the causal backbone. ISM alternates bidirectional attention over query segments with unidirectional attention over answer segments, enabling the synthesis of in-context while preserving global causality. This design eliminates triplet expansion during training and maintains KV-cache reuse during inference, yielding latency comparable to standard causal LLMs. ISM is architecture-agnostic and parameter-free, adding only minimal latency. Across extensive evaluations, ISM outperforms causal baselines not only on multi-turn dialogue, but also on context-intensive tasks like mathematical reasoning.

[40] Human-like Affective Cognition in Foundation Models

Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C. Ong, Noah D. Goodman

Main category: cs.CL

TL;DR: Foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) demonstrate human-like or superior affective cognition abilities in understanding emotions, appraisals, expressions, and outcomes, often matching or exceeding human interparticipant agreement.

DetailsMotivation: To evaluate how well modern AI foundation models understand affective cognition - the relationships between appraisals, emotions, expressions, and outcomes - compared to human capabilities, given that emotional understanding is fundamental to human interaction.

Method: Created an evaluation framework based on psychological theory with 1,280 diverse scenarios exploring affective cognition relationships. Tested foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and 567 humans across carefully selected conditions, including chain-of-thought reasoning.

Result: Foundation models tend to agree with human intuitions, matching or exceeding interparticipant agreement. In some conditions, models are “superhuman” - better predicting modal human judgments than the average human. All models benefit from chain-of-thought reasoning.

Conclusion: Foundation models have acquired a human-like understanding of emotions and their influence on beliefs and behavior, demonstrating sophisticated affective cognition capabilities comparable to or exceeding human performance.

Abstract: Understanding emotions is fundamental to human interaction and experience. Humans easily infer emotions from situations or facial expressions, situations from emotions, and do a variety of other affective cognition. How adept is modern AI at these inferences? We introduce an evaluation framework for testing affective cognition in foundation models. Starting from psychological theory, we generate 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes. We evaluate the abilities of foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across carefully selected conditions. Our results show foundation models tend to agree with human intuitions, matching or exceeding interparticipant agreement. In some conditions, models are ``superhuman’’ – they better predict modal human judgements than the average human. All models benefit from chain-of-thought reasoning. This suggests foundation models have acquired a human-like understanding of emotions and their influence on beliefs and behavior.

[41] Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare

Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman, Aaron Lulla, Monika Drummond Roots, Manu Sharma, Aryan Shrivastava, Nina Vasan, Colleen Waickman

Main category: cs.CL

TL;DR: A psychiatry-focused clinical reasoning dataset for evaluating medical language models on nuanced decision-making tasks while assessing demographic bias.

DetailsMotivation: Current medical LM benchmarks oversimplify clinical practice, especially in psychiatry where fairness and bias issues are critical. Existing datasets rely on multiple-choice board exam questions and miss the daily ambiguities and complexities of mental healthcare delivery.

Method: Created an expert-annotated dataset spanning five domains of mental healthcare decision-making (treatment, diagnosis, documentation, monitoring, triage) without LM assistance. Designed with demographic variables (age, ethnicity, gender) that can be systematically manipulated to study bias. Includes preference datasets for ambiguous cases with multiple valid answers.

Result: Dataset enables systematic evaluation of 16 off-the-shelf and 6 health fine-tuned LMs on task accuracy, fairness impact of demographics, and consistency of free-form responses compared to human annotations.

Conclusion: Provides a more realistic benchmark for medical LMs in psychiatry that captures clinical reasoning complexities and enables systematic bias evaluation, addressing gaps in current medical LM evaluation.

Abstract: Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. In psychiatry especially, these challenges are worsened by fairness and bias issues, since models can be swayed by patient demographics even when those factors should not influence clinical decisions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This U.S.-centric dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables, e.g., for age or ethnicity, and are available for male, female, or non-binary-coded patients. This design enables systematic evaluations of model performance and bias by studying how demographic factors affect decision-making. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating sixteen off-the-shelf and six (mental) health fine-tuned LMs on category-specific task accuracy, on the fairness impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human-annotated samples.

[42] Can Multimodal LLMs Perform Time Series Anomaly Detection?

Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao, Kai Shu

Main category: cs.CL

TL;DR: MLLMs for time series anomaly detection via visual-textual reasoning, with benchmark and multi-agent framework

DetailsMotivation: Existing TSAD methods oversimplify real-world scenarios; MLLMs' potential for visual-textual anomaly detection remains unexplored despite human-like visualization capabilities

Method: Built VisualTimeAnomaly benchmark for comprehensive zero-shot evaluation, then proposed TSAD-Agents multi-agent framework with scanning, planning, detection, and checking agents that dynamically switch modalities and tools

Result: Study reveals key insights about MLLMs for TSAD; proposed framework enables automatic TSAD through synergistic agent collaboration

Conclusion: MLLMs show promise for TSAD through visual-textual reasoning; multi-agent framework provides systematic approach for automatic anomaly detection

Abstract: Time series anomaly detection (TSAD) has been a long-standing pillar problem in Web-scale systems and online infrastructures, such as service reliability monitoring, system fault diagnosis, and performance optimization. Large language models (LLMs) have demonstrated unprecedented capabilities in time series analysis, the potential of multimodal LLMs (MLLMs), particularly vision-language models, in TSAD remains largely under-explored. One natural way for humans to detect time series anomalies is through visualization and textual description. It motivates our research question: Can multimodal LLMs perform time series anomaly detection? Existing studies often oversimplify the problem by treating point-wise anomalies as special cases of range-wise ones or by aggregating point anomalies to approximate range-wise scenarios. They limit our understanding for realistic scenarios such as multi-granular anomalies and irregular time series. To address the gap, we build a VisualTimeAnomaly benchmark to comprehensively investigate zero-shot capabilities of MLLMs for TSAD, progressively from point-, range-, to variate-wise anomalies, and extends to irregular sampling conditions. Our study reveals several key insights in multimodal MLLMs for TSAD. Built on these findings, we propose a MLLMs-based multi-agent framework TSAD-Agents to achieve automatic TSAD. Our framework comprises scanning, planning, detection, and checking agents that synergistically collaborate to reason, plan, and self-reflect to enable automatic TSAD. These agents adaptively invoke tools such as traditional methods and MLLMs and dynamically switch between text and image modalities to optimize detection performance.

[43] The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz, Asaf Yehudai, Elron Bandel, Leshem Choshen, Eyal Shnarch, Percy Liang, Michal Shmueli-Scheuer

Main category: cs.CL

TL;DR: ToRR benchmark evaluates LLMs on table reasoning and robustness across 10 datasets and multiple table formats, revealing brittle performance and the importance of multi-format testing.

DetailsMotivation: Model performance on tabular data is underexplored, creating uncertainty about which models and prompt configurations to use for table-related tasks.

Method: Created ToRR benchmark with 10 datasets covering different table reasoning capabilities across domains, testing models across multiple table representation formats.

Result: Revealed brittle model behavior where even strong models fail to perform robustly on tabular data; no single table format consistently better but multi-format testing crucial.

Conclusion: Table understanding and reasoning remain significant challenges; testing across multiple formats and prompts is essential for reliable capability estimation.

Abstract: Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.

[44] MedPlan: A Two-Stage RAG-Based System for Personalized Medical Plan Generation

Hsin-Ling Hsu, Cong-Tinh Dao, Luning Wang, Zitao Shuai, Thao Nguyen Minh Phan, Jun-En Ding, Chun-Chieh Liao, Pengfei Hu, Xiaoxue Han, Chih-Ho Hsu, Dongsheng Luo, Wen-Chih Peng, Feng Liu, Fang-Ming Hung, Chenwei Wu

Main category: cs.CL

TL;DR: A novel framework that structures LLM reasoning for clinical treatment planning using SOAP methodology, featuring two-stage assessment and treatment generation with patient-specific context.

DetailsMotivation: Current LLM approaches for EHR focus on assessment rather than treatment planning, lack sequential reasoning like clinicians, miss patient-specific historical context, and fail to distinguish subjective vs objective clinical information.

Method: Introduces a framework using SOAP methodology with two-stage architecture: first generates clinical assessment from patient symptoms and objective data, then formulates structured treatment plan informed by assessment and enriched with patient-specific information through retrieval-augmented generation.

Result: Comprehensive evaluation shows the method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.

Conclusion: The framework successfully structures LLM reasoning to align with real-life clinician workflows, addressing critical limitations in current EHR treatment planning systems.

Abstract: Despite recent success in applying large language models (LLMs) to electronic health records (EHR), most systems focus primarily on assessment rather than treatment planning. We identify three critical limitations in current approaches: they generate treatment plans in a single pass rather than following the sequential reasoning process used by clinicians; they rarely incorporate patient-specific historical context; and they fail to effectively distinguish between subjective and objective clinical information. Motivated by the SOAP methodology (Subjective, Objective, Assessment, Plan), we introduce \ours{}, a novel framework that structures LLM reasoning to align with real-life clinician workflows. Our approach employs a two-stage architecture that first generates a clinical assessment based on patient symptoms and objective data, then formulates a structured treatment plan informed by this assessment and enriched with patient-specific information through retrieval-augmented generation. Comprehensive evaluation demonstrates that our method significantly outperforms baseline approaches in both assessment accuracy and treatment plan quality.

[45] Don’t Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning

Yuehan Qin, Shawn Li, Yi Nian, Xinyan Velocity Yu, Yue Zhao, Xuezhe Ma

Main category: cs.CL

TL;DR: A retrieval-based framework that proactively identifies and addresses false premises in user queries before LLM generation to reduce hallucinations without requiring model logits or extensive fine-tuning.

DetailsMotivation: LLMs can produce hallucinated outputs when user queries contain false premises that contradict established facts. Existing approaches are computationally expensive, require extensive training data, or lack proactive mechanisms to prevent hallucination before generation, limiting real-time efficiency.

Method: A retrieval-based framework that: 1) transforms user queries into logical representations, 2) applies retrieval-augmented generation (RAG) to assess premise validity using factual sources, and 3) incorporates verification results into LLM prompts to maintain factual consistency.

Result: The approach effectively reduces hallucinations, improves factual accuracy, and does not require access to model logits or large-scale fine-tuning.

Conclusion: The proposed retrieval-based framework provides an efficient, proactive solution to address false premises in LLM queries, enhancing factual consistency without the computational overhead of existing methods.

Abstract: Large language models (LLMs) have shown substantial capacity for generating fluent, contextually appropriate responses. However, they can produce hallucinated outputs, especially when a user query includes one or more false premises-claims that contradict established facts. Such premises can mislead LLMs into offering fabricated or misleading details. Existing approaches include pretraining, fine-tuning, and inference-time techniques that often rely on access to logits or address hallucinations after they occur. These methods tend to be computationally expensive, require extensive training data, or lack proactive mechanisms to prevent hallucination before generation, limiting their efficiency in real-time applications. We propose a retrieval-based framework that identifies and addresses false premises before generation. Our method first transforms a user’s query into a logical representation, then applies retrieval-augmented generation (RAG) to assess the validity of each premise using factual sources. Finally, we incorporate the verification results into the LLM’s prompt to maintain factual consistency in the final output. Experiments show that this approach effectively reduces hallucinations, improves factual accuracy, and does not require access to model logits or large-scale fine-tuning.

[46] What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text

Aswathy Velutharambath, Kai Sassenberg, Roman Klinger

Main category: cs.CL

TL;DR: Paper challenges NLP deception detection, showing linguistic cues don’t generalize; introduces DeFaBel framework and finds models perform at chance on belief-based deception data.

DetailsMotivation: To critically examine whether deception can be reliably detected from written text alone, challenging prior studies that reported success but may have been driven by dataset artifacts rather than genuine linguistic cues.

Method: Introduces belief-based deception framework defining deception as misalignment between author’s claims and true beliefs. Constructs three DeFaBel corpora (German arguments, multilingual German/English) under varying conditions. Evaluates linguistic cues, feature-based models, pretrained language models, and instruction-tuned LLMs across datasets.

Result: Linguistic cues show negligible, statistically insignificant correlations with deception in DeFaBel. Models perform well on established datasets but near chance on DeFaBel. Predictive cues inconsistent across datasets, with low effect sizes even when significant.

Conclusion: Deception cannot be reliably inferred from linguistic cues alone; prior findings likely driven by dataset artifacts. Calls for rethinking how deception is studied and modeled in NLP, emphasizing the need for more rigorous frameworks like belief-based deception.

Abstract: Can deception be detected solely from written text? Cues of deceptive communication are inherently subtle, even more so in text-only communication. Yet, prior studies have reported considerable success in automatic deception detection. We hypothesize that such findings are largely driven by artifacts introduced during data collection and do not generalize beyond specific datasets. We revisit this assumption by introducing a belief-based deception framework, which defines deception as a misalignment between an author’s claims and true beliefs, irrespective of factual accuracy, allowing deception cues to be studied in isolation. Based on this framework, we construct three corpora, collectively referred to as DeFaBel, including a German-language corpus of deceptive and non-deceptive arguments and a multilingual version in German and English, each collected under varying conditions to account for belief change and enable cross-linguistic analysis. Using these corpora, we evaluate commonly reported linguistic cues of deception. Across all three DeFaBel variants, these cues show negligible, statistically insignificant correlations with deception labels, contrary to prior work that treats such cues as reliable indicators. We further benchmark against other English deception datasets following similar data collection protocols. While some show statistically significant correlations, effect sizes remain low and, critically, the set of predictive cues is inconsistent across datasets. We also evaluate deception detection using feature-based models, pretrained language models, and instruction-tuned large language models. While some models perform well on established deception datasets, they consistently perform near chance on DeFaBel. Our findings challenge the assumption that deception can be reliably inferred from linguistic cues and call for rethinking how deception is studied and modeled in NLP.

[47] A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives

Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He, Xiaolei Huang

Main category: cs.CL

TL;DR: A systematic review of synthetic data generation using LLMs for biomedical applications, covering data modalities, generation methods, and evaluation approaches from 59 studies published 2020-2025.

DetailsMotivation: To address biomedical data challenges like scarcity, utility, and quality issues through synthetic data generation using LLMs, and to systematically review recent advances in this emerging field.

Method: Conducted a scoping review following PRISMA-ScR guidelines, searching literature from 2020-2025 across PubMed, ACM, Web of Science, and Google Scholar, including 59 relevant studies on synthetic data generation in biomedical contexts.

Result: Found predominant data modalities: unstructured texts (78.0%), tabular data (13.6%), multimodal sources (8.4%). Common generation methods: LLM prompting (74.6%), fine-tuning (20.3%), specialized models (5.1%). Evaluation approaches: intrinsic metrics (27.1%), human-in-the-loop (44.1%), LLM-based evaluations (13.6%). Identified limitations in data modalities, domain utility, resource accessibility, and evaluation standardization.

Conclusion: Synthetic data generation using LLMs shows promise for biomedical research but faces challenges in modality coverage, evaluation standardization, and accessibility. Future work should focus on developing standardized evaluation frameworks and improving accessibility for broader adoption.

Abstract: Synthetic data generation using large language models (LLMs) demonstrates substantial promise in addressing biomedical data challenges and shows increasing adoption in biomedical research. This study systematically reviews recent advances in synthetic data generation for biomedical applications and clinical research, focusing on how LLMs address data scarcity, utility, and quality issues with different modalities. We conducted a scoping review following PRISMA-ScR guidelines and searched literature published between 2020 and 2025 through PubMed, ACM, Web of Science, and Google Scholar. A total of 59 studies were included based on relevance to synthetic data generation in biomedical contexts. Among the reviewed studies, the predominant data modalities were unstructured texts (78.0%), tabular data (13.6%), and multimodal sources (8.4%). Common generation methods included LLM prompting (74.6%), fine-tuning (20.3%), and specialized models (5.1%). Evaluations were heterogeneous: intrinsic metrics (27.1%), human-in-the-loop assessments (44.1%), and LLM-based evaluations (13.6%). However, limitations and key barriers persist in data modalities, domain utility, resource and model accessibility, and standardized evaluation protocols. Future efforts may focus on developing standardized, transparent evaluation frameworks and expanding accessibility to support effective applications in biomedical research.

[48] Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems

Jingyu Guo, Yingying Xu

Main category: cs.CL

TL;DR: LLM-based AI agents spontaneously develop stereotype-driven biases in multi-agent workplace interactions without predefined biases, with effects intensifying through hierarchical structures and interaction rounds.

DetailsMotivation: While AI systems are often presumed less susceptible to stereotypes than humans, previous research focused on biases inherited from training data. This study investigates whether stereotypes can emerge spontaneously in AI agent interactions, exploring a phenomenon that may be an emergent property of multi-agent systems rather than just training data artifacts.

Method: The researchers developed a novel experimental framework simulating workplace interactions with neutral initial conditions. They used LLM-based multi-agent systems and tracked stereotype emergence across different interaction rounds, decision-making power levels, and hierarchical structures. The study examined multiple LLM architectures and conducted comprehensive quantitative analysis of stereotype patterns.

Result: Four key findings: (1) LLM-based AI agents develop stereotype-driven biases despite starting without predefined biases; (2) stereotype effects intensify with increased interaction rounds and decision-making power, especially after introducing hierarchical structures; (3) systems exhibit human-like group effects including halo effects, confirmation bias, and role congruity; (4) stereotype patterns manifest consistently across different LLM architectures.

Conclusion: Stereotype formation in AI systems may arise as an emergent property of multi-agent interactions rather than merely from training data biases. The findings underscore the need for future research to explore underlying mechanisms and develop strategies to mitigate ethical impacts of emergent biases in AI systems.

Abstract: While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases. Previous studies have focused on biases inherited from training data, but whether stereotypes can emerge spontaneously in AI agent interactions merits further exploration. Through a novel experimental framework simulating workplace interactions with neutral initial conditions, we investigate the emergence and evolution of stereotypes in LLM-based multi-agent systems. Our findings reveal that (1) LLM-Based AI agents develop stereotype-driven biases in their interactions despite beginning without predefined biases; (2) stereotype effects intensify with increased interaction rounds and decision-making power, particularly after introducing hierarchical structures; (3) these systems exhibit group effects analogous to human social behavior, including halo effects, confirmation bias, and role congruity; and (4) these stereotype patterns manifest consistently across different LLM architectures. Through comprehensive quantitative analysis, these findings suggest that stereotype formation in AI systems may arise as an emergent property of multi-agent interactions, rather than merely from training data biases. Our work underscores the need for future research to explore the underlying mechanisms of this phenomenon and develop strategies to mitigate its ethical impacts.

[49] NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems

Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li, Qifan Zhang, Chenyi Zi, Chen Zhang, Jia Li

Main category: cs.CL

TL;DR: NPG-Muse uses NP-hard graph problems as synthetic training data to enhance LLMs’ long chain-of-thought reasoning through SFT and RL, achieving strong performance across reasoning benchmarks.

DetailsMotivation: Current methods for developing Long CoT reasoning in LLMs rely on expensive human-curated datasets (math, code). The paper seeks scalable alternatives using NP-hard graph problems which inherently require deep reasoning, extensive exploration, and reflective strategies.

Method: Two-stage post-training: 1) Long-CoT Supervised Fine-Tuning on rejection-sampled NP-hard graph instances to enhance reasoning depth, 2) Reinforcement Learning with fine-grained reward design to sharpen reasoning efficiency.

Result: NPG-Muse models show substantially enhanced Long CoT reasoning, achieving consistent gains across mathematics, coding, logical, and graph reasoning benchmarks. NPG-Muse-7B surpasses QwQ-32B on NP-hard graph problems in both accuracy and reasoning efficiency.

Conclusion: NP-hard graph problems serve as an effective and scalable resource for advancing Long CoT reasoning in LLM post-training, providing a synthetic alternative to human-curated datasets.

Abstract: Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are the core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long-CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. The resulting NPG-Muse-series models exhibit substantially enhanced Long CoT reasoning capabilities, achieving consistent gains across mathematics, coding, logical, and graph reasoning benchmarks. NPG-Muse-7B even surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLM post-training. Our implementation is available at https://github.com/littlewyy/NPG-Muse.

[50] LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning

Tiago Fernandes Tavares

Main category: cs.CL

TL;DR: LogiPart is a scalable framework for building interpretable hierarchical taxonomies from text corpora using LLMs on compact samples and efficient NLI-based evaluation, achieving constant token complexity per node.

DetailsMotivation: Current methods for discovering deep, steerable taxonomies face a trade-off between the efficiency of topic models and the prohibitive costs of LLM-integrated frameworks that require full-corpus conditioning.

Method: LogiPart decouples hierarchy growth from expensive LLM conditioning by using locally hosted LLMs on embedding-aware samples to generate taxonomic predicates, then evaluates them across the entire corpus using zero-shot Natural Language Inference (NLI) with fast graph-based label propagation.

Result: Evaluated across four diverse text corpora (≈140,000 documents), LogiPart achieves up to 96% average per-node routing accuracy on complex corpora, discovers meaningful functional axes like policy intent, and enables frontier-level analysis on consumer-grade hardware.

Conclusion: LogiPart makes hypothesis-driven taxonomic discovery feasible under realistic computational constraints by achieving constant O(1) generative token complexity per node relative to corpus size.

Abstract: The discovery of deep, steerable taxonomies in large text corpora is currently restricted by a trade-off between the surface-level efficiency of topic models and the prohibitive, non-scalable assignment costs of LLM-integrated frameworks. We introduce \textbf{LogiPart}, a scalable, hypothesis-first framework for building interpretable hierarchical partitions that decouples hierarchy growth from expensive full-corpus LLM conditioning. LogiPart utilizes locally hosted LLMs on compact, embedding-aware samples to generate concise natural-language taxonomic predicates. These predicates are then evaluated efficiently across the entire corpus using zero-shot Natural Language Inference (NLI) combined with fast graph-based label propagation, achieving constant $O(1)$ generative token complexity per node relative to corpus size. We evaluate LogiPart across four diverse text corpora (totaling $\approx$140,000 documents). Using structured manifolds for \textbf{calibration}, we identify an empirical reasoning threshold at the 14B-parameter scale required for stable semantic grounding. On complex, high-entropy corpora (Wikipedia, US Bills), where traditional thematic metrics reveal an ``alignment gap,’’ inverse logic validation confirms the stability of the induced logic, with individual taxonomic bisections maintaining an average per-node routing accuracy of up to 96%. A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture. LogiPart enables frontier-level exploratory analysis on consumer-grade hardware, making hypothesis-driven taxonomic discovery feasible under realistic computational and governance constraints.

[51] mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations

Guy Dar

Main category: cs.CL

TL;DR: mini-vec2vec is an efficient, stable linear transformation method for aligning text embedding spaces without parallel data, improving upon the expensive and unstable vec2vec approach.

DetailsMotivation: The original vec2vec method for aligning text embedding spaces without parallel data is computationally expensive and unstable, limiting its practical adoption and scalability.

Method: Three-stage approach: 1) tentative matching of pseudo-parallel embedding vectors, 2) transformation fitting, and 3) iterative refinement to learn a linear transformation between embedding spaces.

Result: mini-vec2vec achieves orders of magnitude improvement in efficiency over vec2vec while matching or exceeding its alignment performance, with enhanced stability and interpretability.

Conclusion: The proposed linear transformation method provides a simple, efficient, and robust alternative for embedding space alignment, enabling broader adoption and scaling to new domains.

Abstract: We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data. vec2vec finds a near-perfect alignment, but it is expensive and unstable. We present mini-vec2vec, a simple and efficient alternative that requires substantially lower computational cost and is highly robust. Moreover, the learned mapping is a linear transformation. Our method consists of three main stages: a tentative matching of pseudo-parallel embedding vectors, transformation fitting, and iterative refinement. Our linear alternative exceeds the original instantiation of vec2vec by orders of magnitude in efficiency, while matching or exceeding their results. The method’s stability and interpretable algorithmic steps facilitate scaling and unlock new opportunities for adoption in new domains and fields.

[52] Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang, Xinyuan Luo, Yanjie Sun, Chen Jason Zhang

Main category: cs.CL

TL;DR: A multimodal peer review simulation system using LLMs with visual-text integration, RAG from OpenReview data, and actionable feedback generation for manuscript revision.

DetailsMotivation: Existing academic peer review systems are limited to text-only inputs, lack contextual grounding, and don't provide actionable feedback, hindering effective manuscript revision before submission.

Method: Interactive web-based system integrating multimodal LLMs for textual and visual information processing, using retrieval-augmented generation (RAG) with web-scale OpenReview data, and converting reviews into actionable to-do lists using Action:Objective[#] format.

Result: System generates more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines, and provides effective structured guidance for manuscript revisions.

Conclusion: The framework advances transparent, human-centered scholarly assistance by enabling effective multimodal peer review simulation with actionable feedback for pre-submission manuscript improvement.

Abstract: While large language models (LLMs) offer promising capabilities for automating academic workflows, existing systems for academic peer review remain constrained by text-only inputs, limited contextual grounding, and a lack of actionable feedback. In this work, we present an interactive web-based system for multimodal, community-aware peer review simulation to enable effective manuscript revisions before paper submission. Our framework integrates textual and visual information through multimodal LLMs, enhances review quality via retrieval-augmented generation (RAG) grounded in web-scale OpenReview data, and converts generated reviews into actionable to-do lists using the proposed Action:Objective[#] format, providing structured and traceable guidance. The system integrates seamlessly into existing academic writing platforms, providing interactive interfaces for real-time feedback and revision tracking. Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assistance.

[53] Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?

Amar Lakel

Main category: cs.CL

TL;DR: The paper proposes replacing “Large Language Models” with “Large Discourse Models” and then “Artificial Discursive Agents” to better capture their role in modeling human discourse and experience, advocating for public governance frameworks.

DetailsMotivation: The authors argue that current terminology (LLMs) inadequately captures the nature of generative models, which don't just process language but model human discourse and experience. They seek to shift from technical fascination/fear dichotomies toward responsible governance frameworks.

Method: Theoretical/philosophical analysis using an ontological triad framework distinguishing three regulatory instances: phenomenal world regularities, embodied cognition structuring, and socio-historical linguistic sedimentation. Proposes conceptual shift from LLM → LDM → ADA.

Result: A new conceptual framework for understanding generative models as discourse models that project human experience reified in training corpora, with implications for governance and societal integration.

Conclusion: The paper concludes that we need to move beyond fascination/fear dichotomies toward public trials and co-regulation involving State, industry, civil society, and academia to properly situate artificial discursive agents in social space.

Abstract: This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ‘‘Large Language Models’’ (LLM) with that of ‘‘Large Discourse Models’’ (LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the ‘‘fascination/fear’’ dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.

[54] Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes

Abdullah Al Monsur, Nitesh Vamshi Bommisetty, Gene Louis Kim

Main category: cs.CL

TL;DR: Decoder-only LLMs have architectural limitations for event detection due to unidirectional context; Macro-F1 is better than Micro-F1 for evaluating long-tail event types; LoRA fine-tuning improves decoder-only models’ performance on minority classes.

DetailsMotivation: Address two limitations in event detection research: 1) unidirectional nature of decoder-only LLMs creates architectural bottlenecks for tasks requiring bidirectional context, and 2) reliance on Micro-F1 scores inflates performance by favoring majority classes rather than measuring true capability across diverse event types.

Method: Enhanced models with sentence context and used Low-Rank Adaptation (LoRA) during fine-tuning. Compared decoder-only baselines with context-enhanced versions, focusing on Macro-F1 as evaluation metric to better assess performance across long-tail event types.

Result: Models with sentence context outperformed canonical decoder-only baselines. LoRA fine-tuning provided substantial boost in Macro-F1 scores, especially for decoder-only models, demonstrating LoRA’s effectiveness for improving performance on long-tailed event classes.

Conclusion: Bidirectional context is crucial for event detection tasks, and Macro-F1 provides more representative evaluation than Micro-F1. LoRA fine-tuning effectively enhances LLMs’ performance on minority event classes, addressing architectural limitations of decoder-only models.

Abstract: The current state of event detection research has two notable re-occurring limitations that we investigate in this study. First, the unidirectional nature of decoder-only LLMs presents a fundamental architectural bottleneck for natural language understanding tasks that depend on rich, bidirectional context. Second, we confront the conventional reliance on Micro-F1 scores in event detection literature, which systematically inflates performance by favoring majority classes. Instead, we focus on Macro-F1 as a more representative measure of a model’s ability across the long-tail of event types. Our experiments demonstrate that models enhanced with sentence context achieve superior performance over canonical decoder-only baselines. Using Low-Rank Adaptation (LoRA) during finetuning provides a substantial boost in Macro-F1 scores in particular, especially for the decoder-only models, showing that LoRA can be an effective tool to enhance LLMs’ performance on long-tailed event classes.

[55] Embedding Retrofitting: Data Engineering for better RAG

Anantha Sharma

Main category: cs.CL

TL;DR: Data engineering framework improves knowledge graph quality for word embedding retrofitting by addressing annotation artifacts like hashtags that create spurious edges in noisy graphs.

DetailsMotivation: Retrofitting word embeddings with knowledge graphs improves domain-specific retrieval, but effectiveness depends on knowledge graph quality which suffers from annotation artifacts in real-world corpora.

Method: Proposes a data engineering framework to address data quality degradation from annotation artifacts, specifically analyzing how hashtag annotations inflate knowledge graph density and create spurious edges.

Result: On noisy graphs, all retrofitting techniques degrade performance (-3.5% to -5.2%). After preprocessing, EWMA retrofitting achieves +6.2% improvement, with +33.8% average improvement on quantitative synthesis questions. Preprocessing quality (10%+ swing) matters more than algorithm choice (3% gap).

Conclusion: Preprocessing quality is the primary determinant of retrofitting success, exceeding algorithmic differences, highlighting the importance of addressing annotation artifacts in knowledge graph construction.

Abstract: Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval. However, the effectiveness of retrofitting depends critically on knowledge graph quality, which in turn depends on text preprocessing. This paper presents a data engineering framework that addresses data quality degradation from annotation artifacts in real-world corpora. The analysis shows that hashtag annotations inflate knowledge graph density, leading to creating spurious edges that corrupt the retrofitting objective. On noisy graphs, all retrofitting techniques produce statistically significant degradation ($-3.5%$ to $-5.2%$, $p<0.05$). After preprocessing, \acrshort{ewma} retrofitting achieves $+6.2%$ improvement ($p=0.0348$) with benefits concentrated in quantitative synthesis questions ($+33.8%$ average). The gap between clean and noisy preprocessing (10%+ swing) exceeds the gap between algorithms (3%), establishing preprocessing quality as the primary determinant of retrofitting success.

[56] Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts

Wenhao Li, Daohai Yu, Gen Luo, Yuxin Zhang, Fei Chao, Rongrong Ji, Yifan Wu, Jiaxin Liu, Ziyang Gong, Zimu Liao

Main category: cs.CL

TL;DR: OOMB: A memory-efficient training system for LLMs that enables training with extremely long contexts (up to 4M tokens) on a single GPU by using chunk-recurrent training with constant activation memory and optimized KV cache management.

DetailsMotivation: Training LLMs on long contexts is severely limited by GPU memory overhead from activations that scale linearly with sequence length, requiring large clusters with context parallelism for long-context training.

Method: Uses chunk-recurrent training framework with on-the-fly activation recomputation for constant activation memory (O(1)), plus paged memory manager for KV cache and gradients, asynchronous CPU offloading, and page-level sparse attention to manage KV cache growth.

Result: Achieves only 10MB memory increase per 10K additional tokens for Qwen2.5-7B, enabling training with 4M-token context on a single H200 GPU instead of requiring large clusters.

Conclusion: OOMB represents a substantial advance in resource efficiency for long-context LLM training by directly addressing memory bottlenecks through synergistic optimization techniques.

Abstract: Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available at https://github.com/wenhaoli-xmu/OOMB.

[57] LLMs Know More About Numbers than They Can Say

Fengting Yuchi, Li Du, Jason Eisner

Main category: cs.CL

TL;DR: LLMs struggle with numerical comparisons despite having internal magnitude representations that can be probed to accurately rank numbers, revealing a disconnect between internal representations and verbal reasoning.

DetailsMotivation: The paper investigates why LLMs fail at numerical comparisons with mixed notation (like "5.7 × 10² vs 580") despite their mathematical capabilities, exploring whether they truly understand number magnitudes.

Method: Probes hidden states of open-source LLMs to extract number magnitude representations, uses linear projections to recover log-magnitudes, trains classifiers to rank number pairs, and incorporates probe loss as auxiliary objective during fine-tuning.

Result: Hidden states encode number magnitudes (2.3% error on synthetic text, 19.06% on scientific papers) and rankings (90%+ accuracy), but LLMs only achieve 50-70% accuracy when explicitly asked. Fine-tuning with probe loss improves verbal accuracy by 3.22%.

Conclusion: LLMs have internal magnitude representations that don’t fully translate to explicit reasoning; improving these representations via auxiliary objectives enhances numerical reasoning capabilities.

Abstract: Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: “Which is larger, $5.7 \times 10^2$ or $580$?” This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the log-magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe’s log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models’ internal magnitude representations can enhance their numerical reasoning capabilities. Our code is available at https://github.com/VCY019/Numeracy-Probing.

[58] Large Language Models and Impossible Language Acquisition: “False Promise” or an Overturn of our Current Perspective towards AI

Ziyan Wang, Longlong Ma

Main category: cs.CL

TL;DR: LLMs struggle to learn impossible languages, supporting Chomsky’s critique that they lack human-like language acquisition mechanisms, but this reveals architectural differences between transformers and LSTMs.

DetailsMotivation: To empirically test Chomsky's critique that LLMs are mere pattern predictors that cannot distinguish possible from impossible languages, and to examine the implications for AI's intellectual foundations.

Method: Created syntactically impossible languages by transforming English (reversing sentences, adding negation based on word-count parity). Conducted controlled experiments on GPT-2 small and LSTM models, using statistical analysis (Welch’s t-test) to compare performance on possible vs. impossible languages.

Result: GPT-2 small models significantly underperformed on impossible languages compared to possible ones (p<.001). LSTM models’ performance aligned with Chomsky’s argument, highlighting transformer architecture limitations.

Conclusion: Supports Chomsky’s critique of LLMs as pattern predictors lacking human language acquisition mechanisms, but suggests shifting from Chomsky’s rationalist paradigm to functionalism/empiricism in LLM research, acknowledging architectural differences.

Abstract: In Chomsky’s provocative critique “The False Promise of CHATGPT,” Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critique from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring into the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Statistical analysis (Welch’s t-test) shows GPT2 small models underperform in learning all of the impossible languages compared to their performance on the possible language (p<.001). On the other hand, LSTM models’ performance tallies with Chomsky’s argument, suggesting the irreplaceable role of the evolution of transformer architecture. Based on theoretical analysis and empirical findings, we propose a new vision within Chomsky’s theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his “rationalist-romantics” paradigm to functionalism and empiricism in LLMs research.

[59] Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

Yunchong Huang, Gianni Barlacchi, Sandro Pezzelle

Main category: cs.CL

TL;DR: LLMs struggle with underspecified questions in QA benchmarks, with 16-50% of questions being ambiguous; controlled rewriting shows performance improves significantly when questions are clarified.

DetailsMotivation: The paper addresses the performance gap between LLMs and QA benchmarks, hypothesizing that underspecified questions (queries lacking unique interpretation without additional context) are a major confound in evaluation rather than model limitations.

Method: 1) Develop an LLM-based classifier to identify underspecified questions in QA datasets; 2) Analyze performance differences on underspecified vs. specified questions; 3) Conduct controlled rewriting experiment: rewrite underspecified questions into fully specified variants while keeping gold answers fixed to isolate underspecification effect.

Result: Found 16% to over 50% of benchmark questions are underspecified; LLMs perform significantly worse on underspecified questions; QA performance consistently improves when underspecified questions are rewritten into fully specified variants, indicating many apparent QA failures stem from question underspecification rather than model limitations.

Conclusion: Underspecification is an important confound in QA evaluation; benchmark design should pay greater attention to question clarity; many perceived LLM limitations in QA may actually be due to ambiguous question formulation rather than model capabilities.

Abstract: Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.

[60] Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov

Main category: cs.CL

TL;DR: The paper introduces LAHJATBERT, a BERT-based multi-label classifier for Arabic Dialect Identification using GPT-4o-generated multi-label annotations and curriculum learning strategies.

DetailsMotivation: Arabic Dialect Identification (ADI) has traditionally been framed as single-label classification, but recent work argues it should be multi-label. However, there are no large-scale multi-label datasets available for training, limiting progress in this area.

Method: 1) Construct multi-label dataset using GPT-4o and binary dialect acceptability classifiers with Arabic Level of Dialectness (ALDi) aggregation; 2) Train BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality.

Result: LAHJATBERT achieves macro F1 of 0.69 on the MLADI leaderboard, significantly outperforming the previous best system (0.55).

Conclusion: The paper successfully addresses the lack of multi-label ADI datasets by creating automatic annotations and demonstrates that curriculum learning with dialect complexity improves multi-label dialect identification performance.

Abstract: Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at https://mohamedalaa9.github.io/lahjatbert/.

[61] HLE-Verified: A Systematic Verification and Structured Revision of Humanity’s Last Exam

Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xiang Xu, Bohan Wang, Peng Wang, Xingzhe Wu, Anfeng Li, Qiyuan Feng, Yuhao Zhou, Shoulin Han, Wenjie Luo, Yiyuan Li, Yaxuan Wang, Ruixian Luo, Guojie Lin, Peiyao Xiao, Chengliang Xu, Ben Wang, Zeyu Wang, Zichao Chen, Jianan Ye, Yijie Hu, Jialong Chen, Zongwen Shen, Yuliang Xu, An Yang, Bowen Yu, Dayiheng Liu, Junyang Lin, Hu Wei, Que Shen, Bing Zhao

Main category: cs.CL

TL;DR: HLE-Verified is a cleaned version of the Humanity’s Last Exam benchmark with verified items through expert review and model-based validation, reducing noise for more accurate model evaluation.

DetailsMotivation: The original HLE benchmark contains noisy items that bias evaluation results and distort cross-model comparisons, necessitating a verified version for more reliable assessment of language model capabilities.

Method: Two-stage validation-and-repair workflow: Stage I uses domain-expert review and model-based cross-checks for binary validation; Stage II revises flawed but fixable items through dual independent expert repairs, model-assisted auditing, and adjudication.

Result: Created HLE-Verified with 641 verified items and 1,170 revised-and-certified items, with remaining 689 items as documented uncertain set. Evaluation shows 7-10 percentage point accuracy gain on HLE-Verified, with 30-40 point gains on previously erroneous items.

Conclusion: HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities, with strong association between model confidence and error presence supporting revision effectiveness.

Abstract: Humanity’s Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7–10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30–40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://github.com/SKYLENAGE-AI/HLE-Verified

[62] Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework

Grzegorz Statkiewicz, Alicja Dobrzeniecka, Karolina Seweryn, Aleksandra Krasnodębska, Karolina Piosek, Katarzyna Bogusz, Sebastian Cygert, Wojciech Kusa

Main category: cs.CL

TL;DR: Polish vision-language models created using automated translation of English datasets, achieving strong performance improvements over English VLMs on Polish tasks.

DetailsMotivation: Most VLMs are English-centric, limiting their usability for non-English speakers and hindering development of culturally diverse multimodal systems. Need to create effective VLMs for low-resource languages like Polish.

Method: Adapted LLaVA-Next methodology with fully automated pipeline for translating/filtering existing multimodal datasets, complemented with synthetic Polish data for OCR and culturally specific tasks. Minimal manual intervention.

Result: +9.5% improvement over LLaVA-1.6-Vicuna-13B on Polish-adapted MMBench, higher-quality captions in generative evaluations as measured by human annotators for linguistic correctness.

Conclusion: Large-scale automated translation with lightweight filtering can effectively bootstrap high-quality multimodal models for low-resource languages, though challenges remain in cultural coverage and evaluation.

Abstract: Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we make our models and evaluation dataset publicly available.

[63] Query as Anchor: Scenario-Adaptive User Representation via Large Language Model

Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Ziyi Gao, Xiaotong Lin, Yun Liu, Xing Fu, Yu Cheng, Yongchao Liu, Weiqiang Wang, Zhongle Xie

Main category: cs.CL

TL;DR: Query-as-Anchor framework for dynamic, query-aware user representation learning using LLMs with multi-modal behavioral data, achieving SOTA on industrial benchmarks with efficient deployment.

DetailsMotivation: Existing user representation methods produce static, task-agnostic embeddings that struggle to balance universality with task-sensitivity across diverse downstream scenarios, while multi-source data introduces noise and modality conflicts.

Method: Proposes Query-as-Anchor framework with: 1) UserU industrial-scale pre-training dataset aligning multi-modal behavioral sequences with semantics, 2) Q-Anchor Embedding architecture with hierarchical encoders in dual-tower LLMs via contrastive-autoregressive optimization, 3) Cluster-based Soft Prompt Tuning for scenario-specific alignment, and 4) KV-cache-accelerated inference.

Result: Achieves consistent SOTA performance on 10 Alipay industrial benchmarks, demonstrates strong scalability and efficient deployment, and validates practical effectiveness through large-scale online A/B testing in Alipay’s production system across two real-world scenarios.

Conclusion: Query-as-Anchor successfully shifts user modeling from static encoding to dynamic, query-aware synthesis, effectively balancing universality with task-sensitivity while handling multi-modal data noise and conflicts, with practical industrial deployment viability.

Abstract: Industrial-scale user representation learning requires balancing robust universality with acute task-sensitivity. However, existing paradigms primarily yield static, task-agnostic embeddings that struggle to reconcile the divergent requirements of downstream scenarios within unified vector spaces. Furthermore, heterogeneous multi-source data introduces inherent noise and modality conflicts, degrading representation. We propose Query-as-Anchor, a framework shifting user modeling from static encoding to dynamic, query-aware synthesis. To empower Large Language Models (LLMs) with deep user understanding, we first construct UserU, an industrial-scale pre-training dataset that aligns multi-modal behavioral sequences with user understanding semantics, and our Q-Anchor Embedding architecture integrates hierarchical coarse-to-fine encoders into dual-tower LLMs via joint contrastive-autoregressive optimization for query-aware user representation. To bridge the gap between general pre-training and specialized business logic, we further introduce Cluster-based Soft Prompt Tuning to enforce discriminative latent structures, effectively aligning model attention with scenario-specific modalities. For deployment, anchoring queries at sequence termini enables KV-cache-accelerated inference with negligible incremental latency. Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment. Large-scale online A/B testing in Alipay’s production system across two real-world scenarios further validates its practical effectiveness. Our code is prepared for public release and will be available at: https://github.com/JhCircle/Q-Anchor.

[64] Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu

Main category: cs.CL

TL;DR: Multi-agent sandbox study shows that incorporating broadcast community discussion (critic and audience feedback stored as social memory) significantly improves stand-up comedy writing quality compared to baseline without discussion.

DetailsMotivation: Existing LLM writing evaluations focus on prompts and localized feedback, neglecting how persistent public reception in online communities affects writing quality, particularly for creative domains like stand-up comedy.

Method: Controlled multi-agent sandbox experiment comparing two conditions: discussion condition where critic and audience threads are recorded, filtered, stored as social memory, and retrieved to condition subsequent generations vs. baseline without discussion. 50 rounds (250 paired monologues) evaluated by five expert annotators using A/B preference and 15-item rubric.

Result: Discussion condition wins 75.6% of instances, improves Craft/Clarity (Δ = 0.440) and Social Response (Δ = 0.422), with occasional increases in aggressive humor.

Conclusion: Incorporating broadcast community discussion as social memory significantly enhances LLM-generated stand-up comedy writing quality, demonstrating the value of social feedback mechanisms in creative AI systems.

Abstract: Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined. We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion. Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity (Δ = 0.440) and Social Response (Δ = 0.422), with occasional increases in aggressive humor.

cs.CV

[65] GS-ProCams: Gaussian Splatting-based Projector-Camera Systems

Qingyue Deng, Jijiang Li, Haibin Ling, Bingyao Huang

Main category: cs.CV

TL;DR: GS-ProCams: First Gaussian Splatting framework for projector-camera systems enabling efficient view-agnostic projection mapping without additional devices.

DetailsMotivation: Previous CNN-based ProCams are viewpoint-specific, while NeRF-based methods require additional light sources and are computationally expensive. Need for efficient view-agnostic projection mapping.

Method: Uses 2D Gaussian representations for scene geometry and materials, models complex geometric and photometric mappings with projector responses, and employs differentiable physically-based rendering to jointly estimate parameters from multi-view projections.

Result: Achieves superior ProCams simulation quality compared to NeRF-based methods, uses only 1/10 GPU memory for training, and is 900 times faster in inference speed without needing additional devices.

Conclusion: GS-ProCams provides an efficient, view-agnostic framework for projector-camera systems that significantly improves computational efficiency while maintaining high-quality projection mapping capabilities.

Abstract: We present GS-ProCams, the first Gaussian Splatting-based framework for projector-camera systems (ProCams). GS-ProCams is not only view-agnostic but also significantly enhances the efficiency of projection mapping (PM) that requires establishing geometric and radiometric mappings between the projector and the camera. Previous CNN-based ProCams are constrained to a specific viewpoint, limiting their applicability to novel perspectives. In contrast, NeRF-based ProCams support view-agnostic projection mapping, however, they require an additional co-located light source and demand significant computational and memory resources. To address this issue, we propose GS-ProCams that employs 2D Gaussian for scene representations, and enables efficient view-agnostic ProCams applications. In particular, we explicitly model the complex geometric and photometric mappings of ProCams using projector responses, the projection surface’s geometry and materials represented by Gaussians, and the global illumination component. Then, we employ differentiable physically-based rendering to jointly estimate them from captured multi-view projections. Compared to state-of-the-art NeRF-based methods, our GS-ProCams eliminates the need for additional devices, achieving superior ProCams simulation quality. It also uses only 1/10 of the GPU memory for training and is 900 times faster in inference speed. Please refer to our project page for the code and dataset: https://realqingyue.github.io/GS-ProCams/.

[66] GRAFNet: Multiscale Retinal Processing via Guided Cortical Attention Feedback for Enhancing Medical Image Polyp Segmentation

Abdul Joseph Fofanah, Lian Wen, Alpha Alimamy Kamara, Zhongyi Zhang, David Chen, Albert Patrick Sankoh

Main category: cs.CV

TL;DR: GRAFNet: A biologically inspired neural network for polyp segmentation in colonoscopy that mimics human visual system hierarchy to address challenges of morphological variability, visual similarity to normal structures, and multi-scale detection.

DetailsMotivation: Polyp segmentation in colonoscopy is challenging due to high morphological variability, strong visual similarity to normal structures, and need for robust multi-scale detection. Existing deep learning approaches suffer from unidirectional processing, weak multi-scale fusion, and lack of anatomical constraints, leading to false positives and false negatives.

Method: GRAFNet integrates three biologically inspired modules: (1) Guided Asymmetric Attention Module (GAAM) mimicking orientation-tuned cortical neurons for boundary emphasis, (2) MultiScale Retinal Module (MSRM) replicating retinal ganglion cell pathways for parallel multi-feature analysis, and (3) Guided Cortical Attention Feedback Module (GCAFM) applying predictive coding for iterative refinement. These are unified in a Polyp Encoder-Decoder Module (PEDM) with resolution-adaptive feedback.

Result: Extensive experiments on five public benchmarks (Kvasir-SEG, CVC-300, CVC-ColonDB, CVC-Clinic, and PolypGen) demonstrate consistent state-of-the-art performance with 3-8% Dice improvements and 10-20% higher generalization over leading methods, while offering interpretable decision pathways.

Conclusion: GRAFNet establishes a paradigm where neural computation principles bridge the gap between AI accuracy and clinically trustworthy reasoning, providing both high performance and interpretability for medical image segmentation.

Abstract: Accurate polyp segmentation in colonoscopy is essential for cancer prevention but remains challenging due to: (1) high morphological variability (from flat to protruding lesions), (2) strong visual similarity to normal structures such as folds and vessels, and (3) the need for robust multi-scale detection. Existing deep learning approaches suffer from unidirectional processing, weak multi-scale fusion, and the absence of anatomical constraints, often leading to false positives (over-segmentation of normal structures) and false negatives (missed subtle flat lesions). We propose GRAFNet, a biologically inspired architecture that emulates the hierarchical organisation of the human visual system. GRAFNet integrates three key modules: (1) a Guided Asymmetric Attention Module (GAAM) that mimics orientation-tuned cortical neurones to emphasise polyp boundaries, (2) a MultiScale Retinal Module (MSRM) that replicates retinal ganglion cell pathways for parallel multi-feature analysis, and (3) a Guided Cortical Attention Feedback Module (GCAFM) that applies predictive coding for iterative refinement. These are unified in a Polyp Encoder-Decoder Module (PEDM) that enforces spatial-semantic consistency via resolution-adaptive feedback. Extensive experiments on five public benchmarks (Kvasir-SEG, CVC-300, CVC-ColonDB, CVC-Clinic, and PolypGen) demonstrate consistent state-of-the-art performance, with 3-8% Dice improvements and 10-20% higher generalisation over leading methods, while offering interpretable decision pathways. This work establishes a paradigm in which neural computation principles bridge the gap between AI accuracy and clinically trustworthy reasoning. Code is available at https://github.com/afofanah/GRAFNet.

[67] Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

Peng-Fei Zhang, Zi Huang

Main category: cs.CV

TL;DR: HRA: A multimodal universal attack framework for VLP models using hierarchical refinement for both image and text perturbations

DetailsMotivation: Existing adversarial attacks for Vision-Language Pre-training (VLP) models are mostly sample-specific, causing high computational overhead when scaling to large datasets or new scenarios, necessitating a more efficient universal attack framework.

Method: Hierarchical Refinement Attack (HRA) with two components: 1) For images: uses temporal hierarchy of historical and estimated future gradients to refine optimization path and avoid local minima; 2) For text: hierarchically models textual importance considering intra- and inter-sentence contributions to identify globally influential words for universal perturbations.

Result: Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate superior transferability of the proposed universal multimodal attacks compared to existing methods.

Conclusion: HRA provides an effective universal attack framework for VLP models that overcomes the computational limitations of sample-specific attacks while maintaining strong transferability across different scenarios.

Abstract: Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. For the image modality, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, it hierarchically models textual importance by considering both intra- and inter-sentence contributions to identify globally influential words, which are then used as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets, demonstrate the superior transferability of the proposed universal multimodal attacks.

[68] Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition

Shiyu Xuan, Dongkai Wang, Zechao Li, Jinhui Tang

Main category: cs.CV

TL;DR: A decoupled framework for zero-shot human-object interaction detection that separates object detection from interaction recognition using MLLMs with deterministic generation and spatial-aware pooling.

DetailsMotivation: Zero-shot HOI detection faces challenges in interaction recognition due to combinatorial diversity. Existing methods couple IR with specific detectors and use coarse VLM features, limiting generalization to unseen interactions.

Method: Proposes a decoupled framework separating object detection from IR, leveraging MLLMs for zero-shot IR. Uses deterministic generation method formulating IR as VQA task, spatial-aware pooling integrating appearance and spatial cues, and one-pass deterministic matching.

Result: Achieves superior zero-shot performance on HICO-DET and V-COCO, strong cross-dataset generalization, and flexibility to integrate with any object detectors without retraining.

Conclusion: The decoupled framework with MLLMs enables effective zero-shot HOI detection with strong generalization and detector flexibility.

Abstract: Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage methods, tightly couple IR with a specific detector and rely on coarse-grained vision-language model (VLM) features, which limit generalization to unseen interactions. In this work, we propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. We introduce a deterministic generation method that formulates IR as a visual question answering task and enforces deterministic outputs, enabling training-free zero-shot IR. To further enhance performance and efficiency by fine-tuning the model, we design a spatial-aware pooling module that integrates appearance and pairwise spatial cues, and a one-pass deterministic matching method that predicts all candidate interactions in a single forward pass. Extensive experiments on HICO-DET and V-COCO demonstrate that our method achieves superior zero-shot performance, strong cross-dataset generalization, and the flexibility to integrate with any object detectors without retraining. The codes are publicly available at https://github.com/SY-Xuan/DA-HOI.

[69] MB-DSMIL-CL-PL: Scalable Weakly Supervised Ovarian Cancer Subtype Classification and Localisation Using Contrastive and Prototype Learning with Frozen Patch Features

Marcus Jenkins, Jasenka Mazibrada, Bogdan Leahu, Michal Mackiewicz

Main category: cs.CV

TL;DR: A novel approach for ovarian cancer histopathology subtype classification and localization using contrastive and prototype learning with frozen pre-computed features via feature-space augmentations.

DetailsMotivation: Histopathological subtype analysis is crucial for personalized ovarian cancer treatment, but increasing diagnostic workloads challenge pathology departments. Traditional AI approaches use frozen pre-computed features, while recent end-to-end methods improve accuracy but sacrifice scalability and require time-consuming experimentation.

Method: Proposes contrastive and prototype learning with pre-computed, frozen features using feature-space augmentations for subtype classification and localization in ovarian cancer histopathology images.

Result: Achieves 70.4% and 15.3% improvement in F1 score for instance- and slide-level classification respectively compared to DSMIL, with AUC gains of 16.9% for instance localization and 2.3% for slide classification, while maintaining frozen patch features.

Conclusion: The method provides significant performance improvements for ovarian cancer histopathology analysis while maintaining the scalability advantages of frozen feature approaches, addressing both accuracy and practical deployment constraints.

Abstract: The study of histopathological subtypes is valuable for the personalisation of effective treatment strategies for ovarian cancer. However, increasing diagnostic workloads present a challenge for UK pathology departments, leading to the rise in AI approaches. While traditional approaches in this field have relied on pre-computed, frozen image features, recent advances have shifted towards end-to-end feature extraction, providing an improvement in accuracy but at the expense of significantly reduced scalability during training and time-consuming experimentation. In this paper, we propose a new approach for subtype classification and localisation in ovarian cancer histopathology images using contrastive and prototype learning with pre-computed, frozen features via feature-space augmentations. Compared to DSMIL, our method achieves an improvement of 70.4% and 15.3% in F1 score for instance- and slide-level classification, respectively, along with AUC gains of 16.9% for instance localisation and 2.3% for slide classification, while maintaining the use of frozen patch features.

[70] GMAIL: Generative Modality Alignment for generated Image Learning

Shentong Mo, Sukmin Yun

Main category: cs.CV

TL;DR: GMAIL framework treats generated images as separate modality from real images, using multi-modal learning to bridge them in latent space for improved vision-language tasks.

DetailsMotivation: While generative models can synthesize realistic images for training data, indiscriminate use can cause mode collapse due to modality discrepancies between real and synthetic domains. Need discriminative approach to leverage generated images effectively.

Method: Proposes GMAIL framework that treats generated images as separate modality. First fine-tunes model on generated images using cross-modality alignment loss, then uses aligned model to train various vision-language models with generated images. Bridges real and synthetic modalities in same latent space through multi-modal learning.

Result: Significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. Shows positive generated data scaling trends and notable enhancements in captioning performance of large multimodal model LLaVA.

Conclusion: GMAIL framework effectively leverages generative model advances by discriminatively using generated images as separate modality, boosting vision-language task performance through multi-modal alignment approach.

Abstract: Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.

[71] Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories

Praditha Alwis, Soumyadeep Chandra, Deepak Ravikumar, Kaushik Roy

Main category: cs.CV

TL;DR: Proposes a model-agnostic method using Cumulative Sample Loss (CSL) trajectories to detect annotation errors (mislabeling and disordering) in video datasets without requiring ground truth on errors.

DetailsMotivation: Real-world video datasets often contain annotation errors like mislabeling and temporal disordering, which are particularly harmful for phase-annotated tasks where temporal consistency is critical. Current methods lack effective ways to detect these errors without ground truth.

Method: Train a video segmentation model and save checkpoints at each epoch. Compute Cumulative Sample Loss (CSL) for each frame by averaging losses across all checkpoints. Frames with persistently high or irregular CSL patterns are flagged as likely annotation errors, as mislabeled/disordered frames remain difficult to learn throughout training.

Result: Experiments on EgoPER and Cholec80 datasets demonstrate strong detection performance for both mislabeling and frame disordering errors. The method effectively identifies subtle inconsistencies without requiring ground truth on annotation errors.

Conclusion: The CSL-based approach provides a powerful, model-agnostic tool for dataset auditing and improving training reliability in video-based machine learning by detecting annotation errors through frame-level learnability analysis.

Abstract: High-quality video datasets are foundational for training robust models in tasks like action recognition, phase detection, and event segmentation. However, many real-world video datasets suffer from annotation errors such as mislabeling, where segments are assigned incorrect class labels, and disordering, where the temporal sequence does not follow the correct progression. These errors are particularly harmful in phase-annotated tasks, where temporal consistency is critical. We propose a novel, model-agnostic method for detecting annotation errors by analyzing the Cumulative Sample Loss (CSL)–defined as the average loss a frame incurs when passing through model checkpoints saved across training epochs. This per-frame loss trajectory acts as a dynamic fingerprint of frame-level learnability. Mislabeled or disordered frames tend to show consistently high or irregular loss patterns, as they remain difficult for the model to learn throughout training, while correctly labeled frames typically converge to low loss early. To compute CSL, we train a video segmentation model and store its weights at each epoch. These checkpoints are then used to evaluate the loss of each frame in a test video. Frames with persistently high CSL are flagged as likely candidates for annotation errors, including mislabeling or temporal misalignment. Our method does not require ground truth on annotation errors and is generalizable across datasets. Experiments on EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering. The proposed approach provides a powerful tool for dataset auditing and improving training reliability in video-based machine learning.

[72] Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Hila Manor, Rinon Gal, Haggai Maron, Tomer Michaeli, Gal Chechik

Main category: cs.CV

TL;DR: LoRWeB enables visual analogy learning by dynamically composing learned LoRA transformation primitives at inference time, improving generalization over fixed LoRA approaches.

DetailsMotivation: Current visual analogy methods using single LoRA modules struggle to capture diverse visual transformations within fixed adaptation modules, limiting generalization capabilities.

Method: Proposes LoRWeB with: (1) learnable basis of LoRA modules spanning different visual transformations, and (2) lightweight encoder that dynamically selects and weighs basis LoRAs based on input analogy pair.

Result: Achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations compared to existing methods.

Conclusion: LoRA basis decompositions are a promising direction for flexible visual manipulation, enabling dynamic composition of transformation primitives for better generalization.

Abstract: Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet ${\mathbf{a}$, $\mathbf{a}’$, $\mathbf{b}}$, the goal is to generate $\mathbf{b}’$ such that $\mathbf{a} : \mathbf{a}’ :: \mathbf{b} : \mathbf{b}’$. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a “space of LoRAs”. We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb

[73] Distributional Deep Learning for Super-Resolution of 4D Flow MRI under Domain Shift

Xiaoyi Wen, Fei Jiang

Main category: cs.CV

TL;DR: Distributional deep learning framework for medical image super-resolution that addresses domain shift between synthetic training data (CFD simulations) and real clinical data (4D Flow MRI) to improve generalization.

DetailsMotivation: Conventional super-resolution methods trained on artificially downsampled data fail to generalize to real clinical data due to domain shift, as real low-resolution data come from different acquisition mechanisms than simple downsampling.

Method: Proposes a distributional deep learning framework trained initially on high-resolution CFD simulations and downsampled counterparts, then fine-tuned on a small harmonized dataset of paired 4D Flow MRI and CFD samples.

Result: The framework significantly outperforms traditional deep learning approaches in real data applications, demonstrating effectiveness in addressing domain shift and improving super-resolution performance in clinically realistic scenarios.

Conclusion: Distributional learning effectively addresses domain shift challenges in medical image super-resolution, particularly for 4D Flow MRI enhancement, improving robustness and generalization to real clinical data.

Abstract: Super-resolution is widely used in medical imaging to enhance low-quality data, reducing scan time and improving abnormality detection. Conventional super-resolution approaches typically rely on paired datasets of downsampled and original high resolution images, training models to reconstruct high resolution images from their artificially degraded counterparts. However, in real-world clinical settings, low resolution data often arise from acquisition mechanisms that differ significantly from simple downsampling. As a result, these inputs may lie outside the domain of the training data, leading to poor model generalization due to domain shift. To address this limitation, we propose a distributional deep learning framework that improves model robustness and domain generalization. We develop this approch for enhancing the resolution of 4D Flow MRI (4DF). This is a novel imaging modality that captures hemodynamic flow velocity and clinically relevant metrics such as vessel wall stress. These metrics are critical for assessing aneurysm rupture risk. Our model is initially trained on high resolution computational fluid dynamics (CFD) simulations and their downsampled counterparts. It is then fine-tuned on a small, harmonized dataset of paired 4D Flow MRI and CFD samples. We derive the theoretical properties of our distributional estimators and demonstrate that our framework significantly outperforms traditional deep learning approaches through real data applications. This highlights the effectiveness of distributional learning in addressing domain shift and improving super-resolution performance in clinically realistic scenarios.

[74] Time-Archival Camera Virtualization for Sports and Visual Performances

Yunxiao Zhang, William Stone, Suryansh Kumar

Main category: cs.CV

TL;DR: Proposes neural volume rendering for camera virtualization with time-archival capabilities, enabling novel view synthesis of dynamic scenes with multiple subjects in fast-paced scenarios like sports broadcasting.

DetailsMotivation: Existing 3D Gaussian Splatting methods for dynamic scenes struggle with large non-rigid motions, multiple independent subjects, and lack time-archival capabilities needed for sports broadcasting and live events.

Method: Models dynamic scenes as rigid transformations across multiple synchronized camera views, performs neural representation learning to enable enhanced visual rendering quality and time-archival functionality.

Result: Enables photorealistic novel view synthesis of dynamic scenes with multiple subjects, supports time-archival for revisiting past temporal instances, and addresses limitations of existing dynamic splatting methods.

Conclusion: Neural volume rendering formulation provides effective camera virtualization with time-archival capabilities, making it suitable for sports broadcasting, live performances, and applications requiring retrospective rendering.

Abstract: Camera virtualization – an emerging solution to novel view synthesis – holds transformative potential for visual entertainment, live performances, and sports broadcasting by enabling the generation of photorealistic images from novel viewpoints using images from a limited set of calibrated multiple static physical cameras. Despite recent advances, achieving spatially and temporally coherent and photorealistic rendering of dynamic scenes with efficient time-archival capabilities, particularly in fast-paced sports and stage performances, remains challenging for existing approaches. Recent methods based on 3D Gaussian Splatting (3DGS) for dynamic scenes could offer real-time view-synthesis results. Yet, they are hindered by their dependence on accurate 3D point clouds from the structure-from-motion method and their inability to handle large, non-rigid, rapid motions of different subjects (e.g., flips, jumps, articulations, sudden player-to-player transitions). Moreover, independent motions of multiple subjects can break the Gaussian-tracking assumptions commonly used in 4DGS, ST-GS, and other dynamic splatting variants. This paper advocates reconsidering a neural volume rendering formulation for camera virtualization and efficient time-archival capabilities, making it useful for sports broadcasting and related applications. By modeling a dynamic scene as rigid transformations across multiple synchronized camera views at a given time, our method performs neural representation learning, providing enhanced visual rendering quality at test time. A key contribution of our approach is its support for time-archival, i.e., users can revisit any past temporal instance of a dynamic scene and can perform novel view synthesis, enabling retrospective rendering for replay, analysis, and archival of live events, a functionality absent in existing neural rendering approaches and novel view synthesis…

[75] How to Train Your Long-Context Visual Document Model

Austin Veselka

Main category: cs.CV

TL;DR: First comprehensive study of training long-context vision-language models up to 344K context, focusing on long-document visual QA with transfer to long-context text, achieving SOTA on MMLongBenchDoc.

DetailsMotivation: Existing open-weight long-context VLMs (Qwen3 VL, GLM 4.5/6V) lack reproducible training recipes and data pipelines. Need systematic study of training strategies for long-context VLMs with comprehensive evaluations.

Method: Systematic study of continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models. Uses synthetic data pipelines, extensive long-context evaluations and ablations. Introduces page indices for training/evaluation and releases MMLBD-C benchmark correction.

Result: Achieves state-of-the-art performance on MMLongBenchDoc for both 24B and 32B parameter scales. Key findings: matching training/evaluation context lengths works best; page indices boost performance; synthetic data enables self-improvement; visual long-context training transfers to text.

Conclusion: Comprehensive study provides reproducible recipes for long-context VLMs, demonstrates effective training strategies, shows bidirectional transfer between visual and text long-context capabilities, and releases improved benchmark.

Abstract: We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.

[76] Accelerating Large-Scale Dataset Distillation via Exploration-Exploitation Optimization

Muhammad J. Alahmadi, Peng Gao, Feiyi Wang, Dongkuan, Xu

Main category: cs.CV

TL;DR: E²D is an efficient dataset distillation method that uses exploration-exploitation optimization to bridge the accuracy-efficiency gap in large-scale dataset distillation.

DetailsMotivation: Current dataset distillation methods face a trade-off: optimization-based methods are accurate but computationally expensive, while optimization-free methods are efficient but less accurate. There's a need to bridge this gap for practical large-scale applications.

Method: Proposes Exploration-Exploitation Distillation (E²D) with: 1) Full-image initialization to preserve semantic integrity, 2) Two-phase optimization: exploration phase with uniform updates to identify high-loss regions, and exploitation phase focusing updates on these regions to accelerate convergence and reduce redundant computation.

Result: Achieves state-of-the-art on ImageNet-1K while being 18x faster, and substantially improves accuracy on ImageNet-21K while remaining 4.3x faster than existing methods.

Conclusion: Targeted, redundancy-reducing updates (rather than brute-force optimization) can bridge the accuracy-efficiency gap in large-scale dataset distillation, making it more practical for resource-constrained deployment.

Abstract: Dataset distillation compresses the original data into compact synthetic datasets, reducing training time and storage while retaining model performance, enabling deployment under limited resources. Although recent decoupling-based distillation methods enable dataset distillation at large-scale, they continue to face an efficiency gap: optimization-based decoupling methods achieve higher accuracy but demand intensive computation, whereas optimization-free decoupling methods are efficient but sacrifice accuracy. To overcome this trade-off, we propose Exploration-Exploitation Distillation (E^2D), a simple, practical method that minimizes redundant computation through an efficient pipeline that begins with full-image initialization to preserve semantic integrity and feature diversity. It then uses a two-phase optimization strategy: an exploration phase that performs uniform updates and identifies high-loss regions, and an exploitation phase that focuses updates on these regions to accelerate convergence. We evaluate E^2D on large-scale benchmarks, surpassing the state-of-the-art on ImageNet-1K while being 18x faster, and on ImageNet-21K, our method substantially improves accuracy while remaining 4.3x faster. These results demonstrate that targeted, redundancy-reducing updates, rather than brute-force optimization, bridge the gap between accuracy and efficiency in large-scale dataset distillation. Code is available at https://github.com/ncsu-dk-lab.

[77] Visual Persuasion: What Influences Decisions of Vision-Language Models?

Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh

Main category: cs.CV

TL;DR: A framework for studying visual preferences of vision-language models through controlled choice tasks and systematic image perturbations, enabling visual prompt optimization and preference analysis.

DetailsMotivation: To understand the visual preferences of vision-language models (VLMs) that make decisions at scale (clicking, recommending, buying) but whose visual decision-making processes remain poorly understood.

Method: Treat VLM decision functions as latent visual utilities inferred through revealed preference in choice tasks between systematically edited images. Use visual prompt optimization to iteratively propose and apply visually plausible modifications using image generation models, then evaluate which edits increase selection probability.

Result: Large-scale experiments show optimized edits significantly shift choice probabilities in head-to-head comparisons. An automatic interpretability pipeline identifies consistent visual themes driving selection.

Conclusion: The approach offers a practical way to surface visual vulnerabilities and safety concerns proactively, supporting better auditing and governance of image-based AI agents.

Abstract: The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent’s decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

[78] Consistency-Preserving Diverse Video Generation

Xinshuang Liu, Runfa Blark Li, Truong Nguyen

Main category: cs.CV

TL;DR: Joint-sampling framework for text-to-video flow-matching models that improves batch diversity while preserving temporal consistency using lightweight latent-space objectives.

DetailsMotivation: Text-to-video generation is computationally expensive, limiting samples per prompt. Existing diversity methods for videos often degrade temporal consistency and require costly backpropagation through video decoders. Need for efficient diversity enhancement without sacrificing video quality.

Method: Proposes joint-sampling framework for flow-matching video generators with diversity-driven updates followed by removal of components that would decrease temporal consistency. Uses lightweight latent-space models to compute both diversity and consistency objectives, avoiding video decoding and decoder backpropagation.

Result: Achieves diversity comparable to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness in experiments on state-of-the-art text-to-video flow-matching models.

Conclusion: The proposed method effectively balances diversity and temporal consistency in text-to-video generation while being computationally efficient through latent-space optimization.

Abstract: Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity comparable to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Code will be released.

[79] Training-Free Zero-Shot Anomaly Detection in 3D Brain MRI with 2D Foundation Models

Tai Le-Gia, Jaehyun Ahn

Main category: cs.CV

TL;DR: Training-free zero-shot anomaly detection framework for 3D brain MRI using 2D foundation models to create volumetric tokens without supervision.

DetailsMotivation: Current zero-shot anomaly detection methods are limited to 2D medical images and fail to capture volumetric structure in 3D medical imaging, requiring a solution that extends ZSAD to 3D while maintaining training-free operation.

Method: Constructs localized volumetric tokens by aggregating multi-axis slices processed by 2D foundation models, creating 3D patch tokens that restore cubic spatial context and integrate with distance-based, batch-level anomaly detection pipelines.

Result: The framework provides compact 3D representations practical for standard GPUs and shows that training-free, batch-based ZSAD can be effectively extended from 2D encoders to full 3D MRI volumes.

Conclusion: Offers a simple and robust approach for volumetric anomaly detection in 3D medical imaging without requiring fine-tuning, prompts, or supervision.

Abstract: Zero-shot anomaly detection (ZSAD) has gained increasing attention in medical imaging as a way to identify abnormalities without task-specific supervision, but most advances remain limited to 2D datasets. Extending ZSAD to 3D medical images has proven challenging, with existing methods relying on slice-wise features and vision-language models, which fail to capture volumetric structure. In this paper, we introduce a fully training-free framework for ZSAD in 3D brain MRI that constructs localized volumetric tokens by aggregating multi-axis slices processed by 2D foundation models. These 3D patch tokens restore cubic spatial context and integrate directly with distance-based, batch-level anomaly detection pipelines. The framework provides compact 3D representations that are practical to compute on standard GPUs and require no fine-tuning, prompts, or supervision. Our results show that training-free, batch-based ZSAD can be effectively extended from 2D encoders to full 3D MRI volumes, offering a simple and robust approach for volumetric anomaly detection.

[80] Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

Libo Zhang, Zhaoning Zhang, Wangyang Hong, Peng Qiao, Dongsheng Li

Main category: cs.CV

TL;DR: Sparrow framework accelerates Video Large Language Models (Vid-LLMs) inference by addressing attention dilution and negative visual gain in speculative decoding through visually-aware text-anchored window attention and intermediate-layer visual state bridging.

DetailsMotivation: Speculative decoding suffers from severe performance collapse when applied to Video Large Language Models due to key-value cache explosion and context window mismatches, causing attention dilution and negative visual gain. The authors observe visual semantic internalization where critical visual semantics are encoded into text hidden states, making raw visual inputs redundant during deep inference.

Method: Proposes Sparrow framework with: 1) Visually-aware text-anchored window attention via hidden state reuse to offload visual computation to target model, 2) Intermediate-layer visual state bridging to train draft model with semantic-rich intermediate states, filtering low-level visual noise, and 3) Multi-token prediction strategy to bridge training-inference distribution shift.

Result: Achieves average speedup of 2.82x even with 25k visual tokens, effectively resolving performance degradation in long sequences and offering practical solution for real-time long video tasks.

Conclusion: Sparrow successfully addresses the challenges of applying speculative decoding to Video Large Language Models, enabling efficient inference acceleration for long video understanding tasks through novel attention mechanisms and training strategies.

Abstract: Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.

[81] EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use

Siwei Wen, Zhangcheng Wang, Xingjian Zhang, Lei Huang, Wenjun Wu

Main category: cs.CV

TL;DR: EventMemAgent: Active online video agent framework with hierarchical memory for streaming video understanding, using event-based processing and agentic reinforcement learning.

DetailsMotivation: Addresses the conflict between unbounded streaming video input and limited MLLM context windows, overcoming limitations of passive processing methods that struggle with long-range context vs. fine-grained detail trade-offs.

Method: Hierarchical memory module with dual-layer strategy: short-term memory detects event boundaries using event-granular reservoir sampling; long-term memory archives past observations event-by-event. Integrates multi-granular perception toolkit and Agentic Reinforcement Learning for end-to-end reasoning and tool-use internalization.

Result: Achieves competitive results on online video benchmarks, demonstrating effectiveness for streaming video understanding tasks.

Conclusion: EventMemAgent provides an effective active framework for online video understanding that balances long-range context with fine-grained detail through hierarchical event-based memory and agentic learning.

Abstract: Online video understanding requires models to perform continuous perception and long-range reasoning within potentially infinite visual streams. Its fundamental challenge lies in the conflict between the unbounded nature of streaming media input and the limited context window of Multimodal Large Language Models (MLLMs). Current methods primarily rely on passive processing, which often face a trade-off between maintaining long-range context and capturing the fine-grained details necessary for complex tasks. To address this, we introduce EventMemAgent, an active online video agent framework based on a hierarchical memory module. Our framework employs a dual-layer strategy for online videos: short-term memory detects event boundaries and utilizes event-granular reservoir sampling to process streaming video frames within a fixed-length buffer dynamically; long-term memory structuredly archives past observations on an event-by-event basis. Furthermore, we integrate a multi-granular perception toolkit for active, iterative evidence capture and employ Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent’s intrinsic capabilities. Experiments show that EventMemAgent achieves competitive results on online video benchmarks. The code will be released here: https://github.com/lingcco/EventMemAgent.

[82] Effective and Robust Multimodal Medical Image Analysis

Joy Dhar, Nayyar Zaidi, Maryam Haghighat

Main category: cs.CV

TL;DR: MAIL and Robust-MAIL are multimodal fusion networks for medical imaging that use attention mechanisms for better cross-modal learning while being computationally efficient and robust to adversarial attacks.

DetailsMotivation: Existing multimodal fusion learning methods for medical imaging have limitations: they lack generalizability across modalities, are computationally expensive, and vulnerable to adversarial attacks, which compromises reliability in medical AI applications.

Method: Proposes MAIL network with two key components: 1) efficient residual learning attention block for capturing modality-specific multi-scale patterns, and 2) efficient multimodal cross-attention module for learning complementary shared representations across modalities. Robust-MAIL extends this with random projection filters and modulated attention noise for adversarial robustness.

Result: Extensive evaluations on 20 public datasets show MAIL and Robust-MAIL outperform existing methods with performance gains up to 9.34% while reducing computational costs by up to 78.3%.

Conclusion: The proposed approaches provide superior multimodal fusion for medical imaging with better generalizability, computational efficiency, and adversarial robustness compared to existing methods.

Abstract: Multimodal Fusion Learning (MFL), leveraging disparate data from various imaging modalities (e.g., MRI, CT, SPECT), has shown great potential for addressing medical problems such as skin cancer and brain tumor prediction. However, existing MFL methods face three key limitations: a) they often specialize in specific modalities, and overlook effective shared complementary information across diverse modalities, hence limiting their generalizability for multi-disease analysis; b) they rely on computationally expensive models, restricting their applicability in resource-limited settings; and c) they lack robustness against adversarial attacks, compromising reliability in medical AI applications. To address these limitations, we propose a novel Multi-Attention Integration Learning (MAIL) network, incorporating two key components: a) an efficient residual learning attention block for capturing refined modality-specific multi-scale patterns and b) an efficient multimodal cross-attention module for learning enriched complementary shared representations across diverse modalities. Furthermore, to ensure adversarial robustness, we extend MAIL network to design Robust-MAIL by incorporating random projection filters and modulated attention noise. Extensive evaluations on 20 public datasets show that both MAIL and Robust-MAIL outperform existing methods, achieving performance gains of up to 9.34% while reducing computational costs by up to 78.3%. These results highlight the superiority of our approaches, ensuring more reliable predictions than top competitors. Code: https://github.com/misti1203/MAIL-Robust-MAIL.

[83] CREMD: Crowd-Sourced Emotional Multimodal Dogs Dataset

Jinho Baek, Houwei Cao, Kate Blackwell

Main category: cs.CV

TL;DR: CREMD dataset explores how presentation modes (context, audio, video) and annotator characteristics affect dog emotion recognition, finding context improves agreement but audio effects are inconclusive.

DetailsMotivation: Dog emotion recognition is important for human-animal interaction and veterinary care, but challenging due to subjective assessments and lack of standardized ground truth methods.

Method: Created CREMD dataset with 923 video clips in three presentation modes (no context/audio, context only, context+audio) and collected annotations from diverse participants including dog owners, professionals, and various demographics.

Result: Visual context significantly improved annotation agreement; audio effects inconclusive due to design limitations; non-owners and male annotators showed higher agreement than owners/females; professionals had higher agreement; audio increased confidence for anger/fear recognition.

Conclusion: The study provides insights into factors influencing dog emotion perception and highlights the importance of multimodal presentation and annotator characteristics for reliable emotion recognition.

Abstract: Dog emotion recognition plays a crucial role in enhancing human-animal interactions, veterinary care, and the development of automated systems for monitoring canine well-being. However, accurately interpreting dog emotions is challenging due to the subjective nature of emotional assessments and the absence of standardized ground truth methods. We present the CREMD (Crowd-sourced Emotional Multimodal Dogs Dataset), a comprehensive dataset exploring how different presentation modes (e.g., context, audio, video) and annotator characteristics (e.g., dog ownership, gender, professional experience) influence the perception and labeling of dog emotions. The dataset consists of 923 video clips presented in three distinct modes: without context or audio, with context but no audio, and with both context and audio. We analyze annotations from diverse participants, including dog owners, professionals, and individuals with varying demographic backgrounds and experience levels, to identify factors that influence reliable dog emotion recognition. Our findings reveal several key insights: (1) while adding visual context significantly improved annotation agreement, our findings regarding audio cues are inconclusive due to design limitations (specifically, the absence of a no-context-with-audio condition and limited clean audio availability); (2) contrary to expectations, non-owners and male annotators showed higher agreement levels than dog owners and female annotators, respectively, while professionals showed higher agreement levels, aligned with our initial hypothesis; and (3) the presence of audio substantially increased annotators’ confidence in identifying specific emotions, particularly anger and fear.

[84] DAV-GSWT: Diffusion-Active-View Sampling for Data-Efficient Gaussian Splatting Wang Tiles

Rong Fu, Jiekai Wu, Haiyun Wei, Yee Tan Jia, Wenxin Zhang, Yang Li, Xiaowen Ma, Wangyu Wu, Simon Fong

Main category: cs.CV

TL;DR: DAV-GSWT: A data-efficient framework using diffusion priors and active view sampling to generate high-fidelity Gaussian Splatting Wang Tiles from minimal input observations for large-scale virtual environments.

DetailsMotivation: Current 3D Gaussian Splatting methods for photorealistic neural rendering require dense exemplar reconstructions, limiting scalability for large environments. Procedural methods like Wang Tiles help generate expansive landscapes but still rely on extensive input data.

Method: Combines diffusion priors with active view sampling in a hierarchical uncertainty quantification framework. Uses generative diffusion models to hallucinate missing structural details and autonomously identifies the most informative viewpoints to ensure seamless tile transitions.

Result: Significantly reduces required data volume while maintaining visual integrity and interactive performance for large-scale virtual environments. Experimental results show the system successfully synthesizes high-fidelity Gaussian Splatting Wang Tiles from minimal observations.

Conclusion: DAV-GSWT provides a data-efficient solution for generating large-scale virtual environments using 3D Gaussian Splatting, addressing the data dependency limitations of current methods through diffusion priors and intelligent view sampling.

Abstract: The emergence of 3D Gaussian Splatting has fundamentally redefined the capabilities of photorealistic neural rendering by enabling high-throughput synthesis of complex environments. While procedural methods like Wang Tiles have recently been integrated to facilitate the generation of expansive landscapes, these systems typically remain constrained by a reliance on densely sampled exemplar reconstructions. We present DAV-GSWT, a data-efficient framework that leverages diffusion priors and active view sampling to synthesize high-fidelity Gaussian Splatting Wang Tiles from minimal input observations. By integrating a hierarchical uncertainty quantification mechanism with generative diffusion models, our approach autonomously identifies the most informative viewpoints while hallucinating missing structural details to ensure seamless tile transitions. Experimental results indicate that our system significantly reduces the required data volume while maintaining the visual integrity and interactive performance necessary for large-scale virtual environments.

[85] Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation

Shuwei Li, Lei Tan, Robby T. Tan

Main category: cs.CV

TL;DR: A novel framework for day-to-night unpaired image translation that detects and suppresses semantic hallucinations using a dual-head discriminator and class-specific prototypes.

DetailsMotivation: Day-to-night image translation is challenging due to large appearance shifts and lack of pixel-level supervision, causing semantic hallucinations where objects like traffic signs and vehicles are incorrectly synthesized, degrading downstream task performance.

Method: Proposes a framework with: 1) Dual-head discriminator that performs semantic segmentation to detect hallucinations in background regions, 2) Class-specific prototypes constructed from target-domain object features as semantic anchors, 3) Built on Schrodinger Bridge-based translation with iterative refinement where hallucinated features are pushed away from class prototypes.

Result: Outperforms existing approaches qualitatively and quantitatively. On BDD100K dataset, improves mAP by 15.5% for day-to-night domain adaptation, with 31.7% gain for hallucination-prone classes like traffic lights.

Conclusion: The proposed framework effectively detects and suppresses semantic hallucinations in unpaired image translation, significantly improving downstream task performance for day-to-night domain adaptation.

Abstract: Day-to-night unpaired image translation is important to downstream tasks but remains challenging due to large appearance shifts and the lack of direct pixel-level supervision. Existing methods often introduce semantic hallucinations, where objects from target classes such as traffic signs and vehicles, as well as man-made light effects, are incorrectly synthesized. These hallucinations significantly degrade downstream performance. We propose a novel framework that detects and suppresses hallucinations of target-class features during unpaired translation. To detect hallucination, we design a dual-head discriminator that additionally performs semantic segmentation to identify hallucinated content in background regions. To suppress these hallucinations, we introduce class-specific prototypes, constructed by aggregating features of annotated target-domain objects, which act as semantic anchors for each class. Built upon a Schrodinger Bridge-based translation model, our framework performs iterative refinement, where detected hallucination features are explicitly pushed away from class prototypes in feature space, thus preserving object semantics across the translation trajectory.Experiments show that our method outperforms existing approaches both qualitatively and quantitatively. On the BDD100K dataset, it improves mAP by 15.5% for day-to-night domain adaptation, with a notable 31.7% gain for classes such as traffic lights that are prone to hallucinations.

[86] Efficient Generative Modeling beyond Memoryless Diffusion via Adjoint Schrödinger Bridge Matching

Jeongwoo Shin, Jinhwan Sul, Joonseok Lee, Jaewong Choi, Jaemoo Choi

Main category: cs.CV

TL;DR: ASBM is a two-stage generative modeling framework that learns optimal Schrödinger Bridge trajectories for more efficient diffusion sampling, producing straighter paths and enabling one-step distillation.

DetailsMotivation: Standard diffusion models suffer from highly curved trajectories and noisy score targets due to uninformative, memoryless forward processes that create independent data-noise coupling. This leads to inefficient sampling requiring many steps.

Method: Two-stage approach: 1) Learn Schrödinger Bridge forward dynamic as a coupling construction problem through data-to-energy sampling that transports data to an energy-defined prior. 2) Learn backward generative dynamic with simple matching loss supervised by the induced optimal coupling. Operates in non-memoryless regime for straighter paths.

Result: ASBM produces significantly straighter and more efficient sampling paths, scales to high-dimensional data with improved stability and efficiency. Image generation experiments show improved fidelity with fewer sampling steps. Successfully distilled to a one-step generator.

Conclusion: ASBM provides a framework for learning optimal trajectories in diffusion models, addressing inefficiencies of standard approaches through Schrödinger Bridge formulation and enabling practical applications like one-step generation.

Abstract: Diffusion models often yield highly curved trajectories and noisy score targets due to an uninformative, memoryless forward process that induces independent data-noise coupling. We propose Adjoint Schrödinger Bridge Matching (ASBM), a generative modeling framework that recovers optimal trajectories in high dimensions via two stages. First, we view the Schrödinger Bridge (SB) forward dynamic as a coupling construction problem and learn it through a data-to-energy sampling perspective that transports data to an energy-defined prior. Then, we learn the backward generative dynamic with a simple matching loss supervised by the induced optimal coupling. By operating in a non-memoryless regime, ASBM produces significantly straighter and more efficient sampling paths. Compared to prior works, ASBM scales to high-dimensional data with notably improved stability and efficiency. Extensive experiments on image generation show that ASBM improves fidelity with fewer sampling steps. We further showcase the effectiveness of our optimal trajectory via distillation to a one-step generator.

[87] Emergent Morphing Attack Detection in Open Multi-modal Large Language Models

Marija Ivanovska, Vitomir Štruc

Main category: cs.CV

TL;DR: Open-source multimodal large language models (MLLMs) demonstrate strong zero-shot performance in face morphing attack detection without task-specific training, outperforming specialized MAD systems by 23% in EER.

DetailsMotivation: Most existing morphing attack detection (MAD) systems require task-specific training and generalize poorly to unseen attack types. Meanwhile, MLLMs have shown strong visual-linguistic reasoning but their potential in biometric forensics remains underexplored.

Method: First systematic zero-shot evaluation of open-source MLLMs for single-image MAD using publicly available weights and standardized, reproducible protocol across diverse morphing techniques.

Result: Many MLLMs show non-trivial discriminative ability without fine-tuning. LLaVA1.6-Mistral-7B achieves state-of-the-art performance, surpassing task-specific MAD baselines by at least 23% in equal error rate (EER).

Conclusion: Multimodal pretraining implicitly encodes fine-grained facial inconsistencies indicative of morphing artifacts, enabling zero-shot forensic sensitivity. Open-source MLLMs serve as reproducible, interpretable foundations for biometric security and forensic image analysis.

Abstract: Face morphing attacks threaten biometric verification, yet most morphing attack detection (MAD) systems require task-specific training and generalize poorly to unseen attack types. Meanwhile, open-source multimodal large language models (MLLMs) have demonstrated strong visual-linguistic reasoning, but their potential in biometric forensics remains underexplored. In this paper, we present the first systematic zero-shot evaluation of open-source MLLMs for single-image MAD, using publicly available weights and a standardized, reproducible protocol. Across diverse morphing techniques, many MLLMs show non-trivial discriminative ability without any fine-tuning or domain adaptation, and LLaVA1.6-Mistral-7B achieves state-of-the-art performance, surpassing highly competitive task-specific MAD baselines by at least 23% in terms of equal error rate (EER). The results indicate that multimodal pretraining can implicitly encode fine-grained facial inconsistencies indicative of morphing artifacts, enabling zero-shot forensic sensitivity. Our findings position open-source MLLMs as reproducible, interpretable, and competitive foundations for biometric security and forensic image analysis. This emergent capability also highlights new opportunities to develop state-of-the-art MAD systems through targeted fine-tuning or lightweight adaptation, further improving accuracy and efficiency while preserving interpretability. To support future research, all code and evaluation protocols will be released upon publication.

[88] RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution

Youngwan Jin, Incheol Park, Yagiz Nalcakan, Hyeongjin Ju, Sanghyeop Yeo, Shiho Kim

Main category: cs.CV

TL;DR: RPT-SR is a vision transformer for infrared image super-resolution that incorporates persistent spatial priors from fixed-viewpoint scenes through a dual-token framework with regional prior tokens and local tokens.

DetailsMotivation: General-purpose super-resolution models are inefficient for infrared imaging in fixed-viewpoint scenarios (surveillance, autonomous driving) because they don't exploit persistent spatial priors, leading to redundant learning and suboptimal performance.

Method: Proposes Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR) with a dual-token framework: (1) learnable regional prior tokens that act as persistent memory for scene’s global structure, and (2) local tokens for frame-specific content. These tokens are fused in an attention mechanism allowing priors to dynamically modulate local reconstruction.

Result: Extensive experiments validate the approach, establishing new state-of-the-art performance across diverse datasets covering both Long-Wave (LWIR) and Short-Wave (SWIR) infrared spectra.

Conclusion: RPT-SR effectively addresses the inefficiency of general-purpose super-resolution models in fixed-viewpoint infrared imaging by explicitly encoding scene layout information into the attention mechanism, demonstrating broad applicability across different infrared spectra.

Abstract: General-purpose super-resolution models, particularly Vision Transformers, have achieved remarkable success but exhibit fundamental inefficiencies in common infrared imaging scenarios like surveillance and autonomous driving, which operate from fixed or nearly-static viewpoints. These models fail to exploit the strong, persistent spatial priors inherent in such scenes, leading to redundant learning and suboptimal performance. To address this, we propose the Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR), a novel architecture that explicitly encodes scene layout information into the attention mechanism. Our core contribution is a dual-token framework that fuses (1) learnable, regional prior tokens, which act as a persistent memory for the scene’s global structure, with (2) local tokens that capture the frame-specific content of the current input. By utilizing these tokens into an attention, our model allows the priors to dynamically modulate the local reconstruction process. Extensive experiments validate our approach. While most prior works focus on a single infrared band, we demonstrate the broad applicability and versatility of RPT-SR by establishing new state-of-the-art performance across diverse datasets covering both Long-Wave (LWIR) and Short-Wave (SWIR) spectra

[89] LEADER: Lightweight End-to-End Attention-Gated Dual Autoencoder for Robust Minutiae Extraction

Raffaele Cappelli, Matteo Ferrara

Main category: cs.CV

TL;DR: LEADER is a lightweight end-to-end neural network for fingerprint minutiae extraction that eliminates separate preprocessing/postprocessing steps, achieving state-of-the-art accuracy with only 0.9M parameters.

DetailsMotivation: Deep learning approaches for fingerprint minutiae extraction still rely on separate preprocessing and postprocessing steps; the authors aim to develop a truly end-to-end method that directly maps raw fingerprint images to minutiae descriptors.

Method: Proposes LEADER with a dual-autoencoder structure interconnected through attention-gating, novel “Castle-Moat-Rampart” ground-truth encoding, and integrates non-maximum suppression and angular decoding for complete end-to-end inference.

Result: Achieves 34% higher F1-score on NIST SD27 dataset compared to specialized latent minutiae extractors, with state-of-the-art accuracy on plain fingerprints and robust cross-domain generalization to latent impressions.

Conclusion: LEADER demonstrates that lightweight end-to-end architectures can outperform traditional methods in fingerprint minutiae extraction while learning meaningful internal representations aligned with domain knowledge.

Abstract: Minutiae extraction, a fundamental stage in fingerprint recognition, is increasingly shifting toward deep learning. However, truly end-to-end methods that eliminate separate preprocessing and postprocessing steps remain scarce. This paper introduces LEADER (Lightweight End-to-end Attention-gated Dual autoencodER), a neural network that maps raw fingerprint images to minutiae descriptors, including location, direction, and type. The proposed architecture integrates non-maximum suppression and angular decoding to enable complete end-to-end inference using only 0.9M parameters. It employs a novel “Castle-Moat-Rampart” ground-truth encoding and a dual-autoencoder structure, interconnected through an attention-gating mechanism. Experimental evaluations demonstrate state-of-the-art accuracy on plain fingerprints and robust cross-domain generalization to latent impressions. Specifically, LEADER attains a 34% higher F1-score on the NIST SD27 dataset compared to specialized latent minutiae extractors. Sample-level analysis on this challenging benchmark reveals an average rank of 2.07 among all compared methods, with LEADER securing the first-place position in 47% of the samples-more than doubling the frequency of the second-best extractor. The internal representations learned by the model align with established fingerprint domain features, such as segmentation masks, orientation fields, frequency maps, and skeletons. Inference requires 15ms on GPU and 322ms on CPU, outperforming leading commercial software in computational efficiency. The source code and pre-trained weights are publicly released to facilitate reproducibility.

[90] Semantic-Guided 3D Gaussian Splatting for Transient Object Removal

Aditi Prabakaran, Priyesh Shukla

Main category: cs.CV

TL;DR: Semantic filtering framework using vision-language models to remove transient objects in 3D Gaussian Splatting reconstruction, addressing ghosting artifacts without motion-based heuristics or high memory cost.

DetailsMotivation: Transient objects in multi-view captures cause ghosting artifacts in 3DGS reconstruction. Existing solutions have limitations: scene decomposition requires high memory cost, and motion-based heuristics are vulnerable to parallax ambiguity.

Method: Proposes semantic filtering using vision-language models (CLIP). Accumulates CLIP similarity scores between rendered views and distractor text prompts per-Gaussian across training iterations. Gaussians exceeding calibrated threshold undergo opacity regularization and periodic pruning.

Result: Experiments on RobustNeRF benchmark show consistent improvement in reconstruction quality over vanilla 3DGS across four sequences, while maintaining minimal memory overhead and real-time rendering performance.

Conclusion: Semantic classification resolves parallax ambiguity by identifying object categories independently of motion patterns. Semantic guidance is a practical strategy for transient removal in scenarios with predictable distractor categories.

Abstract: Transient objects in casual multi-view captures cause ghosting artifacts in 3D Gaussian Splatting (3DGS) reconstruction. Existing solutions relied on scene decomposition at significant memory cost or on motion-based heuristics that were vulnerable to parallax ambiguity. A semantic filtering framework was proposed for category-aware transient removal using vision-language models. CLIP similarity scores between rendered views and distractor text prompts were accumulated per-Gaussian across training iterations. Gaussians exceeding a calibrated threshold underwent opacity regularization and periodic pruning. Unlike motion-based approaches, semantic classification resolved parallax ambiguity by identifying object categories independently of motion patterns. Experiments on the RobustNeRF benchmark demonstrated consistent improvement in reconstruction quality over vanilla 3DGS across four sequences, while maintaining minimal memory overhead and real-time rendering performance. Threshold calibration and comparisons with baselines validated semantic guidance as a practical strategy for transient removal in scenarios with predictable distractor categories.

[91] Advanced Acceptance Score: A Holistic Measure for Biometric Quantification

Aman Verma, Seshan Srirangarajan, Sumantra Dutta Roy

Main category: cs.CV

TL;DR: Proposes a comprehensive evaluation framework for biometric gesture recognition systems that goes beyond traditional error rates to assess the quality of fitness scores through ranking, relevance, and feature disentanglement metrics.

DetailsMotivation: Existing biometric capacity estimation methods rely on error rates which don't indicate the goodness of fitness scores. There's a need for better evaluation measures that assess ranking order and relevance of output scores in gesture recognition systems.

Method: Develops an advanced acceptance score that integrates: 1) rank deviation, 2) rewards for higher scores of high-ranked gestures and lower scores of low-ranked gestures, 3) compensation for correspondence between output and ground truth score trends, and 4) discounting based on identity feature disentanglement.

Result: Experiments on three datasets with five SOTA models show the proposed measure selects more appropriate optimal scores than existing measures and correlates with existing measures, validating its reliability.

Conclusion: The proposed holistic evaluation measure provides a more comprehensive assessment of biometric gesture recognition systems than traditional error rates, with public code available for reproducibility.

Abstract: Quantifying biometric characteristics within hand gestures involve derivation of fitness scores from a gesture and identity aware feature space. However, evaluating the quality of these scores remains an open question. Existing biometric capacity estimation literature relies upon error rates. But these rates do not indicate goodness of scores. Thus, in this manuscript we present an exhaustive set of evaluation measures. We firstly identify ranking order and relevance of output scores as the primary basis for evaluation. In particular, we consider both rank deviation as well as rewards for: (i) higher scores of high ranked gestures and (ii) lower scores of low ranked gestures. We also compensate for correspondence between trends of output and ground truth scores. Finally, we account for disentanglement between identity features of gestures as a discounting factor. Integrating these elements with adequate weighting, we formulate advanced acceptance score as a holistic evaluation measure. To assess effectivity of the proposed we perform in-depth experimentation over three datasets with five state-of-the-art (SOTA) models. Results show that the optimal score selected with our measure is more appropriate than existing other measures. Also, our proposed measure depicts correlation with existing measures. This further validates its reliability. We have made our \href{https://github.com/AmanVerma2307/MeasureSuite}{code} public.

[92] Dynamic Training-Free Fusion of Subject and Style LoRAs

Qinglong Cao, Yuntian Chen, Chao Ma, Xiaokang Yang

Main category: cs.CV

TL;DR: Dynamic training-free LoRA fusion framework for text-to-image generation that adaptively combines subject and style LoRAs using KL divergence and gradient-based corrections throughout the diffusion process.

DetailsMotivation: Existing LoRA fusion methods use static heuristics that ignore adaptive feature adjustments and input randomness, limiting coherent subject-style synthesis in text-to-image generation.

Method: Dynamic framework with two mechanisms: 1) Feature-level selection using KL divergence between base model features and LoRA-produced features at each layer, 2) Gradient-based corrections using CLIP/DINO scores during reverse denoising for continuous semantic/stylistic guidance.

Result: Outperforms state-of-the-art LoRA fusion methods across diverse subject-style combinations both qualitatively and quantitatively, achieving coherent synthesis without retraining.

Conclusion: Dynamic training-free fusion through feature-level selection and metric-guided latent adjustment enables superior subject-style synthesis compared to static fusion approaches.

Abstract: Recent studies have explored the combination of multiple LoRAs to simultaneously generate user-specified subjects and styles. However, most existing approaches fuse LoRA weights using static statistical heuristics that deviate from LoRA’s original purpose of learning adaptive feature adjustments and ignore the randomness of sampled inputs. To address this, we propose a dynamic training-free fusion framework that operates throughout the generation process. During the forward pass, at each LoRA-applied layer, we dynamically compute the KL divergence between the base model’s original features and those produced by subject and style LoRAs, respectively, and adaptively select the most appropriate weights for fusion. In the reverse denoising stage, we further refine the generation trajectory by dynamically applying gradient-based corrections derived from objective metrics such as CLIP and DINO scores, providing continuous semantic and stylistic guidance. By integrating these two complementary mechanisms-feature-level selection and metric-guided latent adjustment-across the entire diffusion timeline, our method dynamically achieves coherent subject-style synthesis without any retraining. Extensive experiments across diverse subject-style combinations demonstrate that our approach consistently outperforms state-of-the-art LoRA fusion methods both qualitatively and quantitatively.

[93] Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs

Guangtao Lyu, Qi Liu, Chenghao Xu, Jiexi Yan, Muli Yang, Xueting Li, Fen Fang, Cheng Deng

Main category: cs.CV

TL;DR: PADE is a training-free attention intervention method that leverages internal Positive Attention Dynamics in LVLMs to identify semantically core visual regions, reduce hallucinations, and improve visual grounding without additional computational overhead.

DetailsMotivation: LVLMs suffer from hallucinations and inconsistent outputs with visual inputs. Existing training-free methods have high computational overhead, potential interference, and vulnerability to attention sink phenomena, necessitating a more efficient solution.

Method: Proposes Positive Attention Dynamics Enhancement (PADE): 1) constructs PAD maps to identify semantically core visual regions, 2) applies per-head Median Absolute Deviation Scaling for adaptive intervention strength, and 3) uses System-Token Compensation to maintain attention to complex instructions and ensure long-term output consistency.

Result: Experiments on multiple LVLMs and benchmarks show PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.

Conclusion: PADE demonstrates that internal attention dynamics in LVLMs can be effectively leveraged to enhance visual grounding and reduce hallucinations without training or significant computational overhead, offering a practical solution for reliable multimodal reasoning.

Abstract: LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.

[94] Intracoronary Optical Coherence Tomography Image Processing and Vessel Classification Using Machine Learning

Amal Lahchim, Lambros Athanasiou

Main category: cs.CV

TL;DR: Automated pipeline for coronary OCT image analysis using ML techniques for vessel segmentation and classification with high accuracy (99.68%)

DetailsMotivation: Intracoronary OCT provides high-resolution coronary vessel visualization but faces challenges with noise, imaging artifacts, and complex tissue structures, requiring automated solutions for clinical applications

Method: Integrated pipeline with image preprocessing, guidewire artifact removal, polar-to-Cartesian transformation, unsupervised K-means clustering, local feature extraction, and Logistic Regression/SVM classifiers for pixel-wise vessel classification

Result: Achieved excellent performance with precision, recall, and F1-score values up to 1.00 and overall classification accuracy of 99.68%

Conclusion: Proposed approach provides accurate vessel boundary detection with low computational complexity and minimal manual annotation, offering reliable automated OCT analysis for clinical decision support

Abstract: Intracoronary Optical Coherence Tomography (OCT) enables high-resolution visualization of coronary vessel anatomy but presents challenges due to noise, imaging artifacts, and complex tissue structures. This paper proposes a fully automated pipeline for vessel segmentation and classification in OCT images using machine learning techniques. The proposed method integrates image preprocessing, guidewire artifact removal, polar-to-Cartesian transformation, unsupervised K-means clustering, and local feature extraction. These features are used to train Logistic Regression and Support Vector Machine classifiers for pixel-wise vessel classification. Experimental results demonstrate excellent performance, achieving precision, recall, and F1-score values up to 1.00 and overall classification accuracy of 99.68%. The proposed approach provides accurate vessel boundary detection while maintaining low computational complexity and requiring minimal manual annotation. This method offers a reliable and efficient solution for automated OCT image analysis and has potential applications in clinical decision support and real-time medical image processing.

[95] An Industrial Dataset for Scene Acquisitions and Functional Schematics Alignment

Flavien Armangeon, Thibaud Ehret, Enric Meinhardt-Llopis, Rafael Grompone von Gioi, Guillaume Thibault, Marc Petit, Gabriele Facciolo

Main category: cs.CV

TL;DR: IRIS-v2 dataset for aligning functional schematics with 2D/3D scene data in industrial facilities, using segmentation and graph matching to automate digital twin creation.

DetailsMotivation: Manual alignment of functional schematics with 2D/3D scene data for building digital twins of old industrial facilities is tedious and doesn't scale. Inconsistencies between schematics and reality, plus lack of public industrial datasets, make this an underexplored challenge.

Method: Introduces IRIS-v2 comprehensive dataset with images, point clouds, 2D annotations, segmentation masks, CAD model, 3D pipe routing, and P&ID diagrams. Uses segmentation and graph matching for alignment in a practical case study.

Result: Presents a dataset to support research in this area and demonstrates alignment approach that aims to reduce time required for the task.

Conclusion: IRIS-v2 dataset addresses the scarcity of industrial datasets and provides a foundation for automating schematic-to-scene alignment for digital twin creation in industrial facilities.

Abstract: Aligning functional schematics with 2D and 3D scene acquisitions is crucial for building digital twins, especially for old industrial facilities that lack native digital models. Current manual alignment using images and LiDAR data does not scale due to tediousness and complexity of industrial sites. Inconsistencies between schematics and reality, and the scarcity of public industrial datasets, make the problem both challenging and underexplored. This paper introduces IRIS-v2, a comprehensive dataset to support further research. It includes images, point clouds, 2D annotated boxes and segmentation masks, a CAD model, 3D pipe routing information, and the P&ID (Piping and Instrumentation Diagram). The alignment is experimented on a practical case study, aiming at reducing the time required for this task by combining segmentation and graph matching.

[96] Concept-Enhanced Multimodal RAG: Towards Interpretable and Accurate Radiology Report Generation

Marco Salmè, Federico Siciliano, Fabrizio Silvestri, Paolo Soda, Rosa Sicilia, Valerio Guarrasi

Main category: cs.CV

TL;DR: CEMRAG is a unified framework that combines clinical concept decomposition with multimodal retrieval-augmented generation to improve both interpretability and factual accuracy in radiology report generation using vision-language models.

DetailsMotivation: Current vision-language models for radiology report generation lack interpretability and tend to hallucinate findings not supported by imaging evidence. Existing approaches treat interpretability and accuracy as separate objectives, with concept-based methods focusing on transparency and RAG methods targeting factual grounding through external retrieval.

Method: CEMRAG decomposes visual representations into interpretable clinical concepts and integrates them with multimodal retrieval-augmented generation. This creates enriched contextual prompts that combine visual concept transparency with factual grounding from retrieved evidence.

Result: Experiments on MIMIC-CXR and IU X-Ray datasets across multiple VLM architectures, training regimes, and retrieval configurations show consistent improvements over conventional RAG and concept-only baselines on both clinical accuracy metrics and standard NLP measures.

Conclusion: The framework challenges the assumed trade-off between interpretability and performance, demonstrating that transparent visual concepts can enhance rather than compromise diagnostic accuracy in medical VLMs. The modular design provides a principled pathway toward clinically trustworthy AI-assisted radiology.

Abstract: Radiology Report Generation (RRG) through Vision-Language Models (VLMs) promises to reduce documentation burden, improve reporting consistency, and accelerate clinical workflows. However, their clinical adoption remains limited by the lack of interpretability and the tendency to hallucinate findings misaligned with imaging evidence. Existing research typically treats interpretability and accuracy as separate objectives, with concept-based explainability techniques focusing primarily on transparency, while Retrieval-Augmented Generation (RAG) methods targeting factual grounding through external retrieval. We present Concept-Enhanced Multimodal RAG (CEMRAG), a unified framework that decomposes visual representations into interpretable clinical concepts and integrates them with multimodal RAG. This approach exploits enriched contextual prompts for RRG, improving both interpretability and factual accuracy. Experiments on MIMIC-CXR and IU X-Ray across multiple VLM architectures, training regimes, and retrieval configurations demonstrate consistent improvements over both conventional RAG and concept-only baselines on clinical accuracy metrics and standard NLP measures. These results challenge the assumed trade-off between interpretability and performance, showing that transparent visual concepts can enhance rather than compromise diagnostic accuracy in medical VLMs. Our modular design decomposes interpretability into visual transparency and structured language model conditioning, providing a principled pathway toward clinically trustworthy AI-assisted radiology.

[97] A Novel Public Dataset for Strawberry (Fragaria x ananassa) Ripeness Detection and Comparative Evaluation of YOLO-Based Models

Mustafa Yurdakul, Zeynep Sena Bastug, Ali Emre Gok, Sakir Taşdemir

Main category: cs.CV

TL;DR: A new public strawberry ripeness dataset with 566 images and 1,201 labeled objects is introduced, with comparative evaluation of YOLO models showing YOLOv8s achieves best mAP@50 of 86.09% for smart agriculture applications.

DetailsMotivation: Traditional visual assessment of strawberry ripeness is subjective and error-prone, requiring computer-assisted systems. However, the lack of comprehensive public datasets makes it difficult to compare studies in this field.

Method: Created a new publicly available strawberry ripeness dataset collected under variable light and environmental conditions in two Turkish greenhouses. Conducted comparative tests using YOLOv8, YOLOv9, and YOLO11-based models to evaluate ripeness detection performance.

Result: YOLOv9c achieved highest precision (90.94%), YOLO11s achieved highest recall (83.74%), and YOLOv8s achieved best overall performance with mAP@50 of 86.09%. Small and medium-sized models worked more balanced and efficiently on this dataset.

Conclusion: The dataset establishes a fundamental reference point for smart agriculture applications, showing that computer vision models can effectively detect strawberry ripeness, with smaller models performing well on this type of agricultural dataset.

Abstract: The strawberry (Fragaria x ananassa), known worldwide for its economic value and nutritional richness, is a widely cultivated fruit. Determining the correct ripeness level during the harvest period is crucial for both preventing losses for producers and ensuring consumers receive a quality product. However, traditional methods, i.e., visual assessments alone, can be subjective and have a high margin of error. Therefore, computer-assisted systems are needed. However, the scarcity of comprehensive datasets accessible to everyone in the literature makes it difficult to compare studies in this field. In this study, a new and publicly available strawberry ripeness dataset, consisting of 566 images and 1,201 labeled objects, prepared under variable light and environmental conditions in two different greenhouses in Turkey, is presented to the literature. Comparative tests conducted on the data set using YOLOv8, YOLOv9, and YOLO11-based models showed that the highest precision value was 90.94% in the YOLOv9c model, while the highest recall value was 83.74% in the YOLO11s model. In terms of the general performance criterion mAP@50, YOLOv8s was the best performing model with a success rate of 86.09%. The results show that small and medium-sized models work more balanced and efficiently on this type of dataset, while also establishing a fundamental reference point for smart agriculture applications.

[98] Bayesian Optimization for Design Parameters of 3D Image Data Analysis

David Exler, Joaquin Eduardo Urrutia Gómez, Martin Krüger, Maike Schliephake, John Jbeily, Mario Vitacolonna, Rüdiger Rudolf, Markus Reischl

Main category: cs.CV

TL;DR: 3D biomedical image analysis pipeline using Bayesian optimization to automate model selection and parameter tuning for segmentation and classification tasks.

DetailsMotivation: Manual model selection and parameter tuning for 3D biomedical image segmentation/classification is time-consuming and impractical for large-scale data, requiring automated optimization solutions.

Method: Two-stage Bayesian Optimization pipeline: 1) selects segmentation model and optimizes postprocessing parameters using domain-adapted benchmark, 2) optimizes classifier design choices (encoder, classifier head, pretraining) with assisted annotation workflow.

Result: Pipeline efficiently identifies effective model/parameter configurations in four case studies, reducing manual annotation effort through assisted workflow.

Conclusion: The 3D data Analysis Optimization Pipeline successfully automates model selection and parameter tuning for biomedical imaging, addressing practical bottlenecks in large-scale 3D analysis.

Abstract: Deep learning-based segmentation and classification are crucial to large-scale biomedical imaging, particularly for 3D data, where manual analysis is impractical. Although many methods exist, selecting suitable models and tuning parameters remains a major bottleneck in practice. Hence, we introduce the 3D data Analysis Optimization Pipeline, a method designed to facilitate the design and parameterization of segmentation and classification using two Bayesian Optimization stages. First, the pipeline selects a segmentation model and optimizes postprocessing parameters using a domain-adapted syntactic benchmark dataset. To ensure a concise evaluation of segmentation performance, we introduce a segmentation quality metric that serves as the objective function. Second, the pipeline optimizes design choices of a classifier, such as encoder and classifier head architectures, incorporation of prior knowledge, and pretraining strategies. To reduce manual annotation effort, this stage includes an assisted class-annotation workflow that extracts predicted instances from the segmentation results and sequentially presents them to the operator, eliminating the need for manual tracking. In four case studies, the 3D data Analysis Optimization Pipeline efficiently identifies effective model and parameter configurations for individual datasets.

[99] Criteria-first, semantics-later: reproducible structure discovery in image-based sciences

Jan Bumberger

Main category: cs.CV

TL;DR: A framework for criteria-first, semantics-later image analysis that separates structure extraction from semantic mapping to address limitations of label-based approaches in scientific discovery.

DetailsMotivation: Current semantics-first approaches fail in open-ended scientific discovery, cross-sensor comparability, and long-term monitoring where domain ontologies drift. There's a need for more robust, reproducible analysis methods.

Method: Proposes a criteria-first framework that performs semantics-free structure discovery using explicit optimality criteria, then maps discovered structures to domain ontologies downstream.

Result: Provides a unified framework for reproducible analysis across image-based sciences, enabling stable structural products and plural semantic interpretations without rewriting extraction.

Conclusion: Criteria-first structure discovery with downstream semantic mapping offers a more robust paradigm for scientific image analysis, supporting FAIR digital objects for long-term monitoring.

Abstract: Across the natural and life sciences, images have become a primary measurement modality, yet the dominant analytic paradigm remains semantics-first. Structure is recovered by predicting or enforcing domain-specific labels. This paradigm fails systematically under the conditions that make image-based science most valuable, including open-ended scientific discovery, cross-sensor and cross-site comparability, and long-term monitoring in which domain ontologies and associated label sets drift culturally, institutionally, and ecologically. A deductive inversion is proposed in the form of criteria-first and semantics-later. A unified framework for criteria-first structure discovery is introduced. It separates criterion-defined, semantics-free structure extraction from downstream semantic mapping into domain ontologies or vocabularies and provides a domain-general scaffold for reproducible analysis across image-based sciences. Reproducible science requires that the first analytic layer perform criterion-driven, semantics-free structure discovery, yielding stable partitions, structural fields, or hierarchies defined by explicit optimality criteria rather than local domain ontologies. Semantics is not discarded; it is relocated downstream as an explicit mapping from the discovered structural product to a domain ontology or vocabulary, enabling plural interpretations and explicit crosswalks without rewriting upstream extraction. Grounded in cybernetics, observation-as-distinction, and information theory’s separation of information from meaning, the argument is supported by cross-domain evidence showing that criteria-first components recur whenever labels do not scale. Finally, consequences are outlined for validation beyond class accuracy and for treating structural products as FAIR, AI-ready digital objects for long-term monitoring and digital twins.

[100] ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

Hyunchan Moon, Cheonjun Park, Steven L. Waslander

Main category: cs.CV

TL;DR: ToaSt is a decoupled framework for efficient Vision Transformers that applies specialized compression strategies to different components: structured pruning for attention modules and token channel selection for feed-forward networks.

DetailsMotivation: Vision Transformers achieve great performance but have high computational costs. Existing compression methods like structured pruning require long retraining times, while token compression suffers from global propagation issues that create optimization challenges.

Method: ToaSt uses a decoupled approach: 1) Coupled head-wise structured pruning for Multi-Head Self-Attention modules, leveraging attention operation characteristics for robustness; 2) Token Channel Selection (TCS) for Feed-Forward Networks (which account for over 60% of FLOPs), which enhances compression ratios while avoiding global propagation issues.

Result: Extensive evaluations across nine diverse models (DeiT, ViT-MAE, Swin Transformer) show superior accuracy-efficiency trade-offs. On ViT-MAE-Huge: 88.52% accuracy (+1.64%) with 39.4% FLOPs reduction. Effective transfer to downstream tasks: 52.2 vs 51.9 mAP on COCO object detection.

Conclusion: ToaSt provides an effective decoupled framework for Vision Transformer compression that addresses limitations of existing methods, achieving better efficiency-accuracy trade-offs and demonstrating strong transfer performance to downstream vision tasks.

Abstract: Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52% accuracy (+1.64 %) with 39.4% FLOPs reduction. ToaSt transfers effectively to downstream tasks, cccccachieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.

[101] Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

Shutian Gu, Chengkai Huang, Ruoyu Wang, Lina Yao

Main category: cs.CV

TL;DR: Retrieval-augmented framework improves LLM-based Vision-and-Language Navigation efficiency by using instruction-level trajectory retrieval and step-level candidate pruning without LLM fine-tuning.

DetailsMotivation: LLMs show promise for VLN but suffer from inefficient decision-making due to repeated instruction interpretation and reasoning over verbose navigable candidates at each step.

Method: Two-level retrieval approach: 1) Episode-level instruction embedding retriever selects similar successful trajectories as in-context exemplars; 2) Step-level imitation-learned candidate retriever prunes irrelevant navigable directions before LLM inference.

Result: Consistent improvements in Success Rate, Oracle Success Rate, and SPL on R2R benchmark for both seen and unseen environments; both retrieval components provide complementary benefits.

Conclusion: Retrieval-augmented decision support is effective and scalable for enhancing LLM-based vision-and-language navigation without modifying the underlying language model.

Abstract: Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context exemplars, providing task-specific priors for instruction grounding. At the step level, an imitation-learned candidate retriever prunes irrelevant navigable directions before LLM inference, reducing action ambiguity and prompt complexity. Both retrieval modules are lightweight, modular, and trained independently of the LLM. We evaluate our method on the Room-to-Room (R2R) benchmark. Experimental results demonstrate consistent improvements in Success Rate, Oracle Success Rate, and SPL on both seen and unseen environments. Ablation studies further show that instruction-level exemplar retrieval and candidate pruning contribute complementary benefits to global guidance and step-wise decision efficiency. These results indicate that retrieval-augmented decision support is an effective and scalable strategy for enhancing LLM-based vision-and-language navigation.

[102] Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding

Guile Wu, David Huang, Bingbing Liu, Dongfeng Bai

Main category: cs.CV

TL;DR: A unified 3D scene understanding framework using language and geometry grounded sparse voxel representations to model appearance, semantics, and geometry synergistically.

DetailsMotivation: Existing 3D open-vocabulary scene understanding methods focus on distilling language features from 2D models but overlook the synergy among appearance, semantics, and geometry, causing scene understanding to deviate from geometric structure and become decoupled from reconstruction.

Method: Uses 3D sparse voxels as primitives with appearance, density, feature, and confidence fields. Includes feature modulation module for synergy, distills language features from 2D foundation models, and integrates geometric distillation via depth correlation and pattern consistency regularization from geometry foundation models.

Result: Achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.

Conclusion: Proposed approach successfully models appearance, semantics, and geometry within a unified framework, demonstrating improved synergy and performance in 3D scene understanding.

Abstract: Existing 3D open-vocabulary scene understanding methods mostly emphasize distilling language features from 2D foundation models into 3D feature fields, but largely overlook the synergy among scene appearance, semantics, and geometry. As a result, scene understanding often deviates from the underlying geometric structure of scenes and becomes decoupled from the reconstruction process. In this work, we propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry within a unified framework. Specifically, we use 3D sparse voxels as primitives and employ an appearance field, a density field, a feature field, and a confidence field to holistically represent a 3D scene. To promote synergy among the appearance, density, and feature fields, we construct a feature modulation module and distill language features from a 2D foundation model into our 3D scene model. In addition, we integrate geometric distillation into feature field distillation to transfer geometric knowledge from a geometry foundation model to our 3D scene representations via depth correlation regularization and pattern consistency regularization. These components work together to synergistically model the appearance, semantics, and geometry of the 3D scene within a unified framework. Extensive experiments demonstrate that our approach achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.

[103] RaCo: Ranking and Covariance for Practical Learned Keypoints

Abhiram Shenoi, Philipp Lindenberger, Paul-Edouard Sarlin, Marc Pollefeys

Main category: cs.CV

TL;DR: RaCo is a lightweight neural network for learning robust 3D keypoints with repeatability ranking and metric covariance estimation, trained only on perspective image crops without needing covisible image pairs.

DetailsMotivation: The paper aims to create versatile keypoints for 3D computer vision tasks that are robust to rotations and can operate without requiring covisible image pairs, addressing limitations in existing keypoint detection methods.

Method: RaCo integrates three components: 1) repeatable keypoint detector, 2) differentiable ranker to maximize matches with limited keypoints, and 3) covariance estimator for metric spatial uncertainty. It uses extensive data augmentation for rotational robustness without expensive equivariant architectures.

Result: Achieves state-of-the-art performance in keypoint repeatability and two-view matching on challenging datasets, particularly under large in-plane rotations, while being computationally efficient.

Conclusion: RaCo provides an effective strategy for independent keypoint ranking and metric covariance estimation without additional labels, producing interpretable and repeatable interest points for 3D vision tasks.

Abstract: This paper introduces RaCo, a lightweight neural network designed to learn robust and versatile keypoints suitable for a variety of 3D computer vision tasks. The model integrates three key components: the repeatable keypoint detector, a differentiable ranker to maximize matches with a limited number of keypoints, and a covariance estimator to quantify spatial uncertainty in metric scale. Trained on perspective image crops only, RaCo operates without the need for covisible image pairs. It achieves strong rotational robustness through extensive data augmentation, even without the use of computationally expensive equivariant network architectures. The method is evaluated on several challenging datasets, where it demonstrates state-of-the-art performance in keypoint repeatability and two-view matching, particularly under large in-plane rotations. Ultimately, RaCo provides an effective and simple strategy to independently estimate keypoint ranking and metric covariance without additional labels, detecting interpretable and repeatable interest points. The code is available at https://github.com/cvg/RaCo.

[104] Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, Han Hu

Main category: cs.CV

TL;DR: R3 framework addresses trade-off between generation and understanding in multimodal models by reframing generation as a multi-step “generate-understand-regenerate” process

DetailsMotivation: Current multimodal models face a fundamental trade-off where improving generative capabilities often degrades understanding abilities, and vice versa. The authors identify this as a competitive dynamic between generation and understanding within models.

Method: Proposes Reason-Reflect-Refine (R3) framework that transforms single-step generation into a multi-step process: 1) Reason (initial generation), 2) Reflect (understanding/analysis of generated content), 3) Refine (regeneration based on understanding). This explicitly leverages the model’s understanding capability during generation.

Result: Successfully mitigates the optimization dilemma, achieving stronger generation results and improved understanding ability related to the generation process. The framework provides insights for designing next-generation unified multimodal models.

Conclusion: The R3 framework offers a promising approach to address the fundamental trade-off between generation and understanding in multimodal models by making understanding an explicit part of the generation process, leading to improvements in both capabilities.

Abstract: Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of “generate-understand-regenerate”. By explicitly leveraging the model’s understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.

[105] NeRFscopy: Neural Radiance Fields for in-vivo Time-Varying Tissues from Endoscopy

Laura Salort-Benejam, Antonio Agudo

Main category: cs.CV

TL;DR: NeRFscopy: A self-supervised neural rendering pipeline for 3D reconstruction and novel view synthesis of deformable endoscopic tissues from monocular videos, addressing challenges in medical endoscopy visualization.

DetailsMotivation: Endoscopy is crucial for medical imaging but faces challenges in 3D reconstruction due to tissue deformability, monocular cameras, illumination changes, occlusions, and unknown camera trajectories. A robust dynamic 3D reconstruction pipeline could enhance visualization, improve diagnostic accuracy, aid treatment planning, and guide surgical procedures.

Method: NeRFscopy uses neural rendering with a deformable model consisting of a canonical radiance field and time-dependent deformation field parameterized by SE(3) transformations. The approach efficiently exploits color images through sophisticated learning terms to create a 3D implicit model without requiring templates or pre-trained models, learning solely from data in a self-supervised manner.

Result: NeRFscopy achieves accurate results in novel view synthesis, outperforming competing methods across various challenging endoscopy scenes.

Conclusion: The proposed NeRFscopy pipeline successfully addresses the challenges of 3D reconstruction in endoscopic imaging through neural rendering techniques, enabling improved visualization and potential clinical applications without requiring template models or pre-training.

Abstract: Endoscopy is essential in medical imaging, used for diagnosis, prognosis and treatment. Developing a robust dynamic 3D reconstruction pipeline for endoscopic videos could enhance visualization, improve diagnostic accuracy, aid in treatment planning, and guide surgery procedures. However, challenges arise due to the deformable nature of the tissues, the use of monocular cameras, illumination changes, occlusions and unknown camera trajectories. Inspired by neural rendering, we introduce NeRFscopy, a self-supervised pipeline for novel view synthesis and 3D reconstruction of deformable endoscopic tissues from a monocular video. NeRFscopy includes a deformable model with a canonical radiance field and a time-dependent deformation field parameterized by SE(3) transformations. In addition, the color images are efficiently exploited by introducing sophisticated terms to learn a 3D implicit model without assuming any template or pre-trained model, solely from data. NeRFscopy achieves accurate results in terms of novel view synthesis, outperforming competing methods across various challenging endoscopy scenes.

[106] Meteorological data and Sky Images meets Neural Models for Photovoltaic Power Forecasting

Ines Montoya-Espinagosa, Antonio Agudo

Main category: cs.CV

TL;DR: Hybrid multimodal approach combining sky images, PV energy history, and meteorological data for short/long-term solar forecasting with improved ramp event prediction and cloudy condition robustness.

DetailsMotivation: Address photovoltaic forecasting challenges due to renewable energy variability, particularly solar energy's intermittent nature, to support efficient power grid operation and better solar variability management.

Method: Multimodal approach combining sky images, photovoltaic energy history, and meteorological data using deep neural models for both nowcasting and forecasting, incorporating individual/multiple meteorological variables and analytical solar position.

Result: Inclusion of meteorological data (surface long-wave, radiation downwards, wind + solar position combination) significantly improves predictions in both nowcasting and forecasting tasks, especially on cloudy days.

Conclusion: Integrating diverse data sources improves reliability and interpretability of solar energy prediction models, highlighting the value of multimodal approaches for renewable energy forecasting.

Abstract: Due to the rise in the use of renewable energies as an alternative to traditional ones, and especially solar energy, there is increasing interest in studying how to address photovoltaic forecasting in the face of the challenge of variability in photovoltaic energy production, using different methodologies. This work develops a hybrid approach for short and long-term forecasting based on two studies with the same purpose. A multimodal approach that combines images of the sky and photovoltaic energy history with meteorological data is proposed. The main goal is to improve the accuracy of ramp event prediction, increase the robustness of forecasts in cloudy conditions, and extend capabilities beyond nowcasting, to support more efficient operation of the power grid and better management of solar variability. Deep neural models are used for both nowcasting and forecasting solutions, incorporating individual and multiple meteorological variables, as well as an analytical solar position. The results demonstrate that the inclusion of meteorological data, particularly the surface long-wave, radiation downwards, and the combination of wind and solar position, significantly improves current predictions in both nowcasting and forecasting tasks, especially on cloudy days. This study highlights the importance of integrating diverse data sources to improve the reliability and interpretability of solar energy prediction models.

[107] Context-aware Skin Cancer Epithelial Cell Classification with Scalable Graph Transformers

Lucas Sancéré, Noémie Moreau, Katarzyna Bozek

Main category: cs.CV

TL;DR: Graph Transformers applied to whole-slide image cell graphs outperform image-based methods for classifying healthy vs tumor epithelial cells in skin cancer, leveraging cellular context through graph representations.

DetailsMotivation: Whole-slide images contain rich diagnostic information but current deep learning methods rely on patch-based representations that lose tissue-level context. The paper aims to leverage graph-based approaches to better capture cellular organization and context for improved cancer cell classification.

Method: Proposes using scalable Graph Transformers (SGFormer and DIFFormer) on full-WSI cell graphs for classification. Creates cell graphs from whole-slide images, evaluates different node feature configurations combining morphological, texture features and surrounding cell classes. Extends to multiple WSIs by extracting patches and converting to graphs.

Result: Graph Transformer models achieved balanced accuracies of 85.2±1.5 (SGFormer) and 85.1±2.5 (DIFFormer) vs 81.2±3.0 for best image-based method on single WSI. On multiple WSIs, DIFFormer achieved 83.6±1.9 vs 78.1±0.5 for state-of-the-art image-based CellViT256. Most informative representation combined morphological, texture features and non-epithelial cell classes.

Conclusion: Graph Transformers on cell graphs outperform image-based methods for cancer cell classification, demonstrating the importance of capturing cellular context through graph representations. The approach effectively handles challenging cases where cell types have similar morphologies.

Abstract: Whole-slide images (WSIs) from cancer patients contain rich information that can be used for medical diagnosis or to follow treatment progress. To automate their analysis, numerous deep learning methods based on convolutional neural networks and Vision Transformers have been developed and have achieved strong performance in segmentation and classification tasks. However, due to the large size and complex cellular organization of WSIs, these models rely on patch-based representations, losing vital tissue-level context. We propose using scalable Graph Transformers on a full-WSI cell graph for classification. We evaluate this methodology on a challenging task: the classification of healthy versus tumor epithelial cells in cutaneous squamous cell carcinoma (cSCC), where both cell types exhibit very similar morphologies and are therefore difficult to differentiate for image-based approaches. We first compared image-based and graph-based methods on a single WSI. Graph Transformer models SGFormer and DIFFormer achieved balanced accuracies of $85.2 \pm 1.5$ ($\pm$ standard error) and $85.1 \pm 2.5$ in 3-fold cross-validation, respectively, whereas the best image-based method reached $81.2 \pm 3.0$. By evaluating several node feature configurations, we found that the most informative representation combined morphological and texture features as well as the cell classes of non-epithelial cells, highlighting the importance of the surrounding cellular context. We then extended our work to train on several WSIs from several patients. To address the computational constraints of image-based models, we extracted four $2560 \times 2560$ pixel patches from each image and converted them into graphs. In this setting, DIFFormer achieved a balanced accuracy of $83.6 \pm 1.9$ (3-fold cross-validation), while the state-of-the-art image-based model CellViT256 reached $78.1 \pm 0.5$.

[108] Task-Agnostic Continual Learning for Chest Radiograph Classification

Muthu Subash Kavitha, Anas Zafar, Amgad Muneer, Jia Wu

Main category: cs.CV

TL;DR: CARL-XRay: A continual learning framework for chest X-ray classification that uses adapter-based routing with lightweight task-specific adapters and classifier heads, enabling model updates without retraining on previous data while maintaining performance.

DetailsMotivation: Clinical deployment requires models that can be updated with new datasets without retraining on previous data or degrading validated performance. Current approaches lack efficient continual learning methods for chest radiograph classification where task identifiers are unavailable at inference.

Method: Proposes CARL-XRay with fixed high-capacity backbone and incremental allocation of lightweight task-specific adapters and classifier heads. Uses latent task selector on task-adapted features with compact prototypes and feature-level experience replay for task identification without raw-image storage.

Result: Outperforms joint training under task-unknown deployment with higher routing accuracy (75.0% vs. 62.5%), maintains competitive diagnostic performance (AUROC 0.74 with ground-truth task identity, 0.75 under task-unknown inference), using significantly fewer trainable parameters.

Conclusion: Provides practical alternative to joint training and repeated full retraining for continual clinical deployment, enabling stable task identification and adaptation across sequential updates while avoiding catastrophic forgetting.

Abstract: Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously ob- served data or degrading validated performance. We study, for the first time, a task-incremental continual learning setting for chest radiograph classification, in which heterogeneous chest X-ray datasets arrive sequentially and task identifiers are unavailable at inference. We propose a continual adapter-based routing learning strategy for Chest X-rays (CARL-XRay) that maintains a fixed high-capacity backbone and incrementally allocates lightweight task-specific adapters and classifier heads. A latent task selector operates on task-adapted features and leverages both current and historical context preserved through compact prototypes and feature-level experience replay. This design supports stable task identification and adaptation across sequential updates while avoiding raw-image storage. Experiments on large-scale public chest radiograph datasets demonstrate robust performance retention and reliable task-aware inference under continual dataset ingestion. CARL-XRay outperforms joint training under task-unknown deployment, achieving higher routing accuracy (75.0% vs.\ 62.5%), while maintaining competitive diagnostic performance with AUROC of 0.74 in the oracle setting with ground-truth task identity and 0.75 under task-unknown inference, using significantly fewer trainable parameters. Finally, the proposed framework provides a practical alternative to joint training and repeated full retraining in continual clinical deployment.

[109] VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation

Hui Ren, Yuval Alaluf, Omer Bar Tal, Alexander Schwing, Antonio Torralba, Yael Vinker

Main category: cs.CV

TL;DR: Adapts pretrained text-to-video diffusion models for sequential sketch generation by leveraging LLMs for semantic planning and video diffusion for rendering, using minimal human sketch data.

DetailsMotivation: Most generative models treat sketches as static images, overlooking the temporal structure of creative drawing. Sequential sketch generation captures the meaningful order of strokes that reflects creative exploration.

Method: Two-stage fine-tuning: 1) Learn stroke ordering using synthetic shape compositions with controlled temporal structure, 2) Learn visual appearance from only 7 manually authored sketching processes. Represents sketches as short videos with progressive stroke drawing guided by text-specified ordering.

Result: Generates high-quality sequential sketches that closely follow text-specified orderings with rich visual detail, despite extremely limited human-drawn data. Extensions include brush style conditioning and autoregressive sketch generation.

Conclusion: Demonstrates data-efficient sequential sketch generation by combining LLMs for planning and video diffusion for rendering, enabling controllable and interactive drawing processes.

Abstract: Sketching is inherently a sequential process, in which strokes are drawn in a meaningful order to explore and refine ideas. However, most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing. We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models to generate sketching processes. Our key insight is that large language models and video diffusion models offer complementary strengths for this task: LLMs provide semantic planning and stroke ordering, while video diffusion models serve as strong renderers that produce high-quality, temporally coherent visuals. We leverage this by representing sketches as short videos in which strokes are progressively drawn on a blank canvas, guided by text-specified ordering instructions. We introduce a two-stage fine-tuning strategy that decouples the learning of stroke ordering from the learning of sketch appearance. Stroke ordering is learned using synthetic shape compositions with controlled temporal structure, while visual appearance is distilled from as few as seven manually authored sketching processes that capture both global drawing order and the continuous formation of individual strokes. Despite the extremely limited amount of human-drawn sketch data, our method generates high-quality sequential sketches that closely follow text-specified orderings while exhibiting rich visual detail. We further demonstrate the flexibility of our approach through extensions such as brush style conditioning and autoregressive sketch generation, enabling additional controllability and interactive, collaborative drawing.

[110] DARB-Splatting: Generalizing Splatting with Decaying Anisotropic Radial Basis Functions

Hashiru Pramuditha, Vinasirajan Viruthshaan, Vishagar Arunan, Saeedha Nazar, Sameera Ramasinghe, Simon Lucey, Ranga Rodrigo

Main category: cs.CV

TL;DR: DARBFs (decaying anisotropic radial basis functions) enable splatting-based 3D reconstruction beyond traditional exponential family kernels like Gaussians, offering comparable performance with similar convergence and memory usage.

DetailsMotivation: Current splatting-based 3D reconstruction methods are limited to exponential family functions (like Gaussians) due to their anisotropic nature, ease of projection, and differentiability. However, this restricts exploration of generalized reconstruction kernels, particularly because non-exponential family functions lack easy integrability in 3D to 2D projections.

Method: The paper introduces DARBFs (decaying anisotropic radial basis functions), which are non-negative functions of the Mahalanobis distance. These functions support splatting by approximating the Gaussian function’s closed-form integration advantage, allowing for broader exploration of reconstruction kernels beyond the exponential family.

Result: The method demonstrates varying performances across selected DARBF reconstruction kernels, achieving comparable training convergence and memory footprints with on-par PSNR, SSIM, and LPIPS results compared to traditional Gaussian-based approaches.

Conclusion: DARBFs provide a viable alternative to traditional exponential family kernels for splatting-based 3D reconstruction, expanding the design space of reconstruction kernels while maintaining similar performance metrics and computational efficiency.

Abstract: Splatting-based 3D reconstruction methods have gained popularity with the advent of 3D Gaussian Splatting, efficiently synthesizing high-quality novel views. These methods commonly resort to using exponential family functions, such as the Gaussian function, as reconstruction kernels due to their anisotropic nature, ease of projection, and differentiability in rasterization. However, the field remains restricted to variations within the exponential family, leaving generalized reconstruction kernels largely underexplored, partly due to the lack of easy integrability in 3D to 2D projections. In this light, we show that a class of decaying anisotropic radial basis functions (DARBFs), which are non-negative functions of the Mahalanobis distance, supports splatting by approximating the Gaussian function’s closed-form integration advantage. With this fresh perspective, we demonstrate varying performances across selected DARB reconstruction kernels, achieving comparable training convergence and memory footprints, with on-par PSNR, SSIM, and LPIPS results.

[111] SSL4EO-S12 v1.1: A Multimodal, Multiseasonal Dataset for Pretraining, Updated

Benedikt Blumenstiel, Nassim Ait Ali Braham, Conrad M Albrecht, Stefano Maurogiovanni, Paolo Fraccaro

Main category: cs.CV

TL;DR: SSL4EO-S12 v1.1 is an improved multimodal Earth Observation dataset for pretraining foundation models, fixing alignment issues and adding new modalities like elevation and land-cover data.

DetailsMotivation: To provide a high-quality, multimodal Earth Observation dataset for pretraining large-scale foundation models, addressing previous version's geospatial alignment inaccuracies and inefficient data structure while expanding modality coverage.

Method: Updated the SSL4EO-S12 dataset by fixing geospatial alignment issues, restructuring data into Zarr format stored in WebDataset tar shards for efficient loading, and adding new modalities including elevation, land-cover, and vegetation data.

Result: Created a dataset covering 10,000 largest cities with 246k time series containing nearly one million image patches, now with improved geospatial accuracy, efficient data structure, and multimodal support including cloud masks and new geospatial modalities.

Conclusion: SSL4EO-S12 v1.1 provides a robust, open-access foundation for self-supervised learning and geospatial analysis research, facilitating multimodal pretraining of Earth Observation foundation models.

Abstract: This work presents SSL4EO-S12 v1.1, a multimodal, multitemporal Earth Observation dataset designed for pretraining large-scale foundation models. Building on the success of SSL4EO-S12, this extension updates the previous version to fix geospatial alignment inaccuracies and the inefficent data structure. The dataset allows low-barrier, analysis-ready data loading while maintaining the predecessor’s spatial coverage of the world’s 10,000 largest cities and surrounding geographies, resulting in 246k time series with nearly one million image patches. We package each time series in Zarr file format stored in WebDataset tar shards for efficient data loading and representation of meta-information such as cloud masks. We add new modalities for elevation, land-cover, and vegetation to support multimodal pre-training. Released under the CC-BY-4.0 license, SSL4EO-S12 v1.1 facilitates open research and provides a robust foundation for future advancements in self-supervised learning and geospatial analysis. The dataset is available online through https://huggingface.co/datasets/embed2scale/SSL4EO-S12-v1.1.

[112] VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow

Ada Gorgun, Bernt Schiele, Jonas Fischer

Main category: cs.CV

TL;DR: VITAL improves feature visualization for neural networks by guiding visualization generation with real image statistics and network flow measures to produce more human-understandable prototypical images.

DetailsMotivation: Current feature visualization methods often produce unrecognizable images with repetitive patterns and artifacts, making it hard for humans to understand what neurons are responding to in neural networks, especially important for high-stakes decision-making.

Method: Proposes guiding feature visualization through statistics of real image features combined with measures of relevant network flow to generate prototypical images that better represent what neurons detect.

Result: The approach yields human-understandable visualizations that both qualitatively and quantitatively improve over state-of-the-art feature visualization methods across various neural network architectures.

Conclusion: VITAL provides better tools for understanding neural network reasoning by generating more interpretable visualizations that complement mechanistic circuit analysis, helping decode what information networks use rather than just where it’s encoded.

Abstract: Neural networks are widely adopted to solve complex and challenging tasks. Especially in high-stakes decision-making, understanding their reasoning process is crucial, yet proves challenging for modern deep networks. Feature visualization (FV) is a powerful tool to decode what information neurons are responding to and hence to better understand the reasoning behind such networks. In particular, in FV we generate human-understandable images that reflect the information detected by neurons of interest. However, current methods often yield unrecognizable visualizations, exhibiting repetitive patterns and visual artifacts that are hard to understand for a human. To address these problems, we propose to guide FV through statistics of real image features combined with measures of relevant network flow to generate prototypical images. Our approach yields human-understandable visualizations that both qualitatively and quantitatively improve over state-of-the-art FVs across various architectures. As such, it can be used to decode which information the network uses, complementing mechanistic circuits that identify where it is encoded. Code is available at: https://github.com/adagorgun/VITAL

[113] Digital Twin Generation from Visual Data: A Survey

Andrew Melnik, Benjamin Alt, Giang Nguyen, Artur Wilkowski, Maciej Stefańczyk, Qirui Wu, Sinan Harms, Helge Rhodin, Manolis Savva, Michael Beetz

Main category: cs.CV

TL;DR: Survey paper on generating digital twins from visual data using modern 3D reconstruction techniques like 3D Gaussian Splatting and foundation models, with applications in robotics, media creation, and design workflows.

DetailsMotivation: Digital twins (virtual 3D replicas of physical assets) have important applications in robotics, media content creation, design, and construction workflows. There's a need to survey recent advances in generating these digital twins from visual data using modern computer vision techniques.

Method: Survey methodology analyzing various approaches including 3D Gaussian Splatting, generative inpainting, semantic segmentation, and foundation models. The paper examines their advantages, limitations, and discusses key challenges like occlusions, lighting variations, and scalability.

Result: Provides a comprehensive overview of state-of-the-art methodologies for digital twin generation from visual data, highlighting current trends, gaps in research, and directions for future work in the field.

Conclusion: The survey offers a thorough examination of current digital twin generation techniques, their real-world applications, and identifies important research directions for advancing the field of visual data-based digital twin creation.

Abstract: This survey examines recent advances in generating digital twins from visual data. These digital twins - virtual 3D replicas of physical assets - can be applied to robotics, media content creation, design or construction workflows. We analyze a range of approaches, including 3D Gaussian Splatting, generative inpainting, semantic segmentation, and foundation models, highlighting their respective advantages and limitations. In addition, we discuss key challenges such as occlusions, lighting variations, and scalability, as well as identify gaps, trends, and directions for future research. Overall, this survey aims to provide a comprehensive overview of state-of-the-art methodologies and their implications for real-world applications. Awesome Digital Twin: https://awesomedigitaltwin.github.io

[114] Multispectral airborne laser scanning for tree species classification: a benchmark of machine learning and deep learning algorithms

Josef Taher, Eric Hyyppä, Matti Hyyppä, Klaara Salolahti, Xiaowei Yu, Leena Matikainen, Antero Kukko, Matti Lehtomäki, Harri Kaartinen, Sopitta Thurachen, Paula Litkey, Ville Luoma, Markus Holopainen, Gefei Kong, Hongchao Fan, Petri Rönnholm, Matti Vaaja, Antti Polvivaara, Samuli Junttila, Mikko Vastaranta, Stefano Puliti, Rasmus Astrup, Joel Kostensalo, Mari Myllymäki, Maksymilian Kulicki, Krzysztof Stereńczak, Raul de Paula Pires, Ruben Valbuena, Juan Pedro Carbonell-Rivera, Jesús Torralba, Yi-Chen Chen, Lukas Winiwarter, Markus Hollaus, Gottfried Mandlburger, Narges Takhtkeshha, Fabio Remondino, Maciej Lisiewicz, Bartłomiej Kraszewski, Xinlian Liang, Jianchang Chen, Eero Ahokas, Kirsi Karila, Eugeniu Vezeteu, Petri Manninen, Roope Näsi, Heikki Hyyti, Siiri Pyykkönen, Peilun Hu, Juha Hyyppä

Main category: cs.CV

TL;DR: Deep learning methods, especially point transformer models, outperform traditional ML for tree species classification using high-density multispectral airborne laser scanning data.

DetailsMotivation: Climate-smart forestry requires precise individual tree-level information, but existing methods struggle with deep learning application and rare species identification in imbalanced datasets.

Method: Comprehensive benchmark comparing deep learning and traditional ML methods using high-density multispectral ALS data collected with HeliALS system, complemented by existing Titan data, with crowdsourced annotation of 6326 tree segments across nine species.

Result: Point-based deep learning methods, particularly point transformer models, achieved best performance: 87.9% overall accuracy (74.5% macro-average) with 1065 training segments, and 92.0% (85.1%) with 5000 training segments.

Conclusion: Point transformer models are superior for tree species classification from high-density multispectral ALS data, demonstrating the value of deep learning approaches in forestry remote sensing.

Abstract: Climate-smart and biodiversity-preserving forestry demands precise information on forest resources, extending to the individual tree level. Multispectral airborne laser scanning (ALS) has shown promise in automated point cloud processing, but challenges remain in leveraging deep learning techniques and identifying rare tree species in class-imbalanced datasets. This study addresses these gaps by conducting a comprehensive benchmark of deep learning and traditional shallow machine learning methods for tree species classification. For the study, we collected high-density multispectral ALS data ($>1000$ $\mathrm{pts}/\mathrm{m}^2$) at three wavelengths using the FGI-developed HeliALS system, complemented by existing Optech Titan data (35 $\mathrm{pts}/\mathrm{m}^2$), to evaluate the species classification accuracy of various algorithms in a peri-urban study area located in southern Finland. We established a field reference dataset of 6326 segments across nine species using a newly developed browser-based crowdsourcing tool, which facilitated efficient data annotation. The ALS data, including a training dataset of 1065 segments, was shared with the scientific community to foster collaborative research and diverse algorithmic contributions. Based on 5261 test segments, our findings demonstrate that point-based deep learning methods, particularly a point transformer model, outperformed traditional machine learning and image-based deep learning approaches on high-density multispectral point clouds. For the high-density ALS dataset, a point transformer model provided the best performance reaching an overall (macro-average) accuracy of 87.9% (74.5%) with a training set of 1065 segments and 92.0% (85.1%) with a larger training set of 5000 segments.

[115] APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds

Yuan Gao, Shaobo Xia, Sheng Nie, Cheng Wang, Xiaohuan Xi, Bisheng Yang

Main category: cs.CV

TL;DR: APCoTTA is a Continuous Test-Time Adaptation framework for ALS point cloud semantic segmentation that addresses domain shifts through selective layer updates, entropy-based consistency, and parameter interpolation, with new benchmarks ISPRSC and H3DC.

DetailsMotivation: ALS point cloud semantic segmentation models degrade in real-world deployment due to continuous domain shifts from environmental and sensor changes. Current CTTA approaches are underexplored for ALS point clouds, lacking benchmarks and suffering from catastrophic forgetting and error accumulation.

Method: APCoTTA has three components: 1) Gradient-driven layer selection that selectively updates low-confidence layers while freezing stable ones; 2) Entropy-based consistency loss that discards unreliable samples and enforces consistency only on reliable ones; 3) Random parameter interpolation that stochastically blends adapted parameters with source model parameters.

Result: APCoTTA achieves superior performance on the new ISPRSC and H3DC benchmarks, improving mIoU by approximately 9% and 14% over direct inference, demonstrating effective adaptation to evolving domains while mitigating forgetting and error accumulation.

Conclusion: APCoTTA effectively addresses CTTA challenges for ALS point cloud segmentation through selective adaptation mechanisms and provides the first benchmarks for this task, enabling future research in continuous adaptation for 3D scene understanding.

Abstract: Airborne laser scanning (ALS) point cloud semantic segmentation is a fundamental task for large-scale 3D scene understanding. Fixed models deployed in real-world scenarios often suffer from performance degradation due to continuous domain shifts caused by environmental and sensor changes. Continuous Test-Time Adaptation (CTTA) enables adaptation to evolving unlabeled domains, but its application to ALS point clouds remains underexplored, hindered by the lack of benchmarks and the risks of catastrophic forgetting and error accumulation. To address these challenges, we propose APCoTTA (ALS Point cloud Continuous Test-Time Adaptation), a novel CTTA framework tailored for ALS point cloud semantic segmentation. APCoTTA consists of three key components. First, we adapt a gradient-driven layer selection mechanism for ALS point clouds, selectively updating low-confidence layers while freezing stable ones to preserve source knowledge and mitigate catastrophic forgetting. Second, an entropy-based consistency loss discards unreliable samples and enforces consistency regularization solely on reliable ones, effectively reducing error accumulation and improving adaptation stability. Third, a random parameter interpolation mechanism stochastically blends adapted parameters with source model parameters, further balancing target adaptation and source knowledge retention. Finally, we construct two benchmarks, ISPRSC and H3DC, to address the lack of CTTA benchmarks for ALS point cloud segmentation. Extensive experiments demonstrate that APCoTTA achieves superior performance on both benchmarks, improving mIoU by approximately 9% and 14% over direct inference. The new benchmarks and code are available at https://github.com/Gaoyuan2/APCoTTA.

[116] MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark

Yiwei Ou, Xiaobin Ren, Ronggui Sun, Guansong Gao, Kaiqi Zhao, Manfredo Manfredini

Main category: cs.CV

TL;DR: MMS-VPR introduces a large-scale multimodal dataset for street-level place recognition in pedestrian environments, featuring images, videos, and textual metadata from a Chinese commercial district, plus a benchmarking platform for multimodal VPR methods.

DetailsMotivation: Existing VPR datasets are limited to vehicle-mounted imagery, lack multimodal diversity, and underrepresent pedestrian street scenes in non-Western urban contexts, creating a need for more comprehensive pedestrian-focused multimodal datasets.

Method: Collected 110,529 images and 2,527 video clips across 208 locations in a Chinese commercial district, combining field data (2024) with social media data (2019-2025). Created MMS-VPRlib benchmarking platform with standardized pipeline for data preprocessing, multimodal modeling, signal enhancement, alignment, fusion, and evaluation.

Result: A comprehensive multimodal dataset with day-night coverage, multiple viewing angles, GPS coordinates, timestamps, and semantic textual metadata, plus a unified benchmarking platform that consolidates existing VPR datasets and methods.

Conclusion: MMS-VPR addresses gaps in current VPR datasets by providing pedestrian-focused multimodal data from non-Western urban contexts, enabling systematic exploration of complementary visual, video, and textual modalities for place recognition.

Abstract: Existing visual place recognition (VPR) datasets predominantly rely on vehicle-mounted imagery, offer limited multimodal diversity, and underrepresent dense pedestrian street scenes, particularly in non-Western urban contexts. We introduce MMS-VPR, a large-scale multimodal dataset for street-level place recognition in pedestrian-only environments. MMS-VPR comprises 110,529 images and 2,527 video clips across 208 locations in a ~70,800 $m^2$ open-air commercial district in Chengdu, China. Field data were collected in 2024, while social media data span seven years (2019-2025), providing both fine-grained temporal granularity and long-term temporal coverage. Each location features comprehensive day-night coverage, multiple viewing angles, and multimodal annotations including GPS coordinates, timestamps, and semantic textual metadata. We further release MMS-VPRlib, a unified benchmarking platform that consolidates commonly used VPR datasets and state-of-the-art methods under a standardized, reproducible pipeline. MMS-VPRlib provides modular components for data pre-processing, multimodal modeling (CNN/RNN/Transformer), signal enhancement, alignment, fusion, and performance evaluation. This platform moves beyond traditional image-only paradigms, enabling systematic exploitation of complementary visual, video, and textual modalities. The dataset is available at https://huggingface.co/datasets/Yiwei-Ou/MMS-VPR and the benchmark at https://github.com/yiasun/MMS-VPRlib.

[117] cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning

Maksim Kolodiazhnyi, Denis Tarasov, Dmitrii Zhemchuzhnikov, Alexander Nikulin, Ilya Zisman, Anna Vorontsova, Anton Konushin, Vladislav Kurenkov, Danila Rukhovich

Main category: cs.CV

TL;DR: Multi-modal CAD reconstruction model that processes point clouds, images, and text simultaneously using VLM+LLM approach with SFT and RL fine-tuning, achieving SOTA on CAD benchmarks.

DetailsMotivation: Existing CAD reconstruction methods focus on single input modalities (point clouds, images, or text), limiting generalizability and robustness. The authors aim to democratize CAD design by creating a multi-modal model that can handle all three input types simultaneously.

Method: Two-stage pipeline: 1) Supervised fine-tuning (SFT) on large-scale procedurally generated data using vision-language models, 2) Reinforcement learning fine-tuning using online feedback obtained programmatically, exploring Group Relative Preference Optimization (GRPO) as an online RL algorithm for CAD tasks.

Result: SFT model outperforms existing single-modal approaches in all three input modalities on DeepCAD benchmark. After RL fine-tuning, cadrille achieves new state-of-the-art on three challenging datasets, including a real-world dataset.

Conclusion: The proposed multi-modal CAD reconstruction model successfully integrates multiple input modalities and demonstrates that online RL fine-tuning (GRPO) outperforms offline alternatives for CAD tasks, setting new benchmarks in CAD reconstruction.

Abstract: Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one. Code is avaliable at https://github.com/col14m/cadrille .

[118] Prompts to Summaries: Zero-Shot Language-Guided Video Summarization with Large Language and Video Models

Mario Barbara, Alaa Maalouf

Main category: cs.CV

TL;DR: Zero-shot video summarization framework using pretrained video-language models and LLMs to create text-queryable summaries without training data.

DetailsMotivation: Need for flexible user-controllable video summarization tools that operate without training data and can incorporate natural language user intent, overcoming limitations of existing domain-specific or non-queryable methods.

Method: Four-step pipeline: (1) video segmentation into scenes, (2) batch prompting for scene descriptions using video-language models, (3) LLM scoring of scene importance with tailored prompts, (4) frame-level score propagation using consistency and uniqueness metrics.

Result: Surpasses all prior unsupervised methods on SumMe and TVSum, performs competitively on Query-Focused Video Summarization benchmark, and establishes first baseline on new VidSum-Reason dataset with long-tailed concepts and multi-step reasoning.

Conclusion: Pretrained multimodal models orchestrated with principled prompting and score propagation provide powerful foundation for universal, text-queryable video summarization without training data.

Abstract: The explosive growth of video data intensified the need for flexible user-controllable summarization tools that operate without training data. Existing methods either rely on domain-specific datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video-summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large-language-models (LLMs) judging, without the use of training data, beating unsupervised and matching supervised methods. Our pipeline (i) segments video into scenes, (ii) produces scene descriptions with a memory-efficient batch prompting scheme that scales to hours on a single GPU, (iii) scores scene importance with an LLM via tailored prompts, and (iv) propagates scores to frames using new consistency (temporal coherence) and uniqueness (novelty) metrics for fine-grained frame importance. On SumMe and TVSum, our approach surpasses all prior data-hungry unsupervised methods and performs competitively on the Query-Focused Video Summarization benchmark, where the competing methods require supervised frame-level importance. We release VidSum-Reason, a query-driven dataset featuring long-tailed concepts and multi-step reasoning, where our framework serves as the first challenging baseline. Overall, we demonstrate that pretrained multi-modal models, when orchestrated with principled prompting and score propagation, provide a powerful foundation for universal, text-queryable video summarization.

[119] Pyramidal Patchification Flow for Visual Generation

Hui Li, Baoyou Chen, Liwei Zhang, Jiaye Li, Jingdong Wang, Siyu Zhu

Main category: cs.CV

TL;DR: PPFlow introduces pyramidal patchification for diffusion transformers, using large patches for high noise and small patches for low noise timesteps to improve efficiency while maintaining image generation quality.

DetailsMotivation: Current diffusion transformers use fixed patch sizes throughout all timesteps, which may not be optimal. The authors propose that different patch sizes should be used at different noise levels to improve computational efficiency without sacrificing quality.

Method: PPFlow uses pyramidal patchification where large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps. Linear projections are learned for each patch size, and the unpatchify operation is modified accordingly. The approach operates on full latent representations and uses normal denoising without requiring renoising tricks.

Result: Training from scratch achieves 1.6×-2.0× inference speed improvement over baseline SiT-B/2 with slightly lower training FLOPs and similar image generation performance. Training from pretrained DiTs achieves even better performance with minimal additional training time.

Conclusion: PPFlow demonstrates that adaptive patch sizing based on noise levels can significantly improve diffusion transformer efficiency while maintaining or improving image generation quality, offering a practical approach for efficient multimodal generation.

Abstract: Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a $1.6\times$ ($2.0\times$) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at https://github.com/fudan-generative-vision/PPFlow.

[120] Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition

Jiuhong Xiao, Yang Zhou, Giuseppe Loianno

Main category: cs.CV

TL;DR: QAA is a novel feature aggregation technique for Visual Place Recognition that uses learned queries as reference codebooks to enhance information capacity and improve generalization across multiple datasets.

DetailsMotivation: Existing VPR methods trained on single datasets have dataset-specific biases and limited generalization. Multi-dataset training can saturate feature aggregation layers, leading to suboptimal performance.

Method: Proposes Query-based Adaptive Aggregation (QAA) that uses learned queries as reference codebooks. Computes Cross-query Similarity between query-level image features and reference codebooks to generate robust descriptors.

Result: QAA outperforms state-of-the-art models, achieves balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models.

Conclusion: QAA effectively addresses multi-dataset training challenges in VPR by enhancing information capacity without significant computational overhead, enabling better generalization.

Abstract: Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA’s mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Project page: \href{http://xjh19971.github.io/QAA} {\color{magenta}\texttt{xjh19971.github.io/QAA}}.

[121] THUNDER: Tile-level Histopathology image UNDERstanding benchmark

Pierre Marza, Leo Fillioux, Sofiène Boutaj, Kunal Mahatha, Christian Desrosiers, Pablo Piantanida, Jose Dolz, Stergios Christodoulidis, Maria Vakalopoulou

Main category: cs.CV

TL;DR: THUNDER is a comprehensive benchmark for evaluating digital pathology foundation models on diverse datasets and tasks, with focus on performance, feature analysis, robustness, and uncertainty assessment.

DetailsMotivation: The rapid proliferation of foundation models in digital pathology creates a need for systematic benchmarking to understand research landscape, especially in healthcare where reliability, robustness, and uncertainty assessment are critical.

Method: Developed THUNDER benchmark that allows efficient comparison of many models on diverse datasets with various downstream tasks, studying feature spaces, robustness, and uncertainty of predictions based on embeddings.

Result: Comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness evaluation.

Conclusion: THUNDER provides a fast, easy-to-use, dynamic benchmark for digital pathology foundation models that supports state-of-the-art and user-defined models for reliable comparison in healthcare applications.

Abstract: Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.

[122] FedX: Explanation-Guided Pruning for Communication-Efficient Federated Learning in Remote Sensing

Barış Büyüktaş, Jonas Klotz, Begüm Demir

Main category: cs.CV

TL;DR: FedX: A federated learning approach for remote sensing image classification that uses explanation-guided pruning to reduce communication overhead by transmitting sparse models without performance loss.

DetailsMotivation: Federated learning is suitable for remote sensing tasks due to privacy constraints, but suffers from high communication overhead from frequent exchange of large model updates between clients and server.

Method: FedX uses backpropagation-based explanation methods to estimate task-specific importance of model components, prunes least relevant ones at the central server, and transmits the sparse global model to clients.

Result: FedX significantly reduces shared model parameters while enhancing generalization capability on BigEarthNet-S2 (multi-label) and EuroSAT (single-label) datasets, outperforming unpruned models and state-of-the-art pruning methods.

Conclusion: FedX successfully addresses communication overhead in federated learning for remote sensing by using explanation-guided pruning to transmit sparse models without compromising performance.

Abstract: Federated learning (FL) enables the collaborative training of deep neural networks across decentralized data archives (i.e., clients), where each client stores data locally and only shares model updates with a central server. This makes FL a suitable learning paradigm for remote sensing (RS) image classification tasks, where data centralization may be restricted due to legal and privacy constraints. However, a key challenge in applying FL to RS tasks is the communication overhead caused by the frequent exchange of large model updates between clients and the central server. To address this issue, in this paper we propose a novel strategy (denoted as FedX) that uses explanation-guided pruning to reduce communication overhead by minimizing the size of the transmitted models without compromising performance. FedX leverages backpropagation-based explanation methods to estimate the task-specific importance of model components and prunes the least relevant ones at the central server. The resulting sparse global model is then sent to clients, substantially reducing communication overhead. We evaluate FedX on multi-label scene classification using the BigEarthNet-S2 dataset and single-label scene classification using the EuroSAT dataset. Experimental results show the success of FedX in significantly reducing the number of shared model parameters while enhancing the generalization capability of the global model, compared to both unpruned model and state-of-the-art pruning methods. The code of FedX will be available at https://git.tu-berlin.de/rsim/FedX.

[123] VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment

Md. Mahfuzur Rahman, Kishor Datta Gupta, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Roy George

Main category: cs.CV

TL;DR: VLCE is a multimodal framework that enhances disaster assessment by integrating external semantic knowledge from ConceptNet and WordNet to generate more accurate and actionable captions from satellite and UAV imagery.

DetailsMotivation: Current vision-language models produce inadequate descriptions for disaster assessment due to lack of domain knowledge and refined descriptive processes, limiting their effectiveness in converting visual data into actionable intelligence for emergency response.

Method: VLCE uses two architectures: a CNN-LSTM with ResNet50 backbone pretrained on EuroSat for satellite imagery (xBD dataset), and a Vision Transformer for UAV imagery (RescueNet dataset), both enhanced with external semantic knowledge from ConceptNet and WordNet.

Result: VLCE outperforms baseline models like LLaVA and QwenVL, achieving 95.33% on InfoMetIC for UAV imagery and strong performance on satellite imagery, demonstrating consistent advantages across different architectures and datasets.

Conclusion: The framework represents a significant advancement from basic visual classification to comprehensive situational intelligence generation, with immediate applicability for real-time disaster assessment systems.

Abstract: The processes of classification and segmentation utilizing artificial intelligence play a vital role in the automation of disaster assessments. However, contemporary VLMs produce details that are inadequately aligned with the objectives of disaster assessment, primarily due to their deficiency in domain knowledge and the absence of a more refined descriptive process. This research presents the Vision Language Caption Enhancer (VLCE), a dedicated multimodal framework aimed at integrating external semantic knowledge from ConceptNet and WordNet to improve the captioning process. The objective is to produce disaster-specific descriptions that effectively convert raw visual data into actionable intelligence. VLCE utilizes two separate architectures: a CNN-LSTM model that incorporates a ResNet50 backbone, pretrained on EuroSat for satellite imagery (xBD dataset), and a Vision Transformer developed for UAV imagery (RescueNet dataset). In various architectural frameworks and datasets, VLCE exhibits a consistent advantage over baseline models such as LLaVA and QwenVL. Our optimal configuration reaches an impressive 95.33% on InfoMetIC for UAV imagery while also demonstrating strong performance across satellite imagery. The proposed framework signifies a significant transition from basic visual classification to the generation of comprehensive situational intelligence, demonstrating immediate applicability for implementation in real-time disaster assessment systems.

[124] Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework

Nanaka Hosokawa, Ryou Takahashi, Tomoya Kitano, Yukihiro Iida, Chisako Muramatsu, Tatsuro Hayashi, Yuta Seino, Xiangrong Zhou, Takeshi Hara, Akitoshi Katsumata, Hiroshi Fujita

Main category: cs.CV

TL;DR: A Self-correction Loop with Structured Output (SLSO) framework improves GPT’s accuracy for dental panoramic radiograph interpretation by incorporating iterative validation and structured outputs.

DetailsMotivation: Vision-language models like GPT show potential for medical image interpretation but struggle with generating reliable radiological findings in clinical practice, particularly for dental pathologies like jaw cysts.

Method: Proposes a 10-step integrated processing framework (SLSO) that combines image analysis, structured data generation, tooth number extraction, consistency checking, and iterative regeneration to validate GPT outputs for dental panoramic radiographs with jaw cysts.

Result: SLSO improved output accuracy compared to conventional Chain-of-Thought method across multiple evaluation items, with notable improvements in tooth number identification, tooth movement detection, and root resorption assessment. Consistently structured outputs achieved after up to five regenerations.

Conclusion: The framework establishes feasibility for enhancing AI-generated medical findings, suppresses hallucinations, enforces explicit negative findings, and provides foundation for future validation with larger datasets.

Abstract: Vision-language models (VLMs) such as GPT (Generative Pre-Trained Transformer) have shown potential for medical image interpretation; however, challenges remain in generating reliable radiological findings in clinical practice, as exemplified by dental pathologies. This study proposes a Self-correction Loop with Structured Output (SLSO) framework as an integrated processing methodology to enhance the accuracy and reliability of AI-generated findings for jaw cysts in dental panoramic radiographs. Dental panoramic radiographs with jaw cysts were used to implement a 10-step integrated processing framework incorporating image analysis, structured data generation, tooth number extraction, consistency checking, and iterative regeneration. The framework functioned as an external validation mechanism for GPT outputs. Performance was compared against the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The SLSO framework improved output accuracy for multiple items compared to the CoT method, with the most notable improvements observed in tooth number identification, tooth movement detection, and root resorption assessment. In successful cases, consistently structured outputs were achieved after up to five regenerations. The framework enforced explicit negative finding descriptions and suppressed hallucinations, although accurate identification of extensive lesions spanning multiple teeth remained limited. This investigation established the feasibility of the proposed integrated processing methodology and provided a foundation for future validation studies with larger, more diverse datasets.

[125] A Fully Interpretable Statistical Approach for Roadside LiDAR Background Subtraction

Aitor Iglesias, Nerea Aranjuelo, Patricia Javierre, Ainhoa Menendez, Ignacio Arganda-Carreras, Marcos Nieto

Main category: cs.CV

TL;DR: A fully interpretable statistical method for background subtraction in roadside LiDAR data using Gaussian distribution grid modeling and filtering algorithms to separate foreground/background points for automated driving infrastructure.

DetailsMotivation: To enhance infrastructure-based perception in automated driving by developing an interpretable and flexible background subtraction method for roadside LiDAR data that works with diverse sensor types and configurations.

Method: Uses a Gaussian Distribution Grid (GDG) to model spatial statistics of background using background-only scans, combined with a filtering algorithm that classifies LiDAR points as foreground or background based on this representation.

Result: Outperforms state-of-the-art techniques on the RCooper dataset in accuracy and flexibility, works with minimal background data, and runs efficiently on low-resource hardware for scalable deployment.

Conclusion: The method provides an effective, interpretable solution for roadside LiDAR background subtraction that supports diverse sensors and enables reliable real-world deployment in automated driving infrastructure.

Abstract: We present a fully interpretable and flexible statistical method for background subtraction in roadside LiDAR data, aimed at enhancing infrastructure-based perception in automated driving. Our approach introduces both a Gaussian distribution grid (GDG), which models the spatial statistics of the background using background-only scans, and a filtering algorithm that uses this representation to classify LiDAR points as foreground or background. The method supports diverse LiDAR types, including multiline 360 degree and micro-electro-mechanical systems (MEMS) sensors, and adapts to various configurations. Evaluated on the publicly available RCooper dataset, it outperforms state-of-the-art techniques in accuracy and flexibility, even with minimal background data. Its efficient implementation ensures reliable performance on low-resource hardware, enabling scalable real-world deployment.

[126] Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu, Prithviraj Ammanabrolu, Hyunwoo Kim, Yuan-Hong Liao, Yejin Choi

Main category: cs.CV

TL;DR: A framework for synthesizing large-scale vision-centric multimodal reasoning datasets with over 1M problems, enabling improved VLM performance across vision, text, and audio tasks through systematic data generation and training pipeline analysis.

DetailsMotivation: There's a lack of systematic approaches for synthesizing large-scale vision-centric datasets beyond visual math, limiting progress in multimodal reasoning. Current datasets don't adequately support diverse complexity levels or comprehensive training methodologies (SFT, RL).

Method: Two-stage synthesis framework: (1) generates diverse verifiable questions from existing images at scale, (2) creates complex compositional visual problems by merging simpler questions. Produces reasoning traces, preference data, and instruction prompts supporting SFT, offline and online RL.

Result: Qwen2.5-VL-7B finetuned on this data outperforms open-data baselines across vision-centric benchmarks, matches/surpasses strong closed-data models like MiMo-VL-7B-RL. Shows positive transfer to text-only reasoning (+3.7% MMLU-Pro), audio reasoning (+1.32% MMAU), and embodied QA (+8.8% NiEH).

Conclusion: High-quality vision-centric data synthesis enables strong multimodal reasoning performance and cross-modality transfer. Analysis reveals: SFT with cognitive reasoning traces is essential for scaling online RL; offline RL can match online RL while reducing compute; SFT improves out-of-domain cross-modality transfer.

Abstract: Despite rapid progress, multimodal reasoning still lacks a systematic approach to synthesize large-scale vision-centric datasets beyond visual math. We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts supporting SFT, offline and online RL. Our vision-centric synthesis framework uses a two-stage process focusing on: (1) generating diverse verifiable questions from existing images at scale, and (2) creating complex compositional visual problems by merging simpler questions. Remarkably, finetuning Qwen2.5-VL-7B on our data outperforms existing open-data baselines across evaluated vision-centric benchmarks, and our best configurations match or surpass strong closed-data models such as MiMo-VL-7B-RL on Vstar Bench, CV-Bench and MMStar-V. Notably, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro, +3.7%) and audio reasoning (MMAU, +1.32%), demonstrating its effectiveness. Similarly, despite containing no embodied visual data, we observe notable gains (NiEH, +8.8%) when evaluating open-ended embodied QA. Lastly, we use our data to comprehensively analyze at scale (1M+) the entire VLM post-training pipeline showing that (i) SFT on high-quality data with cognitive behaviors on reasoning traces is essential to scale online RL, (ii) offline RL could match online RL’s performance while disaggregating compute demands, and, (iii) SFT on high quality data also improve out-of-domain, cross-modality transfer.

[127] MRIQT: Physics-Aware Diffusion Model for Image Quality Transfer in Neonatal Ultra-Low-Field MRI

Malek Al Abed, Sebiha Demir, Anne Groteklaes, Elodie Germani, Shahrooz Faghihroohi, Hemmen Sabir, Shadi Albarqouni

Main category: cs.CV

TL;DR: MRIQT is a 3D conditional diffusion framework that enhances portable ultra-low-field MRI (0.064T) to high-field MRI quality for neonatal brain imaging using physics-consistent degradation simulation and volumetric attention-UNet architecture.

DetailsMotivation: Portable ultra-low-field MRI offers accessible neonatal neuroimaging but suffers from poor signal-to-noise ratio and diagnostic quality compared to high-field MRI, limiting its clinical utility for reliable brain assessment.

Method: MRIQT uses a 3D conditional diffusion framework with: 1) realistic K-space degradation for physics-consistent uLF simulation, 2) v-prediction with classifier-free guidance for stable image-to-image generation, 3) SNR-weighted 3D perceptual loss for anatomical fidelity, and 4) volumetric attention-UNet architecture that denoises from noised uLF input conditioned on the same scan.

Result: MRIQT surpasses recent GAN and CNN baselines by 15.3% in PSNR and outperforms state-of-the-art by 1.78%. Physicians rated 85% of outputs as good quality with clear pathology present, trained on a neonatal cohort with diverse pathologies.

Conclusion: MRIQT enables high-fidelity, diffusion-based enhancement of portable ultra-low-field MRI for reliable neonatal brain assessment, demonstrating superior performance over existing methods while maintaining anatomical fidelity.

Abstract: Portable ultra-low-field MRI (uLF-MRI, 0.064 T) offers accessible neuroimaging for neonatal care but suffers from low signal-to-noise ratio and poor diagnostic quality compared to high-field (HF) MRI. We propose MRIQT, a 3D conditional diffusion framework for image quality transfer (IQT) from uLF to HF MRI. MRIQT combines realistic K-space degradation for physics-consistent uLF simulation, v-prediction with classifier-free guidance for stable image-to-image generation, and an SNR-weighted 3D perceptual loss for anatomical fidelity. The model denoises from a noised uLF input conditioned on the same scan, leveraging volumetric attention-UNet architecture for structure-preserving translation. Trained on a neonatal cohort with diverse pathologies, MRIQT surpasses recent GAN and CNN baselines in PSNR 15.3% with 1.78% over the state of the art, while physicians rated 85% of its outputs as good quality with clear pathology present. MRIQT enables high-fidelity, diffusion-based enhancement of portable ultra-low-field (uLF) MRI for deliable neonatal brain assessment.

[128] Text-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMs

Chenchen Lin, Sanbao Su, Rachel Luo, Yuxiao Chen, Yan Wang, Marco Pavone, Fei Miao

Main category: cs.CV

TL;DR: TGIF introduces a lightweight text-guided inter-layer fusion module for MLLMs that dynamically combines visual features from different encoder layers based on the query, improving visual grounding and reducing hallucinations.

DetailsMotivation: Current MLLMs underutilize the rich hierarchy of visual features in vision encoders, relying on single late-layer features and suffering from visually ungrounded hallucinations. Existing multi-layer fusion methods are static and don't adapt to specific queries.

Method: TGIF treats vision encoder layers as depth-wise “experts” and predicts a prompt-dependent fusion of visual features. It follows direct external fusion principles, requires no vision-encoder updates, and adds minimal overhead.

Result: Integrated into LLaVA-1.5-7B, TGIF provides consistent improvements across hallucination, OCR, and VQA benchmarks while preserving or improving performance on ScienceQA, GQA, and MMBench.

Conclusion: Query-conditioned, hierarchy-aware fusion is an effective way to strengthen visual grounding and reduce hallucination in modern MLLMs.

Abstract: Multimodal large language models (MLLMs) typically rely on a single late-layer feature from a frozen vision encoder, leaving the encoder’s rich hierarchy of visual cues under-utilized. MLLMs still suffer from visually ungrounded hallucinations, often relying on language priors rather than image evidence. While many prior mitigation strategies operate on the text side, they leave the visual representation unchanged and do not exploit the rich hierarchy of features encoded across vision layers. Existing multi-layer fusion methods partially address this limitation but remain static, applying the same layer mixture regardless of the query. In this work, we introduce TGIF (Text-Guided Inter-layer Fusion), a lightweight module that treats encoder layers as depth-wise “experts” and predicts a prompt-dependent fusion of visual features. TGIF follows the principle of direct external fusion, requires no vision-encoder updates, and adds minimal overhead. Integrated into LLaVA-1.5-7B, TGIF provides consistent improvements across hallucination, OCR, and VQA benchmarks, while preserving or improving performance on ScienceQA, GQA, and MMBench. These results suggest that query-conditioned, hierarchy-aware fusion is an effective way to strengthen visual grounding and reduce hallucination in modern MLLMs.

[129] TTSA3R: Training-Free Temporal-Spatial Adaptive Persistent State for Streaming 3D Reconstruction

Zhijie Zheng, Xinhao Xiang, Jiawei Zhang

Main category: cs.CV

TL;DR: TTSA3R is a training-free framework for streaming 3D reconstruction that addresses catastrophic forgetting by adaptively updating state representations using both temporal evolution patterns and spatial observation quality.

DetailsMotivation: Streaming recurrent models for 3D reconstruction suffer from catastrophic forgetting over long sequences due to difficulty balancing historical information with new observations. Existing methods use adaptive signals from attention perspective but operate on single dimensions without considering temporal and spatial consistency.

Method: Proposes TTSA3R with two modules: 1) Temporal Adaptive Update Module regulates update magnitude by analyzing temporal state evolution patterns, and 2) Spatial Contextual Update Module localizes spatial regions needing updates through observation-state alignment and scene dynamics. These complementary signals are fused to determine state updating strategies.

Result: Extensive experiments show effectiveness in diverse 3D tasks. On extended sequences, the method exhibits only 1.33x error increase compared to over 4x degradation in baseline models, significantly improving long-term reconstruction stability.

Conclusion: TTSA3R effectively addresses catastrophic forgetting in streaming 3D reconstruction by leveraging both temporal state evolution and spatial observation quality for adaptive state updates, improving long-term reconstruction stability without requiring training.

Abstract: Streaming recurrent models enable efficient 3D reconstruction by maintaining persistent state representations. However, they suffer from catastrophic forgetting over long sequences due to balancing historical information with new observations. Recent methods alleviate this by deriving adaptive signals from attention perspective, but they operate on single dimensions without considering temporal and spatial consistency. To this end, we propose a training-free framework termed TTSA3R that leverages both temporal state evolution and spatial observation quality for adaptive state updates in 3D reconstruction. In particular, we devise a Temporal Adaptive Update Module that regulates update magnitude by analyzing temporal state evolution patterns. Then, a Spatial Contextual Update Module is introduced to localize spatial regions that require updates through observation-state alignment and scene dynamics. These complementary signals are finally fused to determine the state updating strategies. Extensive experiments demonstrate the effectiveness of TTSA3R in diverse 3D tasks. Moreover, our method exhibits only 1.33x error increase compared to over 4x degradation in the baseline model on extended sequences of 3D reconstruction, significantly improving long-term reconstruction stability. Our codes are available at https://github.com/anonus2357/ttsa3r.

[130] Cross-Modal Purification and Fusion for Small-Object RGB-D Transmission-Line Defect Detection

Jiaming Cui, Wenqiang Li, Shuai Zhou, Ruifeng Qin, Feng Shen

Main category: cs.CV

TL;DR: CMAFNet: Cross-modal RGB-depth fusion network for transmission line defect detection that uses semantic recomposition and contextual integration to address small-scale defects in complex backgrounds.

DetailsMotivation: Transmission line defect detection is challenging due to small-scale defects, complex backgrounds, and illumination variations. RGB-based detectors struggle with geometrically subtle defects that lack chromatic contrast, requiring better integration of geometric information from depth data.

Method: Proposes CMAFNet with two main components: 1) Semantic Recomposition Module that performs dictionary-based feature purification using a learned codebook to suppress modality-specific noise while preserving defect information, and 2) Contextual Semantic Integration Framework that captures global spatial dependencies using partial-channel attention. Uses position-wise normalization for explicit reconstruction-driven cross-modal alignment.

Result: Achieves 32.2% mAP@50 and 12.5% APs on TLRGBD benchmark (94.5% small objects), outperforming strongest baseline by 9.8 and 4.0 percentage points. Lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing YOLO-based detectors while matching transformer methods at lower cost.

Conclusion: CMAFNet effectively integrates RGB appearance and depth geometry through principled cross-modal alignment and fusion, demonstrating superior performance for small-scale defect detection in complex transmission line inspection scenarios.

Abstract: Transmission line defect detection remains challenging for automated UAV inspection due to the dominance of small-scale defects, complex backgrounds, and illumination variations. Existing RGB-based detectors, despite recent progress, struggle to distinguish geometrically subtle defects from visually similar background structures under limited chromatic contrast. This paper proposes CMAFNet, a Cross-Modal Alignment and Fusion Network that integrates RGB appearance and depth geometry through a principled purify-then-fuse paradigm. CMAFNet consists of a Semantic Recomposition Module that performs dictionary-based feature purification via a learned codebook to suppress modality-specific noise while preserving defect-discriminative information, and a Contextual Semantic Integration Framework that captures global spatial dependencies using partial-channel attention to enhance structural semantic reasoning. Position-wise normalization within the purification stage enforces explicit reconstruction-driven cross-modal alignment, ensuring statistical compatibility between heterogeneous features prior to fusion. Extensive experiments on the TLRGBD benchmark, where 94.5% of instances are small objects, demonstrate that CMAFNet achieves 32.2% mAP@50 and 12.5% APs, outperforming the strongest baseline by 9.8 and 4.0 percentage points, respectively. A lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing all YOLO-based detectors while matching transformer-based methods at substantially lower computational cost.

[131] Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

Yuxuan Yao, Yuxuan Chen, Hui Li, Kaihui Cheng, Qipeng Guo, Yuwei Sun, Zilong Dong, Jingdong Wang, Siyu Zhu

Main category: cs.CV

TL;DR: Training-free prompt reinjection method addresses prompt forgetting in multimodal diffusion transformers for text-to-image generation by reinjecting early-layer prompt representations into later layers.

DetailsMotivation: The paper identifies a "prompt forgetting phenomenon" in multimodal diffusion transformers where the semantics of prompt representations in the text branch are progressively forgotten as depth increases, which degrades instruction-following capability in text-to-image generation.

Method: Introduces a training-free approach called “prompt reinjection” that reinjects prompt representations from early layers into later layers to alleviate forgetting. This is applied to three representative MMDiTs: SD3, SD3.5, and FLUX.1.

Result: Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text-image generation quality.

Conclusion: Prompt reinjection effectively addresses prompt forgetting in multimodal diffusion transformers, improving text-to-image generation quality without requiring additional training, making it a practical enhancement for existing models.

Abstract: Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs–SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text–image generation quality.

[132] Geometry-Aware Rotary Position Embedding for Consistent Video World Model

Chendong Xiang, Jiajun Liu, Jintao Zhang, Xiao Yang, Zhengwei Fang, Shizun Wang, Zijun Wang, Yingtian Zou, Hang Su, Jun Zhu

Main category: cs.CV

TL;DR: ViewRope introduces geometry-aware encoding for video transformers to maintain spatial persistence in predictive world models by injecting camera-ray directions into attention layers, enabling 3D consistency across long trajectories.

DetailsMotivation: Current predictive world models lack spatial persistence - they fail to maintain stable scene structures over long trajectories and hallucinate details when cameras revisit previously observed locations. This geometric drift stems from reliance on screen-space positional embeddings that conflict with projective geometry required for 3D consistency.

Method: 1) ViewRope: geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers, parameterizing attention with relative ray geometry rather than pixel locality. 2) Geometry-Aware Frame-Sparse Attention: exploits geometric cues to selectively attend to relevant historical frames for efficiency. 3) ViewBench: diagnostic suite for measuring loop-closure fidelity and geometric drift.

Result: ViewRope substantially improves long-term consistency while reducing computational costs, demonstrating better spatial persistence and reduced geometric drift compared to previous approaches.

Conclusion: Geometry-aware encoding through ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps, addressing the fundamental spatial persistence problem in predictive world models.

Abstract: Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware Frame-Sparse Attention}, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present \textbf{ViewBench}, a diagnostic suite measuring loop-closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.

[133] LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases

Khang Nguyen Quoc, Phuong D. Dao, Luyl-Da Quach

Main category: cs.CV

TL;DR: LeafNet dataset and LeafBench benchmark for evaluating VLMs on plant disease understanding, showing VLMs outperform vision-only models but struggle with fine-grained identification.

DetailsMotivation: Current VLMs lack application in domain-specific agricultural tasks like plant pathology due to missing large-scale multimodal datasets and benchmarks for systematic evaluation.

Method: Created LeafNet dataset (186k leaf images, 97 disease classes) and LeafBench benchmark (13,950 QA pairs across 6 agricultural tasks) to evaluate 12 state-of-the-art VLMs on plant disease understanding.

Result: VLMs show substantial performance disparity: >90% accuracy on binary healthy-diseased classification but <65% on fine-grained pathogen/species identification. Fine-tuned VLMs outperform vision-only models, confirming multimodal advantage.

Conclusion: Highlights critical gaps in current VLMs for plant pathology and establishes LeafBench as essential framework for advancing AI-assisted plant disease diagnosis through rigorous multimodal evaluation.

Abstract: Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image–text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy–diseased classification exceeds 90% accuracy, while fine-grained pathogen and species identification remains below 65%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.

[134] Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fangmin Chen, Xing Wang, Hayden Kwok-Hay So

Main category: cs.CV

TL;DR: FLEX is a training-free inference framework that enables autoregressive video diffusion models to generate much longer videos (up to 4 minutes) by addressing spectral bias in positional embeddings and adding dynamic priors through frequency-aware modulation and antiphase noise sampling.

DetailsMotivation: Autoregressive video diffusion models suffer from severe extrapolation failure when generating videos beyond their training horizons, leading to rapid error accumulation and temporal degradation. This limits their practical application for long video generation.

Method: FLEX introduces three key components: 1) Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones, 2) Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors, and 3) Inference-only Attention Sink to anchor global structure. The framework is training-free and plug-and-play.

Result: FLEX significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches long-video fine-tuned baselines at 12x scale (60s duration). It enables models like LongLive to generate consistent and dynamic videos at a 4-minute scale.

Conclusion: FLEX effectively bridges the gap between short-term training and long-term inference for video diffusion models, pushing generation limits to much longer durations without requiring additional training.

Abstract: Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the spectral bias of 3D positional embeddings and the lack of dynamic priors in noise sampling. To address these issues, we propose FLEX (Frequency-aware Length EXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12x scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at https://ga-lee.github.io/FLEX_demo.

cs.AI

[135] Attention-gated U-Net model for semantic segmentation of brain tumors and feature extraction for survival prognosis

Rut Pate, Snehal Rajput, Mehul S. Raval, Rupal A. Kapdi, Mohendra Roy

Main category: cs.AI

TL;DR: An Attention-Gated Recurrent Residual U-Net (R2U-Net) based Triplanar (2.5D) model for brain tumor segmentation achieves state-of-the-art performance on BraTS2021 with 0.900 DSC for whole tumor segmentation, and uses extracted features for survival prediction.

DetailsMotivation: Gliomas are complex primary brain tumors with varying aggressiveness and histology, making surgical treatment challenging. Accurate segmentation is crucial for treatment planning, but current methods need improvement in feature representation and computational efficiency.

Method: Proposes an Attention-Gated Recurrent Residual U-Net (R2U-Net) based Triplanar (2.5D) model that integrates residual connections, recurrent units, and triplanar architectures. Uses attention gates for feature enhancement and extracts 64 features per planar model for survival prediction, reduced to 28 features via Artificial Neural Network (ANN).

Result: Achieves Dice Similarity Score (DSC) of 0.900 for Whole Tumor segmentation on BraTS2021 validation set, comparable to leading models. For survival prediction: 45.71% accuracy, MSE of 108,318.128, and Spearman Rank Correlation Coefficient of 0.338 on test dataset.

Conclusion: The proposed model effectively improves brain tumor segmentation accuracy while maintaining computational efficiency, potentially aiding in better treatment planning. The triplanar approach successfully extracts features for survival prediction, though survival prediction performance has room for improvement.

Abstract: Gliomas, among the most common primary brain tumors, vary widely in aggressiveness, prognosis, and histology, making treatment challenging due to complex and time-intensive surgical interventions. This study presents an Attention-Gated Recurrent Residual U-Net (R2U-Net) based Triplanar (2.5D) model for improved brain tumor segmentation. The proposed model enhances feature representation and segmentation accuracy by integrating residual, recurrent, and triplanar architectures while maintaining computational efficiency, potentially aiding in better treatment planning. The proposed method achieves a Dice Similarity Score (DSC) of 0.900 for Whole Tumor (WT) segmentation on the BraTS2021 validation set, demonstrating performance comparable to leading models. Additionally, the triplanar network extracts 64 features per planar model for survival days prediction, which are reduced to 28 using an Artificial Neural Network (ANN). This approach achieves an accuracy of 45.71%, a Mean Squared Error (MSE) of 108,318.128, and a Spearman Rank Correlation Coefficient (SRC) of 0.338 on the test dataset.

[136] ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan

Main category: cs.AI

TL;DR: ResearchGym is a benchmark for evaluating AI agents on end-to-end research tasks using containerized environments from real ML papers, revealing significant capability-reliability gaps in current frontier models.

DetailsMotivation: To create a systematic evaluation framework for autonomous AI agents on real-world research tasks, moving beyond simple benchmarks to assess end-to-end research capabilities including hypothesis generation, experimentation, and surpassing human baselines.

Method: Repurposed 5 oral/spotlight papers from top conferences (ICML, ICLR, ACL) by preserving datasets, evaluation harnesses, and baselines while withholding the proposed methods. Created containerized task environments with 39 sub-tasks total, requiring agents to propose hypotheses, run experiments, and exceed human baselines.

Result: GPT-5 agent improved over provided baselines in only 1 of 15 evaluations (6.7%) by 11.5%, completing only 26.5% of sub-tasks on average. Identified recurring failure modes: impatience, poor resource management, overconfidence, difficulty with parallel experiments, and context length limits. However, in one run surpassed an ICML 2025 Spotlight solution, showing occasional state-of-the-art capability.

Conclusion: Current frontier AI agents show sharp capability-reliability gaps in research tasks - they can occasionally reach state-of-the-art performance but do so unreliably. ResearchGym provides infrastructure for systematic evaluation of autonomous agents on closed-loop research.

Abstract: We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper’s repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper’s proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper’s metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability–reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.

[137] Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik

Main category: cs.AI

TL;DR: The paper introduces methods to modify LLM reasoning traces to prevent unauthorized knowledge distillation through anti-distillation (degrading training usefulness) and API watermarking (embedding verifiable signatures).

DetailsMotivation: To protect frontier LLMs from unauthorized knowledge distillation that takes unfair advantage of the considerable effort and cost invested in developing these models, by making teacher-generated reasoning traces less useful for distillation while maintaining answer quality.

Method: Several approaches for dynamically rewriting teacher reasoning outputs: two leverage LLM rewriting capabilities, others use gradient-based techniques. The methods preserve answer correctness and semantic coherence while modifying reasoning traces to achieve anti-distillation and watermarking objectives.

Result: A simple instruction-based rewriting approach achieves strong anti-distillation effect while maintaining or even improving teacher performance. The rewriting approach also enables highly reliable watermark detection with essentially no false alarms.

Conclusion: Effective methods exist for protecting LLMs from unauthorized distillation through reasoning trace modification, with instruction-based rewriting showing particularly promising results for both anti-distillation and watermarking objectives.

Abstract: Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher’s reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.

[138] Panini: Continual Learning in Token Space via Structured Memory

Shreyas Rajesh, Pavan Holur, Mehmet Yigit Turali, Chenda Duan, Vwani Roychowdhury

Main category: cs.AI

TL;DR: Panini introduces a continual learning framework using Generative Semantic Workspaces (GSW) - entity/event-aware QA networks that allow LLMs to reconstruct documents and reason via inference chains, improving efficiency and reducing unsupported generation compared to traditional RAG.

DetailsMotivation: Traditional RAG approaches inefficiently use compute by repeatedly reasoning over the same documents and can inject irrelevant context leading to unsupported generation. The paper aims to create a more human-like continual learning system that accumulates and consolidates semantic knowledge efficiently.

Method: Panini represents documents as Generative Semantic Workspaces (GSW) - networks of question-answer pairs that capture entities and events. The base model remains fixed while experiences are integrated into this external semantic memory. For queries, Panini traverses the continually-updated GSW to retrieve the most likely inference chains rather than verbatim documents or chunks.

Result: Across six QA benchmarks, Panini achieves 5%-7% higher average performance than competitive baselines while using 2-30x fewer answer-context tokens. It supports fully open-source pipelines and reduces unsupported answers on curated unanswerable queries.

Conclusion: Efficient structuring of experiences at write time through the GSW framework yields both efficiency and reliability gains at read time, demonstrating a promising approach for continual learning in language models.

Abstract: Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user-specific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant subset at inference time for an LLM to reason over. However, this results in inefficient usage of test-time compute (LLM repeatedly reasons over the same documents); moreover, chunk retrieval can inject irrelevant context that increases unsupported generation. We propose a human-like non-parametric continual learning framework, where the base model remains fixed, and learning occurs by integrating each new experience into an external semantic memory state that accumulates and consolidates itself continually. We present Panini, which realizes this by representing documents as Generative Semantic Workspaces (GSW) – an entity- and event-aware network of question-answer (QA) pairs, sufficient for an LLM to reconstruct the experienced situations and mine latent knowledge via reasoning-grounded inference chains on the network. Given a query, Panini only traverses the continually-updated GSW (not the verbatim documents or chunks), and retrieves the most likely inference chains. Across six QA benchmarks, Panini achieves the highest average performance, 5%-7% higher than other competitive baselines, while using 2-30x fewer answer-context tokens, supports fully open-source pipelines, and reduces unsupported answers on curated unanswerable queries. The results show that efficient and accurate structuring of experiences at write time – as achieved by the GSW framework – yields both efficiency and reliability gains at read time. Code is available at https://github.com/roychowdhuryresearch/gsw-memory.

[139] da Costa and Tarski meet Goguen and Carnap: a novel approach for ontological heterogeneity based on consequence systems

Gabriel Rocha

Main category: cs.AI

TL;DR: A formal framework for ontological heterogeneity using extended consequence systems and development graphs, drawing from Carnapian-Goguenism and da Costa-Tarski principles.

DetailsMotivation: To address ontological heterogeneity (different conceptualizations of the same domain) by providing a formal framework that can handle diverse ontological perspectives and their relationships.

Method: Develops da Costian-Tarskianism approach using consequence systems extended with ontological axioms, introduces extended consequence systems and extended development graphs with morphisms, fibring, and splitting operations.

Result: A formal framework for representing and relating heterogeneous ontologies through mathematical structures that allow systematic comparison and combination of different ontological perspectives.

Conclusion: The approach provides a rigorous foundation for handling ontological heterogeneity with implications for applied ontology, suggesting directions for future research in formal ontology integration.

Abstract: This paper presents a novel approach for ontological heterogeneity that draws heavily from Carnapian-Goguenism, as presented by Kutz, Mossakowski and Lücke (2010). The approach is provisionally designated da Costian-Tarskianism, named after da Costa’s Principle of Tolerance in Mathematics and after Alfred Tarski’s work on the concept of a consequence operator. The approach is based on the machinery of consequence systems, as developed by Carnielli et al. (2008) and Citkin and Muravitsky (2022), and it introduces the idea of an extended consequence system, which is a consequence system extended with ontological axioms. The paper also defines the concept of an extended development graph, which is a graph structure that allows ontologies to be related via morphisms of extended consequence systems, and additionally via other operations such as fibring and splitting. Finally, we discuss the implications of this approach for the field of applied ontology and suggest directions for future research.

[140] Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs

Luise Ge, Yongyan Zhang, Yevgeniy Vorobeychik

Main category: cs.AI

TL;DR: LLMs exhibit two distinct patterns in risky decision-making: reasoning models (RMs) behave more rationally while conversational models (CMs) are more human-like and sensitive to framing effects.

DetailsMotivation: As LLMs are increasingly used in decision support and agentic workflows, there's limited understanding of how they make decisions under uncertainty, particularly regarding risky choices and prospect representation.

Method: Comparative study of 20 frontier and open LLMs along two dimensions: prospect representation (explicit vs. experience-based) and decision rationale (explanation). Complemented by human subjects experiment and rational agent model as reference points.

Result: LLMs cluster into two categories: reasoning models (RMs) that behave rationally and are insensitive to framing effects, and conversational models (CMs) that are less rational, more human-like, and sensitive to prospect ordering, framing, and explanation. Mathematical reasoning training appears to differentiate RMs from CMs.

Conclusion: LLMs exhibit systematic differences in risky decision-making, with mathematical reasoning training being a key factor in producing more rational decision-making behavior, which has implications for their deployment in decision support systems.

Abstract: The use of large language models either as decision support systems, or in agentic workflows, is rapidly transforming the digital ecosystem. However, the understanding of LLM decision-making under uncertainty remains limited. We initiate a comparative study of LLM risky choices along two dimensions: (1) prospect representation (explicit vs. experience based) and (2) decision rationale (explanation). Our study, which involves 20 frontier and open LLMs, is complemented by a matched human subjects experiment, which provides one reference point, while an expected payoff maximizing rational agent model provides another. We find that LLMs cluster into two categories: reasoning models (RMs) and conversational models (CMs). RMs tend towards rational behavior, are insensitive to the order of prospects, gain/loss framing, and explanations, and behave similarly whether prospects are explicit or presented via experience history. CMs are significantly less rational, slightly more human-like, sensitive to prospect ordering, framing, and explanation, and exhibit a large description-history gap. Paired comparisons of open LLMs suggest that a key factor differentiating RMs and CMs is training for mathematical reasoning.

[141] Secure and Energy-Efficient Wireless Agentic AI Networks

Yuanyan Song, Kezhi Wang, Xinmian Xu

Main category: cs.AI

TL;DR: Secure wireless agentic AI network with supervisor AI coordinating multiple agents for QoS provisioning while protecting privacy through friendly jamming and energy optimization.

DetailsMotivation: To address the need for secure, energy-efficient AI reasoning tasks in wireless networks while maintaining confidentiality of private knowledge and reasoning outcomes.

Method: Proposes a secure wireless agentic AI network with supervisor AI agent coordinating multiple agents. Formulates energy minimization problem optimizing agent selection, BS beamforming, and transmission power. Develops two resource allocation schemes: ASC (using ADMM, SDR, SCA) and LAW (using LLM optimizer within agentic workflow).

Result: Proposed solutions reduce network energy consumption by up to 59.1% compared to benchmark schemes. Practical validation using Qwen-based agentic AI system shows satisfactory reasoning accuracy across various public benchmarks.

Conclusion: The secure wireless agentic AI network framework effectively balances QoS provisioning, privacy protection, and energy efficiency through optimized resource allocation and agent coordination.

Abstract: In this paper, we introduce a secure wireless agentic AI network comprising one supervisor AI agent and multiple other AI agents to provision quality of service (QoS) for users’ reasoning tasks while ensuring confidentiality of private knowledge and reasoning outcomes. Specifically, the supervisor AI agent can dynamically assign other AI agents to participate in cooperative reasoning, while the unselected AI agents act as friendly jammers to degrade the eavesdropper’s interception performance. To extend the service duration of AI agents, an energy minimization problem is formulated that jointly optimizes AI agent selection, base station (BS) beamforming, and AI agent transmission power, subject to latency and reasoning accuracy constraints. To address the formulated problem, we propose two resource allocation schemes, ASC and LAW, which first decompose it into three sub-problems. Specifically, ASC optimizes each sub-problem iteratively using the proposed alternating direction method of multipliers (ADMM)-based algorithm, semi-definite relaxation (SDR), and successive convex approximation (SCA), while LAW tackles each sub-problem using the proposed large language model (LLM) optimizer within an agentic workflow. The experimental results show that the proposed solutions can reduce network energy consumption by up to 59.1% compared to other benchmark schemes. Furthermore, the proposed schemes are validated using a practical agentic AI system based on Qwen, demonstrating satisfactory reasoning accuracy across various public benchmarks.

[142] Predicting Invoice Dilution in Supply Chain Finance with Leakage Free Two Stage XGBoost, KAN (Kolmogorov Arnold Networks), and Ensemble Models

Pavel Koptev, Vishnu Kumar, Konstantin Malkov, George Shapiro, Yury Vikhanov

Main category: cs.AI

TL;DR: AI/ML framework to predict invoice dilution in supply chain finance using real-time dynamic credit limits

DetailsMotivation: Invoice/payment dilution (gap between approved amount and actual collection) causes significant non-credit risk and margin loss in supply chain finance. Traditional IPU methods hinder adoption, especially for sub-investment grade buyers.

Method: AI and machine learning framework that supplements deterministic algorithms to predict invoice dilution using extensive production dataset across nine key transaction fields

Result: Not specified in abstract - paper evaluates the framework’s performance

Conclusion: AI/ML can enhance traditional supply chain finance risk management by providing real-time, data-driven dilution predictions

Abstract: Invoice or payment dilution is the gap between the approved invoice amount and the actual collection is a significant source of non credit risk and margin loss in supply chain finance. Traditionally, this risk is managed through the buyer’s irrevocable payment undertaking (IPU), which commits to full payment without deductions. However, IPUs can hinder supply chain finance adoption, particularly among sub-invested grade buyers. A newer, data-driven methods use real-time dynamic credit limits, projecting dilution for each buyer-supplier pair in real-time. This paper introduces an AI, machine learning framework and evaluates how that can supplement a deterministic algorithm to predict invoice dilution using extensive production dataset across nine key transaction fields.

[143] Enhancing Diversity and Feasibility: Joint Population Synthesis from Multi-source Data Using Generative Models

Farbod Abbasi, Zachary Patterson, Bilal Farooq

Main category: cs.AI

TL;DR: Proposes a joint learning WGAN with gradient penalty for multi-source synthetic population generation, improving diversity and feasibility over sequential methods.

DetailsMotivation: Current synthetic population generation methods for agent-based models have limitations: they rely on single datasets or sequential processes that fail to capture feature interplay, and struggle with sampling/structural zeros that reduce diversity and feasibility.

Method: Uses Wasserstein GAN with gradient penalty in a joint learning framework to simultaneously integrate and synthesize multi-source datasets, adding an inverse gradient penalty regularization term to the generator loss function.

Result: Joint approach outperforms sequential baseline with 7% recall increase and 15% precision increase. Regularization further improves diversity/feasibility with 10% recall and 1% precision gains. Similarity score of 88.1 vs 84.6 for sequential method.

Conclusion: The multi-source generative approach using joint learning WGAN with regularization significantly improves synthetic population quality, potentially enhancing accuracy and reliability of agent-based models in transportation and urban planning.

Abstract: Generating realistic synthetic populations is essential for agent-based models (ABM) in transportation and urban planning. Current methods face two major limitations. First, many rely on a single dataset or follow a sequential data fusion and generation process, which means they fail to capture the complex interplay between features. Second, these approaches struggle with sampling zeros (valid but unobserved attribute combinations) and structural zeros (infeasible combinations due to logical constraints), which reduce the diversity and feasibility of the generated data. This study proposes a novel method to simultaneously integrate and synthesize multi-source datasets using a Wasserstein Generative Adversarial Network (WGAN) with gradient penalty. This joint learning method improves both the diversity and feasibility of synthetic data by defining a regularization term (inverse gradient penalty) for the generator loss function. For the evaluation, we implement a unified evaluation metric for similarity, and place special emphasis on measuring diversity and feasibility through recall, precision, and the F1 score. Results show that the proposed joint approach outperforms the sequential baseline, with recall increasing by 7% and precision by 15%. Additionally, the regularization term further improves diversity and feasibility, reflected in a 10% increase in recall and 1% in precision. We assess similarity distributions using a five-metric score. The joint approach performs better overall, and reaches a score of 88.1 compared to 84.6 for the sequential method. Since synthetic populations serve as a key input for ABM, this multi-source generative approach has the potential to significantly enhance the accuracy and reliability of ABM.

[144] When Remembering and Planning are Worth it: Navigating under Change

Omid Madani, J. Brian Burns, Reza Eghbali, Thomas L. Dean

Main category: cs.AI

TL;DR: Paper explores memory strategies for spatial navigation in changing uncertain environments, showing that agents using episodic memory and probability learning outperform simpler approaches when task difficulty increases.

DetailsMotivation: To understand how different memory types can help agents navigate non-stationary environments with uncertainty, limited sensing, and changing barriers/food locations in foraging tasks.

Method: Study range of strategies from simple to sophisticated, focusing on memory and learning architectures. Use non-stationary probability learning to update episodic memories, build imperfect maps from experience, and plan on the fly.

Result: Agents using episodic memory with probability learning become increasingly more efficient than minimal-memory agents as task difficulty (distance to goal) increases, provided uncertainty from localization and change isn’t too large.

Conclusion: Architectures incorporating multiple memory strategies are needed for different subtasks (exploration vs planning), with episodic memory and probability learning providing substantial efficiency gains in complex navigation tasks.

Abstract: We explore how different types and uses of memory can aid spatial navigation in changing uncertain environments. In the simple foraging task we study, every day, our agent has to find its way from its home, through barriers, to food. Moreover, the world is non-stationary: from day to day, the location of the barriers and food may change, and the agent’s sensing such as its location information is uncertain and very limited. Any model construction, such as a map, and use, such as planning, needs to be robust against these challenges, and if any learning is to be useful, it needs to be adequately fast. We look at a range of strategies, from simple to sophisticated, with various uses of memory and learning. We find that an architecture that can incorporate multiple strategies is required to handle (sub)tasks of a different nature, in particular for exploration and search, when food location is not known, and for planning a good path to a remembered (likely) food location. An agent that utilizes non-stationary probability learning techniques to keep updating its (episodic) memories and that uses those memories to build maps and plan on the fly (imperfect maps, i.e. noisy and limited to the agent’s experience) can be increasingly and substantially more efficient than the simpler (minimal-memory) agents, as the task difficulties such as distance to goal are raised, as long as the uncertainty, from localization and change, is not too large.

[145] EAA: Automating materials characterization with vision language model agents

Ming Du, Yanqi Luo, Srutarshi Banerjee, Michael Wojcik, Jelena Popovic, Mathew J. Cherukara

Main category: cs.AI

TL;DR: EAA is a vision-language-model agent system that automates experimental microscopy workflows using multimodal reasoning and tool-augmented actions, demonstrated at a synchrotron beamline.

DetailsMotivation: To automate complex experimental microscopy workflows, reduce operational burden, lower expertise barriers for users, and enhance beamline efficiency through vision-capable agents.

Method: Integrates multimodal reasoning, tool-augmented action, and optional long-term memory in a flexible task-manager architecture. Uses vision-language models for both autonomous procedures and interactive user-guided measurements, with Model Context Protocol compatibility for instrument control tools.

Result: Successfully demonstrated at an imaging beamline at the Advanced Photon Source, including automated zone plate focusing, natural language-described feature search, and interactive data acquisition.

Conclusion: Vision-capable agents can significantly enhance beamline efficiency, reduce operational burden, and lower expertise barriers for experimental microscopy users.

Abstract: We present Experiment Automation Agents (EAA), a vision-language-model-driven agentic system designed to automate complex experimental microscopy workflows. EAA integrates multimodal reasoning, tool-augmented action, and optional long-term memory to support both autonomous procedures and interactive user-guided measurements. Built on a flexible task-manager architecture, the system enables workflows ranging from fully agent-driven automation to logic-defined routines that embed localized LLM queries. EAA further provides a modern tool ecosystem with two-way compatibility for Model Context Protocol (MCP), allowing instrument-control tools to be consumed or served across applications. We demonstrate EAA at an imaging beamline at the Advanced Photon Source, including automated zone plate focusing, natural language-described feature search, and interactive data acquisition. These results illustrate how vision-capable agents can enhance beamline efficiency, reduce operational burden, and lower the expertise barrier for users.

[146] X-MAP: eXplainable Misclassification Analysis and Profiling for Spam and Phishing Detection

Qi Zhang, Dian Chen, Lance M. Kaplan, Audun Jøsang, Dong Hyun Jeong, Feng Chen, Jin-Hee Cho

Main category: cs.AI

TL;DR: X-MAP is an explainable framework that analyzes misclassifications in spam/phishing detection by revealing topic-level semantic patterns behind model failures using SHAP and matrix factorization.

DetailsMotivation: Misclassifications in spam and phishing detection are harmful (false negatives expose users to attacks, false positives degrade trust). Existing uncertainty-based detectors can flag errors but may be deceived and offer limited interpretability.

Method: X-MAP combines SHAP-based feature attributions with non-negative matrix factorization to build interpretable topic profiles for reliably classified spam/phishing and legitimate messages, then measures each message’s deviation from these profiles using Jensen-Shannon divergence.

Result: Misclassified messages exhibit at least two times larger divergence than correctly classified ones. As a detector, X-MAP achieves up to 0.98 AUROC and lowers false-rejection rate at 95% TRR to 0.089. When used as a repair layer, it recovers up to 97% of falsely rejected correct predictions with moderate leakage.

Conclusion: X-MAP demonstrates effectiveness and interpretability for improving spam and phishing detection through explainable misclassification analysis.

Abstract: Misclassifications in spam and phishing detection are very harmful, as false negatives expose users to attacks while false positives degrade trust. Existing uncertainty-based detectors can flag potential errors, but possibly be deceived and offer limited interpretability. This paper presents X-MAP, an eXplainable Misclassification Analysis and Profilling framework that reveals topic-level semantic patterns behind model failures. X-MAP combines SHAP-based feature attributions with non-negative matrix factorization to build interpretable topic profiles for reliably classified spam/phishing and legitimate messages, and measures each message’s deviation from these profiles using Jensen-Shannon divergence. Experiments on SMS and phishing datasets show that misclassified messages exhibit at least two times larger divergence than correctly classified ones. As a detector, X-MAP achieves up to 0.98 AUROC and lowers the false-rejection rate at 95% TRR to 0.089 on positive predictions. When used as a repair layer on base detectors, it recovers up to 97% of falsely rejected correct predictions with moderate leakage. These results demonstrate X-MAP’s effectiveness and interpretability for improving spam and phishing detection.

[147] AgriWorld:A World Tools Protocol Framework for Verifiable Agricultural Reasoning with Code-Executing LLM Agents

Zhixing Zhang, Jesen Zhang, Hao Liu, Qinhan Lv, Jing Yang, Kaitong Cai, Keze Wang

Main category: cs.AI

TL;DR: An agentic framework combining LLMs with agricultural data tools for multimodal reasoning in agriculture, featuring a Python execution environment and reflective agent for geospatial analysis.

DetailsMotivation: Current agricultural foundation models lack language reasoning and interactivity, while LLMs can't directly reason over high-dimensional agricultural datasets. There's a need to bridge this gap for practical agronomic workflows.

Method: Developed AgriWorld (Python execution environment with tools for geospatial queries, remote-sensing analytics, crop simulation, and predictors) and Agro-Reflective (multi-turn LLM agent that writes code, observes results, and refines analysis via execute-observe-refine loop).

Result: Outperforms text-only and direct tool-use baselines on AgroBench (diverse agricultural QA tasks including lookups, forecasting, anomaly detection, and counterfactual analysis), validating execution-driven reflection for reliable agricultural reasoning.

Conclusion: The agentic framework successfully bridges the gap between agricultural data models and LLMs, enabling multimodal reasoning and interactive capabilities for real-world agricultural applications.

Abstract: Foundation models for agriculture are increasingly trained on massive spatiotemporal data (e.g., multi-spectral remote sensing, soil grids, and field-level management logs) and achieve strong performance on forecasting and monitoring. However, these models lack language-based reasoning and interactive capabilities, limiting their usefulness in real-world agronomic workflows. Meanwhile, large language models (LLMs) excel at interpreting and generating text, but cannot directly reason over high-dimensional, heterogeneous agricultural datasets. We bridge this gap with an agentic framework for agricultural science. It provides a Python execution environment, AgriWorld, exposing unified tools for geospatial queries over field parcels, remote-sensing time-series analytics, crop growth simulation, and task-specific predictors (e.g., yield, stress, and disease risk). On top of this environment, we design a multi-turn LLM agent, Agro-Reflective, that iteratively writes code, observes execution results, and refines its analysis via an execute-observe-refine loop. We introduce AgroBench, with scalable data generation for diverse agricultural QA spanning lookups, forecasting, anomaly detection, and counterfactual “what-if” analysis. Experiments outperform text-only and direct tool-use baselines, validating execution-driven reflection for reliable agricultural reasoning.

[148] World-Model-Augmented Web Agents with Action Correction

Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li, Shengyu Zhang

Main category: cs.AI

TL;DR: WAC is a web agent that uses multi-model collaboration with consequence simulation and feedback-driven action refinement to improve web task automation by preventing risky actions and enhancing reasoning about environment changes.

DetailsMotivation: Current web agents based on LLMs struggle with reasoning about environment changes and lack comprehensive risk awareness, often performing risky actions prematurely that lead to task failures and losses.

Method: Proposes WAC with three key components: 1) Multi-agent collaboration where an action model consults a world model for strategic guidance, 2) Two-stage deduction chain where the world model simulates action outcomes and a judge model scrutinizes them, 3) Feedback-driven action refinement triggered when risks are detected.

Result: WAC achieves absolute gains of 1.8% on VisualWebArena and 1.3% on Online-Mind2Web benchmarks, demonstrating improved performance in web task automation.

Conclusion: The proposed approach of model collaboration, consequence simulation, and feedback-driven refinement effectively addresses limitations in current web agents, enabling more reliable and risk-aware web task automation.

Abstract: Web agents based on large language models have demonstrated promising capability in automating web tasks. However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause losses and lead to task failure. To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback-driven action refinement. To overcome the cognitive isolation of individual models, we introduce a multi-agent collaboration process that enables an action model to consult a world model as a web-environment expert for strategic guidance; the action model then grounds these suggestions into executable actions, leveraging prior knowledge of environmental state transition dynamics to enhance candidate action proposal. To achieve risk-aware resilient task execution, we introduce a two-stage deduction chain. A world model, specialized in environmental state transitions, simulates action outcomes, which a judge model then scrutinizes to trigger action corrective feedback when necessary. Experiments show that WAC achieves absolute gains of 1.8% on VisualWebArena and 1.3% on Online-Mind2Web.

[149] Improving LLM Reliability through Hybrid Abstention and Adaptive Detection

Ankit Sharma, Nachiket Tapas, Jyotiprakash Patra

Main category: cs.AI

TL;DR: Adaptive abstention system for LLMs that dynamically adjusts safety thresholds using real-time contextual signals to balance safety and utility while reducing latency.

DetailsMotivation: Current LLM safety mechanisms face a fundamental trade-off: strict filtering blocks benign queries while relaxed controls risk unsafe content. Conventional guardrails are context-insensitive, computationally expensive, and cause high latency.

Method: Proposes an adaptive abstention system with dynamic safety thresholds based on contextual signals (domain, user history). Uses multi-dimensional detection architecture with five parallel detectors combined through hierarchical cascade mechanism to optimize speed and precision.

Result: Significant reductions in false positives, especially in sensitive domains like medical advice and creative writing. Maintains high safety precision and near-perfect recall under strict modes. Achieves substantial latency improvements compared to non-cascaded models and external guardrail systems.

Conclusion: The context-aware abstention framework effectively balances safety and utility while preserving performance, offering a scalable solution for reliable LLM deployment.

Abstract: Large Language Models (LLMs) deployed in production environments face a fundamental safety-utility trade-off either a strict filtering mechanisms prevent harmful outputs but often block benign queries or a relaxed controls risk unsafe content generation. Conventional guardrails based on static rules or fixed confidence thresholds are typically context-insensitive and computationally expensive, resulting in high latency and degraded user experience. To address these limitations, we introduce an adaptive abstention system that dynamically adjusts safety thresholds based on real-time contextual signals such as domain and user history. The proposed framework integrates a multi-dimensional detection architecture composed of five parallel detectors, combined through a hierarchical cascade mechanism to optimize both speed and precision. The cascade design reduces unnecessary computation by progressively filtering queries, achieving substantial latency improvements compared to non-cascaded models and external guardrail systems. Extensive evaluation on mixed and domain-specific workloads demonstrates significant reductions in false positives, particularly in sensitive domains such as medical advice and creative writing. The system maintains high safety precision and near-perfect recall under strict operating modes. Overall, our context-aware abstention framework effectively balances safety and utility while preserving performance, offering a scalable solution for reliable LLM deployment.

[150] Common Belief Revisited

Thomas Ågotnes

Main category: cs.AI

TL;DR: The paper analyzes the logical properties of common belief in epistemic logic, showing that common belief in KD45 systems requires an additional axiom beyond shift-reflexivity, and this depends on the number of agents.

DetailsMotivation: To address an open problem in epistemic logic: what is the complete logical characterization of common belief when individual beliefs follow KD45 properties? The paper investigates whether KD4 extended with shift-reflexivity fully captures common belief.

Method: The authors use formal logical analysis and proof techniques in modal logic to examine the properties of common belief operators. They analyze the axiomatic requirements and demonstrate completeness through logical proofs.

Result: The paper shows that KD4 with shift-reflexivity is not sufficient to characterize common belief - one additional axiom is required, and this axiom depends on the number of agents. The authors provide a complete characterization of common belief, solving the open problem.

Conclusion: Common belief in KD45 systems has a more complex logical structure than previously thought, requiring both shift-reflexivity and an additional agent-count-dependent axiom for complete characterization.

Abstract: Contrary to common belief, common belief is not KD4. If individual belief is KD45, common belief does indeed lose the 5 property and keep the D and 4 properties – and it has none of the other commonly considered properties of knowledge and belief. But it has another property: $C(Cφ\rightarrow φ)$ – corresponding to so-called shift-reflexivity (reflexivity one step ahead). This observation begs the question: is KD4 extended with this axiom a complete characterisation of common belief in the KD45 case? If not, what \emph{is} the logic of common belief? In this paper we show that the answer to the first question is ``no’’: there is one additional axiom, and, furthermore, it relies on the number of agents. We show that the result is a complete characterisation of common belief, settling the open problem.

[151] GenAI-LA: Generative AI and Learning Analytics Workshop (LAK 2026), April 27–May 1, 2026, Bergen, Norway

Javier Irigoyen, Roberto Daza, Aythami Morales, Julian Fierrez, Francisco Jurado, Alvaro Ortigosa, Ruben Tolosana

Main category: cs.AI

TL;DR: EduEVAL-DB is a dataset for evaluating AI tutors, containing teacher explanations from human and LLM-simulated roles, annotated with pedagogical risk dimensions, with validation experiments on risk detection models.

DetailsMotivation: To support evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations, addressing the need for standardized assessment of explanation quality in educational AI systems.

Method: Created dataset with 854 explanations for 139 ScienceQA questions, including human-teacher and LLM-simulated teacher roles via prompt engineering. Developed pedagogical risk rubric with 5 dimensions, annotated explanations through semi-automatic process with expert review. Conducted validation experiments benchmarking Gemini 2.5 Pro vs Llama 3.1 8B and fine-tuning for risk detection.

Result: Dataset provides comprehensive pedagogical risk annotations. Preliminary experiments show potential for using EduEVAL-DB to train models for pedagogical risk detection, with focus on deployable models on consumer hardware.

Conclusion: EduEVAL-DB enables systematic evaluation of AI tutors’ explanation quality and supports development of pedagogical risk detection systems that can be deployed in practical educational settings.

Abstract: This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are annotated with binary risk labels through a semi-automatic process with expert teacher review. Finally, we present preliminary validation experiments to assess the suitability of EduEVAL-DB for evaluation. We benchmark a state-of-the-art education-oriented model (Gemini 2.5 Pro) against a lightweight local Llama 3.1 8B model and examine whether supervised fine-tuning on EduEVAL-DB supports pedagogical risk detection using models deployable on consumer hardware.

[152] Quantifying construct validity in large language model evaluations

Ryan Othniel Kearns

Main category: cs.AI

TL;DR: Proposes structured capabilities model to extract interpretable, generalizable capabilities from LLM benchmark results by combining insights from latent factor models and scaling laws to address construct validity issues.

DetailsMotivation: Benchmark results are often treated as synonymous with general model capabilities, but benchmarks have problems like test set contamination and annotator error. Existing approaches (latent factor models and scaling laws) are unsatisfactory for construct validity - latent factor models ignore scaling laws and proxy model size, while scaling laws ignore measurement error and produce uninterpretable capabilities.

Method: Develops structured capabilities model that combines insights from both approaches: model scale should inform capabilities (like scaling laws), and these capabilities should inform observed results up to measurement error (like latent factor models). Fits this model and alternatives on large sample of results from OpenLLM Leaderboard.

Result: Structured capabilities outperform latent factor models on parsimonious fit indices, and exhibit better out-of-distribution benchmark prediction than scaling laws. The model demonstrates better explanatory and predictive power for quantifying construct validity in LLM evaluations.

Conclusion: Structured capabilities model successfully addresses construct validity issues by properly separating model scale from capabilities, combining strengths of both existing approaches to extract interpretable and generalizable capabilities from benchmark results.

Abstract: The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninterpretable and overfit to the observed benchmarks. This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results. I fit this model and its two alternatives on a large sample of results from the OpenLLM Leaderboard. Structured capabilities outperform latent factor models on parsimonious fit indices, and exhibit better out-of-distribution benchmark prediction than scaling laws. These improvements are possible because neither existing approach separates model scale from capabilities in the appropriate way. Model scale should inform capabilities, as in scaling laws, and these capabilities should inform observed results up to measurement error, as in latent factor models. In combining these two insights, structured capabilities demonstrate better explanatory and predictive power for quantifying construct validity in LLM evaluations.

[153] RUVA: Personalized Transparent On-Device Graph Reasoning

Gabriele Conte, Alessio Mattiace, Gianni Carmosino, Potito Aghilar, Giovanni Servedio, Francesco Musicco, Vito Walter Anelli, Tommaso Di Noia, Francesco Maria Donini

Main category: cs.AI

TL;DR: Ruva introduces a “Glass Box” architecture for Personal AI that replaces black-box vector databases with Personal Knowledge Graphs, enabling human-in-the-loop memory curation, precise fact redaction, and true privacy compliance.

DetailsMotivation: Current Personal AI systems use black-box retrieval-augmented generation with vector databases that lack accountability - users can't inspect why AI hallucinates or retrieves sensitive data, and deletion from vector spaces is mathematically imprecise, leaving privacy-violating "ghosts."

Method: Ruva shifts from vector matching to graph reasoning by grounding Personal AI in a Personal Knowledge Graph. This enables users to inspect what the AI knows and perform precise redaction of specific facts through human-in-the-loop memory curation.

Result: Ruva provides a “Glass Box” architecture that ensures the “Right to be Forgotten” by allowing users to be editors of their own AI memories, with precise control over what information is retained or removed.

Conclusion: Ruva represents a paradigm shift from opaque vector-based systems to transparent graph-based Personal AI, empowering users with accountability, privacy, and control over their AI’s knowledge.

Abstract: The Personal AI landscape is currently dominated by “Black Box” Retrieval-Augmented Generation. While standard vector databases offer statistical matching, they suffer from a fundamental lack of accountability: when an AI hallucinates or retrieves sensitive data, the user cannot inspect the cause nor correct the error. Worse, “deleting” a concept from a vector space is mathematically imprecise, leaving behind probabilistic “ghosts” that violate true privacy. We propose Ruva, the first “Glass Box” architecture designed for Human-in-the-Loop Memory Curation. Ruva grounds Personal AI in a Personal Knowledge Graph, enabling users to inspect what the AI knows and to perform precise redaction of specific facts. By shifting the paradigm from Vector Matching to Graph Reasoning, Ruva ensures the “Right to be Forgotten.” Users are the editors of their own lives; Ruva hands them the pen. The project and the demo video are available at http://sisinf00.poliba.it/ruva/.

[154] How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

Hongxuan Wu, Yukun Zhang, Xueqing Zhou

Main category: cs.AI

TL;DR: The paper introduces PID Flow, a framework using Partial Information Decomposition to analyze how multimodal Transformers process visual and linguistic information across layers, revealing that visual information peaks early while language dominates late layers.

DetailsMotivation: To understand how multimodal Transformers (like LLaVA) make predictions - whether driven by visual evidence, linguistic reasoning, or genuine cross-modal fusion - and how this information structure evolves across Transformer layers.

Method: Developed PID Flow: a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation to make Partial Information Decomposition tractable for high-dimensional neural representations. Applied to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks.

Result: Found consistent modal transduction pattern: visual-unique information peaks early and decays with depth, language-unique information dominates late layers (≈82% of final prediction), cross-modal synergy remains low (<2%). Pattern is stable across model variants but task-dependent. Causal experiments with attention knockouts show disrupting primary transduction pathway increases trapped visual information and information cost.

Conclusion: Provides information-theoretic, causal account of how vision becomes language in multimodal Transformers, revealing architectural bottlenecks where modality-specific information is lost, with implications for improving multimodal model design.

Abstract: When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation – and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth, language-unique information surges in late layers to account for roughly 82% of the final prediction, and cross-modal synergy remains below 2%. This trajectory is highly stable across model variants (layer-wise correlations $>$0.96) yet strongly task-dependent, with semantic redundancy governing the detailed information fingerprint. To establish causality, we perform targeted Image$\rightarrow$Question attention knockouts and show that disrupting the primary transduction pathway induces predictable increases in trapped visual-unique information, compensatory synergy, and total information cost – effects that are strongest in vision-dependent tasks and weakest in high-redundancy tasks. Together, these results provide an information-theoretic, causal account of how vision becomes language in multimodal Transformers, and offer quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost.

[155] On inferring cumulative constraints

Konstantin Sidorov

Main category: cs.AI

TL;DR: A preprocessing method for scheduling problems that infers additional cumulative constraints to capture multi-resource interactions, improving search performance and solution quality.

DetailsMotivation: Current constraint programming approaches for scheduling propagate cumulative constraints individually, missing important multi-resource interactions that cause slowdowns on certain benchmarks.

Method: The method interprets cumulative constraints as linear inequalities over occupancy vectors, discovers sets of tasks that cannot run in parallel (covers), strengthens these cover inequalities through lifting, and injects the resulting constraints back into the scheduling problem.

Result: Experiments on RCPSP and RCPSP/max test suites show improved search performance, tighter objective bounds, discovery of 25 new lower bounds and five new best solutions, with eight lower bounds obtained directly from the inferred constraints.

Conclusion: The preprocessing method effectively captures multi-resource interactions in scheduling problems, leading to better performance and solution quality without significant degradation on unfavorable instances.

Abstract: Cumulative constraints are central in scheduling with constraint programming, yet propagation is typically performed per constraint, missing multi-resource interactions and causing severe slowdowns on some benchmarks. I present a preprocessing method for inferring additional cumulative constraints that capture such interactions without search-time probing. This approach interprets cumulative constraints as linear inequalities over occupancy vectors and generates valid inequalities by (i) discovering covers, the sets of tasks that cannot run in parallel, (ii) strengthening the cover inequalities for the discovered sets with lifting, and (iii) injecting the resulting constraints back into the scheduling problem instance. Experiments on standard RCPSP and RCPSP/max test suites show that these inferred constraints improve search performance and tighten objective bounds on favorable instances, while incurring little degradation on unfavorable ones. Additionally, these experiments discover 25 new lower bounds and five new best solutions; eight of the lower bounds are obtained directly from the inferred constraints.

[156] CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving

Lucas Elbert Suryana, Farah Bierenga, Sanne van Buuren, Pepijn Kooij, Elsefien Tulleners, Federico Scari, Simeon Calvert, Bart van Arem, Arkady Zgonnikov

Main category: cs.AI

TL;DR: CARE Drive is a framework for evaluating whether vision language models in automated driving make decisions based on human-relevant reasons rather than just post-hoc rationalizations.

DetailsMotivation: Existing evaluation methods for vision language models in automated driving focus only on outcome-based performance (safety, trajectory accuracy) without assessing whether model decisions reflect genuine human-relevant reasoning. This creates false confidence in safety-critical domains where understanding the basis for decisions is crucial.

Method: CARE Drive is a model-agnostic framework that uses a two-stage process: 1) prompt calibration to ensure stable outputs, and 2) systematic contextual perturbation to measure decision sensitivity to human reasons (safety margins, social pressure, efficiency constraints). It compares baseline vs. reason-augmented model decisions under controlled contextual variation.

Result: Explicit human reasons significantly influence model decisions, improving alignment with expert-recommended behavior in a cyclist overtaking scenario. However, responsiveness varies across contextual factors, indicating uneven sensitivity to different types of reasons.

Conclusion: Reason responsiveness in foundation models can be systematically evaluated without modifying model parameters, providing empirical evidence that human reasons can causally influence model decision behavior in safety-critical applications.

Abstract: Foundation models, including vision language models, are increasingly used in automated driving to interpret scenes, recommend actions, and generate natural language explanations. However, existing evaluation methods primarily assess outcome based performance, such as safety and trajectory accuracy, without determining whether model decisions reflect human relevant considerations. As a result, it remains unclear whether explanations produced by such models correspond to genuine reason responsive decision making or merely post hoc rationalizations. This limitation is especially significant in safety critical domains because it can create false confidence. To address this gap, we propose CARE Drive, Context Aware Reasons Evaluation for Driving, a model agnostic framework for evaluating reason responsiveness in vision language models applied to automated driving. CARE Drive compares baseline and reason augmented model decisions under controlled contextual variation to assess whether human reasons causally influence decision behavior. The framework employs a two stage evaluation process. Prompt calibration ensures stable outputs. Systematic contextual perturbation then measures decision sensitivity to human reasons such as safety margins, social pressure, and efficiency constraints. We demonstrate CARE Drive in a cyclist overtaking scenario involving competing normative considerations. Results show that explicit human reasons significantly influence model decisions, improving alignment with expert recommended behavior. However, responsiveness varies across contextual factors, indicating uneven sensitivity to different types of reasons. These findings provide empirical evidence that reason responsiveness in foundation models can be systematically evaluated without modifying model parameters.

[157] PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra

Xiachong Feng, Liang Zhao, Weihong Zhong, Yichong Huang, Yuxuan Gu, Lingpeng Kong, Xiaocheng Feng, Bing Qin

Main category: cs.AI

TL;DR: PERSONA is a training-free framework for personality control in LLMs that manipulates personality vectors in activation space, achieving fine-tuning level performance through vector arithmetic operations on orthogonal trait directions.

DetailsMotivation: Current methods for personality control in LLMs use static prompting or expensive fine-tuning, which fail to capture the dynamic and compositional nature of human personality traits.

Method: Three-stage framework: 1) Persona-Base extracts orthogonal trait vectors via contrastive activation analysis, 2) Persona-Algebra enables control through vector arithmetic (scalar multiplication, addition, subtraction), 3) Persona-Flow achieves context-aware adaptation by dynamically composing vectors during inference.

Result: Achieves mean score of 9.60 on PersonalityBench (nearly matching supervised fine-tuning upper bound of 9.61), and up to 91% win rates on Persona-Evolve benchmark for dynamic personality adaptation across diverse model families.

Conclusion: Personality traits in LLMs are mathematically tractable as extractable, approximately orthogonal directions in representation space, opening new directions for interpretable and efficient behavioral control without gradient updates.

Abstract: Current methods for personality control in Large Language Models rely on static prompting or expensive fine-tuning, failing to capture the dynamic and compositional nature of human traits. We introduce PERSONA, a training-free framework that achieves fine-tuning level performance through direct manipulation of personality vectors in activation space. Our key insight is that personality traits appear as extractable, approximately orthogonal directions in the model’s representation space that support algebraic operations. The framework operates through three stages: Persona-Base extracts orthogonal trait vectors via contrastive activation analysis; Persona-Algebra enables precise control through vector arithmetic (scalar multiplication for intensity, addition for composition, subtraction for suppression); and Persona-Flow achieves context-aware adaptation by dynamically composing these vectors during inference. On PersonalityBench, our approach achieves a mean score of 9.60, nearly matching the supervised fine-tuning upper bound of 9.61 without any gradient updates. On our proposed Persona-Evolve benchmark for dynamic personality adaptation, we achieve up to 91% win rates across diverse model families. These results provide evidence that aspects of LLM personality are mathematically tractable, opening new directions for interpretable and efficient behavioral control.

[158] Recursive Concept Evolution for Compositional Reasoning in Large Language Models

Sarim Chaudhry

Main category: cs.AI

TL;DR: RCE enables pretrained LLMs to dynamically modify their internal representation geometry during inference by generating low-rank concept subspaces when needed, improving compositional reasoning performance.

DetailsMotivation: Current LLMs struggle with compositional reasoning tasks despite strong performance on other complex reasoning. Existing methods only expand token-level search but leave the model's latent representation space fixed, causing performance collapse when required abstractions aren't encoded.

Method: Recursive Concept Evolution (RCE) framework that: 1) detects representational inadequacy, 2) spawns dynamically generated low-rank concept subspaces, 3) selects subspaces via minimum description length criterion, 4) merges synergistic subspaces, and 5) consolidates via constrained optimization to preserve stability.

Result: RCE integrated with Mistral-7B yields 12-18 point gains on ARC-AGI-2, 8-14 point improvements on GPQA and BBH, and consistent reductions in depth-induced error on MATH and HLE.

Conclusion: RCE enables LLMs to construct new abstractions rather than just recombining existing ones, significantly improving compositional reasoning capabilities by allowing dynamic modification of internal representation geometry during inference.

Abstract: Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE. Existing methods improve reasoning by expanding token-level search through chain-of-thought prompting, self-consistency, or reinforcement learning, but they leave the model’s latent representation space fixed. When the required abstraction is not already encoded in this space, performance collapses. We propose Recursive Concept Evolution (RCE), a framework that enables pretrained language models to modify their internal representation geometry during inference. RCE introduces dynamically generated low-rank concept subspaces that are spawned when representational inadequacy is detected, selected through a minimum description length criterion, merged when synergistic, and consolidated via constrained optimization to preserve stability. This process allows the model to construct new abstractions rather than recombining existing ones. We integrate RCE with Mistral-7B and evaluate it across compositional reasoning benchmarks. RCE yields 12-18 point gains on ARC-AGI-2, 8-14 point improvements on GPQA and BBH, and consistent reductions in depth-induced error on MATH and HLE.

[159] GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent Systems

Yiqin Yang, Xu Yang, Yuhua Jiang, Ni Mu, Hao Hu, Runpeng Xie, Ziyou Zhang, Siyuan Li, Yuan-Hua Ni, Qianchuan Zhao, Bo Xu

Main category: cs.AI

TL;DR: GlobeDiff: A diffusion-based method for inferring global state from local observations in partially observable multi-agent systems, addressing limitations of belief state estimation and communication approaches.

DetailsMotivation: Partial observability in multi-agent systems hinders effective coordination. Existing belief state methods focus on past experiences without leveraging global information, while communication methods lack robust models to utilize auxiliary information effectively.

Method: Proposes Global State Diffusion Algorithm (GlobeDiff) that formulates state inference as a multi-modal diffusion process to overcome ambiguities in state estimation while inferring global state with high fidelity.

Result: Theoretical proof shows estimation error under both unimodal and multi-modal distributions can be bounded. Extensive experiments demonstrate superior performance and accurate global state inference.

Conclusion: GlobeDiff effectively addresses partial observability challenges in multi-agent systems by providing a robust diffusion-based approach for global state inference from local observations.

Abstract: In the realm of multi-agent systems, the challenge of \emph{partial observability} is a critical barrier to effective coordination and decision-making. Existing approaches, such as belief state estimation and inter-agent communication, often fall short. Belief-based methods are limited by their focus on past experiences without fully leveraging global information, while communication methods often lack a robust model to effectively utilize the auxiliary information they provide. To solve this issue, we propose Global State Diffusion Algorithm~(GlobeDiff) to infer the global state based on the local observations. By formulating the state inference process as a multi-modal diffusion process, GlobeDiff overcomes ambiguities in state estimation while simultaneously inferring the global state with high fidelity. We prove that the estimation error of GlobeDiff under both unimodal and multi-modal distributions can be bounded. Extensive experimental results demonstrate that GlobeDiff achieves superior performance and is capable of accurately inferring the global state.

[160] This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

Jessica Hullman, David Broska, Huaman Sun, Aaron Shaw

Main category: cs.AI

TL;DR: LLMs as synthetic participants in social science experiments require different validation approaches: heuristic methods for exploratory research and statistical calibration for confirmatory research with formal guarantees.

DetailsMotivation: The paper addresses the growing use of LLMs as synthetic participants in social science experiments but notes limited guidance on when such simulations yield valid inferences about human behavior, highlighting the need for clear validation strategies.

Method: The paper contrasts two strategies: 1) Heuristic approaches using prompt engineering and model fine-tuning to make LLM behavior interchangeable with humans, and 2) Statistical calibration combining auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses.

Result: Heuristic approaches are useful for exploratory tasks but lack formal statistical guarantees for confirmatory research. Statistical calibration preserves validity under explicit assumptions and provides more precise causal effect estimates at lower cost than human-only experiments.

Conclusion: Both approaches depend on how well LLMs approximate relevant populations, and researchers should consider opportunities beyond simply substituting LLMs for human participants in studies.

Abstract: A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.

[161] Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings

Suhyung Jang, Ghang Lee, Jaekun Lee, Hyunjun Lee

Main category: cs.AI

TL;DR: Using LLM embeddings (GPT & LLaMA) as semantic encodings for building object classification outperforms traditional one-hot encoding in BIM applications.

DetailsMotivation: Conventional encoding methods like one-hot fail to capture nuanced relationships among closely related building object subtypes, limiting AI's semantic comprehension in AECO industry applications.

Method: Proposed using LLM embeddings (OpenAI GPT and Meta LLaMA) as encodings to preserve semantic distinctions. Trained GraphSAGE models to classify 42 building object subtypes across five high-rise residential BIMs, testing various embedding dimensions including original high-dimensional embeddings and compacted embeddings via Matryoshka representation model.

Result: LLM encodings outperformed conventional one-hot baseline, with llama-3 (compacted) embedding achieving weighted average F1-score of 0.8766 vs. 0.8475 for one-hot encoding.

Conclusion: LLM-based encodings enhance AI’s ability to interpret complex, domain-specific building semantics, showing promise for broad application in semantic elaboration tasks throughout AECO industry.

Abstract: Accurate representation of building semantics, encompassing both generic object types and specific subtypes, is essential for effective AI model training in the architecture, engineering, construction, and operation (AECO) industry. Conventional encoding methods (e.g., one-hot) often fail to convey the nuanced relationships among closely related subtypes, limiting AI’s semantic comprehension. To address this limitation, this study proposes a novel training approach that employs large language model (LLM) embeddings (e.g., OpenAI GPT and Meta LLaMA) as encodings to preserve finer distinctions in building semantics. We evaluated the proposed method by training GraphSAGE models to classify 42 building object subtypes across five high-rise residential building information models (BIMs). Various embedding dimensions were tested, including original high-dimensional LLM embeddings (1,536, 3,072, or 4,096) and 1,024-dimensional compacted embeddings generated via the Matryoshka representation model. Experimental results demonstrated that LLM encodings outperformed the conventional one-hot baseline, with the llama-3 (compacted) embedding achieving a weighted average F1-score of 0.8766, compared to 0.8475 for one-hot encoding. The results underscore the promise of leveraging LLM-based encodings to enhance AI’s ability to interpret complex, domain-specific building semantics. As the capabilities of LLMs and dimensionality reduction techniques continue to evolve, this approach holds considerable potential for broad application in semantic elaboration tasks throughout the AECO industry.

[162] Developing AI Agents with Simulated Data: Why, what, and how?

Xiaoran Liu, Istvan David

Main category: cs.AI

TL;DR: This chapter introduces simulation-based synthetic data generation for AI training, focusing on digital twin-based solutions to address data scarcity and quality issues in subsymbolic AI.

DetailsMotivation: The motivation is to address the key impediment of insufficient data volume and quality in modern subsymbolic AI adoption by leveraging simulation techniques for synthetic data generation.

Method: The chapter presents a reference framework for describing, designing, and analyzing digital twin-based AI simulation solutions, covering key concepts, benefits, and challenges of simulation-based synthetic data generation.

Result: The chapter provides a systematic approach to synthetic data generation through simulation, offering a structured framework for implementing digital twin-based solutions for AI training.

Conclusion: Simulation-based synthetic data generation, particularly through digital twin approaches, offers a systematic solution to data scarcity and quality issues in AI training, enabling broader adoption of subsymbolic AI techniques.

Abstract: As insufficient data volume and quality remain the key impediments to the adoption of modern subsymbolic AI, techniques of synthetic data generation are in high demand. Simulation offers an apt, systematic approach to generating diverse synthetic data. This chapter introduces the reader to the key concepts, benefits, and challenges of simulation-based synthetic data generation for AI training purposes, and to a reference framework to describe, design, and analyze digital twin-based AI simulation solutions.

[163] From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV

Yousef Emami, Hao Zhou, Miguel Gutierrez Gaitan, Kai Li, Luis Almeida, Zhu Han

Main category: cs.AI

TL;DR: LLM-assisted In-Context Learning framework for public safety UAVs to improve path planning, velocity control, and data collection scheduling with reduced packet loss and jailbreaking vulnerabilities.

DetailsMotivation: Public safety UAVs need better decision-making for emergency response, but traditional DRL approaches have limitations like high training complexity and simulation-to-reality gaps. LLMs offer strong reasoning and generalization abilities that can adapt to new tasks through natural language prompts without retraining.

Method: Proposes integrating LLM-assisted In-Context Learning (ICL) with public safety UAVs, deploying LLMs at network edge for low latency and privacy. Uses ICL for task adaptation via natural language prompts and example-based guidance without model retraining.

Result: Case study on data collection scheduling shows LLM-assisted ICL framework significantly reduces packet loss compared to conventional approaches while mitigating potential jailbreaking vulnerabilities.

Conclusion: LLM-assisted ICL enables adaptive, context-aware decision-making for public safety UAVs, offering lightweight and efficient solution to enhance UAV autonomy and responsiveness in emergencies.

Abstract: A public safety Uncrewed Aerial Vehicle (UAV) enhances situational awareness during emergency response. Its agility, mobility optimization, and ability to establish Line-of-Sight (LoS) communication make it increasingly important for managing emergencies such as disaster response, search and rescue, and wildfire monitoring. Although Deep Reinforcement Learning (DRL) has been used to optimize UAV navigation and control, its high training complexity, low sample efficiency, and the simulation-to-reality gap limit its practicality in public safety applications. Recent advances in Large Language Models (LLMs) present a promising alternative. With strong reasoning and generalization abilities, LLMs can adapt to new tasks through In-Context Learning (ICL), enabling task adaptation via natural language prompts and example-based guidance without retraining. Deploying LLMs at the network edge, rather than in the cloud, further reduces latency and preserves data privacy, making them suitable for real-time, mission-critical public safety UAVs. This paper proposes integrating LLM-assisted ICL with public safety UAVs to address key functions such as path planning and velocity control in emergency response. We present a case study on data collection scheduling, demonstrating that the LLM-assisted ICL framework can significantly reduce packet loss compared to conventional approaches while also mitigating potential jailbreaking vulnerabilities. Finally, we discuss LLM optimizers and outline future research directions. The ICL framework enables adaptive, context-aware decision-making for public safety UAVs, offering a lightweight and efficient solution to enhance UAV autonomy and responsiveness in emergencies.

[164] Prover Agent: An Agent-Based Framework for Formal Mathematical Proofs

Kaito Baba, Chaoran Liu, Shuhei Kurita, Akiyoshi Sannai

Main category: cs.AI

TL;DR: Prover Agent is an AI system that combines LLMs with Lean proof assistant for automated theorem proving, achieving state-of-the-art results on MiniF2F and PutnamBench benchmarks.

DetailsMotivation: To create an effective automated theorem proving system that bridges informal reasoning from LLMs with formal verification from proof assistants, addressing the challenge of discovering viable proof strategies through auxiliary lemma generation.

Method: Integrates an informal reasoning LLM with formal prover model and Lean proof assistant feedback, generating auxiliary lemmas that include not just subgoals but also special cases and useful facts from assumptions to discover proof strategies.

Result: Achieves 88.1% success rate on MiniF2F and solves 25 problems on PutnamBench with smaller sample budget than previous approaches, establishing new state-of-the-art for small language model methods.

Conclusion: Prover Agent demonstrates effective integration of LLMs with formal proof assistants for automated theorem proving, with auxiliary lemma generation playing a crucial role in solving challenging problems.

Abstract: We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas. These auxiliary lemmas are not limited to subgoals in the formal proof but can also include special cases or potentially useful facts derived from the assumptions, which help in discovering a viable proof strategy. It achieves an 88.1% success rate on MiniF2F and solves 25 problems on the PutnamBench with a smaller sample budget than previous approaches, establishing a new state-of-the-art on both benchmarks among methods using small language models (SLMs). We also present theoretical analyses and case studies that illustrate how these generated lemmas contribute to solving challenging problems. Our code is publicly available at https://github.com/kAIto47802/Prover-Agent.

[165] OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety

Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, Maarten Sap

Main category: cs.AI

TL;DR: OpenAgentSafety is a comprehensive framework for evaluating AI agent safety across 8 risk categories using real tools and 350+ multi-turn tasks, revealing significant safety vulnerabilities in current LLMs.

DetailsMotivation: Current AI agent safety benchmarks are inadequate due to reliance on simulated environments, narrow task domains, or unrealistic tool abstractions, creating a need for more comprehensive evaluation frameworks that test agents interacting with real tools in realistic scenarios.

Method: Developed a modular framework that evaluates agents interacting with real tools (web browsers, code execution, file systems, bash shells, messaging platforms) across 350+ multi-turn, multi-user tasks spanning both benign and adversarial user intents. Uses rule-based analysis combined with LLM-as-judge assessments to detect overt and subtle unsafe behaviors across 8 risk categories.

Result: Evaluation of five prominent LLMs revealed unsafe behavior in 51.2% of safety-vulnerable tasks with Claude-Sonnet-3.7, increasing to 72.7% with o3-mini, highlighting critical safety vulnerabilities in current agent systems.

Conclusion: OpenAgentSafety provides a comprehensive framework for evaluating AI agent safety, revealing significant vulnerabilities in current systems and demonstrating the need for stronger safeguards before real-world deployment. The framework is designed for extensibility to support ongoing safety research.

Abstract: Recent advances in AI agents capable of solving complex, everyday tasks, from scheduling to customer service, have enabled deployment in real-world settings, but their possibilities for unsafe behavior demands rigorous evaluation. While prior benchmarks have attempted to assess agent safety, most fall short by relying on simulated environments, narrow task domains, or unrealistic tool abstractions. We introduce OpenAgentSafety, a comprehensive and modular framework for evaluating agent behavior across eight critical risk categories. Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms; and supports over 350 multi-turn, multi-user tasks spanning both benign and adversarial user intents. OpenAgentSafety is designed for extensibility, allowing researchers to add tools, tasks, websites, and adversarial strategies with minimal effort. It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors. Empirical analysis of five prominent LLMs in agentic scenarios reveals unsafe behavior in 51.2% of safety-vulnerable tasks with Claude-Sonnet-3.7, to 72.7% with o3-mini, highlighting critical safety vulnerabilities and the need for stronger safeguards before real-world deployment.

[166] Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury

Main category: cs.AI

TL;DR: GSW is a neuro-inspired generative memory framework that builds structured, interpretable representations of evolving situations to enable LLMs to reason over long-context episodic narratives, outperforming RAG baselines by up to 20% on episodic memory benchmarks.

DetailsMotivation: Current LLMs struggle with long-context reasoning, especially for tracking entities through episodic events. Existing retrieval solutions (semantic embeddings, knowledge graphs) are tailored for fact-based retrieval but fail to build space-time-anchored narrative representations needed for episodic memory.

Method: Proposes Generative Semantic Workspace (GSW) with two components: an Operator that maps observations to intermediate semantic structures, and a Reconciler that integrates these into a persistent workspace enforcing temporal, spatial, and logical coherence. This creates structured, interpretable representations of evolving situations.

Result: On Episodic Memory Benchmark (EpBench) with corpora from 100k to 1M tokens, GSW outperforms existing RAG baselines by up to 20%. It’s highly efficient, reducing query-time context tokens by 51% compared to the next most token-efficient baseline, significantly reducing inference time costs.

Conclusion: GSW provides a blueprint for endowing LLMs with human-like episodic memory, enabling reasoning over long horizons and evolving situations. It addresses fundamental limitations in current LLM memory frameworks for narrative understanding.

Abstract: Large Language Models (LLMs) face fundamental challenges in long-context reasoning: many documents exceed their finite context windows, while performance on texts that do fit degrades with sequence length, necessitating their augmentation with external memory frameworks. Current solutions, which have evolved from retrieval using semantic embeddings to more sophisticated structured knowledge graphs representations for improved sense-making and associativity, are tailored for fact-based retrieval and fail to build the space-time-anchored narrative representations required for tracking entities through episodic events. To bridge this gap, we propose the \textbf{Generative Semantic Workspace} (GSW), a neuro-inspired generative memory framework that builds structured, interpretable representations of evolving situations, enabling LLMs to reason over evolving roles, actions, and spatiotemporal contexts. Our framework comprises an \textit{Operator}, which maps incoming observations to intermediate semantic structures, and a \textit{Reconciler}, which integrates these into a persistent workspace that enforces temporal, spatial, and logical coherence. On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20%}. Furthermore, GSW is highly efficient, reducing query-time context tokens by \textbf{51%} compared to the next most token-efficient baseline, reducing inference time costs considerably. More broadly, GSW offers a concrete blueprint for endowing LLMs with human-like episodic memory, paving the way for more capable agents that can reason over long horizons. Code is available at https://github.com/roychowdhuryresearch/gsw-memory.

[167] FRSICL: LLM-Enabled In-Context Learning Flight Resource Allocation for Fresh Data Collection in UAV-Assisted Wildfire Monitoring

Yousef Emami, Hao Zhou, Miguel Gutierrez Gaitan, Kai Li, Luis Almeida

Main category: cs.AI

TL;DR: LLM-based flight resource allocation for UAV wildfire monitoring that optimizes data collection schedules and UAV velocities in real-time to minimize Age of Information, using in-context learning instead of traditional DRL approaches.

DetailsMotivation: UAVs are crucial for wildfire monitoring but require efficient data collection to minimize information latency. Traditional DRL approaches have limitations including low sampling efficiency, simulation-to-reality gaps, and complex training, making them unsuitable for time-critical wildfire monitoring applications.

Method: Proposes FRSICL (Flight Resource Allocation scheme based on LLM-Enabled In-Context Learning) that uses LLMs with in-context learning to jointly optimize data collection schedules and UAV velocities. The approach generates decisions using natural language task descriptions and environmental feedback without extensive retraining.

Result: Simulation results show FRSICL outperforms state-of-the-art baselines including Proximal Policy Optimization, Block Coordinate Descent, and Nearest Neighbor approaches in minimizing average Age of Information across ground sensors.

Conclusion: LLM-based in-context learning provides an effective alternative to DRL for real-time UAV resource allocation in time-critical applications like wildfire monitoring, offering better generalization and adaptability without extensive retraining requirements.

Abstract: Uncrewed Aerial Vehicles (UAVs) play a vital role in public safety, especially in monitoring wildfires, where early detection reduces environmental impact. In UAV-Assisted Wildfire Monitoring (UAWM) systems, jointly optimizing the data collection schedule and UAV velocity is essential to minimize the average Age of Information (AoI) for sensory data. Deep Reinforcement Learning (DRL) has been used for this optimization, but its limitations-including low sampling efficiency, discrepancies between simulation and real-world conditions, and complex training make it unsuitable for time-critical applications such as wildfire monitoring. Recent advances in Large Language Models (LLMs) provide a promising alternative. With strong reasoning and generalization capabilities, LLMs can adapt to new tasks through In-Context Learning (ICL), which enables task adaptation using natural language prompts and example-based guidance without retraining. This paper proposes a novel online Flight Resource Allocation scheme based on LLM-Enabled In-Context Learning (FRSICL) to jointly optimize the data collection schedule and UAV velocity along the trajectory in real time, thereby asymptotically minimizing the average AoI across all ground sensors. Unlike DRL, FRSICL generates data collection schedules and velocities using natural language task descriptions and feedback from the environment, enabling dynamic decision-making without extensive retraining. Simulation results confirm the effectiveness of FRSICL compared to state-of-the-art baselines, namely Proximal Policy Optimization, Block Coordinate Descent, and Nearest Neighbor.

[168] Chain of Summaries: Summarization Through Iterative Questioning

William Brach, Kristián Košťál, Lukas Galke Poech

Main category: cs.AI

TL;DR: CoS (Chain of Summaries) is a Hegelian dialectical method for generating information-dense web content summaries that are LLM-friendly, improving Q&A performance while reducing token usage.

DetailsMotivation: Web content is often in LLM-unfriendly formats and exceeds context length limits, making it difficult for LLMs to effectively use external web information. There's a need for general-purpose summaries that make web content more accessible to LLMs while anticipating future information needs.

Method: CoS uses Hegel’s dialectical method: starting with an initial summary (thesis), identifying limitations through questioning (antithesis), and iteratively refining to create a general-purpose summary (synthesis). This process generates information-dense, plain-text summaries optimized for LLM consumption.

Result: CoS outperforms zero-shot LLM baselines by up to 66% and specialized summarization methods (Chain of Density, BRIO, PEGASUS) by up to 27% on TriviaQA, TruthfulQA, and SQUAD datasets. CoS summaries yield higher Q&A performance than source content while using substantially fewer tokens and being LLM-agnostic.

Conclusion: CoS provides an effective method for making web content more accessible to LLMs through information-dense summaries, offering website maintainers a way to improve LLM compatibility while retaining human oversight capabilities.

Abstract: Large Language Models (LLMs) are increasingly using external web content. However, much of this content is not easily digestible by LLMs due to LLM-unfriendly formats and limitations of context length. To address this issue, we propose a method for generating general-purpose, information-dense summaries that act as plain-text repositories of web content. Inspired by Hegel’s dialectical method, our approach, denoted as Chain of Summaries (CoS), iteratively refines an initial summary (thesis) by identifying its limitations through questioning (antithesis), leading to a general-purpose summary (synthesis) that can satisfy current and anticipate future information needs. Experiments on the TriviaQA, TruthfulQA, and SQUAD datasets demonstrate that CoS outperforms zero-shot LLM baselines by up to 66% and specialized summarization methods such as Chain of Density, BRIO and PEGASUS by up to 27%. CoS-generated summaries yield higher Q&A performance compared to the source content, while requiring substantially fewer tokens and being agnostic to the specific downstream LLM. CoS thus resembles an appealing option for website maintainers to make their content more accessible for LLMs, while retaining possibilities for human oversight.

[169] MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models

Zhanliang Wang, Kai Wang

Main category: cs.AI

TL;DR: MultiSHAP is a model-agnostic interpretability framework that uses Shapley Interaction Index to explain cross-modal interactions in multimodal AI models by quantifying synergistic effects between visual and textual elements.

DetailsMotivation: Multimodal AI models lack interpretability, which is crucial for high-stakes applications. Existing explanation methods provide coarse insights but cannot precisely quantify synergistic effects between modalities and are limited to open-source models.

Method: Leverages Shapley Interaction Index to attribute multimodal predictions to pairwise interactions between fine-grained visual and textual elements (image patches and text tokens). Works with both open- and closed-source models.

Result: Experiments on public multimodal benchmarks confirm MultiSHAP faithfully captures cross-modal reasoning mechanisms. Provides instance-level explanations (synergistic/suppressive effects) and dataset-level explanations (generalizable interaction patterns).

Conclusion: MultiSHAP offers a general, extensible solution for interpreting complex multimodal AI models, addressing the interpretability gap in multimodal systems.

Abstract: Multimodal AI models have achieved impressive performance in tasks that require integrating information from multiple modalities, such as vision and language. However, their “black-box” nature poses a major barrier to deployment in high-stakes applications where interpretability and trustworthiness are essential. How to explain cross-modal interactions in multimodal AI models remains a major challenge. While existing model explanation methods, such as attention map and Grad-CAM, offer coarse insights into cross-modal relationships, they cannot precisely quantify the synergistic effects between modalities, and are limited to open-source models with accessible internal weights. Here we introduce MultiSHAP, a model-agnostic interpretability framework that leverages the Shapley Interaction Index to attribute multimodal predictions to pairwise interactions between fine-grained visual and textual elements (such as image patches and text tokens), while being applicable to both open- and closed-source models. Our approach provides: (1) instance-level explanations that reveal synergistic and suppressive cross-modal effects for individual samples - “why the model makes a specific prediction on this input”, and (2) dataset-level explanation that uncovers generalizable interaction patterns across samples - “how the model integrates information across modalities”. Experiments on public multimodal benchmarks confirm that MultiSHAP faithfully captures cross-modal reasoning mechanisms, while real-world case studies demonstrate its practical utility. Our framework is extensible beyond two modalities, offering a general solution for interpreting complex multimodal AI models.

[170] SIGMUS: Semantic Integration for Knowledge Graphs in Multimodal Urban Spaces

Brian Wang, Mani Srivastava

Main category: cs.AI

TL;DR: SIGMUS is a system that uses LLMs to create knowledge graphs connecting multimodal urban sensor data (text, images, air quality, weather, traffic) to identify and reason about urban incidents without human-encoded rules.

DetailsMotivation: Urban spaces generate abundant multimodal sensor data that could help identify and analyze incidents (emergencies, events, disasters), but this data is fragmented and difficult to integrate due to reliance on human-driven reasoning for identifying relationships between multimodal data and incidents.

Method: SIGMUS uses Large Language Models to generate world knowledge for identifying relationships between urban incidents and multimodal data sources (news articles, CCTV images, air quality, weather, traffic measurements), organizing this knowledge into structured knowledge graphs without human-encoded rules.

Result: The system successfully produces reasonable connections between 5 different data sources and relevant incidents occurring at the same time and location, demonstrating effective multimodal data integration.

Conclusion: SIGMUS provides an automated approach to integrate multimodal urban data for incident analysis using LLMs and knowledge graphs, reducing reliance on human-encoded rules while maintaining reasonable connection quality.

Abstract: Modern urban spaces are equipped with an increasingly diverse set of sensors, all producing an abundance of multimodal data. Such multimodal data can be used to identify and reason about important incidents occurring in urban landscapes, such as major emergencies, cultural and social events, as well as natural disasters. However, such data may be fragmented over several sources and difficult to integrate due to the reliance on human-driven reasoning for identifying relationships between the multimodal data corresponding to an incident, as well as understanding the different components which define an incident. Such relationships and components are critical to identifying the causes of such incidents, as well as producing forecasting the scale and intensity of future incidents as they begin to develop. In this work, we create SIGMUS, a system for Semantic Integration for Knowledge Graphs in Multimodal Urban Spaces. SIGMUS uses Large Language Models (LLMs) to produce the necessary world knowledge for identifying relationships between incidents occurring in urban spaces and data from different modalities, allowing us to organize evidence and observations relevant to an incident without relying and human-encoded rules for relating multimodal sensory data with incidents. This organized knowledge is represented as a knowledge graph, organizing incidents, observations, and much more. We find that our system is able to produce reasonable connections between 5 different data sources (new article text, CCTV images, air quality, weather, and traffic measurements) and relevant incidents occurring at the same time and location.

[171] Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents

Davide Paglieri, Bartłomiej Cupiał, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel

Main category: cs.AI

TL;DR: LLM agents trained with dynamic planning framework that decides when to plan, improving efficiency and performance on long-horizon tasks through supervised fine-tuning and RL.

DetailsMotivation: Existing methods like ReAct require LLMs to always plan before every action, which is computationally expensive and degrades performance on long-horizon tasks, while never planning limits performance. Need for flexible planning decisions.

Method: Two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, (2) reinforcement learning to refine this capability in long-horizon environments. Framework enables agents to decide when to allocate test-time compute for planning.

Result: Dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives in the Crafter environment. Agents can be effectively steered by human-written plans, surpassing independent capabilities.

Conclusion: Dynamic planning framework improves LLM agent performance on long-horizon tasks by enabling flexible planning decisions, with potential for safer and more collaborative agentic systems through human plan steering.

Abstract: Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities. In agentic settings, existing methods like ReAct prompt LLMs to explicitly plan before every action; however, we demonstrate that always planning is computationally expensive and degrades performance on long-horizon tasks, while never planning further limits performance. To address this, we introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning. We propose a simple two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, and (2) RL to refine this capability in long-horizon environments. Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives. Additionally, we demonstrate that these agents can be effectively steered by human-written plans, surpassing their independent capabilities and highlighting the potential for safer and more collaborative agentic systems.

[172] Learning-Based Planning for Improving Science Return of Earth Observation Satellites

Abigail Breitfeld, Alberto Candela, Juan Delfa, Akseli Kangaslahti, Itai Zilberstein, Steve Chien, David Wettergreen

Main category: cs.AI

TL;DR: Learning-based dynamic targeting for Earth observation satellites using reinforcement and imitation learning to optimize instrument pointing and data collection.

DetailsMotivation: Earth observation satellites have limitations: fixed orbits, limited sensor field of view, and resource-intensive pointing operations. Dynamic targeting can optimize data collection by intelligently reconfiguring instruments using lookahead data, but existing heuristic methods are suboptimal.

Method: Two learning-based approaches: 1) Reinforcement learning to learn optimal targeting policies, and 2) Imitation learning that mimics a dynamic programming solution. Both methods build on dynamic programming to plan sampling location sequences.

Result: Learning methods outperform existing heuristic approaches: imitation learning performs 10.0% better than the best heuristic, reinforcement learning performs 13.7% better. Both methods can be trained effectively with small amounts of data.

Conclusion: Learning-based approaches significantly improve dynamic targeting performance for Earth observation satellites, enabling more efficient and informative data collection with limited training data requirements.

Abstract: Earth observing satellites are powerful tools for collecting scientific information about our planet, however they have limitations: they cannot easily deviate from their orbital trajectories, their sensors have a limited field of view, and pointing and operating these sensors can take a large amount of the spacecraft’s resources. It is important for these satellites to optimize the data they collect and include only the most important or informative measurements. Dynamic targeting is an emerging concept in which satellite resources and data from a lookahead instrument are used to intelligently reconfigure and point a primary instrument. Simulation studies have shown that dynamic targeting increases the amount of scientific information gathered versus conventional sampling strategies. In this work, we present two different learning-based approaches to dynamic targeting, using reinforcement and imitation learning, respectively. These learning methods build on a dynamic programming solution to plan a sequence of sampling locations. We evaluate our approaches against existing heuristic methods for dynamic targeting, showing the benefits of using learning for this application. Imitation learning performs on average 10.0% better than the best heuristic method, while reinforcement learning performs on average 13.7% better. We also show that both learning methods can be trained effectively with small amounts of data.

[173] ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen

Main category: cs.AI

TL;DR: ErrorMap is a method to analyze why LLMs fail rather than just detecting failures, creating failure signatures and ErrorAtlas taxonomy across 35 datasets and 83 models.

DetailsMotivation: Current LLM benchmarks only show when models fail but not why they fail, making it difficult to understand root causes of errors and guide meaningful model improvement.

Method: ErrorMap extracts failure signatures by analyzing error sources across models and datasets, categorizing failures into types like formatting issues, calculation errors, dataset noise, omissions, and question misinterpretation.

Result: Applied to 35 datasets and 83 models, generated ErrorAtlas taxonomy revealing recurring failure patterns and highlighting underexplored error types like omissions and question misinterpretation.

Conclusion: ErrorMap enables deeper evaluation by shifting focus from success metrics to understanding failure causes, offering richer insights into model behavior and limitations across tasks.

Abstract: Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model’s unique “failure signature”, clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details in the output and question misinterpretation. By shifting focus from where models succeed to why they fail, ErrorMap and ErrorAtlas enable advanced evaluation - one that exposes hidden weaknesses and directs progress. Unlike success, typically measured by task-level metrics, our approach introduces a deeper evaluation layer that can be applied globally across models and tasks, offering richer insights into model behavior and limitations. We make the taxonomy and code publicly available with plans to periodically update ErrorAtlas as new benchmarks and models emerge.

[174] A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen

Main category: cs.AI

TL;DR: The paper introduces InfoQA, a multi-call framework for Multi-Hop Question Answering that addresses LLM capacity limitations by combining capacity-aware task decomposition with pruning of reasoning traces to keep information load within single-pass limits.

DetailsMotivation: LLMs have finite per-pass output capacity, making single-pass reasoning vulnerable to capacity overflow when integrating dispersed, interdependent evidence in Multi-Hop Question Answering tasks with noise.

Method: Proposes InfoQA framework with capacity-aware task decomposition, active pruning of prior reasoning traces to keep information load within single-pass limits, and dependency-explicit workflow for precise control over reasoning paths.

Result: Experimental results show model behavior aligns with predicted capacity curves, and InfoQA achieves consistent performance improvements on a stringent noise-rich benchmark.

Conclusion: The work establishes theoretical performance bounds for single-pass LLMs and provides a practical multi-call framework that inspires more LLM multi-step reasoning methods.

Abstract: Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.

[175] “Sorry, I Didn’t Catch That”: How Speech Models Miss What Matters Most

Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou

Main category: cs.AI

TL;DR: Speech recognition systems fail on short, high-stakes street name transcriptions with 44% error rate, disproportionately affecting non-English speakers; synthetic data generation improves accuracy by 60%.

DetailsMotivation: Despite low word error rates on standard benchmarks, speech recognition systems fail on short, high-stakes utterances in real-world deployments, particularly for U.S. street name transcription where errors have serious consequences.

Method: Evaluated 15 commercial speech recognition models on linguistically diverse U.S. speakers’ street name recordings, analyzed downstream geographic impact, and introduced synthetic data generation using open-source text-to-speech models to produce diverse pronunciations for fine-tuning.

Result: Found average transcription error rate of 44%, with routing distance errors twice as large for non-English primary speakers; fine-tuning with <1,000 synthetic samples improved street name transcription accuracy by nearly 60% for non-English speakers.

Conclusion: There’s a critical gap between benchmark performance and real-world reliability in speech systems, but synthetic data generation offers a scalable path to reducing high-stakes transcription errors and addressing performance disparities.

Abstract: Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.

[176] Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability

Shojiro Yamabe, Jun Sakuma

Main category: cs.AI

TL;DR: DLMs have a vulnerability where injecting affirmative tokens at intermediate denoising steps can bypass safety guardrails, enabling jailbreak attacks; a DLM-specific safety alignment method is proposed to mitigate this.

DetailsMotivation: Diffusion language models (DLMs) generate tokens in parallel through iterative denoising, which reduces latency and enables bidirectional conditioning. However, the safety risks from jailbreak attacks exploiting this inference mechanism are not well understood, revealing a critical vulnerability in DLMs.

Method: The paper investigates DLM vulnerabilities by showing that if affirmative tokens for harmful queries appear at intermediate steps, subsequent denoising can be steered toward harmful responses. It demonstrates that simply injecting such tokens can bypass safety guardrails. The authors propose a novel safety alignment method tailored to DLMs that trains models to generate safe responses from contaminated intermediate states containing affirmative tokens.

Result: Experiments show the proposed method significantly mitigates the vulnerability with minimal impact on task performance. The method also improves robustness against conventional jailbreak attacks. The vulnerability allows existing optimization-based jailbreak attacks to succeed on DLMs.

Conclusion: DLMs have a critical vulnerability stemming from their iterative denoising process that enables jailbreak attacks. The proposed DLM-specific safety alignment method effectively mitigates this vulnerability while maintaining task performance. The work underscores the need for DLM-specific safety research beyond traditional autoregressive models.

Abstract: Diffusion language models (DLMs) generate tokens in parallel through iterative denoising, which can reduce latency and enable bidirectional conditioning. However, the safety risks posed by jailbreak attacks that exploit this inference mechanism are not well understood. In this paper, we reveal that DLMs have a critical vulnerability stemming from their iterative denoising process and propose a countermeasure. Specifically, our investigation shows that if an affirmative token for a harmful query appears at an intermediate step, subsequent denoising can be steered toward a harmful response even in aligned models. As a result, simply injecting such affirmative tokens can readily bypass the safety guardrails. Furthermore, we demonstrate that the vulnerability allows existing optimization-based jailbreak attacks to succeed on DLMs. Building on this analysis, we propose a novel safety alignment method tailored to DLMs that trains models to generate safe responses from contaminated intermediate states that contain affirmative tokens. Our experiments indicate that the proposed method significantly mitigates the vulnerability with minimal impact on task performance. Furthermore, our method improves robustness against conventional jailbreak attacks. Our work underscores the need for DLM-specific safety research. Our code is available at https://github.com/mdl-lab/dlm-priming-vulnerability.

[177] Generalized Parallel Scaling with Interdependent Generations

Harry Dong, David Brandfonbrener, Eryk Helenowski, Yun He, Mrinal Kumar, Han Fang, Yuejie Chi, Karthik Abinav Sankararaman

Main category: cs.AI

TL;DR: Bridge enables parallel LLM inference with interdependent responses by treating batched hidden states holistically, improving response quality and consistency through information sharing between parallel generations.

DetailsMotivation: Current parallel LLM inference generates responses independently, wasting computational resources and failing to leverage potentially useful information across parallel responses, unlike sequential generation where past computation informs future steps.

Method: Bridge reinterprets batched LLM hidden states as holistic tensors rather than independent slices, introducing a small number of new parameters (2.8%-5.1%) to enable information sharing between parallel responses during generation.

Result: Bridge improves relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 39%, boosts consistency of correct responses, and scales to any generation width while outperforming independent generations.

Conclusion: Bridge enables a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique, offering improved response quality and consistency.

Abstract: Parallel LLM inference scaling involves sampling a set of $N>1$ responses for a single input prompt. However, these $N$ parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8%-5.1%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 39% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.

[178] SR-Scientist: Scientific Equation Discovery With Agentic AI

Shijie Xia, Yuhan Sun, Pengfei Liu

Main category: cs.AI

TL;DR: SR-Scientist: A framework that transforms LLMs from simple equation proposers into autonomous AI scientists that write code, analyze data, implement equations, evaluate them, and optimize based on feedback using a code interpreter toolset.

DetailsMotivation: Current LLM-based scientific discovery methods limit LLMs to just proposing equations within search algorithms, failing to leverage their full potential as autonomous agents capable of the entire scientific discovery process.

Method: Wrap code interpreter into tools for data analysis and equation evaluation, create an agent that uses these tools to autonomously analyze data, implement equations as code, submit for evaluation, and optimize equations through iterative feedback with minimal human-defined pipelines.

Result: Outperforms baseline methods by 6-35% across four science disciplines, demonstrates robustness to noise, generalization to out-of-domain data, symbolic accuracy, and enhanced capabilities through end-to-end reinforcement learning.

Conclusion: SR-Scientist successfully elevates LLMs to autonomous AI scientists capable of end-to-end scientific discovery, showing significant performance improvements and opening new possibilities for AI-driven scientific research.

Abstract: Recently, Large Language Models (LLMs) have been applied to scientific equation discovery, leveraging their embedded scientific knowledge for hypothesis generation. However, current methods typically confine LLMs to the role of an equation proposer within search algorithms like genetic programming. In this paper, we present SR-Scientist, a framework that elevates the LLM from a simple equation proposer to an autonomous AI scientist that writes code to analyze data, implements the equation as code, submits it for evaluation, and optimizes the equation based on experimental feedback. Specifically, we wrap the code interpreter into a set of tools for data analysis and equation evaluation. The agent is instructed to optimize the equation by utilizing these tools over a long horizon with minimal human-defined pipelines. Empirical results show that SR-Scientist outperforms baseline methods by an absolute margin of 6% to 35% on datasets covering four science disciplines. Additionally, we demonstrate our method’s robustness to noise, the generalization of the discovered equations to out-of-domain data, and their symbolic accuracy. Furthermore, we develop an end-to-end reinforcement learning framework to enhance the agent’s capabilities.

[179] Comparative Expressivity for Structured Argumentation Frameworks with Uncertain Rules and Premises

Carlo Proietti, Antonio Yuste-Ginel

Main category: cs.AI

TL;DR: This paper studies qualitative uncertainty modeling in formal argumentation, comparing abstract and structured approaches with expressivity analysis.

DetailsMotivation: Most existing works focus on abstract models for arguing with uncertainty, but there's a need to study plausible instantiations of these abstract models and ground uncertainty in argument components.

Method: Introduces a notion of expressivity that handles both abstract and structured formalisms, and presents expressivity results comparing abstract models (incomplete argumentation frameworks with dependencies) and structured models (ASPIC+).

Result: Presents both negative and positive expressivity results comparing the expressivity of abstract and structured models of argumentation with uncertainty.

Conclusion: The study advances understanding of uncertainty modeling in argumentation by providing expressivity analysis that bridges abstract and structured approaches.

Abstract: Modelling qualitative uncertainty in formal argumentation is essential both for practical applications and theoretical understanding. Yet, most of the existing works focus on \textit{abstract} models for arguing with uncertainty. Following a recent trend in the literature, we tackle the open question of studying plausible instantiations of these abstract models. To do so, we ground the uncertainty of arguments in their components, structured within rules and premises. Our main technical contributions are: i) the introduction of a notion of expressivity that can handle abstract and structured formalisms, and ii) the presentation of both negative and positive expressivity results, comparing the expressivity of abstract and structured models of argumentation with uncertainty. These results affect incomplete abstract argumentation frameworks, and their extension with dependencies, on the abstract side, and ASPIC+, on the structured side.

[180] Advanced Assistance for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction

Gerui Xu, Boyou Chen, Huizhong Guo, Dave LeBlanc, Arpan Kusari, Efe Yarbasi, Ananna Ahmed, Zhaonan Sun, Shan Bao

Main category: cs.AI

TL;DR: Multi-agent AI framework reconstructs pre-crash scenarios from fragmented collision data using multimodal inputs and EDR signals, achieving high accuracy without domain-specific training.

DetailsMotivation: Traditional traffic collision reconstruction relies on human expertise and is challenging for pre-crash scenarios. There's a need for AI systems that can process fragmented multimodal collision data to reconstruct pre-crash events and infer vehicle behaviors.

Method: Two-phase collaborative framework: Phase I generates natural-language crash reconstructions from multimodal inputs (narrative reports, structured variables, scene diagrams). Phase II combines reconstructions with Event Data Recorder signals to identify striking/struck vehicles and isolate relevant EDR records for pre-crash behavior inference.

Result: Achieved 100% accuracy across 4,155 trials on 277 rear-end LVD crashes. On 39 complex cases with ambiguous data, framework maintained perfect accuracy while human analysts achieved 92.31%. Ablation tests showed structured reasoning anchors improved accuracy from 96.5% to 99.7%.

Conclusion: The zero-shot framework demonstrates scalable AI-assisted pre-crash analysis without domain-specific training, remaining robust under incomplete inputs and showing performance driven by structured prompts rather than model choice.

Abstract: Traffic collision reconstruction traditionally relies on human expertise and can be accurate, but pre-crash reconstruction is more challenging. This study develops a multi-agent AI framework that reconstructs pre-crash scenarios and infers vehicle behaviors from fragmented collision data. We propose a two-phase collaborative framework with reconstruction and reasoning stages. The system processes 277 rear-end lead vehicle deceleration (LVD) crashes from the Crash Investigation Sampling System (CISS, 2017 to 2022), integrating narrative reports, structured tabular variables, and scene diagrams. Phase I generates natural-language crash reconstructions from multimodal inputs. Phase II combines these reconstructions with Event Data Recorder (EDR) signals to (1) identify striking and struck vehicles and (2) isolate the EDR records most relevant to the collision moment, enabling inference of key pre-crash behaviors. For validation, we evaluated all LVD cases and emphasized 39 complex crashes where multiple EDR records per crash created ambiguity due to missing or conflicting data. Ground truth was set by consensus of two independent manual annotators, with a separate language model used only to flag potential conflicts for re-checking. The framework achieved 100% accuracy across 4,155 trials; three reasoning models produced identical outputs, indicating that performance is driven by the structured prompts rather than model choice. Research analysts without reconstruction training achieved 92.31% accuracy on the same 39 complex cases. Ablation tests showed that removing structured reasoning anchors reduced case-level accuracy from 99.7% to 96.5% and increased errors across multiple output dimensions. The system remained robust under incomplete inputs. This zero-shot evaluation, without domain-specific training or fine-tuning, suggests a scalable approach for AI-assisted pre-crash analysis.

[181] Aeon: High-Performance Neuro-Symbolic Memory Management for Long-Horizon LLM Agents

Mustafa Arslan

Main category: cs.AI

TL;DR: Aeon is a neuro-symbolic cognitive operating system that structures LLM memory into hierarchical spatial and temporal components to overcome computational bottlenecks and memory limitations in long-context reasoning.

DetailsMotivation: LLMs face quadratic computational costs from self-attention and "Lost in the Middle" phenomenon where reasoning degrades with long contexts. Existing vector database approaches treat memory as unstructured embeddings, failing to capture hierarchical and temporal structure of long-horizon interactions.

Method: Aeon structures memory into a Memory Palace (spatial index via SIMD-accelerated Page-Clustered Vector Index) and a Trace (neuro-symbolic episodic graph). Key innovations include symmetric INT8 scalar quantization, decoupled write-ahead log for crash recovery, sidecar blob arena for extended text storage, and semantic lookaside buffer for fast retrieval.

Result: Benchmarks on Apple M4 Max show 4.70ns INT8 dot product latency, 3.09us tree traversal at 100K nodes (3.4x over FP32), P99 read latency of 750ns under 16-thread contention, and sub-5us retrieval latencies via semantic lookaside buffer.

Conclusion: Aeon provides a comprehensive memory management system for LLMs that addresses computational bottlenecks, memory limitations, and structural deficiencies in existing approaches through neuro-symbolic architecture and hardware-accelerated optimizations.

Abstract: Large Language Models (LLMs) are fundamentally constrained by the quadratic computational cost of self-attention and the “Lost in the Middle” phenomenon, where reasoning capabilities degrade as context windows expand. Existing solutions, primarily “Flat RAG” architectures relying on vector databases, treat memory as an unstructured bag of embeddings, failing to capture the hierarchical and temporal structure of long-horizon interactions. This paper presents Aeon, a Neuro-Symbolic Cognitive Operating System that redefines memory as a managed OS resource. Aeon structures memory into a Memory Palace (a spatial index implemented via Atlas, a SIMD-accelerated Page-Clustered Vector Index) and a Trace (a neuro-symbolic episodic graph). This architecture introduces three advances: (1) Symmetric INT8 Scalar Quantization, achieving 3.1x spatial compression and 5.6x math acceleration via NEON SDOT intrinsics; (2) a decoupled Write-Ahead Log (WAL) ensuring crash-recoverability with statistically negligible overhead (<1%); and (3) a Sidecar Blob Arena eliminating the prior 440-character text ceiling via an append-only mmap-backed blob file with generational garbage collection. The Semantic Lookaside Buffer (SLB) exploits conversational locality to achieve sub-5us retrieval latencies, with INT8 vectors dequantized to FP32 on cache insertion to preserve L1-resident lookup performance. Benchmarks on Apple M4 Max demonstrate that the combined architecture achieves 4.70ns INT8 dot product latency, 3.09us tree traversal at 100K nodes (3.4x over FP32), and P99 read latency of 750ns under hostile 16-thread contention via epoch-based reclamation.

[182] PolySHAP: Extending KernelSHAP with Interaction-Informed Polynomial Regression

Fabian Fumagalli, R. Teal Witter, Christopher Musco

Main category: cs.AI

TL;DR: PolySHAP extends KernelSHAP by using higher-degree polynomials to capture non-linear feature interactions, improving Shapley value estimation accuracy and providing theoretical justification for paired sampling heuristics.

DetailsMotivation: KernelSHAP approximates Shapley values using linear functions, but real-world feature interactions are often non-linear. The authors aim to improve accuracy by capturing these non-linear interactions through polynomial approximations.

Method: Extends KernelSHAP by approximating the game via higher-degree polynomials (PolySHAP) instead of linear functions. Proves theoretical connections between second-order PolySHAP and paired sampling (antithetic sampling) heuristics.

Result: PolySHAP yields empirically better Shapley value estimates across various benchmark datasets. Theoretically proves that paired sampling outputs exactly the same approximations as second-order PolySHAP without fitting degree 2 polynomials.

Conclusion: PolySHAP improves upon KernelSHAP by capturing non-linear feature interactions, provides theoretical consistency guarantees, and offers the first strong theoretical justification for the practical success of paired sampling heuristics.

Abstract: Shapley values have emerged as a central game-theoretic tool in explainable AI (XAI). However, computing Shapley values exactly requires $2^d$ game evaluations for a model with $d$ features. Lundberg and Lee’s KernelSHAP algorithm has emerged as a leading method for avoiding this exponential cost. KernelSHAP approximates Shapley values by approximating the game as a linear function, which is fit using a small number of game evaluations for random feature subsets. In this work, we extend KernelSHAP by approximating the game via higher degree polynomials, which capture non-linear interactions between features. Our resulting PolySHAP method yields empirically better Shapley value estimates for various benchmark datasets, and we prove that these estimates are consistent. Moreover, we connect our approach to paired sampling (antithetic sampling), a ubiquitous modification to KernelSHAP that improves empirical accuracy. We prove that paired sampling outputs exactly the same Shapley value approximations as second-order PolySHAP, without ever fitting a degree 2 polynomial. To the best of our knowledge, this finding provides the first strong theoretical justification for the excellent practical performance of the paired sampling heuristic.

[183] ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research

Hao Shen, Hang Yang, Zhouhong Gu, Weili Han

Main category: cs.AI

TL;DR: ScholarGym is an evaluation environment that isolates and analyzes the information-gathering stage of deep research systems, decomposing the process into Query Planning, Tool Invocation, and Relevance Assessment stages for systematic evaluation.

DetailsMotivation: Current evaluation of deep research systems focuses on holistic scoring of final reports, which tightly couples decision-making, workflow design, and environmental feedback, preventing decomposable analysis of individual components.

Method: ScholarGym decomposes the research process into three explicit stages: Query Planning (decomposing research questions), Tool Invocation (retrieving papers), and Relevance Assessment (evaluating retrieved content). It uses 2,536 expert-annotated queries over a static corpus of 570K papers with deterministic retrieval.

Result: Iterative query decomposition yields 2.9-3.3× F1 gains over single-query retrieval. Models with extended thinking trade recall for precision. Query Planning quality and Relevance Assessment constitute dual bottlenecks separating proprietary from open-source model performance.

Conclusion: ScholarGym enables systematic analysis of deep research systems by isolating information-gathering stages, revealing key bottlenecks and performance characteristics that are obscured in end-to-end evaluations.

Abstract: Large language models have advanced from single-turn question answering to deep research systems that iteratively decompose research questions, invoke retrieval tools, and synthesize information across multiple rounds. Evaluating such systems typically involves scoring their final research reports holistically, but this end-to-end paradigm tightly couples the language model’s decision-making, workflow design, and environmental feedback, precluding decomposable analysis of individual components. We introduce ScholarGym, an evaluation environment that isolates the information-gathering stage of deep research on academic literature. Under a unified workflow, ScholarGym decomposes the research process into three explicit stages – Query Planning, Tool Invocation, and Relevance Assessment – and evaluates each against 2,536 expert-annotated queries over a static corpus of 570K papers with deterministic retrieval. Systematic experiments reveal that iterative query decomposition yields 2.9–3.3$\times$ F1 gains over single-query retrieval, models with extended thinking trade recall for precision, and Query Planning quality together with Relevance Assessment constitute dual bottlenecks that separate proprietary from open-source model performance.

[184] FlowSteer: Interactive Agentic Workflow Orchestration via End-to-End Reinforcement Learning

Mingda Zhang, Haoran Luo, Tiesunlong Shen, Qika Lin, Xiaoying Tang, Rui Mao, Erik Cambria

Main category: cs.AI

TL;DR: FlowSteer is an RL framework that automates workflow orchestration using a policy model interacting with an executable canvas environment, addressing challenges of manual cost, operator/LLM dependence, and sparse rewards.

DetailsMotivation: Existing workflow orchestration faces high manual costs, reliance on specific operators/LLMs, and sparse reward signals, limiting automation and adaptability.

Method: End-to-end RL framework with lightweight policy model as agent interacting with executable canvas environment; policy analyzes states and selects editing actions while canvas executes operators and provides feedback; supports plug-and-play operator libraries and interchangeable LLM backends; uses Canvas Workflow Relative Policy Optimization (CWRPO) with diversity-constrained rewards and conditional release.

Result: Experimental results on twelve datasets show FlowSteer significantly outperforms baselines across various tasks.

Conclusion: FlowSteer provides an effective automated workflow orchestration solution that addresses key challenges in current approaches through RL-based interaction paradigm.

Abstract: In recent years, a variety of powerful agentic workflows have been applied to solve a wide range of human problems. However, existing workflow orchestration still faces key challenges, including high manual cost, reliance on specific operators/large language models (LLMs), and sparse reward signals. To address these challenges, we propose FlowSteer, an end-to-end reinforcement learning framework that takes a lightweight policy model as the agent and an executable canvas environment, automating workflow orchestration through multi-turn interaction. In this process, the policy model analyzes execution states and selects editing actions, while the canvas executes operators and returns feedback for iterative refinement. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. To effectively train this interaction paradigm, we propose Canvas Workflow Relative Policy Optimization (CWRPO), which introduces diversity-constrained rewards with conditional release to stabilize learning and suppress shortcut behaviors. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks.

[185] MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, Jinsung Yoon

Main category: cs.AI

TL;DR: MARS is a modular AI research agent framework that uses budget-aware planning, modular construction, and comparative reflective memory to automate AI research while managing computational costs and performance attribution.

DetailsMotivation: Current LLM-based agents struggle with AI research automation due to computationally expensive evaluations (like model training) and opaque performance attribution, often generating monolithic scripts that ignore execution costs and causal factors.

Method: Three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search to balance performance with execution expense; (2) Modular Construction using a “Design-Decompose-Implement” pipeline; (3) Comparative Reflective Memory that analyzes solution differences to distill insights and address credit assignment.

Result: MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with top methods on the global leaderboard. The system shows qualitative “Aha!” moments with 63% of utilized lessons originating from cross-branch transfer.

Conclusion: MARS effectively automates AI research by managing computational costs, enabling modular development, and facilitating knowledge transfer across search paths through comparative reflection.

Abstract: Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a “Design-Decompose-Implement” pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard’s top methods. Furthermore, the system exhibits qualitative “Aha!” moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

[186] LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge

Xin Wang, Hong Jia, Hualin Zhou, Sheng Guang Wang, Yu Zhang, Ting Dang, Tao Gu

Main category: cs.AI

TL;DR: LQA is a lightweight quantized-adaptive framework for vision-language models that enables efficient on-device deployment through modality-aware quantization and gradient-free test-time adaptation.

DetailsMotivation: Vision-language models face deployment challenges on edge devices due to resource constraints and performance degradation under distribution shifts. Existing test-time adaptation methods are too resource-intensive for on-device use.

Method: Proposes LQA framework with: 1) Selective Hybrid Quantization (SHQ) - modality-aware quantization strategy, and 2) Quantized gradient-free adaptation mechanism for efficient test-time adaptation without heavy gradient computations.

Result: LQA improves adaptation performance by 4.5%, uses less memory than full-precision models, and outperforms gradient-based TTA methods with up to 19.9× lower memory usage across seven datasets.

Conclusion: LQA provides a practical solution for robust, privacy-preserving, and efficient VLM deployment on resource-constrained edge devices.

Abstract: Deploying Vision-Language Models (VLMs) on edge devices is challenged by resource constraints and performance degradation under distribution shifts. While test-time adaptation (TTA) can counteract such shifts, existing methods are too resource-intensive for on-device deployment. To address this challenge, we propose LQA, a lightweight, quantized-adaptive framework for VLMs that combines a modality-aware quantization strategy with gradient-free test-time adaptation. We introduce Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enable robust and efficient VLM deployment on resource-constrained hardware. Experiments across both synthetic and real-world distribution shifts show that LQA improves overall adaptation performance by 4.5%, uses less memory than full-precision models, and significantly outperforms gradient-based TTA methods, achieving up to 19.9$\times$ lower memory usage across seven open-source datasets. These results demonstrate that LQA offers a practical pathway for robust, privacy-preserving, and efficient VLM deployment on edge devices.

[187] stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation

Lucas Maes, Quentin Le Lidec, Dan Haramati, Nassim Massaudi, Damien Scieur, Yann LeCun, Randall Balestriero

Main category: cs.AI

TL;DR: SWM is a modular, tested world-model research ecosystem with standardized environments, data-collection tools, planning algorithms, and baseline implementations to improve reusability and evaluation standardization.

DetailsMotivation: Current world model implementations are publication-specific, limiting reusability, increasing bug risks, and reducing evaluation standardization. There's a need for a robust, modular ecosystem to support world model research.

Method: Developed stable-worldmodel (SWM) - a modular ecosystem with efficient data-collection tools, standardized environments, planning algorithms, and baseline implementations. Environments include controllable factors of variation (visual and physical properties).

Result: SWM provides a reusable framework demonstrated by studying zero-shot robustness in DINO-WM. The ecosystem supports robustness and continual learning research through controllable environmental variations.

Conclusion: SWM addresses critical issues in world model research by providing a standardized, modular ecosystem that improves reusability, reduces bugs, and enables better evaluation standardization.

Abstract: World Models have emerged as a powerful paradigm for learning compact, predictive representations of environment dynamics, enabling agents to reason, plan, and generalize beyond direct experience. Despite recent interest in World Models, most available implementations remain publication-specific, severely limiting their reusability, increasing the risk of bugs, and reducing evaluation standardization. To mitigate these issues, we introduce stable-worldmodel (SWM), a modular, tested, and documented world-model research ecosystem that provides efficient data-collection tools, standardized environments, planning algorithms, and baseline implementations. In addition, each environment in SWM enables controllable factors of variation, including visual and physical properties, to support robustness and continual learning research. Finally, we demonstrate the utility of SWM by using it to study zero-shot robustness in DINO-WM.

[188] Tabular Foundation Models Can Learn Association Rules

Erkan Karabulut, Daniel Daza, Paul Groth, Martijn C. Schut, Victoria Degeler

Main category: cs.AI

TL;DR: TabProbe: A model-agnostic framework that extracts association rules from tabular foundation models without frequent itemset mining, enabling high-quality rule discovery even in low-data regimes.

DetailsMotivation: Classical ARM methods suffer from rule explosion and poor scalability, while recent neural approaches degrade in low-data settings. Tabular foundation models offer strong in-context generalization but haven't been leveraged for association rule mining.

Method: Proposes a model-agnostic framework to extract association rules from any conditional probabilistic model over tabular data. Introduces TabProbe as an instantiation that uses TFMs as conditional probability estimators to learn association rules without frequent itemset mining.

Result: TFMs consistently produce concise, high-quality association rules with strong predictive performance and remain robust in low-data settings without task-specific training.

Conclusion: TabProbe successfully leverages tabular foundation models for association rule mining, overcoming limitations of both classical and recent neural approaches, particularly in low-data regimes.

Abstract: Association Rule Mining (ARM) is a fundamental task for knowledge discovery in tabular data and is widely used in high-stakes decision-making. Classical ARM methods rely on frequent itemset mining, leading to rule explosion and poor scalability, while recent neural approaches mitigate these issues but suffer from degraded performance in low-data regimes. Tabular foundation models (TFMs), pretrained on diverse tabular data with strong in-context generalization, provide a basis for addressing these limitations. We introduce a model-agnostic association rule learning framework that extracts association rules from any conditional probabilistic model over tabular data, enabling us to leverage TFMs. We then introduce TabProbe, an instantiation of our framework that utilizes TFMs as conditional probability estimators to learn association rules out-of-the-box without frequent itemset mining. We evaluate our approach on tabular datasets of varying sizes based on standard ARM rule quality metrics and downstream classification performance. The results show that TFMs consistently produce concise, high-quality association rules with strong predictive performance and remain robust in low-data settings without task-specific training. Source code is available at https://github.com/DiTEC-project/tabprobe.

[189] Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

Luís Silva, Diogo Gonçalves, Catarina Farinha, Clara Matos, Luís Ungaro

Main category: cs.AI

TL;DR: Arbor is a framework that decomposes decision tree navigation into specialized node-level tasks to improve LLM adherence to structured workflows in high-stakes domains like healthcare triage.

DetailsMotivation: Large language models struggle with maintaining strict adherence to structured workflows in high-stakes domains, especially as prompt length increases, leading to instruction-following degradation, lost-in-the-middle effects, and context window overflow.

Method: Arbor decomposes decision tree navigation into specialized node-level tasks, standardizes decision trees into edge-list representations, uses DAG-based orchestration to iteratively retrieve outgoing edges, evaluates transitions via dedicated LLM calls, and delegates response generation to separate inference steps.

Result: Arbor improves mean turn accuracy by 29.4 percentage points, reduces per-turn latency by 57.1%, and achieves 14.4x reduction in per-turn cost compared to single-prompt baselines across 10 foundation models using real clinical triage conversations.

Conclusion: Architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines for structured decision-making tasks.

Abstract: Large language models struggle to maintain strict adherence to structured workflows in high-stakes domains such as healthcare triage. Monolithic approaches that encode entire decision structures within a single prompt are prone to instruction-following degradation as prompt length increases, including lost-in-the-middle effects and context window overflow. To address this gap, we present Arbor, a framework that decomposes decision tree navigation into specialized, node-level tasks. Decision trees are standardized into an edge-list representation and stored for dynamic retrieval. At runtime, a directed acyclic graph (DAG)-based orchestration mechanism iteratively retrieves only the outgoing edges of the current node, evaluates valid transitions via a dedicated LLM call, and delegates response generation to a separate inference step. The framework is agnostic to the underlying decision logic and model provider. Evaluated against single-prompt baselines across 10 foundation models using annotated turns from real clinical triage conversations. Arbor improves mean turn accuracy by 29.4 percentage points, reduces per-turn latency by 57.1%, and achieves an average 14.4x reduction in per-turn cost. These results indicate that architectural decomposition reduces dependence on intrinsic model capability, enabling smaller models to match or exceed larger models operating under single-prompt baselines.

[190] From User Preferences to Base Score Extraction Functions in Gradual Argumentation (with Appendix)

Aniol Civit, Antonio Rago, Antonio Andriella, Guillem Alenyà, Francesca Toni

Main category: cs.AI

TL;DR: Base Score Extraction Functions map user preferences over arguments to base scores in gradual argumentation frameworks, enabling easier construction of Quantitative Bipolar Argumentation Frameworks without requiring expert score selection.

DetailsMotivation: Selecting appropriate base scores for arguments in gradual argumentation frameworks requires user expertise and is often non-trivial. Organizing arguments by preference could simplify this process, making gradual argumentation more accessible for transparent AI systems.

Method: Introduces Base Score Extraction Functions that map user preferences over arguments to base scores. These functions can be applied to Bipolar Argumentation Frameworks with preferences to obtain Quantitative Bipolar Argumentation Frameworks. The method incorporates approximations of non-linearities in human preferences and includes an algorithm for base score extraction.

Result: The approach is evaluated both theoretically and experimentally in a robotics setting. The paper provides recommendations for selecting appropriate gradual semantics in practice and demonstrates how preference-based score extraction can simplify the construction of argumentation frameworks.

Conclusion: Base Score Extraction Functions offer a practical method for deriving base scores from user preferences, making gradual argumentation frameworks more accessible and easier to construct while maintaining their utility for transparent AI systems.

Abstract: Gradual argumentation is a field of symbolic AI which is attracting attention for its ability to support transparent and contestable AI systems. It is considered a useful tool in domains such as decision-making, recommendation, debate analysis, and others. The outcomes in such domains are usually dependent on the arguments’ base scores, which must be selected carefully. Often, this selection process requires user expertise and may not always be straightforward. On the other hand, organising the arguments by preference could simplify the task. In this work, we introduce \emph{Base Score Extraction Functions}, which provide a mapping from users’ preferences over arguments to base scores. These functions can be applied to the arguments of a \emph{Bipolar Argumentation Framework} (BAF), supplemented with preferences, to obtain a \emph{Quantitative Bipolar Argumentation Framework} (QBAF), allowing the use of well-established computational tools in gradual argumentation. We outline the desirable properties of base score extraction functions, discuss some design choices, and provide an algorithm for base score extraction. Our method incorporates an approximation of non-linearities in human preferences to allow for better approximation of the real ones. Finally, we evaluate our approach both theoretically and experimentally in a robotics setting, and offer recommendations for selecting appropriate gradual semantics in practice.

[191] Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence

Alisa Vinogradova, Vlad Vinogradov, Luba Greenwood, Ilya Yasny, Dmitry Kobyzev, Shoman Kasbekar, Kong Nguyen, Dmitrii Radkevich, Roman Doronin, Andrey Doronichev

Main category: cs.AI

TL;DR: A benchmarking methodology and Bioptic Agent for drug asset scouting that outperforms major AI models in discovering non-US, multilingual pharmaceutical innovations with high recall and low hallucination.

DetailsMotivation: Pharmaceutical innovation has shifted globally with most assets originating outside the US and disclosed in non-English channels, creating multi-billion dollar risks for investors who miss these "under-the-radar" assets. Current AI agents lag human experts in comprehensive, non-hallucinated discovery across multilingual sources.

Method: Proposes a benchmarking methodology using multilingual multi-agent pipeline to create challenging completeness benchmarks with complex user queries and ground-truth assets outside US-centric radar. Develops a tuned, tree-based self-learning Bioptic Agent designed for complete, non-hallucinated scouting. Uses LLM-as-judge evaluation calibrated to expert opinions.

Result: Bioptic Agent achieves 79.7% F1 score, significantly outperforming Claude Opus 4.6 (56.2%), Gemini 3 Pro + Deep Research (50.6%), OpenAI GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%). Performance improves with additional compute.

Conclusion: The Bioptic Agent demonstrates superior performance in drug asset scouting across multilingual sources, addressing critical gaps in current AI systems for comprehensive discovery of global pharmaceutical innovations while minimizing hallucinations.

Abstract: Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests that over 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total. A growing share of scholarly output is also non-U.S. Industry estimates put China at 30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface “under-the-radar” assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today’s Deep Research AI agents still lag human experts in achieving high recall discovery across heterogeneous, multilingual sources without hallucination. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real-deal complexity, we collected screening queries from expert investors, BD, and VC professionals and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. On this benchmark, our Bioptic Agent achieves 79.7% F1 score, outperforming Claude Opus 4.6 (56.2%), Gemini 3 Pro + Deep Research (50.6%), OpenAI GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%). Performance improves steeply with additional compute, supporting the view that more compute yields better results.

cs.SD

[192] S-PRESSO: Ultra Low Bitrate Sound Effect Compression With Diffusion Autoencoders And Offline Quantization

Zineb Lahrichi, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters

Main category: cs.SD

TL;DR: S-PRESSO is a 48kHz sound effect compression model that achieves ultra-low bitrates (down to 0.096 kbps) using a latent diffusion decoder with both continuous and discrete embeddings via offline quantization.

DetailsMotivation: Existing neural audio compression methods are limited to low-resolution audio and degrade significantly at very low bitrates with audible artifacts. There's a need for high-quality compression at extreme compression rates for sound effects.

Method: Uses a pretrained latent diffusion model as decoder with a latent encoder that learns compressed audio embeddings. Employs offline quantization to produce both continuous and discrete embeddings. Achieves extremely low frame rates (down to 1Hz, 750x compression) by leveraging generative priors of the diffusion decoder.

Result: Outperforms both continuous and discrete baselines in audio quality, acoustic similarity and reconstruction metrics despite operating at high compression rates. Produces convincing and realistic reconstructions at the cost of exact fidelity.

Conclusion: S-PRESSO demonstrates that leveraging generative priors from diffusion models enables extreme audio compression (down to 0.096 kbps) while maintaining perceptual quality, pushing the boundaries of what’s possible in neural audio compression.

Abstract: Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing methods remain constrained to low-resolution audio and degrade substantially at very low bitrates, where audible artifacts are prominent. In this paper, we present S-PRESSO, a 48kHz sound effect compression model that produces both continuous and discrete embeddings at ultra-low bitrates, down to 0.096 kbps, via offline quantization. Our model relies on a pretrained latent diffusion model to decode compressed audio embeddings learned by a latent encoder. Leveraging the generative priors of the diffusion decoder, we achieve extremely low frame rates, down to 1Hz (750x compression rate), producing convincing and realistic reconstructions at the cost of exact fidelity. Despite operating at high compression rates, we demonstrate that S-PRESSO outperforms both continuous and discrete baselines in audio quality, acoustic similarity and reconstruction metrics.

[193] Structure-Aware Piano Accompaniment via Style Planning and Dataset-Aligned Pattern Retrieval

Wanyu Zang, Yang Yu, Meng Yu

Main category: cs.SD

TL;DR: A structure-aware piano accompaniment system that uses transformer-based style planning and retrieval of human-performed patterns to generate symbolic MIDI accompaniments from lead sheets.

DetailsMotivation: To create a symbolic piano accompaniment system that can generate diverse, long-form accompaniments with strong style realization by decoupling high-level structural planning from note-level realization.

Method: Uses a lightweight transformer to predict interpretable per-measure style plans conditioned on section/phrase structure and functional harmony, then retrieves and reharmonizes human-performed piano patterns from a corpus using an explicit energy-based retrieval formulation with multiple constraints.

Result: The system generates diverse long-form piano accompaniments with strong style realization from structured lead sheets and optional keyword prompts, demonstrating effectiveness through experimental validation.

Conclusion: The structure-aware approach combining transformer-based style planning with energy-guided pattern retrieval effectively generates high-quality symbolic piano accompaniments with good style control and structural coherence.

Abstract: We introduce a structure-aware approach for symbolic piano accompaniment that decouples high-level planning from note-level realization. A lightweight transformer predicts an interpretable, per-measure style plan conditioned on section/phrase structure and functional harmony, and a retriever then selects and reharmonizes human-performed piano patterns from a corpus. We formulate retrieval as pattern matching under an explicit energy with terms for harmonic feasibility, structural-role compatibility, voice-leading continuity, style preferences, and repetition control. Given a structured lead sheet and optional keyword prompts, the system generates piano-accompaniment MIDI. In our experiments, transformer style-planner-guided retrieval produces diverse long-form accompaniments with strong style realization. We further analyze planner ablations and quantify inter-style isolation. Experimental results demonstrate the effectiveness of our inference-time approach for piano accompaniment generation.

[194] The Equalizer: Introducing Shape-Gain Decomposition in Neural Audio Codecs

Samir Sadok, Laurent Girin, Xavier Alameda-Pineda

Main category: cs.SD

TL;DR: Introducing shape-gain decomposition into neural audio codecs improves bitrate-distortion performance and reduces complexity by separating gain quantization from shape encoding.

DetailsMotivation: Current neural audio codecs jointly encode gain and shape in the same latent space, making them inefficient and sensitive to input signal level variations, leading to codebook redundancy and poor bitrate-distortion performance.

Method: Proposed Equalizer methodology decomposes input signal into gain and normalized shape vector before encoding. Shape is processed by neural audio codec while gain is quantized separately with scalar quantization. Output reconstructed from normalized NAC output and quantized gain.

Result: Experiments on speech signals show substantial improvement in bitrate-distortion performance and massive reduction in complexity. The method is easily applicable to any neural audio codec.

Conclusion: Shape-gain decomposition, a classical audio coding technique, when integrated into neural audio codecs, significantly enhances performance and efficiency while maintaining compatibility with existing NAC architectures.

Abstract: Neural audio codecs (NACs) typically encode the short-term energy (gain) and normalized structure (shape) of speech/audio signals jointly within the same latent space. As a result, they are poorly robust to a global variation of the input signal level in the sense that such variation has strong influence on the embedding vectors at the output of the encoder and their quantization. This methodology is inherently inefficient, leading to codebook redundancy and suboptimal bitrate-distortion performance. To address these limitations, we propose to introduce shape-gain decomposition, widely used in classical speech/audio coding, into the NAC framework. The principle of the proposed Equalizer methodology is to decompose the input signal – before the NAC encoder – into gain and normalized shape vector on a short-term basis. The shape vector is processed by the NAC, while the gain is quantized with scalar quantization and transmitted separately. The output (decoded) signal is reconstructed from the normalized output of the NAC and the quantized gain. Our experiments conducted on speech signals show that this general methodology, easily applicable to any NAC, enables a substantial gain in bitrate-distortion performance, as well as a massive reduction in complexity.

[195] UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

Qiangong Zhou, Nagasaka Tomohiro

Main category: cs.SD

TL;DR: Unified model merging TTS and A2F for joint text-to-speech and audio-to-face generation with internal feature transfer to improve audio-facial consistency.

DetailsMotivation: To improve consistency between audio and facial expressions generated from text by creating a unified model that enables internal feature transfer between TTS and audio-to-face generation systems.

Method: Merges two independent models (TTS and A2F) into a unified model to enable internal feature transfer, and extends emotion control mechanisms from TTS to the joint model.

Result: Validates feasibility of reusing intermediate TTS representations for joint modeling of speech and facial expressions, providing engineering practice references for speech expression co-design.

Conclusion: Demonstrates system design approach for joint audio-visual generation from text, focusing on feature reuse and consistency rather than generation quality.

Abstract: This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: https://github.com/GoldenFishes/UniTAF

[196] A Generative-First Neural Audio Autoencoder

Jonah Casebeer, Ge Zhu, Zhepei Wang, Nicholas J. Bryan

Main category: cs.SD

TL;DR: A generative-first audio autoencoder architecture that achieves 3360x temporal downsampling, supports both continuous/discrete representations and multiple audio formats in one model, enabling 10x faster encoding and 1.6x lower rates while maintaining quality.

DetailsMotivation: Existing neural autoencoders for audio generation have limitations: they are reconstruction-first, resulting in high latent rates, slow encoding, and require separate architectures for different representations (discrete vs. continuous) and audio channel formats, which hinders practical workflows from preprocessing to inference conditioning.

Method: Introduces a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x. The model supports both continuous and discrete representations and common audio channel formats (mono, stereo, etc.) in a single unified architecture, balancing compression, quality, and speed.

Result: Achieves 10x faster encoding, 1.6x lower latent rates, and eliminates the need for channel-format-specific variants while maintaining competitive reconstruction quality. A 60-second mono signal compresses to just 788 tokens, making generative modeling more tractable.

Conclusion: The generative-first audio autoencoder enables applications previously constrained by processing costs by providing a unified, efficient architecture that supports multiple representations and formats, significantly improving the practicality of large-scale audio generative modeling.

Abstract: Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.

[197] TAC: Timestamped Audio Captioning

Sonal Kumar, Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Dinesh Manocha, Nicholas J. Bryan, Zeyu Jin, Justin Salamon

Main category: cs.SD

TL;DR: TAC is a timestamped audio captioner that produces temporally grounded audio descriptions with varying detail, trained on synthetic mixtures to handle polyphonic scenes, and TAC-V extends it to audio-visual descriptions, both serving as semantic bridges for LLMs to achieve SOTA on multimodal understanding benchmarks.

DetailsMotivation: Current Large Audio Language Models struggle with overlapping events in complex acoustic scenes, leading to temporally inconsistent captions and hallucinations, necessitating better temporal grounding and detail in audio understanding.

Method: TAC is trained with a synthetic data pipeline constructing challenging dynamic mixtures from real-world audio sources for robust polyphonic learning. TAC-V extends this to audio-visual descriptions. Both serve as semantic bridges for text-only LLMs in TAC→LLM and TAC-V→LLM cascades.

Result: TAC outperforms all competing methods in event detection and dense captioning with low hallucination rate and accurate temporal grounding. TAC→LLM and TAC-V→LLM cascades achieve state-of-the-art scores on audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding benchmarks.

Conclusion: TAC provides temporally grounded audio descriptions that effectively bridge semantic gaps for LLMs, enabling superior multimodal understanding and reasoning across audio and audio-visual domains.

Abstract: Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a “semantic bridge” for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.

[198] Token-Based Audio Inpainting via Discrete Diffusion

Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani

Main category: cs.SD

TL;DR: Discrete diffusion approach for audio inpainting using tokenized music representations, outperforming previous methods on large missing segments up to 750ms.

DetailsMotivation: Previous diffusion-based audio inpainting methods struggle with large missing segments. The authors aim to develop a more effective approach for restoring long gaps in audio recordings, particularly for musical content.

Method: Uses discrete diffusion over tokenized music representations from a pre-trained audio tokenizer. Incorporates two key training approaches: derivative-based regularization loss for smooth temporal dynamics, and span-based absorbing transition for structured corruption during diffusion.

Result: Outperforms strong baselines on MusicNet and MAESTRO datasets for gaps up to 750ms, particularly effective for gaps of 150ms and above.

Conclusion: Advances musical audio restoration and introduces new directions for discrete diffusion model training, enabling stable and semantically coherent restoration of long gaps in audio.

Abstract: Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative-based regularization loss that enforces smooth temporal dynamics, and a span-based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion model training. Visit our project page for examples and code.

[199] XAI-Driven Spectral Analysis of Cough Sounds for Respiratory Disease Characterization

Patricia Amado-Caballero, Luis Miguel San-José-Revuelta, María Dolores Aguilar-García, José Ramón Garmendia-Leiza, Carlos Alberola-López, Pablo Casaseca-de-la-Higuera

Main category: cs.SD

TL;DR: XAI-driven methodology using occlusion maps on cough spectrograms processed by CNN to identify disease-specific acoustic patterns, particularly for COPD diagnosis.

DetailsMotivation: To enhance understanding of cough sound analysis for respiratory disease management by making AI models more interpretable and uncovering disease-specific acoustic signatures that raw spectrogram analysis might miss.

Method: Uses occlusion maps on CNN-processed cough spectrograms to highlight relevant spectral regions, then performs spectral analysis on weighted spectrograms to extract features and identify disease-specific patterns.

Result: Identified significant differences between disease groups (especially COPD) in occlusion-weighted spectral regions, contrasting with no significant differences in raw spectrogram analysis, revealing more variable cough patterns in COPD patients.

Conclusion: XAI techniques can uncover disease-specific acoustic signatures and improve diagnostic capabilities of cough sound analysis by providing more interpretable results than traditional methods.

Abstract: This paper proposes an eXplainable Artificial Intelligence (XAI)-driven methodology to enhance the understanding of cough sound analysis for respiratory disease management. We employ occlusion maps to highlight relevant spectral regions in cough spectrograms processed by a Convolutional Neural Network (CNN). Subsequently, spectral analysis of spectrograms weighted by these occlusion maps reveals significant differences between disease groups, particularly in patients with COPD, where cough patterns appear more variable in the identified spectral regions of interest. This contrasts with the lack of significant differences observed when analyzing raw spectrograms. The proposed approach extracts and analyzes several spectral features, demonstrating the potential of XAI techniques to uncover disease-specific acoustic signatures and improve the diagnostic capabilities of cough sound analysis by providing more interpretable results.

[200] MARS-Sep: Multimodal-Aligned Reinforced Sound Separation

Zihan Zhang, Xize Cheng, Zhennan Jiang, Dongjie Fu, Jingyuan Chen, Zhou Zhao, Tao Jin

Main category: cs.SD

TL;DR: MARS-Sep is a reinforcement learning framework for universal sound separation that aligns separation outputs with semantic preferences using a reward model derived from audio-text-vision encoders, improving semantic quality over traditional signal-level optimization.

DetailsMotivation: Current sound separation models optimized for low-level signal metrics often produce outputs contaminated with perceptually salient interference from acoustically similar sources, failing to achieve semantic purity. This misalignment between signal metrics and human perception parallels the alignment problem in LLMs.

Method: MARS-Sep reformulates separation as decision making using reinforcement learning. It learns a factorized Beta mask policy steered by a preference reward model derived from progressively-aligned audio-text-vision encoders. The framework uses a stable, clipped trust-region surrogate for optimization and directly incentivizes semantic consistency with query prompts.

Result: Extensive experiments on multiple benchmarks show consistent gains in Text-, Audio-, and Image-Queried separation tasks, with notable improvements in both signal metrics and semantic quality compared to traditional approaches.

Conclusion: The preference alignment perspective effectively addresses the semantic contamination problem in sound separation, demonstrating that RL-based frameworks with multimodal reward models can produce more semantically consistent and perceptually satisfying separation outputs.

Abstract: Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. We introduce a preference alignment perspective, analogous to aligning LLMs with human intent. To address this, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is steered by a preference reward model and optimized by a stable, clipped trust-region surrogate. The reward, derived from a progressively-aligned audio-text-vision encoder, directly incentivizes semantic consistency with query prompts. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://github.com/mars-sep/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.

[201] AudioRAG+: Feedback-driven Retrieval-augmented Audio Generation with Large Audio Language Models

Junqi Zhao, Chenxing Li, Jinzheng Zhao, Rilin Chen, Dong Yu, Mark D. Plumbley, Wenwu Wang

Main category: cs.SD

TL;DR: Feedback-driven RAG approach using Large Audio Language Models to improve text-to-audio generation by identifying missing sound events and retrieving relevant concepts from external databases.

DetailsMotivation: Addresses the problem of missing or imperfect synthesis of specific sound events in text-to-audio generation, where pre-trained models often struggle to generate certain concepts accurately.

Method: Uses Large Audio Language Models to analyze audio generation outputs, identify missing sound events, retrieve relevant concepts from an external database, and incorporate retrieved information into the generation process through a feedback loop.

Result: Method enhances LALMs’ ability to identify missing sound events and improves performance across different models, outperforming existing RAG-specialized approaches.

Conclusion: Feedback-driven RAG with LALMs provides an effective approach to improve text-to-audio generation by addressing model limitations through external knowledge retrieval.

Abstract: We propose a general feedback-driven retrieval-augmented generation (RAG) approach that leverages Large Audio Language Models (LALMs) to address the missing or imperfect synthesis of specific sound events in text-to-audio (TTA) generation. Unlike previous RAG-based TTA methods that typically train specialized models from scratch, we utilize LALMs to analyze audio generation outputs, retrieve concepts that pre-trained models struggle to generate from an external database, and incorporate the retrieved information into the generation process. Experimental results show that our method not only enhances the ability of LALMs to identify missing sound events but also delivers improvements across different models, outperforming existing RAG-specialized approaches.

[202] CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation

Crystal Min Hui Poon, Pai Chet Ng, Xiaoxiao Miao, Immanuel Jun Kai Loh, Bowen Zhang, Haoyu Song, Ian Mcloughlin

Main category: cs.SD

TL;DR: CLARITY is a framework for instruction-guided TTS that addresses accent and linguistic biases through contextual linguistic adaptation and retrieval-augmented accent prompting to improve accent accuracy and fairness.

DetailsMotivation: Current TTS systems suffer from accent bias (defaulting to dominant phonetic patterns) and linguistic bias (misalignment in dialect-specific lexical/cultural information), which are interdependent and reduce perceived quality in authentic accent generation.

Method: CLARITY uses dual-signal optimization: 1) contextual linguistic adaptation to localize input text to target dialect, and 2) retrieval-augmented accent prompting (RAAP) to ensure accent-consistent speech prompts. The framework is backbone-agnostic.

Result: Evaluation on twelve varieties of English accents shows CLARITY improves accent accuracy and fairness, ensuring higher perceptual quality output through both subjective and objective analysis.

Conclusion: CLARITY effectively addresses coupled biases in TTS systems through its dual optimization approach, enabling more authentic and fair accent generation across diverse English dialects.

Abstract: Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist in reducing perceived quality: accent bias, where models default towards dominant phonetic patterns, and linguistic bias, a misalignment in dialect-specific lexical or cultural information. These biases are interdependent and authentic accent generation requires both accent fidelity and correctly localized text. We present CLARITY (Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis), a backbone-agnostic framework to address both biases through dual-signal optimization. Firstly, we apply contextual linguistic adaptation to localize input text to align with the target dialect. Secondly, we propose retrieval-augmented accent prompting (RAAP) to ensure accent-consistent speech prompts. We evaluate CLARITY on twelve varieties of English accent via both subjective and objective analysis. Results clearly indicate that CLARITY improves accent accuracy and fairness, ensuring higher perceptual quality output\footnote{Code and audio samples are available at https://github.com/ICT-SIT/CLARITY.

cs.LG

[203] Near-Optimal Sample Complexity for Online Constrained MDPs

Chang Liu, Yunfan Li, Lin F. Yang

Main category: cs.LG

TL;DR: Model-based primal-dual algorithm for safe RL in CMDPs with theoretical guarantees for both relaxed and strict feasibility settings, matching lower bounds.

DetailsMotivation: Safety is critical in real-world RL applications like autonomous driving and robotics. Existing methods for Constrained Markov Decision Processes (CMDPs) often suffer from safety violations or high sample complexity.

Method: Proposes a model-based primal-dual algorithm that balances regret and constraint violations, using techniques from online RL and constrained optimization. Addresses two settings: relaxed feasibility (small violations allowed) and strict feasibility (zero violations).

Result: For relaxed feasibility: ε-optimal policy with ε-bounded violation with high probability using Õ(SAH³/ε²) episodes, matching unconstrained MDP lower bound. For strict feasibility: ε-optimal policy with zero violation with high probability using Õ(SAH⁵/ε²ζ²) episodes, matching CMDP generative model lower bound.

Conclusion: Learning CMDPs online is as easy as learning with a generative model and no more challenging than learning unconstrained MDPs when small violations are allowed.

Abstract: Safety is a fundamental challenge in reinforcement learning (RL), particularly in real-world applications such as autonomous driving, robotics, and healthcare. To address this, Constrained Markov Decision Processes (CMDPs) are commonly used to enforce safety constraints while optimizing performance. However, existing methods often suffer from significant safety violations or require a high sample complexity to generate near-optimal policies. We address two settings: relaxed feasibility, where small violations are allowed, and strict feasibility, where no violation is allowed. We propose a model-based primal-dual algorithm that balances regret and bounded constraint violations, drawing on techniques from online RL and constrained optimization. For relaxed feasibility, we prove that our algorithm returns an $\varepsilon$-optimal policy with $\varepsilon$-bounded violation with arbitrarily high probability, requiring $\tilde{O}\left(\frac{SAH^3}{\varepsilon^2}\right)$ learning episodes, matching the lower bound for unconstrained MDPs. For strict feasibility, we prove that our algorithm returns an $\varepsilon$-optimal policy with zero violation with arbitrarily high probability, requiring $\tilde{O}\left(\frac{SAH^5}{\varepsilon^2ζ^2}\right)$ learning episodes, where $ζ$ is the problem-dependent Slater constant characterizing the size of the feasible region. This result matches the lower bound for learning CMDPs with access to a generative model. Our results demonstrate that learning CMDPs in an online setting is as easy as learning with a generative model and is no more challenging than learning unconstrained MDPs when small violations are allowed.

[204] Hybrid Feature Learning with Time Series Embeddings for Equipment Anomaly Prediction

Takato Yasuno

Main category: cs.LG

TL;DR: Hybrid approach combining Granite TinyTimeMixer time series embeddings with statistical features for HVAC anomaly prediction, achieving high precision and low false positive rates.

DetailsMotivation: Pure deep learning approaches often fail to achieve sufficient accuracy for time series anomaly detection in real-world predictive maintenance applications, especially for HVAC equipment.

Method: Combines 64-dimensional time series embeddings from Granite TinyTimeMixer encoder (fine-tuned with LoRA) with 28-dimensional statistical features (trend, volatility, drawdown indicators), then uses LightGBM gradient boosting classifier for anomaly prediction.

Result: Achieved Precision of 91-95% and ROC-AUC of 0.995 for anomaly prediction at 30, 60, and 90-day horizons. Production-ready performance with false positive rate ≤1.1% and detection rate of 88-94% on 64 equipment units with 51,564 samples.

Conclusion: Practical anomaly detection systems can be realized by leveraging complementary strengths between deep learning’s representation learning capabilities and statistical feature engineering for predictive maintenance.

Abstract: In predictive maintenance of equipment, deep learning-based time series anomaly detection has garnered significant attention; however, pure deep learning approaches often fail to achieve sufficient accuracy on real-world data. This study proposes a hybrid approach that integrates 64-dimensional time series embeddings from Granite TinyTimeMixer with 28-dimensional statistical features based on domain knowledge for HVAC equipment anomaly prediction tasks. Specifically, we combine time series embeddings extracted from a Granite TinyTimeMixer encoder fine-tuned with LoRA (Low-Rank Adaptation) and 28 types of statistical features including trend, volatility, and drawdown indicators, which are then learned using a LightGBM gradient boosting classifier. In experiments using 64 equipment units and 51,564 samples, we achieved Precision of 91–95% and ROC-AUC of 0.995 for anomaly prediction at 30-day, 60-day, and 90-day horizons. Furthermore, we achieved production-ready performance with a false positive rate of 1.1% or less and a detection rate of 88–94%, demonstrating the effectiveness of the system for predictive maintenance applications. This work demonstrates that practical anomaly detection systems can be realized by leveraging the complementary strengths between deep learning’s representation learning capabilities and statistical feature engineering.

[205] PolyNODE: Variable-dimension Neural ODEs on M-polyfolds

Per Åhag, Alexander Friedrich, Fredrik Ohlsson, Viktor Vigren Näslund

Main category: cs.LG

TL;DR: PolyNODEs extend neural ODEs to variable-dimensional spaces (M-polyfolds), enabling flow-based models that can handle varying dimensions, with applications to autoencoders with dimensional bottlenecks.

DetailsMotivation: Existing neural ODE models are constrained to fixed-dimensional dynamics due to manifold dimension limitations, preventing their application to variable-dimensional data and spaces with dimensional bottlenecks.

Method: Extend NODEs to M-polyfolds (spaces accommodating varying dimensions with differentiability), introduce PolyNODEs as variable-dimensional flow-based models, construct explicit M-polyfolds with dimensional bottlenecks, and develop PolyNODE autoencoders using parametrized vector fields.

Result: PolyNODE models can be trained to solve reconstruction tasks in variable-dimensional spaces, extract latent representations from input data, and use these representations for downstream classification tasks.

Conclusion: PolyNODEs represent the first variable-dimensional flow-based model in geometric deep learning, successfully extending NODEs beyond fixed-dimensional constraints while maintaining differentiability and enabling practical applications.

Abstract: Neural ordinary differential equations (NODEs) are geometric deep learning models based on dynamical systems and flows generated by vector fields on manifolds. Despite numerous successful applications, particularly within the flow matching paradigm, all existing NODE models are fundamentally constrained to fixed-dimensional dynamics by the intrinsic nature of the manifold’s dimension. In this paper, we extend NODEs to M-polyfolds (spaces that can simultaneously accommodate varying dimensions and a notion of differentiability) and introduce PolyNODEs, the first variable-dimensional flow-based model in geometric deep learning. As an example application, we construct explicit M-polyfolds featuring dimensional bottlenecks and PolyNODE autoencoders based on parametrised vector fields that traverse these bottlenecks. We demonstrate experimentally that our PolyNODE models can be trained to solve reconstruction tasks in these spaces, and that latent representations of the input can be extracted and used to solve downstream classification tasks. The code used in our experiments is publicly available at https://github.com/turbotage/PolyNODE .

[206] Refine Now, Query Fast: A Decoupled Refinement Paradigm for Implicit Neural Fields

Tianyu Xiong, Skylar Wurster, Han-Wei Shen

Main category: cs.LG

TL;DR: DRR-Net: A decoupled representation refinement paradigm for Implicit Neural Representations that achieves high fidelity with fast inference by encoding rich representations into compact embeddings offline.

DetailsMotivation: INRs face a fidelity-speed dilemma where deep MLPs have high inference cost while embedding-based models lack expressiveness, limiting their practical use as surrogates for large 3D scientific simulations.

Method: Proposes Decoupled Representation Refinement (DRR) paradigm using a deep refiner network and non-parametric transformations in offline process to encode rich representations into compact embeddings, decoupling slow neural networks from fast inference path. Introduces DRR-Net and Variational Pairs data augmentation for complex tasks.

Result: Achieves state-of-the-art fidelity while being up to 27× faster at inference than high-fidelity baselines, remaining competitive with fastest models on ensemble simulation datasets.

Conclusion: DRR paradigm offers effective strategy for building powerful and practical neural field surrogates with minimal compromise between speed and quality, applicable to broader INR applications.

Abstract: Implicit Neural Representations (INRs) have emerged as promising surrogates for large 3D scientific simulations due to their ability to continuously model spatial and conditional fields, yet they face a critical fidelity-speed dilemma: deep MLPs suffer from high inference cost, while efficient embedding-based models lack sufficient expressiveness. To resolve this, we propose the Decoupled Representation Refinement (DRR) architectural paradigm. DRR leverages a deep refiner network, alongside non-parametric transformations, in a one-time offline process to encode rich representations into a compact and efficient embedding structure. This approach decouples slow neural networks with high representational capacity from the fast inference path. We introduce DRR-Net, a simple network that validates this paradigm, and a novel data augmentation strategy, Variational Pairs (VP) for improving INRs under complex tasks like high-dimensional surrogate modeling. Experiments on several ensemble simulation datasets demonstrate that our approach achieves state-of-the-art fidelity, while being up to 27$\times$ faster at inference than high-fidelity baselines and remaining competitive with the fastest models. The DRR paradigm offers an effective strategy for building powerful and practical neural field surrogates and \rev{INRs in broader applications}, with a minimal compromise between speed and quality.

[207] Learning Representations from Incomplete EHR Data with Dual-Masked Autoencoding

Xiao Xiang, David Restrepo, Hyewon Jeong, Yugang Jia, Leo Anthony Celi

Main category: cs.LG

TL;DR: AID-MAE is a self-supervised method for learning from incomplete EHR time series using dual masking (intrinsic for natural missingness, augmented for reconstruction) that outperforms baselines on clinical tasks.

DetailsMotivation: Learning from EHR time series is challenging due to irregular sampling, heterogeneous missingness, and sparse observations. Existing methods either impute before learning, represent missingness as input, or focus only on imputation, limiting their ability to learn effective representations for clinical downstream tasks.

Method: Proposes Augmented-Intrinsic Dual-Masked Autoencoder (AID-MAE) that learns directly from incomplete time series using two masks: intrinsic mask for naturally missing values, and augmented mask that hides a subset of observed values for reconstruction. Only processes unmasked tokens during training.

Result: AID-MAE consistently outperforms strong baselines including XGBoost and DuETT across multiple clinical tasks on two datasets. Learned embeddings naturally stratify patient cohorts in representation space.

Conclusion: AID-MAE effectively learns representations from incomplete EHR time series through dual masking, demonstrating superior performance on clinical tasks and meaningful patient stratification.

Abstract: Learning from electronic health records (EHRs) time series is challenging due to irregular sam- pling, heterogeneous missingness, and the resulting sparsity of observations. Prior self-supervised meth- ods either impute before learning, represent missingness through a dedicated input signal, or optimize solely for imputation, reducing their capacity to efficiently learn representations that support clinical downstream tasks. We propose the Augmented-Intrinsic Dual-Masked Autoencoder (AID-MAE), which learns directly from incomplete time series by applying an intrinsic missing mask to represent naturally missing values and an augmented mask that hides a subset of observed values for reconstruction during training. AID-MAE processes only the unmasked subset of tokens and consistently outperforms strong baselines, including XGBoost and DuETT, across multiple clinical tasks on two datasets. In addition, the learned embeddings naturally stratify patient cohorts in the representation space.

[208] Neural Network-Based Parameter Estimation of a Labour Market Agent-Based Model

M Lopes Alves, Joel Dyer, Doyne Farmer, Michael Wooldridge, Anisoara Calinescu

Main category: cs.LG

TL;DR: A simulation-based inference framework using neural networks is applied to parameter estimation in large-scale agent-based models, specifically a labor market ABM, showing improved efficiency over traditional Bayesian methods.

DetailsMotivation: Parameter estimation in large-scale agent-based models is computationally challenging, limiting their use as decision-support tools. Traditional methods struggle with exploring the high-dimensional parameter space efficiently.

Method: Uses a simulation-based inference (SBI) framework with neural networks for parameter estimation. Applied to a labor market ABM based on job transition networks, tested with both synthetic datasets and real U.S. labor market data. Compares NN-learned summary statistics with traditional statistical measures.

Result: The neural network-based approach successfully recovers original parameters when evaluating posterior distributions across various dataset scales and demonstrates improved computational efficiency compared to traditional Bayesian methods.

Conclusion: Neural network-based simulation-based inference provides an effective solution for parameter estimation in large-scale agent-based models, overcoming computational constraints and improving efficiency for decision-support applications.

Abstract: Agent-based modelling (ABM) is a widespread approach to simulate complex systems. Advancements in computational processing and storage have facilitated the adoption of ABMs across many fields; however, ABMs face challenges that limit their use as decision-support tools. A significant issue is parameter estimation in large-scale ABMs, particularly due to computational constraints on exploring the parameter space. This study evaluates a state-of-the-art simulation-based inference (SBI) framework that uses neural networks (NN) for parameter estimation. This framework is applied to an established labour market ABM based on job transition networks. The ABM is initiated with synthetic datasets and the real U.S. labour market. Next, we compare the effectiveness of summary statistics derived from a list of statistical measures with that learned by an embedded NN. The results demonstrate that the NN-based approach recovers the original parameters when evaluating posterior distributions across various dataset scales and improves efficiency compared to traditional Bayesian methods.

[209] Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel, Rodrigo Toro Icarte

Main category: cs.LG

TL;DR: VLMs outperform their underlying LLMs on text-only tasks, especially long-context retrieval, due to visual training disrupting positional shortcuts and forcing more robust symbolic binding mechanisms.

DetailsMotivation: To investigate the surprising phenomenon that Vision Language Models (VLMs) can outperform their underlying Large Language Models (LLMs) on purely text-only tasks, particularly in long-context information retrieval.

Method: Built a controlled synthetic retrieval task, compared transformer performance with text-only training vs. image-tokenized training, used mechanistic interpretability to analyze internal binding strategies, and characterized variations across training regimes, visual encoders, and initializations.

Result: Visual training nearly doubles text-only out-of-distribution performance compared to text-only training; mechanistic analysis reveals visual training disrupts positional shortcuts through spatial translation invariance, forcing more robust symbolic binding that persists even after text-only examples are reintroduced.

Conclusion: Cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality, suggesting visual training fundamentally changes model binding strategies in ways that improve text-only performance.

Abstract: Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model’s internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.

[210] A XAI-based Framework for Frequency Subband Characterization of Cough Spectrograms in Chronic Respiratory Disease

Patricia Amado-Caballero, Luis M. San-José-Revuelta, Xinheng Wang, José Ramón Garmendia-Leiza, Carlos Alberola-López, Pablo Casaseca-de-la-Higuera

Main category: cs.LG

TL;DR: XAI framework using CNN and occlusion maps to analyze cough sound spectrograms for COPD diagnosis, identifying disease-specific spectral patterns across frequency subbands.

DetailsMotivation: To develop an explainable AI approach for analyzing cough sounds in chronic respiratory diseases, particularly COPD, to identify interpretable spectral markers for improved diagnosis and understanding of pathophysiological characteristics.

Method: Train CNN on time-frequency representations of cough signals, use occlusion maps to identify diagnostically relevant spectrogram regions, decompose into five frequency subbands for targeted spectral feature extraction and analysis.

Result: Spectral patterns differ across subbands and disease groups, revealing complementary and compensatory trends; approach distinguishes COPD from other respiratory conditions and chronic from non-chronic groups based on interpretable spectral markers.

Conclusion: XAI-enhanced frequency-resolved analysis provides valuable insights into cough acoustics pathophysiology and demonstrates translational potential for respiratory disease diagnostics through interpretable spectral markers.

Abstract: This paper presents an explainable artificial intelligence (XAI)-based framework for the spectral analysis of cough sounds associated with chronic respiratory diseases, with a particular focus on Chronic Obstructive Pulmonary Disease (COPD). A Convolutional Neural Network (CNN) is trained on time-frequency representations of cough signals, and occlusion maps are used to identify diagnostically relevant regions within the spectrograms. These highlighted areas are subsequently decomposed into five frequency subbands, enabling targeted spectral feature extraction and analysis. The results reveal that spectral patterns differ across subbands and disease groups, uncovering complementary and compensatory trends across the frequency spectrum. Noteworthy, the approach distinguishes COPD from other respiratory conditions, and chronic from non-chronic patient groups, based on interpretable spectral markers. These findings provide insight into the underlying pathophysiological characteristics of cough acoustics and demonstrate the value of frequency-resolved, XAI-enhanced analysis for biomedical signal interpretation and translational respiratory disease diagnostics.

[211] Learning Data-Efficient and Generalizable Neural Operators via Fundamental Physics Knowledge

Siying Ma, Mehrdad M. Zadeh, Mauricio Soroco, Wuyang Chen, Jiguo Cao, Vijay Ganesh

Main category: cs.LG

TL;DR: Multiphysics training framework for neural operators that jointly learns from original PDEs and their simplified forms to improve generalization and data efficiency.

DetailsMotivation: Existing neural operator approaches focus on learning simulations from target PDEs but overlook fundamental physical principles underlying these equations, limiting generalization ability.

Method: Proposes a multiphysics training framework that jointly trains neural operators on both original PDEs and their simplified basic forms, making it architecture-agnostic and compatible with various neural operator models.

Result: The framework enhances data efficiency, reduces predictive errors, and improves out-of-distribution generalization across 1D/2D/3D PDE problems, particularly for physical parameter shifts and synthetic-to-real transfer.

Conclusion: Explicit incorporation of fundamental physics knowledge significantly strengthens the generalization ability of neural operators for scientific machine learning applications.

Abstract: Recent advances in scientific machine learning (SciML) have enabled neural operators (NOs) to serve as powerful surrogates for modeling the dynamic evolution of physical systems governed by partial differential equations (PDEs). While existing approaches focus primarily on learning simulations from the target PDE, they often overlook more fundamental physical principles underlying these equations. Inspired by how numerical solvers are compatible with simulations of different settings of PDEs, we propose a multiphysics training framework that jointly learns from both the original PDEs and their simplified basic forms. Our framework enhances data efficiency, reduces predictive errors, and improves out-of-distribution (OOD) generalization, particularly in scenarios involving shifts of physical parameters and synthetic-to-real transfer. Our method is architecture-agnostic and demonstrates consistent improvements in normalized root mean square error (nRMSE) across a wide range of 1D/2D/3D PDE problems. Through extensive experiments, we show that explicit incorporation of fundamental physics knowledge significantly strengthens the generalization ability of neural operators. We will release models and codes at https://sites.google.com/view/sciml-fundemental-pde.

[212] COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression

Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Baher Mohammad, Stamatios Lefkimmiatis

Main category: cs.LG

TL;DR: COMPOT is a training-free compression framework for Transformers that uses sparse dictionary learning with orthogonal dictionaries and Procrustes updates, achieving better compression-accuracy trade-offs than low-rank methods.

DetailsMotivation: Existing Transformer compression methods like truncated SVD enforce a single shared subspace which can degrade accuracy, while sparse dictionary learning approaches suffer from iterative optimization problems. There's a need for more flexible, training-free compression that maintains accuracy.

Method: COMPOT uses a small calibration dataset to estimate sparse weight factorization with orthogonal dictionaries, enabling closed-form Procrustes updates for dictionaries and analytical single-step sparse coding for coefficients. It also includes a one-shot dynamic allocation strategy to adaptively redistribute layer-wise compression rates based on sensitivity.

Result: Extensive experiments show COMPOT consistently delivers superior quality-compression trade-offs over strong low-rank and sparse baselines across diverse architectures and tasks, while remaining fully compatible with post-training quantization for extreme compression.

Conclusion: COMPOT provides an effective training-free compression framework for Transformers that combines the flexibility of sparse dictionary learning with efficient optimization through orthogonal dictionaries and Procrustes updates, outperforming existing methods.

Abstract: Post-training compression of Transformer models commonly relies on truncated singular value decomposition (SVD). However, enforcing a single shared subspace can degrade accuracy even at moderate compression. Sparse dictionary learning provides a more flexible union-of-subspaces representation, but existing approaches often suffer from iterative dictionary and coefficient updates. We propose COMPOT (Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers), a training-free compression framework that uses a small calibration dataset to estimate a sparse weight factorization. COMPOT employs orthogonal dictionaries that enable closed-form Procrustes updates for the dictionary and analytical single-step sparse coding for the coefficients, eliminating iterative optimization. To handle heterogeneous layer sensitivity under a global compression budget, COMPOT further introduces a one-shot dynamic allocation strategy that adaptively redistributes layer-wise compression rates. Extensive experiments across diverse architectures and tasks show that COMPOT consistently delivers a superior quality-compression trade-off over strong low-rank and sparse baselines, while remaining fully compatible with post-training quantization for extreme compression. Code is available $\href{https://github.com/mts-ai/COMPOT}{here}$.

[213] MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference

Raphaël Baur, Yannick Metz, Maria Gkoulta, Mennatallah El-Assady, Giorgia Ramponi, Thomas Kleine Buening

Main category: cs.LG

TL;DR: Bayesian approach for joint reward learning from heterogeneous feedback types (demonstrations, comparisons, ratings, stops) using amortized variational inference with shared reward encoder and feedback-specific likelihood decoders.

DetailsMotivation: Current reward learning methods either use single feedback types or manually combine multiple types with weighted loss terms, lacking principled approaches for heterogeneous feedback that provide qualitatively different signals.

Method: Formulates reward learning as Bayesian inference over shared latent reward function, with each feedback type contributing through explicit likelihood. Uses scalable amortized variational inference with shared reward encoder and feedback-specific likelihood decoders, trained by optimizing single evidence lower bound.

Result: Jointly inferred reward posteriors outperform single-type baselines, exploit complementary information across feedback types, yield more robust policies to environment perturbations, and provide interpretable uncertainty signals for model confidence analysis.

Conclusion: Bayesian framework enables principled joint learning from heterogeneous feedback without manual loss balancing, improving performance and robustness while providing uncertainty quantification for interpretability.

Abstract: Reward learning typically relies on a single feedback type or combines multiple feedback types using manually weighted loss terms. Currently, it remains unclear how to jointly learn reward functions from heterogeneous feedback types such as demonstrations, comparisons, ratings, and stops that provide qualitatively different signals. We address this challenge by formulating reward learning from multiple feedback types as Bayesian inference over a shared latent reward function, where each feedback type contributes information through an explicit likelihood. We introduce a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders and is trained by optimizing a single evidence lower bound. Our approach avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing. Across discrete and continuous-control benchmarks, we show that jointly inferred reward posteriors outperform single-type baselines, exploit complementary information across feedback types, and yield policies that are more robust to environment perturbations. The inferred reward uncertainty further provides interpretable signals for analyzing model confidence and consistency across feedback types.

[214] ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

DatologyAI, :, Aldo Gael Carranza, Kaleigh Mentzer, Ricardo Pio Monti, Alex Fang, Alvin Deng, Amro Abbas, Anshuman Suri, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

Main category: cs.LG

TL;DR: Targeted per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling, showing that data quality improvements benefit all languages in multilingual models.

DetailsMotivation: Multilingual foundation models face challenges from uneven data availability across languages and performance interference from joint training (the "curse of multilinguality"). The paper investigates whether these issues stem from fundamental capacity limits or correctable data quality deficiencies.

Method: Studied multilingual data curation across 13 languages through controlled bilingual experiments. Examined how improving data quality for one language affects others, and developed bespoke per-language curation. Applied findings to large-scale training with curated multilingual allocations comprising under 8% of total tokens. Created a 20T-token pretraining corpus from public sources.

Result: Curating English improved non-English performance in 12 of 13 languages, while curating non-English improved English. Bespoke per-language curation produced substantially larger within-language improvements. Models trained on curated allocations achieved competitive multilingual accuracy with 4-10x fewer training FLOPs than baselines. The 20T-token corpus contributed to Trinity Large (400B/A13B) showing strong multilingual performance relative to training FLOPs.

Conclusion: Targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling, establishing a new Pareto frontier in multilingual performance versus compute. Data quality improvements are not zero-sum but benefit all languages in multilingual models.

Abstract: Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the “curse of multilinguality”. We study multilingual data curation across thirteen languages and find that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits others: curating English improves non-English performance in 12 of 13 languages, while curating non-English yields reciprocal improvements in English. Bespoke per-language curation produces substantially larger within-language improvements. Extending these findings to large-scale general-purpose training mixtures, we show that curated multilingual allocations comprising under 8% of total tokens remain remarkably effective. We operationalize this approach within an effort that produced a 20T-token pretraining corpus derived entirely from public sources. Models with 3B and 8B parameters trained on a 1T-token random subset achieve competitive multilingual accuracy with 4-10x fewer training FLOPs than strong public baselines, establishing a new Pareto frontier in multilingual performance versus compute. Moreover, these benefits extend to frontier model scale: the 20T-token corpus served as part of the pretraining dataset for Trinity Large (400B/A13B), which exhibits strong multilingual performance relative to its training FLOPs. These results show that targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling.

[215] Automatically Finding Reward Model Biases

Atticus Wang, Iván Arcuschin, Arthur Conmy

Main category: cs.LG

TL;DR: A method using LLMs to automatically discover biases in reward models, identifying issues like favoring redundant spacing and hallucinated content in Skywork-V2-8B reward model.

DetailsMotivation: Reward models in LLM post-training often reward undesirable attributes like length, format, hallucinations, and sycophancy, creating a need for automated methods to identify these biases.

Method: Using an LLM to iteratively propose and refine candidate biases through evolutionary iteration, which outperforms flat best-of-N search, with validation using synthetically injected biases.

Result: The method successfully recovered known biases and discovered novel ones, including Skywork-V2-8B’s tendency to favor responses with redundant spacing and hallucinated content.

Conclusion: The work contributes to improving reward models through automated interpretability methods, showing that evolutionary iteration is effective for bias discovery.

Abstract: Reward models are central to large language model (LLM) post-training. However, past work has shown that they can reward spurious or undesirable attributes such as length, format, hallucinations, and sycophancy. In this work, we introduce and study the research problem of automatically finding reward model biases in natural language. We offer a simple approach of using an LLM to iteratively propose and refine candidate biases. Our method can recover known biases and surface novel ones: for example, we found that Skywork-V2-8B, a leading open-weight reward model, often mistakenly favors responses with redundant spacing and responses with hallucinated content. In addition, we show evidence that evolutionary iteration outperforms flat best-of-N search, and we validate the recall of our pipeline using synthetically injected biases. We hope our work contributes to further research on improving RMs through automated interpretability methods.

[216] tensorFM: Low-Rank Approximations of Cross-Order Feature Interactions

Alessio Mazzetto, Mohammad Mahdi Khalili, Laura Fee Nern, Michael Viderman, Alex Shtoff, Krzysztof Dembczyński

Main category: cs.LG

TL;DR: TensorFM: A low-rank tensor factorization model for capturing high-order interactions in tabular categorical data, generalizing field-weighted factorization machines with competitive performance and low latency.

DetailsMotivation: Many practical applications like click-through rate prediction and social sciences involve tabular categorical data with multiple attributes. Existing models need better ways to capture high-order interactions between attributes efficiently while maintaining low latency for time-sensitive applications.

Method: Introduces tensorFM, which uses low-rank tensor approximation to represent the strength of high-order interactions between categorical attributes. The model generalizes field-weighted factorization machines by efficiently capturing complex attribute interactions through tensor factorization.

Result: TensorFM demonstrates competitive performance with state-of-the-art methods while maintaining low latency, making it suitable for time-sensitive applications like online advertising.

Conclusion: TensorFM provides an effective and efficient approach for modeling high-order interactions in tabular categorical data, offering practical advantages for real-world applications requiring both accuracy and speed.

Abstract: We address prediction problems on tabular categorical data, where each instance is defined by multiple categorical attributes, each taking values from a finite set. These attributes are often referred to as fields, and their categorical values as features. Such problems frequently arise in practical applications, including click-through rate prediction and social sciences. We introduce and analyze {tensorFM}, a new model that efficiently captures high-order interactions between attributes via a low-rank tensor approximation representing the strength of these interactions. Our model generalizes field-weighted factorization machines. Empirically, tensorFM demonstrates competitive performance with state-of-the-art methods. Additionally, its low latency makes it well-suited for time-sensitive applications, such as online advertising.

[217] BindCLIP: A Unified Contrastive-Generative Representation Learning Framework for Virtual Screening

Anjie Qiao, Zhen Wang, Yaliang Li, Jiahua Rao, Yuedong Yang

Main category: cs.LG

TL;DR: BindCLIP improves virtual screening by combining contrastive learning with pose-conditioned diffusion for more interaction-aware embeddings, addressing limitations of previous CLIP-style models.

DetailsMotivation: Current CLIP-style models for virtual screening (like DrugCLIP) have limitations: their representations can be insensitive to fine-grained binding interactions and may rely on shortcut correlations in training data, limiting their ability to accurately rank ligands by true binding compatibility.

Method: BindCLIP uses a unified contrastive-generative framework that jointly trains pocket and ligand encoders with CLIP-style contrastive learning plus a pocket-conditioned diffusion objective for binding pose generation. It also introduces hard-negative augmentation and a ligand-ligand anchoring regularizer to prevent representation collapse and mitigate shortcut reliance.

Result: BindCLIP shows consistent improvements over strong baselines on two public benchmarks, achieves substantial gains on challenging out-of-distribution virtual screening, and improves ligand-analogue ranking on the FEP+ benchmark.

Conclusion: Integrating generative, pose-level supervision with contrastive learning yields more interaction-aware embeddings and improves generalization in realistic screening settings, advancing virtual screening toward real-world applicability.

Abstract: Virtual screening aims to efficiently identify active ligands from massive chemical libraries for a given target pocket. Recent CLIP-style models such as DrugCLIP enable scalable virtual screening by embedding pockets and ligands into a shared space. However, our analyses indicate that such representations can be insensitive to fine-grained binding interactions and may rely on shortcut correlations in training data, limiting their ability to rank ligands by true binding compatibility. To address these issues, we propose BindCLIP, a unified contrastive-generative representation learning framework for virtual screening. BindCLIP jointly trains pocket and ligand encoders using CLIP-style contrastive learning together with a pocket-conditioned diffusion objective for binding pose generation, so that pose-level supervision directly shapes the retrieval embedding space toward interaction-relevant features. To further mitigate shortcut reliance, we introduce hard-negative augmentation and a ligand-ligand anchoring regularizer that prevents representation collapse. Experiments on two public benchmarks demonstrate consistent improvements over strong baselines. BindCLIP achieves substantial gains on challenging out-of-distribution virtual screening and improves ligand-analogue ranking on the FEP+ benchmark. Together, these results indicate that integrating generative, pose-level supervision with contrastive learning yields more interaction-aware embeddings and improves generalization in realistic screening settings, bringing virtual screening closer to real-world applicability.

[218] Closing the Distribution Gap in Adversarial Training for LLMs

Chengzhi Hu, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn

Main category: cs.LG

TL;DR: Distributional Adversarial Training (DAT) improves LLM robustness by using Diffusion LLMs to approximate the true data distribution and generate diverse adversarial samples for training.

DetailsMotivation: Current adversarial training methods for LLMs fail to cover the full data distribution, leaving models vulnerable to simple in-distribution exploits like tense changes or translations. This persistent fragility stems from inadequate coverage of the data distribution during training.

Method: Proposes Distributional Adversarial Training (DAT) that leverages Diffusion LLMs to approximate the true joint distribution of prompts and responses. This enables generation of diverse, high-likelihood samples that address generalization failures. Combines optimization over the data distribution provided by the diffusion model with continuous adversarial training.

Result: DAT achieves substantially higher adversarial robustness than previous methods against various attacks including simple in-distribution exploits.

Conclusion: By better approximating the true data distribution during adversarial training, DAT significantly improves LLM robustness against diverse attacks, addressing a fundamental limitation in current adversarial training approaches.

Abstract: Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.

[219] Size Transferability of Graph Transformers with Convolutional Positional Encodings

Javier Porras-Valenzuela, Zhiyang Wang, Alejandro Ribeiro

Main category: cs.LG

TL;DR: Graph Transformers with GNN positional encodings inherit transferability guarantees from GNNs, enabling training on small graphs and generalization to larger ones with theoretical support from manifold convergence.

DetailsMotivation: To understand Graph Transformers (GTs) with GNN-based positional encodings through theoretical lens and establish their transferability properties, enabling efficient training on small graphs for generalization to larger graphs.

Method: Analyze GTs through manifold limit models for graph sequences, establish theoretical connection between GTs with GNN positional encodings and Manifold Neural Networks (MNNs), and leverage transferability results for GNNs under manifold convergence.

Result: GTs inherit transferability guarantees from their positional encodings, allowing provable generalization from small to large graphs under mild assumptions. Experiments show GTs exhibit scalable behavior comparable to GNNs, with practical demonstration on shortest path distance estimation over terrains.

Conclusion: GTs with GNN positional encodings are theoretically grounded and practically efficient, offering new insights for understanding GTs and enabling efficient training in large-scale settings through transferability properties.

Abstract: Transformers have achieved remarkable success across domains, motivating the rise of Graph Transformers (GTs) as attention-based architectures for graph-structured data. A key design choice in GTs is the use of Graph Neural Network (GNN)-based positional encodings to incorporate structural information. In this work, we study GTs through the lens of manifold limit models for graph sequences and establish a theoretical connection between GTs with GNN positional encodings and Manifold Neural Networks (MNNs). Building on transferability results for GNNs under manifold convergence, we show that GTs inherit transferability guarantees from their positional encodings. In particular, GTs trained on small graphs provably generalize to larger graphs under mild assumptions. We complement our theory with extensive experiments on standard graph benchmarks, demonstrating that GTs exhibit scalable behavior on par with GNNs. To further show the efficiency in a real-world scenario, we implement GTs for shortest path distance estimation over terrains to better illustrate the efficiency of the transferable GTs. Our results provide new insights into the understanding of GTs and suggest practical directions for efficient training of GTs in large-scale settings.

[220] Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics

Ihor Kendiukhov

Main category: cs.LG

TL;DR: First systematic study of neural scaling laws for transformers in single-cell genomics shows power-law scaling emerges with sufficient data but not in data-limited regimes.

DetailsMotivation: While neural scaling laws have been extensively documented for language and vision transformers, their existence in single-cell genomics remains largely unexplored. The paper aims to investigate whether similar scaling behaviors emerge in transformers trained on single-cell RNA sequencing data.

Method: Used masked-reconstruction transformers trained on single-cell RNA sequencing data from CELLxGENE Census. Created two experimental regimes: data-rich (512 genes, 200,000 cells) and data-limited (1,024 genes, 10,000 cells). Tested seven model sizes spanning three orders of magnitude (533 to 3.4×10^8 parameters) and fitted parametric scaling laws to validation mean squared error.

Result: Data-rich regime exhibits clear power-law scaling with irreducible loss floor of ~1.44, while data-limited regime shows negligible scaling. Preliminary conversion of asymptotic floor suggests ~2.30 bits of entropy per masked gene position.

Conclusion: Scaling laws analogous to NLP do emerge in single-cell transcriptomics when sufficient data are available, with data-to-parameter ratio as critical determinant of scaling behavior. Findings have implications for single-cell foundation model design.

Abstract: Neural scaling laws – power-law relationships between loss, model size, and data – have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we construct two experimental regimes: a data-rich regime (512 highly variable genes, 200,000 cells) and a data-limited regime (1,024 genes, 10,000 cells). Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law to validation mean squared error (MSE). The data-rich regime exhibits clear power-law scaling with an irreducible loss floor of c ~ 1.44, while the data-limited regime shows negligible scaling, indicating that model capacity is not the binding constraint when data are scarce. These results establish that scaling laws analogous to those observed in natural language processing do emerge in single-cell transcriptomics when sufficient data are available, and they identify the data-to-parameter ratio as a critical determinant of scaling behaviour. A preliminary conversion of the data-rich asymptotic floor to information-theoretic units yields an estimate of approximately 2.30 bits of entropy per masked gene position. We discuss implications for the design of single-cell foundation models and outline the additional measurements needed to refine this entropy estimate.

[221] Fast and Effective On-policy Distillation from Reasoning Prefixes

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman

Main category: cs.LG

TL;DR: Prefix distillation reduces on-policy distillation training cost by applying distillation only to prefixes of student-generated outputs and terminating sampling early, achieving similar performance with 2x-47x FLOP reduction.

DetailsMotivation: On-policy distillation (OPD) provides better generalization than off-policy methods but requires expensive on-the-fly sampling of student policies during training, especially for long responses. Observations show training signals are often concentrated in output prefixes, and short teacher-generated prefixes can significantly help students produce correct answers.

Method: Proposes on-policy prefix distillation: modifies OPD by applying distillation objective only to prefixes of student-generated outputs and terminating each sampling early during distillation, rather than using full outputs.

Result: Experiments on AI-for-Math and out-of-domain benchmarks show that on-policy prefix distillation matches the performance of full OPD while reducing training FLOP by 2x-47x.

Conclusion: Prefix distillation is a simple yet effective modification that maintains OPD’s benefits while dramatically reducing computational cost, making on-policy distillation more practical for training large language models.

Abstract: On-policy distillation (OPD), which samples trajectories from the student model and supervises them with a teacher at the token level, avoids relying solely on verifiable terminal rewards and can yield better generalization than off-policy distillation. However, OPD requires expensive on-the-fly sampling of the student policy during training, which substantially increases training cost, especially for long responses. Our initial analysis shows that, during OPD, training signals are often concentrated in the prefix of each output, and that even a short teacher-generated prefix can significantly help the student produce the correct answer. Motivated by these observations, we propose a simple yet effective modification of OPD: we apply the distillation objective only to prefixes of student-generated outputs and terminate each sampling early during distillation. Experiments on a suite of AI-for-Math and out-of-domain benchmarks show that on-policy prefix distillation matches the performance of full OPD while reducing training FLOP by 2x-47x.

[222] Complex-Valued Unitary Representations as Classification Heads for Improved Uncertainty Quantification in Deep Neural Networks

Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari

Main category: cs.LG

TL;DR: Quantum-inspired classification head using complex-valued Hilbert space projections and learned unitary transformations improves model calibration on CIFAR-10, achieving 2.4x better calibration error than standard softmax.

DetailsMotivation: Deep neural networks achieve high accuracy but remain poorly calibrated - their confidence scores don't reliably reflect true probability of correctness, which is problematic for safety-critical applications.

Method: Proposes a quantum-inspired classification head that projects backbone features into complex-valued Hilbert space and evolves them under learned unitary transformations parameterized via the Cayley map. Uses controlled hybrid experimental design with shared backbone and interchangeable heads to isolate effects of complex-valued unitary representations.

Result: Unitary magnitude head achieves ECE of 0.0146 (2.4x improvement over standard softmax, 3.5x over temperature scaling). On CIFAR-10H human-uncertainty benchmark, wave function head achieves lowest KL-divergence (0.336) to human soft labels. Born rule measurement layer degrades calibration.

Conclusion: Complex-valued unitary representations improve model calibration and better capture human perceptual ambiguity, though quantum-mechanically motivated Born rule measurement degrades performance. Method shows promise for safety-critical applications but has limitations in out-of-distribution detection and sentiment analysis.

Abstract: Modern deep neural networks achieve high predictive accuracy but remain poorly calibrated: their confidence scores do not reliably reflect the true probability of correctness. We propose a quantum-inspired classification head architecture that projects backbone features into a complex-valued Hilbert space and evolves them under a learned unitary transformation parameterised via the Cayley map. Through a controlled hybrid experimental design - training a single shared backbone and comparing lightweight interchangeable heads - we isolate the effect of complex-valued unitary representations on calibration. Our ablation study on CIFAR-10 reveals that the unitary magnitude head (complex features evolved under a Cayley unitary, read out via magnitude and softmax) achieves an Expected Calibration Error (ECE) of 0.0146, representing a 2.4x improvement over a standard softmax head (0.0355) and a 3.5x improvement over temperature scaling (0.0510). Surprisingly, replacing the softmax readout with a Born rule measurement layer - the quantum-mechanically motivated approach - degrades calibration to an ECE of 0.0819. On the CIFAR-10H human-uncertainty benchmark, the wave function head achieves the lowest KL-divergence (0.336) to human soft labels among all compared methods, indicating that complex-valued representations better capture the structure of human perceptual ambiguity. We provide theoretical analysis connecting norm-preserving unitary dynamics to calibration through feature-space geometry, report negative results on out-of-distribution detection and sentiment analysis to delineate the method’s scope, and discuss practical implications for safety-critical applications. Code is publicly available.

[223] The Information Geometry of Softmax: Probing and Steering

Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch

Main category: cs.LG

TL;DR: The paper explores how AI systems encode semantic structure into geometric representation spaces, arguing that information geometry is the natural framework for understanding representations that define softmax distributions, and introduces “dual steering” for robust concept manipulation.

DetailsMotivation: The paper is motivated by understanding how semantic structure gets encoded into the geometric structure of AI representation spaces, with the observation that the natural geometry should reflect how models use representations to produce behavior.

Method: Focuses on representations defining softmax distributions, arguing for information geometry as the natural framework. Develops “dual steering” method for robustly steering representations to exhibit particular concepts using linear probes, with theoretical guarantees.

Result: Proves that dual steering optimally modifies target concepts while minimizing changes to off-target concepts. Empirically demonstrates enhanced controllability and stability of concept manipulation.

Conclusion: Information geometry provides a natural framework for understanding semantic encoding in representation spaces, and dual steering offers an effective method for robust concept manipulation with theoretical guarantees.

Abstract: This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation of this paper is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop “dual steering”, a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off-target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.

[224] Hybrid Federated and Split Learning for Privacy Preserving Clinical Prediction and Treatment Optimization

Farzana Akter, Rakib Hossain, Deb Kanna Roy Toushi, Mahmood Menon Khan, Sultana Amin, Lisan Al Amin

Main category: cs.LG

TL;DR: Hybrid Federated Learning-Split Learning framework for privacy-preserving clinical decision support without raw data sharing, with empirical privacy auditing and defense mechanisms.

DetailsMotivation: Clinical decision support systems face governance and privacy constraints that prevent pooling patient data across institutions. There's a need for privacy-preserving collaborative modeling that maintains predictive utility while protecting sensitive health information.

Method: Combines Federated Learning (FL) and Split Learning (SL) in a hybrid framework where feature-extraction trunks stay on client devices and prediction heads are on a coordinating server. Includes empirical privacy auditing using membership inference attacks on cut-layer representations and lightweight defenses (activation clipping, additive Gaussian noise).

Result: Hybrid FL-SL variants achieve competitive predictive performance and decision-facing prioritization compared to standalone FL or SL. Provides tunable privacy-utility trade-off that reduces audited leakage without raw-data sharing across three public clinical datasets under non-IID partitions.

Conclusion: Hybrid FL-SL offers a practical design space for privacy-preserving healthcare decision support where utility, leakage risk, and deployment costs must be explicitly balanced, positioning it as a viable solution for collaborative clinical modeling under privacy constraints.

Abstract: Collaborative clinical decision support is often constrained by governance and privacy rules that prevent pooling patient-level records across institutions. We present a hybrid privacy-preserving framework that combines Federated Learning (FL) and Split Learning (SL) to support decision-oriented healthcare modeling without raw-data sharing. The approach keeps feature-extraction trunks on clients while hosting prediction heads on a coordinating server, enabling shared representation learning and exposing an explicit collaboration boundary where privacy controls can be applied. Rather than assuming distributed training is inherently private, we audit leakage empirically using membership inference on cut-layer representations and study lightweight defenses based on activation clipping and additive Gaussian noise. We evaluate across three public clinical datasets under non-IID client partitions using a unified pipeline and assess performance jointly along four deployment-relevant axes: factual predictive utility, uplift-based ranking under capacity constraints, audited privacy leakage, and communication overhead. Results show that hybrid FL-SL variants achieve competitive predictive performance and decision-facing prioritization behavior relative to standalone FL or SL, while providing a tunable privacy-utility trade-off that can reduce audited leakage without requiring raw-data sharing. Overall, the work positions hybrid FL-SL as a practical design space for privacy-preserving healthcare decision support where utility, leakage risk, and deployment cost must be balanced explicitly.

[225] On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Taejong Joo, Wenhan Xia, Cheolmin Kim, Ming Zhang, Eugene Ie

Main category: cs.LG

TL;DR: Masked gradient optimization outperforms sophisticated adaptive optimizers for LLM training through geometric regularization

DetailsMotivation: Challenge the exclusive reliance on dense adaptive optimizers for LLM training by showing that random masking of parameter updates can be highly effective

Method: Introduce Momentum-aligned gradient masking (Magma) which modulates masked updates using momentum-gradient alignment, creating a simple drop-in replacement for adaptive optimizers

Result: Magma consistently outperforms state-of-the-art optimizers, reducing perplexity by over 19% compared to Adam and 9% compared to Muon for 1B models

Conclusion: Random masking induces curvature-dependent geometric regularization that smooths optimization trajectories, making masked optimization a promising alternative to complex adaptive optimizers

Abstract: Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19% and 9% compared to Adam and Muon, respectively.

[226] Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade

Main category: cs.LG

TL;DR: Researchers develop scaling laws to predict downstream performance from pre-training compute using quantile regression on large-scale evaluations, finding mostly stable capability boundaries except for math reasoning which advances over time.

DetailsMotivation: Practitioners need reliable scaling laws to translate compute budgets into expected downstream accuracy with contemporary post-training methods, and to understand how stable these mappings are as the field evolves.

Method: Use large-scale observational evaluations (5k existing + 2k new samples) to estimate capability boundaries via smoothed quantile regression with monotone, saturating sigmoid parameterization. Validate temporal reliability by fitting on earlier model generations and evaluating on later releases.

Result: Estimated capability boundaries are mostly stable across tasks, except math reasoning which shows consistently advancing boundaries over time. The method also analyzes task-dependent saturation and contamination-related shifts on math reasoning.

Conclusion: The work introduces a practical methodology for translating compute budgets into reliable performance expectations and monitoring capability boundary shifts, releasing the Proteus 2k evaluation dataset and an efficient algorithm using 20% of evaluation budget.

Abstract: For deploying foundation models, practitioners increasingly need prescriptive scaling laws: given a pre training compute budget, what downstream accuracy is attainable with contemporary post training practice, and how stable is that mapping as the field evolves? Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases. Across various tasks, the estimated boundaries are mostly stable, with the exception of math reasoning that exhibits a consistently advancing boundary over time. We then extend our approach to analyze task dependent saturation and to probe contamination related shifts on math reasoning tasks. Finally, we introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget. Together, our work releases the Proteus 2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.

[227] A Scalable Curiosity-Driven Game-Theoretic Framework for Long-Tail Multi-Label Learning in Data Mining

Jing Yang, Keze Wang

Main category: cs.LG

TL;DR: CD-GTMLL is a game-theoretic framework for long-tail multi-label classification that treats sub-predictors as players collaborating to maximize accuracy while using curiosity rewards to focus on rare tail labels.

DetailsMotivation: Long-tail distribution in multi-label classification poses challenges where few head labels dominate while many tail labels are rare, existing methods disrupt inter-label dependencies or require brittle hyperparameter tuning, especially with large label spaces (tens of thousands of labels).

Method: Recasts long-tail MLC as a multi-player game where each sub-predictor specializes in a partition of label space, collaborating to maximize global accuracy while pursuing intrinsic curiosity rewards based on tail label rarity and inter-player disagreement, adaptively injecting learning signals into under-represented tail labels.

Result: Extensive experiments across 7 benchmarks including extreme multi-label classification datasets with 30,000+ labels show CD-GTMLL consistently surpasses state-of-the-art methods with gains up to +1.6% P@3 on Wiki10-31K, with theoretical analysis showing convergence to tail-aware equilibrium and improvements in Rare-F1 metric.

Conclusion: By integrating game theory with curiosity mechanisms, CD-GTMLL enhances model efficiency in resource-constrained environments and paves way for more adaptive learning in imbalanced data scenarios across industries like e-commerce and healthcare.

Abstract: The long-tail distribution, where a few head labels dominate while rare tail labels abound, poses a persistent challenge for large-scale Multi-Label Classification (MLC) in real-world data mining applications. Existing resampling and reweighting strategies often disrupt inter-label dependencies or require brittle hyperparameter tuning, especially as the label space expands to tens of thousands of labels. To address this issue, we propose Curiosity-Driven Game-Theoretic Multi-Label Learning (CD-GTMLL), a scalable cooperative framework that recasts long-tail MLC as a multi-player game - each sub-predictor (“player”) specializes in a partition of the label space, collaborating to maximize global accuracy while pursuing intrinsic curiosity rewards based on tail label rarity and inter-player disagreement. This mechanism adaptively injects learning signals into under-represented tail labels without manual balancing or tuning. We further provide a theoretical analysis showing that our CD-GTMLL converges to a tail-aware equilibrium and formally links the optimization dynamics to improvements in the Rare-F1 metric. Extensive experiments across 7 benchmarks, including extreme multi-label classification datasets with 30,000+ labels, demonstrate that CD-GTMLL consistently surpasses state-of-the-art methods, with gains up to +1.6% P@3 on Wiki10-31K. Ablation studies further confirm the contributions of both game-theoretic cooperation and curiosity-driven exploration to robust tail performance. By integrating game theory with curiosity mechanisms, CD-GTMLL not only enhances model efficiency in resource-constrained environments but also paves the way for more adaptive learning in imbalanced data scenarios across industries like e-commerce and healthcare.

[228] Directional Reasoning Trajectory Change (DRTC): Identifying Critical Trace Segments in Reasoning Models

Waldemar Chang

Main category: cs.LG

TL;DR: DRTC is a causal interpretability framework that identifies pivot points in language model reasoning and measures how specific context chunks steer the reasoning trajectory.

DetailsMotivation: Existing interpretability methods for language models often highlight tokens correlated with answers but fail to reveal where models make consequential reasoning turns, what earlier context triggers those turns, or whether highlighted text actually steers the reasoning process.

Method: DRTC detects pivot decision points using uncertainty and distribution-shift signals, then applies receiver-side interventions that preserve the realized rollout while blocking information flow from selected earlier chunks at pivots. It measures whether interventions redirect the model’s log-probability trajectory relative to the realized rollout direction, producing signed per-chunk attribution scores.

Result: Directional influence is sharply concentrated across four reasoning models (Gini 0.50-0.58, top-5 percent mass 0.23-0.28). Learned pivots induce stronger intervention magnitudes than random spans. In scaling study on 500 MATH problems, learned spans outperform matched random spans (median delta = 0.409, 355/500 positive).

Conclusion: DRTC provides a causally grounded, trajectory-level view of how specific context elements steer reasoning under on-policy dynamics, offering insights into long-horizon reasoning in language models.

Abstract: Understanding how language models carry out long-horizon reasoning remains an open challenge. Existing interpretability methods often highlight tokens or spans correlated with an answer, but they rarely reveal where the model makes consequential reasoning turns, which earlier context causally triggers those turns, or whether the highlighted text actually steers the reasoning process. We introduce Directional Reasoning Trajectory Change (DRTC), a process-causal framework for interpreting long-form reasoning from a single on-policy rollout. DRTC detects pivot decision points using uncertainty and distribution-shift signals, then applies receiver-side interventions that preserve the realized rollout without resampling the continuation while blocking information flow from selected earlier chunks only at a pivot. It measures whether each intervention redirects the direction of the model’s log-probability trajectory relative to the realized rollout direction, producing a signed per-chunk attribution score. We also compute turning-angle curvature changes on raw logits as a complementary diagnostic and introduce curvature signatures to summarize shared intervention-response geometry. Empirically, directional influence is sharply concentrated across four reasoning models (per-example |DRTC| shares yield Gini 0.50 to 0.58 and top-5 percent mass 0.23 to 0.28), and learned pivots induce stronger intervention magnitudes than matched random spans. In a scaling study on 500 MATH problems with R1-Distill-Qwen-1.5B, learned spans outperform matched random spans (median delta = 0.409, 355 of 500 positive; sign test p = 2.3e-21). Overall, DRTC provides a causally grounded, trajectory-level view of how specific context elements steer reasoning under on-policy dynamics.

[229] FedPSA: Modeling Behavioral Staleness in Asynchronous Federated Learning

Chaoyi Lu

Main category: cs.LG

TL;DR: FedPSA is a fine-grained asynchronous federated learning framework that uses parameter sensitivity to measure model staleness and a dynamic momentum queue to adjust tolerance for outdated information, achieving significant performance improvements over existing methods.

DetailsMotivation: Asynchronous Federated Learning (AFL) accelerates training by not waiting for slower clients, but suffers from performance degradation due to model staleness. Existing methods use coarse-grained round difference measures that don't observe the model itself, limiting performance.

Method: FedPSA introduces parameter sensitivity to measure model obsolescence more precisely and establishes a dynamic momentum queue to assess the current training phase in real time, allowing dynamic adjustment of tolerance for outdated information.

Result: Extensive experiments on multiple datasets show FedPSA achieves up to 6.37% improvement over baseline methods and 1.93% over state-of-the-art methods.

Conclusion: FedPSA provides a more fine-grained approach to handling staleness in asynchronous federated learning, significantly improving performance through parameter sensitivity analysis and dynamic tolerance adjustment.

Abstract: Asynchronous Federated Learning (AFL) has emerged as a significant research area in recent years. By not waiting for slower clients and executing the training process concurrently, it achieves faster training speed compared to traditional federated learning. However, due to the staleness introduced by the asynchronous process, its performance may degrade in some scenarios. Existing methods often use the round difference between the current model and the global model as the sole measure of staleness, which is coarse-grained and lacks observation of the model itself, thereby limiting the performance ceiling of asynchronous methods. In this paper, we propose FedPSA (Parameter Sensitivity-based Asynchronous Federated Learning), a more fine-grained AFL framework that leverages parameter sensitivity to measure model obsolescence and establishes a dynamic momentum queue to assess the current training phase in real time, thereby adjusting the tolerance for outdated information dynamically. Extensive experiments on multiple datasets and comparisons with various methods demonstrate the superior performance of FedPSA, achieving up to 6.37% improvement over baseline methods and 1.93% over the current state-of-the-art method.

[230] Discovering Implicit Large Language Model Alignment Objectives

Edward Chen, Sanmi Koyejo, Carlos Guestrin

Main category: cs.LG

TL;DR: Obj-Disco automatically decomposes LLM alignment reward signals into interpretable natural language objectives to improve transparency and safety.

DetailsMotivation: Current LLM alignment methods use complex reward signals that obscure specific behaviors, creating risks of misalignment and reward hacking. Existing interpretation methods either rely on predefined rubrics (missing unknown issues) or fail to identify comprehensive causal objectives.

Method: Obj-Disco uses an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain residual reward signals. It decomposes alignment rewards into sparse, weighted combinations of human-interpretable natural language objectives.

Result: The framework consistently captures >90% of reward behavior across diverse tasks, model sizes, and alignment algorithms. Human evaluation corroborates these findings. A case study shows Obj-Disco can identify latent misaligned incentives that emerge alongside intended behaviors.

Conclusion: Obj-Disco provides a crucial tool for uncovering implicit objectives in LLM alignment, enabling more transparent and safer AI development by making reward signals interpretable and identifying potential misalignment issues.

Abstract: Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of “unknown unknowns”, or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework’s robustness. Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.

[231] ER-MIA: Black-Box Adversarial Memory Injection Attacks on Long-Term Memory-Augmented Large Language Models

Mitchell Piehl, Zhaohan Xi, Zuobin Xiong, Pan He, Muchao Ye

Main category: cs.LG

TL;DR: First systematic study of black-box adversarial memory injection attacks targeting similarity-based retrieval in memory-augmented LLMs, introducing ER-MIA framework with high success rates.

DetailsMotivation: LLMs are increasingly augmented with long-term memory systems to overcome context window limitations, but recent research shows these memory systems create new attack surfaces, particularly through similarity-based retrieval mechanisms.

Method: Introduces ER-MIA, a unified framework for black-box adversarial memory injection attacks with two realistic attack settings: content-based attacks and question-targeted attacks. Includes composable attack primitives and ensemble attacks that work under minimal attacker assumptions.

Result: Extensive experiments across multiple LLMs and long-term memory systems demonstrate high success rates, showing similarity-based retrieval constitutes a fundamental system-level vulnerability that persists across memory designs and application scenarios.

Conclusion: Similarity-based retrieval in long-term memory-augmented LLMs represents a significant security vulnerability that requires attention, as memory systems create new attack surfaces that can be exploited through black-box adversarial injection attacks.

Abstract: Large language models (LLMs) are increasingly augmented with long-term memory systems to overcome finite context windows and enable persistent reasoning across interactions. However, recent research finds that LLMs become more vulnerable because memory provides extra attack surfaces. In this paper, we present the first systematic study of black-box adversarial memory injection attacks that target the similarity-based retrieval mechanism in long-term memory-augmented LLMs. We introduce ER-MIA, a unified framework that exposes this vulnerability and formalizes two realistic attack settings: content-based attacks and question-targeted attacks. In these settings, ER-MIA includes an arsenal of composable attack primitives and ensemble attacks that achieve high success rates under minimal attacker assumptions. Extensive experiments across multiple LLMs and long-term memory systems demonstrate that similarity-based retrieval constitutes a fundamental and system-level vulnerability, revealing security risks that persist across memory designs and application scenarios.

[232] CDRL: A Reinforcement Learning Framework Inspired by Cerebellar Circuits and Dendritic Computational Strategies

Sibo Zhang, Rui Jing, Liangfu Lv, Jian Zhang, Yunliang Zang

Main category: cs.LG

TL;DR: A biologically-inspired RL architecture based on cerebellar principles improves sample efficiency, robustness, and generalization in noisy, high-dimensional tasks.

DetailsMotivation: RL suffers from low sample efficiency, noise sensitivity, and weak generalization under partial observability. While most approaches focus on optimization strategies, the role of architectural priors in representation learning and decision dynamics is under-explored. The paper draws inspiration from the cerebellum's structural principles to address these limitations.

Method: Proposes a biologically grounded RL architecture incorporating cerebellar principles: large expansion, sparse connectivity, sparse activation, and dendritic-level modulation. The architecture is tested on noisy, high-dimensional RL benchmarks, with sensitivity analysis of architectural parameters.

Result: Both the cerebellar architecture and dendritic modulation consistently improve sample efficiency, robustness, and generalization compared to conventional designs. Sensitivity analysis suggests cerebellum-inspired structures can offer optimized performance for RL with constrained model parameters.

Conclusion: Cerebellar structural priors serve as effective inductive biases for RL, demonstrating the value of biologically-inspired architectural designs in improving RL performance across multiple dimensions.

Abstract: Reinforcement learning (RL) has achieved notable performance in high-dimensional sequential decision-making tasks, yet remains limited by low sample efficiency, sensitivity to noise, and weak generalization under partial observability. Most existing approaches address these issues primarily through optimization strategies, while the role of architectural priors in shaping representation learning and decision dynamics is less explored. Inspired by structural principles of the cerebellum, we propose a biologically grounded RL architecture that incorporate large expansion, sparse connectivity, sparse activation, and dendritic-level modulation. Experiments on noisy, high-dimensional RL benchmarks show that both the cerebellar architecture and dendritic modulation consistently improve sample efficiency, robustness, and generalization compared to conventional designs. Sensitivity analysis of architectural parameters suggests that cerebellum-inspired structures can offer optimized performance for RL with constrained model parameters. Overall, our work underscores the value of cerebellar structural priors as effective inductive biases for RL.

[233] Fractional-Order Federated Learning

Mohammad Partohaghighi, Roummel Marcia, YangQuan Chen

Main category: cs.LG

TL;DR: FOFedAvg introduces fractional-order stochastic gradient descent to federated learning, improving convergence speed and communication efficiency for non-IID data by incorporating historical information and long-range dependencies.

DetailsMotivation: Federated learning faces challenges with slow convergence, high communication costs, and non-IID data distribution across clients. The authors aim to address these issues by incorporating memory-aware fractional-order updates that capture long-range relationships and historical information.

Method: Proposes Fractional-Order Federated Averaging (FOFedAvg) which integrates Fractional-Order Stochastic Gradient Descent (FOSGD) into the federated averaging framework. The method uses fractional-order derivatives to incorporate historical gradient information and long-range dependencies, making the optimization more robust to heterogeneous client data.

Result: FOFedAvg outperforms established federated optimization algorithms on multiple benchmark datasets (MNIST, FEMNIST, CIFAR-10/100, EMNIST, etc.) across various non-IID partitioning schemes. It shows improved test performance and faster convergence speed while maintaining communication efficiency.

Conclusion: Fractional-order, memory-aware updates substantially improve federated learning robustness and effectiveness, offering a practical solution for distributed training on heterogeneous data with theoretical convergence guarantees.

Abstract: Federated learning (FL) allows remote clients to train a global model collaboratively while protecting client privacy. Despite its privacy-preserving benefits, FL has significant drawbacks, including slow convergence, high communication cost, and non-independent-and-identically-distributed (non-IID) data. In this work, we present a novel FedAvg variation called Fractional-Order Federated Averaging (FOFedAvg), which incorporates Fractional-Order Stochastic Gradient Descent (FOSGD) to capture long-range relationships and deeper historical information. By introducing memory-aware fractional-order updates, FOFedAvg improves communication efficiency and accelerates convergence while mitigating instability caused by heterogeneous, non-IID client data. We compare FOFedAvg against a broad set of established federated optimization algorithms on benchmark datasets including MNIST, FEMNIST, CIFAR-10, CIFAR-100, EMNIST, the Cleveland heart disease dataset, Sent140, PneumoniaMNIST, and Edge-IIoTset. Across a range of non-IID partitioning schemes, FOFedAvg is competitive with, and often outperforms, these baselines in terms of test performance and convergence speed. On the theoretical side, we prove that FOFedAvg converges to a stationary point under standard smoothness and bounded-variance assumptions for fractional order $0<α\le 1$. Together, these results show that fractional-order, memory-aware updates can substantially improve the robustness and effectiveness of federated learning, offering a practical path toward distributed training on heterogeneous data.

[234] Doubly Stochastic Mean-Shift Clustering

Tom Trigano, Yann Sepulcre, Itshak Lapidot

Main category: cs.LG

TL;DR: DSMS introduces randomized bandwidth selection in Mean-Shift clustering to address sensitivity to bandwidth hyperparameters, improving exploration and preventing over-segmentation in sparse data scenarios.

DetailsMotivation: Standard Mean-Shift algorithms are highly sensitive to bandwidth hyperparameters, especially in data-scarce regimes where fixed bandwidths lead to fragmentation and spurious modes. The need for better exploration of density landscapes motivates a more flexible approach.

Method: Doubly Stochastic Mean-Shift (DSMS) introduces randomness in both trajectory updates and kernel bandwidth selection. At each iteration, data samples and radius are drawn from continuous uniform distributions, creating an implicit regularization mechanism.

Result: DSMS significantly outperforms standard and stochastic Mean-Shift baselines on synthetic Gaussian mixtures. It exhibits remarkable stability and prevents over-segmentation in sparse clustering scenarios without performance degradation.

Conclusion: Randomized bandwidth policy in DSMS acts as effective regularization, providing better exploration of density landscapes and addressing the bandwidth sensitivity problem in Mean-Shift clustering, particularly in data-scarce regimes.

Abstract: Standard Mean-Shift algorithms are notoriously sensitive to the bandwidth hyperparameter, particularly in data-scarce regimes where fixed-scale density estimation leads to fragmentation and spurious modes. In this paper, we propose Doubly Stochastic Mean-Shift (DSMS), a novel extension that introduces randomness not only in the trajectory updates but also in the kernel bandwidth itself. By drawing both the data samples and the radius from a continuous uniform distribution at each iteration, DSMS effectively performs a better exploration of the density landscape. We show that this randomized bandwidth policy acts as an implicit regularization mechanism, and provide convergence theoretical results. Comparative experiments on synthetic Gaussian mixtures reveal that DSMS significantly outperforms standard and stochastic Mean-Shift baselines, exhibiting remarkable stability and preventing over-segmentation in sparse clustering scenarios without other performance degradation.

[235] Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits

Gilad Nurko, Roi Benita, Yehoshua Dissen, Tomohiro Nakatani, Marc Delcroix, Shoko Araki, Joseph Keshet

Main category: cs.LG

TL;DR: A framework that couples two diffusion models for joint signal enhancement and classification, enabling mutual guidance between input signal reconstruction and classifier logit refinement without retraining the classifier.

DetailsMotivation: Standard approaches treat signal enhancement and classification as separate sequential stages, failing to leverage semantic information from classifier outputs during denoising. This leads to suboptimal performance in noisy environments.

Method: Proposes a domain-agnostic framework with two interacting diffusion models: one operates on the input signal, the other on classifier output logits. Introduces three strategies to model the joint distribution of input and logits, enabling mutual guidance without classifier retraining.

Result: The framework surpasses traditional sequential enhancement baselines, delivering robust improvements in classification accuracy under diverse noise conditions for both image classification and automatic speech recognition tasks.

Conclusion: The joint enhancement framework effectively integrates signal reconstruction and classification, demonstrating that mutual guidance between enhancement and semantic information leads to superior performance in noisy environments across different domains.

Abstract: Robust classification in noisy environments remains a fundamental challenge in machine learning. Standard approaches typically treat signal enhancement and classification as separate, sequential stages: first enhancing the signal and then applying a classifier. This approach fails to leverage the semantic information in the classifier’s output during denoising. In this work, we propose a general, domain-agnostic framework that integrates two interacting diffusion models: one operating on the input signal and the other on the classifier’s output logits, without requiring any retraining or fine-tuning of the classifier. This coupled formulation enables mutual guidance, where the enhancing signal refines the class estimation and, conversely, the evolving class logits guide the signal reconstruction towards discriminative regions of the manifold. We introduce three strategies to effectively model the joint distribution of the input and the logit. We evaluated our joint enhancement method for image classification and automatic speech recognition. The proposed framework surpasses traditional sequential enhancement baselines, delivering robust and flexible improvements in classification accuracy under diverse noise conditions.

[236] Fairness over Equality: Correcting Social Incentives in Asymmetric Sequential Social Dilemmas

Alper Demir, Hüseyin Aydın, Kale-ab Abebe Tessera, David Abel, Stefano V. Albrecht

Main category: cs.LG

TL;DR: Proposes fairness modifications for multi-agent reinforcement learning in asymmetric sequential social dilemmas to improve cooperation without global information.

DetailsMotivation: Existing fairness-based methods in multi-agent reinforcement learning assume identical agent incentives and require global information, failing in asymmetric social dilemmas where agents have different capabilities or reward structures.

Method: Three modifications: 1) Redefine fairness using agents’ reward ranges instead of raw equality, 2) Introduce agent-based weighting to handle inherent asymmetries, 3) Localize social feedback to work under partial observability without global information sharing.

Result: In asymmetric scenarios, the proposed method fosters faster emergence of cooperative policies compared to existing approaches while maintaining scalability and practicality.

Conclusion: Addressing agent asymmetries through modified fairness definitions and localized feedback enables more effective cooperation in sequential social dilemmas without requiring global information access.

Abstract: Sequential Social Dilemmas (SSDs) provide a key framework for studying how cooperation emerges when individual incentives conflict with collective welfare. In Multi-Agent Reinforcement Learning, these problems are often addressed by incorporating intrinsic drives that encourage prosocial or fair behavior. However, most existing methods assume that agents face identical incentives in the dilemma and require continuous access to global information about other agents to assess fairness. In this work, we introduce asymmetric variants of well-known SSD environments and examine how natural differences between agents influence cooperation dynamics. Our findings reveal that existing fairness-based methods struggle to adapt under asymmetric conditions by enforcing raw equality that wrongfully incentivize defection. To address this, we propose three modifications: (i) redefining fairness by accounting for agents’ reward ranges, (ii) introducing an agent-based weighting mechanism to better handle inherent asymmetries, and (iii) localizing social feedback to make the methods effective under partial observability without requiring global information sharing. Experimental results show that in asymmetric scenarios, our method fosters faster emergence of cooperative policies compared to existing approaches, without sacrificing scalability or practicality.

[237] Logit Distance Bounds Representational Similarity

Beatrix M. B. Nielsen, Emanuele Marconato, Luigi Gresele, Andrea Dittadi, Simon Buchholz

Main category: cs.LG

TL;DR: The paper shows that KL divergence-based model distillation can match teacher predictions while failing to preserve linear representational similarity, whereas logit-distance distillation better preserves linear representational properties and concept recoverability.

DetailsMotivation: While identifiability theory shows that models with identical conditional distributions have linearly equivalent representations, it's unclear if this holds approximately when distributions are close but not equal. KL divergence closeness doesn't guarantee linear representational similarity, raising concerns about distillation methods.

Method: The authors study a distributional distance based on logit differences, define a representational dissimilarity measure based on identifiability classes, and prove it’s bounded by logit distance. They show KL divergence upper-bounds logit distance but provides weak control in practice. They conduct distillation experiments on synthetic and image datasets comparing KL-based vs logit-distance distillation.

Result: Logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher’s linearly recoverable concepts compared to KL-based distillation. KL divergence can match predictions while failing to preserve linear representational properties.

Conclusion: Logit-distance is a better objective than KL divergence for preserving linear representational similarity in model distillation, which is important for maintaining interpretable concept representations in compressed or student models.

Abstract: For a broad family of discriminative models that includes autoregressive language models, identifiability results imply that if two models induce the same conditional distributions, then their internal representations agree up to an invertible linear transformation. We ask whether an analogous conclusion holds approximately when the distributions are close instead of equal. Building on the observation of Nielsen et al. (2025) that closeness in KL divergence need not imply high linear representational similarity, we study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees. Specifically, we define a representational dissimilarity measure based on the models’ identifiability class and prove that it is bounded by the logit distance. We further show that, when model probabilities are bounded away from zero, KL divergence upper-bounds logit distance; yet the resulting bound fails to provide nontrivial control in practice. As a consequence, KL-based distillation can match a teacher’s predictions while failing to preserve linear representational properties, such as linear-probe recoverability of human-interpretable concepts. In distillation experiments on synthetic and image datasets, logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher’s linearly recoverable concepts.

[238] Benchmarking IoT Time-Series AD with Event-Level Augmentations

Dmitry Zhevnenko, Ilya Makarov, Aleksandr Kovalenko, Fedor Meshchaninov, Anton Kozhukhov, Vladislav Travnikov, Makar Ippolitov, Kirill Yashunin, Iurii Katser

Main category: cs.LG

TL;DR: A comprehensive evaluation protocol for anomaly detection in IoT time series with event-level augmentations simulating real-world perturbations, showing no universal winner across different model types and datasets.

DetailsMotivation: Current anomaly detection research focuses too much on point-level results on curated datasets, limiting practical value for model selection in safety-critical IoT applications. There's a need for realistic evaluation that considers event-level reliability and earliness under real-world perturbations.

Method: Introduced an evaluation protocol with unified event-level augmentations simulating real-world issues: sensor dropout, linear/log drift, additive noise, and window shifts. Also performed sensor-level probing via mask-as-missing zeroing with per-channel influence estimation. Evaluated 14 representative models on 5 public and 2 industrial datasets using unified splits and event aggregation.

Result: No universal winner across different scenarios: graph-structured models transfer best under dropout and long events; density/flow models work well on clean stationary plants but are fragile to monotone drift; spectral CNNs lead when periodicity is strong; reconstruction autoencoders become competitive after sensor vetting; predictive/hybrid dynamics help when faults break temporal dependencies.

Conclusion: The evaluation protocol reveals that model performance depends heavily on specific perturbation types and dataset characteristics, providing practical insights for model selection and design choices in real-world IoT anomaly detection applications.

Abstract: Anomaly detection (AD) for safety-critical IoT time series should be judged at the event level: reliability and earliness under realistic perturbations. Yet many studies still emphasize point-level results on curated base datasets, limiting value for model selection in practice. We introduce an evaluation protocol with unified event-level augmentations that simulate real-world issues: calibrated sensor dropout, linear and log drift, additive noise, and window shifts. We also perform sensor-level probing via mask-as-missing zeroing with per-channel influence estimation to support root-cause analysis. We evaluate 14 representative models on five public anomaly datasets (SWaT, WADI, SMD, SKAB, TEP) and two industrial datasets (steam turbine, nuclear turbogenerator) using unified splits and event aggregation. There is no universal winner: graph-structured models transfer best under dropout and long events (e.g., on SWaT under additive noise F1 drops 0.804->0.677 for a graph autoencoder, 0.759->0.680 for a graph-attention variant, and 0.762->0.756 for a hybrid graph attention model); density/flow models work well on clean stationary plants but can be fragile to monotone drift; spectral CNNs lead when periodicity is strong; reconstruction autoencoders become competitive after basic sensor vetting; predictive/hybrid dynamics help when faults break temporal dependencies but remain window-sensitive. The protocol also informs design choices: on SWaT under log drift, replacing normalizing flows with Gaussian density reduces high-stress F1 from ~0.75 to ~0.57, and fixing a learned DAG gives a small clean-set gain (~0.5-1.0 points) but increases drift sensitivity by ~8x.

[239] On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

Yannic Neuhaus, Nicolas Flammarion, Matthias Hein, Francesco Croce

Main category: cs.LG

TL;DR: Evaluation framework for chain-of-thought reasoning generalization in multimodal models using grid navigation tasks, showing limited OOD generalization despite CoT benefits for in-distribution performance.

DetailsMotivation: While reasoning integration in LLMs and VLMs has improved capabilities, generalization of reasoning models remains poorly understood. The paper aims to rigorously evaluate how well chain-of-thought approaches generalize on planning tasks.

Method: Uses grid-based navigation task where models must output move sequences from start to goal avoiding obstacles. Fine-tunes model variants with different input representations (visual/textual) and CoT strategies, systematically evaluating under in-distribution and out-of-distribution conditions.

Result: CoT reasoning improves in-distribution generalization across all representations, but out-of-distribution generalization (e.g., to larger maps) remains very limited. Reasoning traces combining multiple text formats yield best OOD generalization. Purely text-based models consistently outperform image-based approaches.

Conclusion: Current reasoning models show limited generalization beyond training distribution despite CoT benefits. Text-based representations outperform visual ones for this planning task, suggesting need for better multimodal reasoning approaches.

Abstract: Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task. Specifically, we consider a grid-based navigation task in which a model is provided with a map and must output a sequence of moves that guides a player from a start position to a goal while avoiding obstacles. The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions. Our experiments show that, while CoT reasoning improves in-distribution generalization across all representations, out-of-distribution generalization (e.g., to larger maps) remains very limited in most cases when controlling for trivial matches with the ID data. Surprisingly, we find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization. Finally, purely text-based models consistently outperform those utilizing image-based inputs, including a recently proposed approach relying on latent space reasoning.

[240] POP: Prior-fitted Optimizer Policies

Jan Kobiolka, Christian Frey, Gresa Shala, Arlind Kadra, Erind Bedalli, Josif Grabocka

Main category: cs.LG

TL;DR: POP is a meta-learned optimizer that predicts coordinate-wise step sizes using contextual information from optimization trajectories, trained on millions of synthetic problems and outperforming various optimization methods on benchmark functions.

DetailsMotivation: Classical gradient-based optimizers are highly sensitive to hyperparameter choices and require careful tuning of learning rates, momentum, and gradient accumulation, especially in non-convex settings. There's a need for more robust optimization methods that don't require extensive task-specific tuning.

Method: POP (Prior-fitted Optimizer Policies) is a meta-learned optimizer that predicts coordinate-wise step sizes conditioned on contextual information from the optimization trajectory. It’s trained on millions of synthetic optimization problems sampled from a novel prior spanning both convex and non-convex objectives.

Result: POP consistently outperforms first-order gradient-based methods, non-convex optimization approaches (e.g., evolutionary strategies), Bayesian optimization, and a recent meta-learned competitor on an established benchmark of 47 optimization functions under matched budget constraints.

Conclusion: POP demonstrates strong generalization capabilities without task-specific tuning, offering a robust alternative to traditional optimization methods that require extensive hyperparameter tuning.

Abstract: Optimization refers to the task of finding extrema of an objective function. Classical gradient-based optimizers are highly sensitive to hyperparameter choices. In highly non-convex settings their performance relies on carefully tuned learning rates, momentum, and gradient accumulation. To address these limitations, we introduce POP (Prior-fitted Optimizer Policies), a meta-learned optimizer that predicts coordinate-wise step sizes conditioned on the contextual information provided in the optimization trajectory. Our model is learned on millions of synthetic optimization problems sampled from a novel prior spanning both convex and non-convex objectives. We evaluate POP on an established benchmark including 47 optimization functions of various complexity, where it consistently outperforms first-order gradient-based methods, non-convex optimization approaches (e.g., evolutionary strategies), Bayesian optimization, and a recent meta-learned competitor under matched budget constraints. Our evaluation demonstrates strong generalization capabilities without task-specific tuning.

[241] Evaluating Federated Learning for Cross-Country Mood Inference from Smartphone Sensing Data

Sharmad Kalpande, Saurabh Shirke, Haroon R. Lone

Main category: cs.LG

TL;DR: FedFAP: A feature-aware personalized federated learning framework for mood inference from smartphone sensing data across diverse populations while preserving privacy.

DetailsMotivation: Traditional mood assessment methods are infrequent and retrospective, failing to capture continuous mood instability. Smartphone sensing offers passive mood inference but faces challenges with privacy constraints, uneven sensing availability, and behavioral variability across populations.

Method: FedFAP (feature-aware personalized federated learning framework) operates in a cross-country federated learning setting where each country acts as an independent client. The framework accommodates heterogeneous sensing modalities across regions and uses population-aware personalization to handle diverse behavioral patterns.

Result: FedFAP achieves an AUROC of 0.744 for mood inference, outperforming both centralized approaches and existing personalized federated baselines across geographically and culturally diverse populations.

Conclusion: The framework demonstrates how population-aware personalization and privacy-preserving learning can enable scalable mood-aware mobile sensing technologies, offering design insights for future mood-aware systems.

Abstract: Mood instability is a key behavioral indicator of mental health, yet traditional assessments rely on infrequent and retrospective reports that fail to capture its continuous nature. Smartphone-based mobile sensing enables passive, in-the-wild mood inference from everyday behaviors; however, deploying such systems at scale remains challenging due to privacy constraints, uneven sensing availability, and substantial variability in behavioral patterns. In this work, we study mood inference using smartphone sensing data in a cross-country federated learning setting, where each country participates as an independent client while retaining local data. We introduce FedFAP, a feature-aware personalized federated framework designed to accommodate heterogeneous sensing modalities across regions. Evaluations across geographically and culturally diverse populations show that FedFAP achieves an AUROC of 0.744, outperforming both centralized approaches and existing personalized federated baselines. Beyond inference, our results offer design insights for mood-aware systems, demonstrating how population-aware personalization and privacy-preserving learning can enable scalable and mood-aware mobile sensing technologies.

[242] LLM-as-Judge on a Budget

Aadirupa Saha, Aniket Wagde, Branislav Kveton

Main category: cs.LG

TL;DR: A variance-adaptive query allocation method for LLM-as-a-judge evaluation that optimizes computational budget allocation across prompt-response pairs to minimize score estimation error.

DetailsMotivation: LLM-as-a-judge evaluation requires multiple queries per prompt-response pair due to stochastic judgments, but uniform allocation across pairs is suboptimal given varying score variances. There's a need for principled budget allocation to minimize estimation error within fixed computational constraints.

Method: Proposes a variance-adaptive approach using multi-armed bandit theory and concentration inequalities. Dynamically allocates queries based on estimated score variances, concentrating computational resources where uncertainty is highest. Achieves near-optimal budget allocation with theoretical guarantees.

Result: Method achieves worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K σ_i^2}{B}}\right)$ with near-optimal budget allocation. Experiments on Summarize-From-Feedback and HelpSteer2 datasets show significant outperformance over uniform allocation, reducing worst-case estimation error while maintaining identical budgets.

Conclusion: Establishes theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale. Provides principled approach to optimize computational budget allocation in LLM-as-a-judge evaluation systems.

Abstract: LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K σ_i^2}{B}}\right)$, $σ_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. % Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.

[243] ExLipBaB: Exact Lipschitz Constant Computation for Piecewise Linear Neural Networks

Tom A. Splittgerber

Main category: cs.LG

TL;DR: Generalization of LipBaB algorithm to compute exact Lipschitz constants for arbitrary piecewise linear neural networks with various activation functions beyond just ReLU.

DetailsMotivation: While Lipschitz constants are important for robustness guarantees and regularization, existing exact computation methods are limited to ReLU-activated networks, which have downsides in Lipschitz-constrained contexts. There's a need for exact computation methods that work with more diverse activation functions.

Method: Proposes a generalization of the LipBaB algorithm to compute exact Lipschitz constants for arbitrary piecewise linear neural networks and p-norms. Supports various activations including ReLU, LeakyReLU, GroupSort, MinMax, FullSort, and MaxPool.

Result: The method enables exact Lipschitz constant computation for networks with diverse piecewise linear activation functions, overcoming limitations of previous ReLU-only approaches.

Conclusion: The generalized LipBaB algorithm provides exact Lipschitz constant computation for a broader class of neural networks, supporting benchmarking and robustness guarantee applications for networks with various piecewise linear activations.

Abstract: It has been shown that a neural network’s Lipschitz constant can be leveraged to derive robustness guarantees, to improve generalizability via regularization or even to construct invertible networks. Therefore, a number of methods varying in the tightness of their bounds and their computational cost have been developed to approximate the Lipschitz constant for different classes of networks. However, comparatively little research exists on methods for exact computation, which has been shown to be NP-hard. Nonetheless, there are applications where one might readily accept the computational cost of an exact method. These applications could include the benchmarking of new methods or the computation of robustness guarantees for small models on sensitive data. Unfortunately, existing exact algorithms restrict themselves to only ReLU-activated networks, which are known to come with severe downsides in the context of Lipschitz-constrained networks. We therefore propose a generalization of the LipBaB algorithm to compute exact Lipschitz constants for arbitrary piecewise linear neural networks and $p$-norms. With our method, networks may contain traditional activations like ReLU or LeakyReLU, activations like GroupSort or the related MinMax and FullSort, which have been of increasing interest in the context of Lipschitz constrained networks, or even other piecewise linear functions like MaxPool.

[244] GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhong, Mingdao Liu, Mingming Zhao, Pengfan Du, Qian Dong, Rui Lu, Shuang-Li, Shulin Cao, Song Liu, Ting Jiang, Xiaodong Chen, Xiaohan Zhang, Xuancheng Huang, Xuezhen Dong, Yabo Xu, Yao Wei, Yifan An, Yilin Niu, Yitong Zhu, Yuanhao Wen, Yukuo Cen, Yushi Bai, Zhongpei Qiao, Zihan Wang, Zikang Wang, Zilin Zhu, Ziqiang Liu, Zixuan Li, Bojie Wang, Bosi Wen, Can Huang, Changpeng Cai, Chao Yu, Chen Li, Chen Li, Chenghua Huang, Chengwei Hu, Chenhui Zhang, Chenzheng Zhu, Congfeng Yin, Daoyan Lin, Dayong Yang, Di Wang, Ding Ai, Erle Zhu, Fangzhou Yi, Feiyu Chen, Guohong Wen, Hailong Sun, Haisha Zhao, Haiyi Hu, Hanchen Zhang, Hanrui Liu, Hanyu Zhang, Hao Peng, Hao Tai, Haobo Zhang, He Liu, Hongwei Wang, Hongxi Yan, Hongyu Ge, Huan Liu, Huan Liu, Huanpeng Chu, Jia’ni Zhao, Jiachen Wang, Jiajing Zhao, Jiamin Ren, Jiapeng Wang, Jiaxin Zhang, Jiayi Gui, Jiayue Zhao, Jijie Li, Jing An, Jing Li, Jingwei Yuan, Jinhua Du, Jinxin Liu, Junkai Zhi, Junwen Duan, Kaiyue Zhou, Kangjian Wei, Ke Wang, Keyun Luo, Laiqiang Zhang, Leigang Sha, Liang Xu, Lindong Wu, Lintao Ding, Lu Chen, Minghao Li, Nianyi Lin, Pan Ta, Qiang Zou, Rongjun Song, Ruiqi Yang, Shangqing Tu, Shangtong Yang, Shaoxiang Wu, Shengyan Zhang, Shijie Li, Shuang Li, Shuyi Fan, Wei Qin, Wei Tian, Weining Zhang, Wenbo Yu, Wenjie Liang, Xiang Kuang, Xiangmeng Cheng, Xiangyang Li, Xiaoquan Yan, Xiaowei Hu, Xiaoying Ling, Xing Fan, Xingye Xia, Xinyuan Zhang, Xinze Zhang, Xirui Pan, Xunkai Zhang, Yandong Wu, Yanfu Li, Yidong Wang, Yifan Zhu, Yijun Tan, Yilin Zhou, Yiming Pan, Ying Zhang, Yinpei Su, Yipeng Geng, Yipeng Geng, Yong Yan, Yonglin Tan, Yuean Bi, Yuhan Shen, Yuhao Yang, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yurong Wu, Yutao Zhang, Yuxi Duan, Yuxuan Zhang, Zezhen Liu, Zhengtao Jiang, Zhenhe Yan, Zheyu Zhang, Zhixiang Wei, Zhuo Chen, Zhuoer Feng, Zijun Yao, Ziwei Chai, Ziyuan Wang, Zuzhou Zhang, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, Jie Tang

Main category: cs.LG

TL;DR: GLM-5 is a next-generation foundation model that transitions from vibe coding to agentic engineering, featuring improved training efficiency, novel reinforcement learning algorithms, and state-of-the-art performance on coding tasks.

DetailsMotivation: The paper aims to advance foundation models beyond traditional coding approaches by transitioning to agentic engineering, improving training efficiency, and enhancing model alignment and autonomy for real-world software engineering challenges.

Method: GLM-5 adopts DSA to reduce training/inference costs while maintaining long-context fidelity, implements asynchronous reinforcement learning infrastructure to decouple generation from training, and proposes novel asynchronous agent RL algorithms for complex, long-horizon interactions.

Result: GLM-5 achieves state-of-the-art performance on major open benchmarks and demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges.

Conclusion: GLM-5 represents a significant advancement in foundation models for coding and agentic engineering, with innovations in training efficiency and reinforcement learning that enable superior performance on complex software engineering tasks.

Abstract: We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.

[245] Approximation Theory for Lipschitz Continuous Transformers

Takashi Furuya, Davide Murari, Carola-Bibiane Schönlieb

Main category: cs.LG

TL;DR: Lipschitz-continuous Transformers designed for stability and robustness, with universal approximation guarantees in Lipschitz-constrained function spaces.

DetailsMotivation: Transformers need stability and robustness for safety-critical applications, but existing Lipschitz-constrained architectures lack theoretical approximation guarantees.

Method: Introduce gradient-descent-type in-context Transformers with MLP and attention blocks realized as explicit Euler steps of negative gradient flows, ensuring Lipschitz continuity by construction. Use measure-theoretic formalism to analyze Transformers as operators on probability measures.

Result: Prove universal approximation theorem for this class within Lipschitz-constrained function spaces, with guarantees independent of token count.

Conclusion: Provides rigorous theoretical foundation for designing robust, Lipschitz continuous Transformer architectures with stability guarantees.

Abstract: Stability and robustness are critical for deploying Transformers in safety-sensitive settings. A principled way to enforce such behavior is to constrain the model’s Lipschitz constant. However, approximation-theoretic guarantees for architectures that explicitly preserve Lipschitz continuity have yet to be established. In this work, we bridge this gap by introducing a class of gradient-descent-type in-context Transformers that are Lipschitz-continuous by construction. We realize both MLP and attention blocks as explicit Euler steps of negative gradient flows, ensuring inherent stability without sacrificing expressivity. We prove a universal approximation theorem for this class within a Lipschitz-constrained function space. Crucially, our analysis adopts a measure-theoretic formalism, interpreting Transformers as operators on probability measures, to yield approximation guarantees independent of token count. These results provide a rigorous theoretical foundation for the design of robust, Lipschitz continuous Transformer architectures.

[246] On the Geometric Coherence of Global Aggregation in Federated GNN

Chethana Prasad Kabgere, Shylaja SS

Main category: cs.LG

TL;DR: GGRS framework addresses geometric failure in federated GNNs by regulating client updates based on geometric admissibility criteria to preserve relational transformation coherence.

DetailsMotivation: Standard aggregation mechanisms in federated GNNs fail when client graphs have heterogeneous structural and propagation characteristics, causing destructive interference in relational transformation space despite numerical convergence.

Method: Proposes GGRS (Global Geometric Reference Structure), a server-side framework that regulates client updates before aggregation using geometric admissibility criteria to preserve directional consistency of relational transformations and maintain diversity of propagation subspaces.

Result: Experiments on heterogeneous GNN-native and Amazon Co-purchase datasets show GGRS preserves global message-passing coherence across training rounds, demonstrating the necessity of geometry-aware regulation in federated graph learning.

Conclusion: Geometry-aware regulation is essential for federated GNNs to maintain relational behavior coherence, and GGRS provides an effective server-side solution without accessing client data or graph topology.

Abstract: Federated Learning (FL) enables distributed training across multiple clients without centralized data sharing, while Graph Neural Networks (GNNs) model relational data through message passing. In federated GNN settings, client graphs often exhibit heterogeneous structural and propagation characteristics. When standard aggregation mechanisms are applied to such heterogeneous updates, the global model may converge numerically while exhibiting degraded relational behavior.Our work identifies a geometric failure mode of global aggregation in Cross- Domain Federated GNNs. Although GNN parameters are numerically represented as vectors, they encode relational transformations that govern the direction, strength, and sensitivity of information flow across graph neighborhoods. Aggregating updates originating from incompatible propagation regimes can therefore introduce destructive interference in this transformation space.This leads to loss of coherence in global message passing. Importantly, this degradation is not necessarily reflected in conventional metrics such as loss or accuracy.To address this issue, we propose GGRS (Global Geometric Reference Structure), a server-side framework that regulates client updates prior to aggregation based on geometric admissibility criteria. GGRS preserves directional consistency of relational transformations as well as maintains diversity of admissible propagation subspaces. It also stabilizes sensitivity to neighborhood interactions, without accessing client data or graph topology. Experiments on heterogeneous GNN-native, Amazon Co-purchase datasets demonstrate that GGRS preserves global message-passing coherence across training rounds by highlighting the necessity of geometry-aware regulation in federated graph learning.

[247] The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

Mohammad Taufeeque, Stefan Heimersheim, Adam Gleave, Chris Cundy

Main category: cs.LG

TL;DR: Training AI systems against white-box deception detectors can lead to obfuscation strategies where models learn to hide deception rather than become honest, with two main obfuscation approaches emerging in realistic coding environments.

DetailsMotivation: The paper addresses the challenge of training AI systems to be honest using white-box deception detectors. While such training aims to promote honesty, there's a risk that models might learn to obfuscate their deception to evade detection rather than genuinely becoming honest. Prior work studied this in artificial settings, so the authors aim to examine obfuscation in more realistic environments.

Method: The authors construct a realistic coding environment where reward hacking via hardcoding test cases naturally occurs. They introduce a taxonomy of possible outcomes when training against a deception detector and empirically study obfuscation emergence. They analyze two obfuscation strategies: obfuscated activations (modifying internal representations) and obfuscated policies (outputting deceptive text that evades detection). They use theoretical analysis of policy gradient methods and empirical evaluation with KL regularization and detector penalties.

Result: Obfuscation emerges in realistic coding environments. Obfuscated activations arise from representation drift during RL, with or without detector penalty. The probe penalty only incentivizes obfuscated policies, which is theoretically expected for policy gradient methods. Sufficiently high KL regularization and detector penalty can yield honest policies, establishing white-box deception detectors as viable training signals for tasks prone to reward hacking.

Conclusion: White-box deception detectors can be effective training signals for promoting honesty in AI systems, but careful design is needed to prevent obfuscation strategies. With appropriate regularization and penalty settings, honest policies can be achieved even in tasks prone to reward hacking.

Abstract: Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. Prior work has studied obfuscation only in artificial settings where models were directly rewarded for harmful output. We construct a realistic coding environment where reward hacking via hardcoding test cases naturally occurs, and show that obfuscation emerges in this setting. We introduce a taxonomy of possible outcomes when training against a deception detector. The model either remains honest, or becomes deceptive via two possible obfuscation strategies. (i) Obfuscated activations: the model outputs deceptive text while modifying its internal representations to no longer trigger the detector. (ii) Obfuscated policy: the model outputs deceptive text that evades the detector, typically by including a justification for the reward hack. Empirically, obfuscated activations arise from representation drift during RL, with or without a detector penalty. The probe penalty only incentivizes obfuscated policies; we theoretically show this is expected for policy gradient methods. Sufficiently high KL regularization and detector penalty can yield honest policies, establishing white-box deception detectors as viable training signals for tasks prone to reward hacking.

[248] Guided Diffusion by Optimized Loss Functions on Relaxed Parameters for Inverse Material Design

Jens U. Kreber, Christian Weißenfels, Joerg Stueckler

Main category: cs.LG

TL;DR: A diffusion model-based inverse design method for composite materials that relaxes discrete design spaces into continuous representations, enabling gradient-based optimization through differentiable FEM simulations.

DetailsMotivation: Inverse design problems in engineering often involve discrete parameters and constraints that prevent gradient-based optimization, while multiple design parameters can yield similar outputs requiring multimodal probabilistic approaches.

Method: Relax discrete design space into continuous grid representation, train diffusion model as prior on relaxed space, sample via guided diffusion using gradients from differentiable FEM simulation, then backproject to original parameter space.

Result: Method generates diverse composite material designs matching specified bulk modulus within 1% relative error in 2D/3D settings, and can minimize material density simultaneously via multi-objective loss.

Conclusion: Diffusion models combined with differentiable simulations enable effective inverse design for problems with discrete constraints, providing diverse solutions while maintaining accuracy.

Abstract: Inverse design problems are common in engineering and materials science. The forward direction, i.e., computing output quantities from design parameters, typically requires running a numerical simulation, such as a FEM, as an intermediate step, which is an optimization problem by itself. In many scenarios, several design parameters can lead to the same or similar output values. For such cases, multi-modal probabilistic approaches are advantageous to obtain diverse solutions. A major difficulty in inverse design stems from the structure of the design space, since discrete parameters or further constraints disallow the direct use of gradient-based optimization. To tackle this problem, we propose a novel inverse design method based on diffusion models. Our approach relaxes the original design space into a continuous grid representation, where gradients can be computed by implicit differentiation in the forward simulation. A diffusion model is trained on this relaxed parameter space in order to serve as a prior for plausible relaxed designs. Parameters are sampled by guided diffusion using gradients that are propagated from an objective function specified at inference time through the differentiable simulation. A design sample is obtained by backprojection into the original parameter space. We develop our approach for a composite material design problem where the forward process is modeled as a linear FEM problem. We evaluate the performance of our approach in finding designs that match a specified bulk modulus. We demonstrate that our method can propose diverse designs within 1% relative error margin from medium to high target bulk moduli in 2D and 3D settings. We also demonstrate that the material density of generated samples can be minimized simultaneously by using a multi-objective loss function.

[249] CEPAE: Conditional Entropy-Penalized Autoencoders for Time Series Counterfactuals

Tomàs Garriga, Gerard Sanz, Eduard Serrahima de Cambra, Axel Brando

Main category: cs.LG

TL;DR: CEPAE: A novel autoencoder-based approach for counterfactual inference on time series data using entropy penalization for disentangled representations, outperforming existing methods on synthetic and real-world datasets.

DetailsMotivation: Counterfactual inference on time series is crucial for decision-making in fields like finance, healthcare, and marketing, but existing methods are not well-suited for time series data impacted by market events. The paper addresses this gap motivated by an industrial application.

Method: Uses Structural Causal Model framework with abduction-action-prediction procedure. First adapts variational autoencoders and adversarial autoencoders for time series, then introduces CEPAE (Conditional Entropy-Penalized Autoencoder) which employs entropy penalization loss over latent space to encourage disentangled representations.

Result: CEPAE generally outperforms other approaches on synthetic, semi-synthetic, and real-world datasets across evaluated metrics. The approach is validated both theoretically and experimentally.

Conclusion: CEPAE provides an effective autoencoder-based approach for counterfactual inference on time series data, with entropy penalization enabling better disentangled representations and improved performance over existing methods.

Abstract: The ability to accurately perform counterfactual inference on time series is crucial for decision-making in fields like finance, healthcare, and marketing, as it allows us to understand the impact of events or treatments on outcomes over time. In this paper, we introduce a new counterfactual inference approach tailored to time series data impacted by market events, which is motivated by an industrial application. Utilizing the abduction-action-prediction procedure and the Structural Causal Model framework, we first adapt methods based on variational autoencoders and adversarial autoencoders, both previously used in counterfactual literature although not in time series settings. Then, we present the Conditional Entropy-Penalized Autoencoder (CEPAE), a novel autoencoder-based approach for counterfactual inference, which employs an entropy penalization loss over the latent space to encourage disentangled data representations. We validate our approach both theoretically and experimentally on synthetic, semi-synthetic, and real-world datasets, showing that CEPAE generally outperforms the other approaches in the evaluated metrics.

[250] 1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization

Sohir Maskey, Constantin Eichenberg, Johannes Messner, Douglas Orr

Main category: cs.LG

TL;DR: K-means quantization outperforms integer formats for LLMs, with 1-bit weights achieving best downstream performance under fixed memory budget.

DetailsMotivation: Current QAT methods lack exploration of full quantization design space, with poor understanding of trade-offs between quantization and downstream performance, often relying only on perplexity-based evaluations.

Method: Empirical study of QAT in low-bit regime, comparing k-means based weight quantization vs integer formats, evaluating under fixed inference memory budget constraints.

Result: K-means quantization outperforms integer formats and can be efficiently implemented on standard hardware; 1-bit quantized weights achieve best performance on generative downstream tasks under fixed memory budget.

Conclusion: K-means quantization is superior to integer formats for LLM compression, and 1-bit weight quantization offers optimal downstream performance within memory constraints.

Abstract: Quantization-aware training (QAT) is an effective method to drastically reduce the memory footprint of LLMs while keeping performance degradation at an acceptable level. However, the optimal choice of quantization format and bit-width presents a challenge in practice. The full design space of quantization is not fully explored in the context of QAT, and the precise trade-off between quantization and downstream performance is poorly understood, as comparisons often rely solely on perplexity-based evaluations. In this work, we address these shortcomings with an empirical study of QAT in the low-bit regime. We show that k-means based weight quantization outperforms integer formats and can be implemented efficiently on standard hardware. Furthermore, we find that, under a fixed inference memory budget, the best performance on generative downstream tasks is achieved with $1$-bit quantized weights.

[251] Accelerated Predictive Coding Networks via Direct Kolen-Pollack Feedback Alignment

Davide Casnici, Martin Lefebvre, Justin Dauwels, Charlotte Frenkel

Main category: cs.LG

TL;DR: Direct Kolen-Pollack predictive coding (DKP-PC) improves biologically-inspired neural network training by using learnable feedback connections to eliminate depth-dependent error propagation delays and vanishing updates in early layers.

DetailsMotivation: Standard predictive coding (PC) has two key limitations: error signals must propagate through multiple inference-phase steps from output to early layers, and feedback decays exponentially during this process, causing vanishing updates in early layers. These issues limit PC's practical efficiency and scalability.

Method: DKP-PC introduces learnable feedback connections from the output layer to all hidden layers, establishing direct pathways for error transmission. This combines direct feedback alignment and direct Kolen-Pollack algorithms to simultaneously address feedback delay and exponential decay while preserving update locality.

Result: DKP-PC reduces theoretical error propagation time complexity from O(L) to O(1) (where L is network depth), eliminating depth-dependent delay. It achieves performance at least comparable to, and often exceeding, standard PC while offering improved latency and computational performance.

Conclusion: DKP-PC provides a more efficient and scalable variant of predictive coding that addresses key limitations of standard PC, making it suitable for custom hardware-efficient implementations while maintaining biological plausibility through local updates.

Abstract: Predictive coding (PC) is a biologically inspired algorithm for training neural networks that relies only on local updates, allowing parallel learning across layers. However, practical implementations face two key limitations: error signals must still propagate from the output to early layers through multiple inference-phase steps, and feedback decays exponentially during this process, leading to vanishing updates in early layers. We propose direct Kolen-Pollack predictive coding (DKP-PC), which simultaneously addresses both feedback delay and exponential decay, yielding a more efficient and scalable variant of PC while preserving update locality. Leveraging direct feedback alignment and direct Kolen-Pollack algorithms, DKP-PC introduces learnable feedback connections from the output layer to all hidden layers, establishing a direct pathway for error transmission. This yields an algorithm that reduces the theoretical error propagation time complexity from O(L), with L being the network depth, to O(1), removing depth-dependent delay in error signals. Moreover, empirical results demonstrate that DKP-PC achieves performance at least comparable to, and often exceeding, that of standard PC, while offering improved latency and computational performance, supporting its potential for custom hardware-efficient implementations.

[252] Uniform error bounds for quantized dynamical models

Abdelkader Metakalard, Fabien Lauer, Kevin Colin, Marion Gilson

Main category: cs.LG

TL;DR: Statistical guarantees for dynamical models learned from dependent data, with uniform error bounds for quantized models and imperfect optimization in system identification.

DetailsMotivation: To provide rigorous statistical guarantees for dynamical models learned from dependent data sequences, addressing practical challenges in system identification including model quantization and imperfect optimization algorithms.

Method: Develops two families of uniform error bounds: 1) slow-rate bounds via block decomposition, and 2) fast-rate, variance-adaptive bounds via a novel spaced-point strategy. The bounds scale with the number of bits required to encode models.

Result: Provides statistical error bounds that translate hardware constraints (model quantization) into interpretable statistical complexities, offering guarantees for practical system identification scenarios.

Conclusion: The paper establishes statistical foundations for learning dynamical models from dependent data with practical constraints, bridging hardware limitations and statistical learning theory.

Abstract: This paper provides statistical guarantees on the accuracy of dynamical models learned from dependent data sequences. Specifically, we develop uniform error bounds that apply to quantized models and imperfect optimization algorithms commonly used in practical contexts for system identification, and in particular hybrid system identification. Two families of bounds are obtained: slow-rate bounds via a block decomposition and fast-rate, variance-adaptive, bounds via a novel spaced-point strategy. The bounds scale with the number of bits required to encode the model and thus translate hardware constraints into interpretable statistical complexities.

[253] A unified theory of feature learning in RNNs and DNNs

Jan P. Bauer, Kirsten Fischer, Moritz Helias, Agostina Palmigiano

Main category: cs.LG

TL;DR: A unified mean-field theory connects RNNs and DNNs through representational kernels, showing weight sharing in RNNs creates correlated representations across timesteps and aids generalization in sequential tasks.

DetailsMotivation: To understand how the structural similarity between RNNs and DNNs (differing only by weight sharing) relates to their distinct functional properties, and to develop a unified theoretical framework connecting architectural structure to functional biases.

Method: Developed a unified mean-field theory for RNNs and DNNs using representational kernels, analyzing fully trained networks in the feature learning (μP) regime. The theory casts training as Bayesian inference over sequences and patterns, revealing functional implications of RNNs’ weight sharing.

Result: In DNN-typical tasks, identified a phase transition: below a threshold, RNNs and DNNs behave identically; above it, only RNNs develop correlated representations across timesteps. For sequential tasks, RNNs’ weight sharing induces an inductive bias that aids generalization by interpolating unsupervised time steps.

Conclusion: The theory provides a principled way to connect architectural structure (specifically weight sharing in RNNs) to functional biases, explaining how RNNs develop temporal correlations and generalize better in sequential tasks compared to DNNs.

Abstract: Recurrent and deep neural networks (RNNs/DNNs) are cornerstone architectures in machine learning. Remarkably, RNNs differ from DNNs only by weight sharing, as can be shown through unrolling in time. How does this structural similarity fit with the distinct functional properties these networks exhibit? To address this question, we here develop a unified mean-field theory for RNNs and DNNs in terms of representational kernels, describing fully trained networks in the feature learning ($μ$P) regime. This theory casts training as Bayesian inference over sequences and patterns, directly revealing the functional implications induced by the RNNs’ weight sharing. In DNN-typical tasks, we identify a phase transition when the learning signal overcomes the noise due to randomness in the weights: below this threshold, RNNs and DNNs behave identically; above it, only RNNs develop correlated representations across timesteps. For sequential tasks, the RNNs’ weight sharing furthermore induces an inductive bias that aids generalization by interpolating unsupervised time steps. Overall, our theory offers a way to connect architectural structure to functional biases.

Zakaria Shams Siam, Xuefeng Liu, Chong Liu

Main category: cs.LG

TL;DR: A novel multi-objective coverage (MOC) problem formulation for identifying representative samples covering feasible multi-objective space, with applications in drug discovery and materials design.

DetailsMotivation: Need to accelerate scientific discovery in critical applications like drug discovery by identifying small representative sets that cover feasible multi-objective space, enabling faster evaluation than whole feasible sets.

Method: Proposed MOC-CAS algorithm using upper confidence bound-based acquisition function guided by Gaussian process posterior predictions, with smoothed relaxation of hard feasibility test and approximate optimizer.

Result: MOC-CAS achieves superior performance compared to competitive baselines across large-scale protein-target datasets for SARS-CoV-2 and cancer, assessed on five objectives from SMILES-based features.

Conclusion: The MOC problem formulation and MOC-CAS algorithm effectively address the need for representative sampling in multi-objective spaces for accelerating scientific discovery in drug discovery applications.

Abstract: In this paper, we formulate the new multi-objective coverage (MOC) problem where our goal is to identify a small set of representative samples whose predicted outcomes broadly cover the feasible multi-objective space. This problem is of great importance in many critical real-world applications, e.g., drug discovery and materials design, as this representative set can be evaluated much faster than the whole feasible set, thus significantly accelerating the scientific discovery process. Existing works cannot be directly applied as they either focus on sample space coverage or multi-objective optimization that targets the Pareto front. However, chemically diverse samples often yield identical objective profiles, and safety constraints are usually defined on the objectives. To solve this MOC problem, we propose a novel search algorithm, MOC-CAS, which employs an upper confidence bound-based acquisition function to select optimistic samples guided by Gaussian process posterior predictions. For enabling efficient optimization, we develop a smoothed relaxation of the hard feasibility test and derive an approximate optimizer. Compared to the competitive baselines, we show that our MOC-CAS empirically achieves superior performances across large-scale protein-target datasets for SARS-CoV-2 and cancer, each assessed on five objectives derived from SMILES-based features.

[255] Certified Per-Instance Unlearning Using Individual Sensitivity Bounds

Hanna Benarroch, Jamal Atif, Olivier Cappé

Main category: cs.LG

TL;DR: Certified machine unlearning via adaptive per-instance noise calibration instead of worst-case sensitivity, reducing performance degradation while maintaining formal guarantees.

DetailsMotivation: Traditional certified unlearning uses worst-case sensitivity calibration leading to excessive noise injection and performance degradation, limiting practical applicability. The paper aims to develop more efficient unlearning with adaptive per-instance noise calibration.

Method: Uses per-instance differential privacy to define individual data point sensitivities in noisy gradient dynamics. For ridge regression trained via Langevin dynamics, derives high-probability per-instance sensitivity bounds for certified unlearning with less noise injection.

Result: Achieves certified unlearning with substantially less noise injection compared to worst-case sensitivity approaches. Validated through experiments in linear settings and provides empirical evidence in deep learning settings.

Conclusion: Adaptive per-instance noise calibration enables more practical certified unlearning with better performance while maintaining formal guarantees, showing promise for broader applicability beyond linear models.

Abstract: Certified machine unlearning can be achieved via noise injection leading to differential privacy guarantees, where noise is calibrated to worst-case sensitivity. Such conservative calibration often results in performance degradation, limiting practical applicability. In this work, we investigate an alternative approach based on adaptive per-instance noise calibration tailored to the individual contribution of each data point to the learned solution. This raises the following challenge: how can one establish formal unlearning guarantees when the mechanism depends on the specific point to be removed? To define individual data point sensitivities in noisy gradient dynamics, we consider the use of per-instance differential privacy. For ridge regression trained via Langevin dynamics, we derive high-probability per-instance sensitivity bounds, yielding certified unlearning with substantially less noise injection. We corroborate our theoretical findings through experiments in linear settings and provide further empirical evidence on the relevance of the approach in deep learning settings.

[256] Symbolic recovery of PDEs from measurement data

Erion Morina, Philipp Scholl, Martin Holler

Main category: cs.LG

TL;DR: The paper presents a method for symbolic PDE identification using rational function neural networks, with theoretical identifiability guarantees and empirical validation.

DetailsMotivation: PDE models are essential for describing complex natural phenomena, but identifying the underlying physical laws from noisy measurements typically doesn't yield interpretable symbolic expressions, hindering understanding.

Method: Uses neural network architectures based on rational functions for symbolic representation of physical laws, leveraging rational functions’ approximation power and arithmetic flexibility. Provides identifiability results showing unique reconstruction of simplest physical laws from noiseless measurements.

Result: Theoretical identifiability guarantees show symbolic networks can uniquely reconstruct simplest physical laws within PDE models. Regularization promotes interpretability and sparsity. Empirical validation with ParFam architecture supports practical reconstructibility.

Conclusion: Rational function-based neural networks provide a viable approach for symbolic PDE identification with theoretical guarantees and practical applicability, enabling interpretable physical law discovery from measurements.

Abstract: Models based on partial differential equations (PDEs) are powerful for describing a wide range of complex relationships in the natural sciences. Accurately identifying the PDE model, which represents the underlying physical law, is essential for a proper understanding of the problem. This reconstruction typically relies on indirect and noisy measurements of the system’s state and, without specifically tailored methods, rarely yields symbolic expressions, thereby hindering interpretability. In this work, we address this issue by considering existing neural network architectures based on rational functions for the symbolic representation of physical laws. These networks leverage the approximation power of rational functions while also benefiting from their flexibility in representing arithmetic operations. Our main contribution is an identifiability result, showing that, in the limit of noiseless, complete measurements, such symbolic networks can uniquely reconstruct the simplest physical law within the PDE model. Specifically, reconstructed laws remain expressible within the symbolic network architecture, with regularization-minimizing parameterizations promoting interpretability and sparsity in case of $L^1$-regularization. In addition, we provide regularity results for symbolic networks. Empirical validation using the ParFam architecture supports these theoretical findings, providing evidence for the practical reconstructibility of physical laws.

[257] DNN-Enabled Multi-User Beamforming for Throughput Maximization under Adjustable Fairness

Kaifeng Lu, Markus Rupp, Stefan Schwarz

Main category: cs.LG

TL;DR: Proposes an optimization-based unsupervised learning approach using wireless transformer (WiT) architecture to balance fairness and sum rate in wireless communications through Lagrangian formulation with automatic dual-ascent updates.

DetailsMotivation: Addressing the fundamental challenge of ensuring user fairness in wireless communications, which involves balancing the trade-off between fairness and sum rate - a non-convex, multi-objective optimization problem whose complexity grows with network scale.

Method: Uses wireless transformer (WiT) architecture that learns from channel state information (CSI) features. Reformulates the trade-off by combining sum rate and fairness objectives through a Lagrangian multiplier, updated automatically via dual-ascent algorithm. This allows controllable fairness constraint while maximizing sum rate.

Result: The approach offers a flexible solution for managing trade-off optimization under prescribed fairness, effectively realizing a trace on the Pareto front between the two conflicting objectives.

Conclusion: The proposed optimization-based unsupervised learning approach with WiT architecture provides an effective method for balancing fairness and sum rate in wireless communications through automatic Lagrangian multiplier updates.

Abstract: Ensuring user fairness in wireless communications is a fundamental challenge, as balancing the trade-off between fairness and sum rate leads to a non-convex, multi-objective optimization whose complexity grows with network scale. To alleviate this conflict, we propose an optimization-based unsupervised learning approach based on the wireless transformer (WiT) architecture that learns from channel state information (CSI) features. We reformulate the trade-off by combining the sum rate and fairness objectives through a Lagrangian multiplier, which is updated automatically via a dual-ascent algorithm. This mechanism allows for a controllable fairness constraint while simultaneously maximizing the sum rate, effectively realizing a trace on the Pareto front between two conflicting objectives. Our findings show that the proposed approach offers a flexible solution for managing the trade-off optimization under prescribed fairness.

[258] Beyond ReLU: Bifurcation, Oversmoothing, and Topological Priors

Erkan Turan, Gaspard Abel, Maysam Behmanesh, Emery Pierson, Maks Ovsjanikov

Main category: cs.LG

TL;DR: Theoretical analysis of GNN oversmoothing using bifurcation theory, showing that replacing monotone activations can destabilize homogeneous states and create stable non-homogeneous patterns that resist oversmoothing.

DetailsMotivation: Deep Graph Neural Networks suffer from oversmoothing where node features converge to homogeneous, non-informative states. The authors aim to understand this representational collapse from a dynamical systems perspective using bifurcation theory.

Method: The authors reframe oversmoothing as convergence to a stable homogeneous fixed point. Using Lyapunov-Schmidt reduction, they analytically prove that replacing standard monotone activations (like ReLU) with a specific class of functions induces a bifurcation that destabilizes the homogeneous state and creates stable non-homogeneous patterns.

Result: The theory predicts a precise scaling law for the amplitude of emergent patterns, which is quantitatively validated in experiments. The authors also derive a closed-form, bifurcation-aware initialization that shows utility in real benchmark experiments.

Conclusion: Oversmoothing in GNNs can be understood and mitigated through bifurcation theory by replacing monotone activations, which creates stable non-homogeneous patterns that resist representational collapse.

Abstract: Graph Neural Networks (GNNs) learn node representations through iterative network-based message-passing. While powerful, deep GNNs suffer from oversmoothing, where node features converge to a homogeneous, non-informative state. We re-frame this problem of representational collapse from a \emph{bifurcation theory} perspective, characterizing oversmoothing as convergence to a stable ``homogeneous fixed point.’’ Our central contribution is the theoretical discovery that this undesired stability can be broken by replacing standard monotone activations (e.g., ReLU) with a class of functions. Using Lyapunov-Schmidt reduction, we analytically prove that this substitution induces a bifurcation that destabilizes the homogeneous state and creates a new pair of stable, non-homogeneous \emph{patterns} that provably resist oversmoothing. Our theory predicts a precise, nontrivial scaling law for the amplitude of these emergent patterns, which we quantitatively validate in experiments. Finally, we demonstrate the practical utility of our theory by deriving a closed-form, bifurcation-aware initialization and showing its utility in real benchmark experiments.

[259] The Stationarity Bias: Stratified Stress-Testing for Time-Series Imputation in Regulated Dynamical Systems

Amirreza Dolatpour Fathkouhi, Alireza Namazi, Heman Shakeri

Main category: cs.LG

TL;DR: Paper identifies Stationarity Bias in time-series imputation benchmarks and proposes Stratified Stress-Test to evaluate models separately on stationary vs transient regimes, showing linear methods fail during critical transients despite good RMSE scores.

DetailsMotivation: Current time-series imputation benchmarks use uniform random masking and shape-agnostic metrics like MSE/RMSE, which create a systematic Stationarity Bias. This bias makes simple methods appear superior because they predominantly sample easy, low-entropy stationary regimes where these methods trivially succeed, while failing to evaluate performance during critical transient events.

Method: Proposes Stratified Stress-Test that partitions evaluation into Stationary and Transient regimes. Uses Continuous Glucose Monitoring (CGM) as testbed with ground-truth forcing functions (meals, insulin) for precise regime identification. Derives empirical missingness distributions from clinical trials and imposes them on complete training data to prevent models from exploiting unrealistically clean observations.

Result: Three key findings: (1) Linear interpolation achieves state-of-the-art reconstruction during stable intervals, (2) During critical transients, linear methods show drastically degraded morphological fidelity despite good RMSE (RMSE Mirage), (3) Deep learning models preserve both pointwise accuracy and morphological integrity during transients, making them essential for safety-critical tasks.

Conclusion: The proposed evaluation framework addresses Stationarity Bias by separating performance assessment into different regimes, revealing that complex models are necessary for safety-critical transient events despite their computational cost. The framework generalizes to any regulated system where routine stationarity dominates critical transients.

Abstract: Time-series imputation benchmarks employ uniform random masking and shape-agnostic metrics (MSE, RMSE), implicitly weighting evaluation by regime prevalence. In systems with a dominant attractor – homeostatic physiology, nominal industrial operation, stable network traffic – this creates a systematic \emph{Stationarity Bias}: simple methods appear superior because the benchmark predominantly samples the easy, low-entropy regime where they trivially succeed. We formalize this bias and propose a \emph{Stratified Stress-Test} that partitions evaluation into Stationary and Transient regimes. Using Continuous Glucose Monitoring (CGM) as a testbed – chosen for its rigorous ground-truth forcing functions (meals, insulin) that enable precise regime identification – we establish three findings with broad implications:(i)~Stationary Efficiency: Linear interpolation achieves state-of-the-art reconstruction during stable intervals, confirming that complex architectures are computationally wasteful in low-entropy regimes.(ii)~Transient Fidelity: During critical transients (post-prandial peaks, hypoglycemic events), linear methods exhibit drastically degraded morphological fidelity (DTW), disproportionate to their RMSE – a phenomenon we term the \emph{RMSE Mirage}, where low pointwise error masks the destruction of signal shape.(iii)~Regime-Conditional Model Selection: Deep learning models preserve both pointwise accuracy and morphological integrity during transients, making them essential for safety-critical downstream tasks. We further derive empirical missingness distributions from clinical trials and impose them on complete training data, preventing models from exploiting unrealistically clean observations and encouraging robustness under real-world missingness. This framework generalizes to any regulated system where routine stationarity dominates critical transients.

[260] Continuous-Time Piecewise-Linear Recurrent Neural Networks

Alena Brändle, Lukas Eisenmann, Florian Götz, Daniel Durstewitz

Main category: cs.LG

TL;DR: Continuous-time piecewise-linear RNNs (cPLRNNs) for dynamical systems reconstruction that handle irregular time intervals while maintaining mathematical tractability.

DetailsMotivation: Current piecewise-linear RNNs are discrete-time models, which conflicts with the continuous-time nature of most physical/biological processes and cannot handle irregular temporal data. Neural ODEs are continuous but lack the performance and tractability of PLRNNs.

Method: Developed continuous-time PLRNNs (cPLRNNs) with a novel training algorithm that bypasses numerical integration by exploiting piecewise-linear structure. The method allows semi-analytical determination of topological objects like equilibria and limit cycles.

Result: cPLRNNs outperform both discrete-time PLRNNs and Neural ODEs on dynamical systems reconstruction benchmarks, including systems with discontinuities and hard thresholds.

Conclusion: cPLRNNs provide a continuous-time alternative to discrete PLRNNs that maintains mathematical tractability while handling irregular time intervals and achieving superior reconstruction performance.

Abstract: In dynamical systems reconstruction (DSR) we aim to recover the dynamical system (DS) underlying observed time series. Specifically, we aim to learn a generative surrogate model which approximates the underlying, data-generating DS, and recreates its long-term properties (`climate statistics’). In scientific and medical areas, in particular, these models need to be mechanistically tractable – through their mathematical analysis we would like to obtain insight into the recovered system’s workings. Piecewise-linear (PL), ReLU-based RNNs (PLRNNs) have a strong track-record in this regard, representing SOTA DSR models while allowing mathematical insight by virtue of their PL design. However, all current PLRNN variants are discrete-time maps. This is in disaccord with the assumed continuous-time nature of most physical and biological processes, and makes it hard to accommodate data arriving at irregular temporal intervals. Neural ODEs are one solution, but they do not reach the DSR performance of PLRNNs and often lack their tractability. Here we develop theory for continuous-time PLRNNs (cPLRNNs): We present a novel algorithm for training and simulating such models, bypassing numerical integration by efficiently exploiting their PL structure. We further demonstrate how important topological objects like equilibria or limit cycles can be determined semi-analytically in trained models. We compare cPLRNNs to both their discrete-time cousins as well as Neural ODEs on DSR benchmarks, including systems with discontinuities which come with hard thresholds.

[261] Relative Geometry of Neural Forecasters: Linking Accuracy and Alignment in Learned Latent Geometry

Deniz Kucukahmetler, Maximilian Jean Hemmann, Julian Mosig von Aehrenfeld, Maximilian Amthor, Christian Deubel, Nico Scherf, Diaaeldin Taha

Main category: cs.LG

TL;DR: Neural forecasters for dynamical systems show reproducible family-level representational alignment patterns, with MLPs aligning with MLPs, RNNs with RNNs, while transformers and echo-state networks achieve strong forecasts despite weaker alignment.

DetailsMotivation: To understand how neural networks internally represent the underlying latent geometry of complex dynamical systems, which remains poorly understood despite their accurate forecasting capabilities.

Method: Introduce anchor-based, geometry-agnostic relative embeddings to remove rotational and scaling ambiguities in latent spaces. Apply this framework across seven canonical dynamical systems (ranging from periodic to chaotic) to study neural forecasters through representational alignment.

Result: Reveal reproducible family-level structure: multilayer perceptrons align with other MLPs, recurrent networks with RNNs, while transformers and echo-state networks achieve strong forecasts despite weaker alignment. Alignment generally correlates with forecasting accuracy, but high accuracy can coexist with low alignment.

Conclusion: Relative geometry provides a simple, reproducible foundation for comparing how different model families internalize and represent dynamical structure, offering insights into neural network representations beyond just forecasting performance.

Abstract: Neural networks can accurately forecast complex dynamical systems, yet how they internally represent underlying latent geometry remains poorly understood. We study neural forecasters through the lens of representational alignment, introducing anchor-based, geometry-agnostic relative embeddings that remove rotational and scaling ambiguities in latent spaces. Applying this framework across seven canonical dynamical systems - ranging from periodic to chaotic - we reveal reproducible family-level structure: multilayer perceptrons align with other MLPs, recurrent networks with RNNs, while transformers and echo-state networks achieve strong forecasts despite weaker alignment. Alignment generally correlates with forecasting accuracy, yet high accuracy can coexist with low alignment. Relative geometry thus provides a simple, reproducible foundation for comparing how model families internalize and represent dynamical structure.

[262] CAMEL: An ECG Language Model for Forecasting Cardiac Events

Neelay Velingker, Alaia Solko-Breslin, Mayank Keoliya, Seewon Choi, Jiayi Xin, Anika Marathe, Alireza Oraii, Rajat Deo, Sameed Khatana, Rajeev Alur, Mayur Naik, Eric Wong

Main category: cs.LG

TL;DR: CAMEL is a novel ECG language model that can forecast future cardiac events by understanding longer ECG signals through a specialized encoder and curriculum learning, achieving state-of-the-art performance on multiple benchmarks.

DetailsMotivation: Current ECG language models can only classify existing conditions and generate reports, but cannot forecast future cardiac events, which has immense clinical value for early intervention planning.

Method: Proposes CAMEL with a specialized ECG encoder for cross-understanding of ECG signals and text, trained using LoRA adaptation with curriculum learning including ECG classification, metrics calculation, and multi-turn conversations to elicit reasoning.

Result: Achieves strong zero-shot performance across 6 tasks and 9 datasets, including new ECGForecastBench for arrhythmia forecasting. Outperforms existing ELMs and supervised baselines with +7.0% gain on ECGBench and +12.4-21.1% improvements on forecasting tasks.

Conclusion: CAMEL is the first ECG language model capable of forecasting future cardiac events, demonstrating superior performance through specialized encoding and curriculum learning, with significant clinical implications for early intervention.

Abstract: Electrocardiograms (ECG) are electrical recordings of the heart that are critical for diagnosing cardiovascular conditions. ECG language models (ELMs) have recently emerged as a promising framework for ECG classification accompanied by report generation. However, current models cannot forecast future cardiac events despite the immense clinical value for planning earlier intervention. To address this gap, we propose CAMEL, the first ELM that is capable of inference over longer signal durations which enables its forecasting capability. Our key insight is a specialized ECG encoder which enables cross-understanding of ECG signals with text. We train CAMEL using established LLM training procedures, combining LoRA adaptation with a curriculum learning pipeline. Our curriculum includes ECG classification, metrics calculations, and multi-turn conversations to elicit reasoning. CAMEL demonstrates strong zero-shot performance across 6 tasks and 9 datasets, including ECGForecastBench, a new benchmark that we introduce for forecasting arrhythmias. CAMEL is on par with or surpasses ELMs and fully supervised baselines both in- and out-of-distribution, achieving SOTA results on ECGBench (+7.0% absolute average gain) as well as ECGForecastBench (+12.4% over fully supervised models and +21.1% over zero-shot ELMs).

[263] Controlled oscillation modeling using port-Hamiltonian neural networks

Maximino Linares, Guillaume Doras, Thomas Hélie

Main category: cs.LG

TL;DR: Port-Hamiltonian neural networks with second-order discrete gradient method outperform Runge-Kutta for learning dynamical systems with conservation laws.

DetailsMotivation: Data-driven methods for learning dynamical systems often fail to capture underlying conservation laws, limiting generalization. Existing port-Hamiltonian neural network methods use Runge-Kutta discretizations that may not preserve power balance principles.

Method: Proposes using a second-order discrete gradient method embedded in port-Hamiltonian neural networks for learning dynamical systems. Tests on three systems: harmonic oscillator, Duffing oscillator, and self-sustained oscillator. Compares with Runge-Kutta method and analyzes different port-Hamiltonian formulations.

Result: The discrete gradient method outperforms Runge-Kutta of the same order. Experiments compare two theoretically equivalent port-Hamiltonian formulations and analyze Jacobian regularization impact during training.

Conclusion: Second-order discrete gradient methods improve port-Hamiltonian neural network performance for learning dynamical systems with conservation laws, enabling better generalization through power-preserving discretizations.

Abstract: Learning dynamical systems through purely data-driven methods is challenging as they do not learn the underlying conservation laws that enable them to correctly generalize. Existing port-Hamiltonian neural network methods have recently been successfully applied for modeling mechanical systems. However, even though these methods are designed on power-balance principles, they usually do not consider power-preserving discretizations and often rely on Runge-Kutta numerical methods. In this work, we propose to use a second-order discrete gradient method embedded in the learning of dynamical systems with port-Hamiltonian neural networks. Numerical results are provided for three systems deliberately selected to span different ranges of dynamical behavior under control: a baseline harmonic oscillator with quadratic energy storage; a Duffing oscillator, with a non-quadratic Hamiltonian offering amplitude-dependent effects; and a self-sustained oscillator, which can stabilize in a controlled limit cycle through the incorporation of a nonlinear dissipation. We show how the use of this discrete gradient method outperforms the performance of a Runge-Kutta method of the same order. Experiments are also carried out to compare two theoretically equivalent port-Hamiltonian systems formulations and to analyze the impact of regularizing the Jacobian of port-Hamiltonian neural networks during training.

[264] Random Wavelet Features for Graph Kernel Machines

Valentin de Bassompierre, Jean-Charles Delvenne, Laurent Jacques

Main category: cs.LG

TL;DR: Randomized spectral node embeddings that approximate graph kernels via random feature methods, achieving more accurate kernel approximations than existing approaches.

DetailsMotivation: Graph kernels provide principled node similarity measures but are computationally expensive for large networks. There's a need for scalable methods that can approximate graph kernels effectively while preserving structural information.

Method: Introduces randomized spectral node embeddings using random feature methods for kernel approximation. The approach constructs embeddings whose dot products estimate low-rank approximations of graph kernels, particularly effective for spectrally localized kernels.

Result: The method achieves more accurate kernel approximations than existing approaches, with theoretical and empirical validation showing effectiveness for spectrally localized kernels.

Conclusion: Randomized spectral constructions provide a scalable and principled approach for graph representation learning, enabling efficient approximation of graph kernels while preserving structural information.

Abstract: Node embeddings map graph vertices into low-dimensional Euclidean spaces while preserving structural information. They are central to tasks such as node classification, link prediction, and signal reconstruction. A key goal is to design node embeddings whose dot products capture meaningful notions of node similarity induced by the graph. Graph kernels offer a principled way to define such similarities, but their direct computation is often prohibitive for large networks. Inspired by random feature methods for kernel approximation in Euclidean spaces, we introduce randomized spectral node embeddings whose dot products estimate a low-rank approximation of any specific graph kernel. We provide theoretical and empirical results showing that our embeddings achieve more accurate kernel approximations than existing methods, particularly for spectrally localized kernels. These results demonstrate the effectiveness of randomized spectral constructions for scalable and principled graph representation learning.

[265] MRC-GAT: A Meta-Relational Copula-Based Graph Attention Network for Interpretable Multimodal Alzheimer’s Disease Diagnosis

Fatemeh Khalvandi, Saadat Izadi, Abdolah Chalechale

Main category: cs.LG

TL;DR: MRC-GAT: A meta-relational copula-based graph attention network for Alzheimer’s disease diagnosis using multimodal data (risk factors, cognitive tests, MRI).

DetailsMotivation: Early and precise AD diagnosis is crucial but current graph-based approaches use fixed structural designs that limit flexibility and generalization across heterogeneous patient data.

Method: Proposes MRC-GAT with copula-based similarity alignment, relational attention, and node fusion integrated into episodic meta-learning. Multimodal features (RF, cognitive scores, MRI) are aligned via copula transformation and combined using multi-relational attention.

Result: Achieved 96.87% accuracy on TADPOLE dataset and 92.31% on NACC dataset, demonstrating state-of-the-art performance compared to existing diagnostic models.

Conclusion: MRC-GAT shows robustness and applicability for AD diagnosis with interpretability at various disease stages, overcoming limitations of fixed graph structures.

Abstract: Alzheimer’s disease (AD) is a progressive neurodegenerative condition necessitating early and precise diagnosis to provide prompt clinical management. Given the paramount importance of early diagnosis, recent studies have increasingly focused on computer-aided diagnostic models to enhance precision and reliability. However, most graph-based approaches still rely on fixed structural designs, which restrict their flexibility and limit generalization across heterogeneous patient data. To overcome these limitations, the Meta-Relational Copula-Based Graph Attention Network (MRC-GAT) is proposed as an efficient multimodal model for AD classification tasks. The proposed architecture, copula-based similarity alignment, relational attention, and node fusion are integrated as the core components of episodic meta-learning, such that the multimodal features, including risk factors (RF), Cognitive test scores, and MRI attributes, are first aligned via a copula-based transformation in a common statistical space and then combined by a multi-relational attention mechanism. According to evaluations performed on the TADPOLE and NACC datasets, the MRC-GAT model achieved accuracies of 96.87% and 92.31%, respectively, demonstrating state-of-the-art performance compared to existing diagnostic models. Finally, the proposed model confirms the robustness and applicability of the proposed method by providing interpretability at various stages of disease diagnosis.

[266] UrbanVerse: Learning Urban Region Representation Across Cities and Tasks

Fengze Sun, Egemen Tanin, Shanika Karunasekera, Zuqing Li, Flora D. Salim, Jianzhong Qi

Main category: cs.LG

TL;DR: UrbanVerse is a foundation-style model for cross-city urban representation learning and cross-task urban analytics that uses graph-based region sequences and a novel diffusion module for multi-task learning.

DetailsMotivation: Existing urban representation learning methods are limited in their ability to generalize across different cities and analytic tasks. The authors aim to create a foundation-style model that can work across diverse urban environments and multiple prediction tasks.

Method: UrbanVerse models urban regions as nodes on a graph and uses random walks to create “sequences of regions” that capture both local and neighborhood structural features. For cross-task generalization, they propose HCondDiffCT, a diffusion-based module that integrates region-conditioned prior knowledge and task-conditioned semantics to jointly model multiple downstream urban prediction tasks.

Result: Experiments on real-world datasets show UrbanVerse consistently outperforms state-of-the-art methods across six tasks under cross-city settings, achieving up to 35.89% improvements in prediction accuracy.

Conclusion: UrbanVerse successfully addresses the limitations of existing urban representation learning methods by providing a foundation-style model that generalizes well across cities and tasks, with the HCondDiffCT module being generic enough to enhance existing models.

Abstract: Recent advances in urban region representation learning have enabled a wide range of applications in urban analytics, yet existing methods remain limited in their capabilities to generalize across cities and analytic tasks. We aim to generalize urban representation learning beyond city- and task-specific settings, towards a foundation-style model for urban analytics. To this end, we propose UrbanVerse, a model for cross-city urban representation learning and cross-task urban analytics. For cross-city generalization, UrbanVerse focuses on features local to the target regions and structural features of the nearby regions rather than the entire city. We model regions as nodes on a graph, which enables a random walk-based procedure to form “sequences of regions” that reflect both local and neighborhood structural features for urban region representation learning. For cross-task generalization, we propose a cross-task learning module named HCondDiffCT. This module integrates region-conditioned prior knowledge and task-conditioned semantics into the diffusion process to jointly model multiple downstream urban prediction tasks. HCondDiffCT is generic. It can also be integrated with existing urban representation learning models to enhance their downstream task effectiveness. Experiments on real-world datasets show that UrbanVerse consistently outperforms state-of-the-art methods across six tasks under cross-city settings, achieving up to 35.89% improvements in prediction accuracy.

[267] Beyond Match Maximization and Fairness: Retention-Optimized Two-Sided Matching

Ren Kishimoto, Rikiya Takehi, Koichi Tanaka, Masahiro Nomura, Riku Togashi, Yoji Tomita, Yuta Saito

Main category: cs.LG

TL;DR: A new algorithm called MRet (Matching for Retention) that optimizes user retention rather than match maximization or fairness in two-sided matching platforms like online dating.

DetailsMotivation: Traditional matching platforms optimize for total matches, creating imbalance where some users get too many matches while others get few, leading to user abandonment. Fairness objectives don't directly address retention, which is crucial for subscription-based platforms.

Method: Introduces a dynamic learning-to-rank algorithm called MRet that models user retention by learning personalized retention curves from user profiles and interaction history. It dynamically adapts recommendations by jointly considering retention gains for both the user receiving recommendations and those being recommended.

Result: Empirical evaluations on synthetic and real-world datasets from a major online dating platform show MRet achieves higher user retention compared to conventional methods that optimize matches or fairness.

Conclusion: Directly optimizing for user retention rather than matches or fairness is more effective for platforms where retention is the ultimate goal, addressing the core problem of user abandonment in two-sided matching systems.

Abstract: On two-sided matching platforms such as online dating and recruiting, recommendation algorithms often aim to maximize the total number of matches. However, this objective creates an imbalance, where some users receive far too many matches while many others receive very few and eventually abandon the platform. Retaining users is crucial for many platforms, such as those that depend heavily on subscriptions. Some may use fairness objectives to solve the problem of match maximization. However, fairness in itself is not the ultimate objective for many platforms, as users do not suddenly reward the platform simply because exposure is equalized. In practice, where user retention is often the ultimate goal, casually relying on fairness will leave the optimization of retention up to luck. In this work, instead of maximizing matches or axiomatically defining fairness, we formally define the new problem setting of maximizing user retention in two-sided matching platforms. To this end, we introduce a dynamic learning-to-rank (LTR) algorithm called Matching for Retention (MRet). Unlike conventional algorithms for two-sided matching, our approach models user retention by learning personalized retention curves from each user’s profile and interaction history. Based on these curves, MRet dynamically adapts recommendations by jointly considering the retention gains of both the user receiving recommendations and those who are being recommended, so that limited matching opportunities can be allocated where they most improve overall retention. Naturally but importantly, empirical evaluations on synthetic and real-world datasets from a major online dating platform show that MRet achieves higher user retention, since conventional methods optimize matches or fairness rather than retention.

[268] The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova

Main category: cs.LG

TL;DR: Fine-tuning language models on benign tasks can unexpectedly degrade safety guardrails due to geometric instability in alignment subspaces, where curvature effects systematically steer optimization into safety-critical regions.

DetailsMotivation: Current safety paradigms fail to explain why fine-tuning aligned language models on harmless tasks unpredictably degrades safety guardrails, even with no harmful training data or adversarial intent.

Method: Geometric analysis of alignment structure, proving that safety alignment concentrates in low-dimensional subspaces with sharp curvature. Formalizes Alignment Instability Condition with three geometric properties and establishes quartic scaling law for alignment loss.

Result: Reveals structural instability where initial orthogonal updates collapse under gradient descent dynamics due to curvature coupling, causing safety degradation that scales with the fourth power of training time.

Conclusion: Alignment fragility is an intrinsic geometric property of gradient descent on curved manifolds, requiring curvature-aware methods and a shift from reactive safety testing to predictive diagnostics.

Abstract: Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.

[269] Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning

Oswin So, Eric Yang Yu, Songyuan Zhang, Matthew Cleaveland, Mitchell Black, Chuchu Fan

Main category: cs.LG

TL;DR: FGE is a deep RL method that simultaneously identifies feasible initial conditions and learns safe reachability policies, outperforming existing methods by 50%+ coverage on challenging tasks.

DetailsMotivation: There's a fundamental mismatch between RL (optimizing expected returns) and reachability problems (maximizing safe state coverage), leading to poor performance on low-probability states. Existing methods don't address feasibility of initial conditions.

Method: Feasibility-Guided Exploration (FGE) simultaneously identifies a subset of feasible initial conditions where safe policies exist and learns policies to solve reachability problems over this set.

Result: FGE learns policies with over 50% more coverage than best existing methods for challenging initial conditions across MuJoCo and Kinetix simulators with pixel observations.

Conclusion: FGE effectively addresses the RL-reachability mismatch by jointly exploring feasibility and learning safe policies, significantly improving coverage for challenging safety-critical tasks.

Abstract: Recent advances in deep reinforcement learning (RL) have achieved strong results on high-dimensional control tasks, but applying RL to reachability problems raises a fundamental mismatch: reachability seeks to maximize the set of states from which a system remains safe indefinitely, while RL optimizes expected returns over a user-specified distribution. This mismatch can result in policies that perform poorly on low-probability states that are still within the safe set. A natural alternative is to frame the problem as a robust optimization over a set of initial conditions that specify the initial state, dynamics and safe set, but whether this problem has a solution depends on the feasibility of the specified set, which is unknown a priori. We propose Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe policy exists, and learns a policy to solve the reachability problem over this set of initial conditions. Empirical results demonstrate that FGE learns policies with over 50% more coverage than the best existing method for challenging initial conditions across tasks in the MuJoCo simulator and the Kinetix simulator with pixel observations.

[270] Stabilizing Test-Time Adaptation of High-Dimensional Simulation Surrogates via D-Optimal Statistics

Anna Zimmel, Paul Setinek, Gianluca Galletti, Johannes Brandstetter, Werner Zellinger

Main category: cs.LG

TL;DR: A test-time adaptation framework for machine learning surrogates in engineering simulations that addresses distribution shifts using D-optimal statistics for stable adaptation in high-dimensional regression problems.

DetailsMotivation: Machine learning surrogates in engineering face performance degradation due to distribution shifts between training and deployment (e.g., unseen geometries/configurations). Existing test-time adaptation methods are designed for lower-dimensional classification with structured outputs, making them unstable for high-dimensional, unstructured regression problems common in simulations.

Method: Proposes a TTA framework based on storing maximally informative (D-optimal) statistics that enables stable adaptation and principled parameter selection at test time. Applied to pretrained simulation surrogates to handle distribution shifts.

Result: Achieves up to 7% out-of-distribution improvements at negligible computational cost. First systematic demonstration of effective TTA for high-dimensional simulation regression and generative design optimization, validated on SIMSHIFT and EngiBench benchmarks.

Conclusion: The proposed D-optimal statistics-based TTA framework effectively addresses distribution shifts in engineering simulation surrogates, providing stable adaptation for high-dimensional regression problems with minimal computational overhead.

Abstract: Machine learning surrogates are increasingly used in engineering to accelerate costly simulations, yet distribution shifts between training and deployment often cause severe performance degradation (e.g., unseen geometries or configurations). Test-Time Adaptation (TTA) can mitigate such shifts, but existing methods are largely developed for lower-dimensional classification with structured outputs and visually aligned input-output relationships, making them unstable for the high-dimensional, unstructured and regression problems common in simulation. We address this challenge by proposing a TTA framework based on storing maximally informative (D-optimal) statistics, which jointly enables stable adaptation and principled parameter selection at test time. When applied to pretrained simulation surrogates, our method yields up to 7% out-of-distribution improvements at negligible computational cost. To the best of our knowledge, this is the first systematic demonstration of effective TTA for high-dimensional simulation regression and generative design optimization, validated on the SIMSHIFT and EngiBench benchmarks.

[271] CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

Zarif Ikram, Arad Firouzkouhi, Stephen Tu, Mahdi Soltanolkotabi, Paria Rashidinejad

Main category: cs.LG

TL;DR: CrispEdit is a second-order LLM editing algorithm that treats capability preservation as an explicit constraint using constrained optimization and projects updates onto low-curvature subspaces to prevent capability degradation.

DetailsMotivation: Current LLM editing methods often corrupt general capabilities while changing targeted behavior, producing degenerate behaviors similar to proxy/reward hacking. There's a need for editing methods that preserve model capabilities while making targeted changes.

Method: Formulates editing as constrained optimization, enforces capability preservation by projecting edit updates onto low-curvature subspace of capability-loss landscape using Bregman divergence. Uses K-FAC for efficiency and novel matrix-free projector exploiting Kronecker structure to avoid massive projection matrices.

Result: Achieves high edit success while keeping capability degradation below 1% on average across datasets, significantly improving over prior editors on standard model-editing benchmarks.

Conclusion: CrispEdit provides a scalable and principled second-order editing algorithm that successfully addresses capability preservation in LLM editing through constrained optimization and efficient second-order methods.

Abstract: A central challenge in large language model (LLM) editing is capability preservation: methods that successfully change targeted behavior can quietly game the editing proxy and corrupt general capabilities, producing degenerate behaviors reminiscent of proxy/reward hacking. We present CrispEdit, a scalable and principled second-order editing algorithm that treats capability preservation as an explicit constraint, unifying and generalizing several existing editing approaches. CrispEdit formulates editing as constrained optimization and enforces the constraint by projecting edit updates onto the low-curvature subspace of the capability-loss landscape. At the crux of CrispEdit is expressing capability constraint via Bregman divergence, whose quadratic form yields the Gauss-Newton Hessian exactly and even when the base model is not trained to convergence. We make this second-order procedure efficient at the LLM scale using Kronecker-factored approximate curvature (K-FAC) and a novel matrix-free projector that exploits Kronecker structure to avoid constructing massive projection matrices. Across standard model-editing benchmarks, CrispEdit achieves high edit success while keeping capability degradation below 1% on average across datasets, significantly improving over prior editors.

[272] Operationalising the Superficial Alignment Hypothesis via Task Complexity

Tomás Vergara-Browne, Darshan Patil, Ivan Titov, Siva Reddy, Tiago Pimentel, Marius Mosbach

Main category: cs.LG

TL;DR: The paper proposes a new metric called “task complexity” to formalize the Superficial Alignment Hypothesis, showing that pre-trained models drastically reduce the complexity of achieving high performance on tasks, and post-training collapses this complexity by orders of magnitude.

DetailsMotivation: The Superficial Alignment Hypothesis (SAH) lacks precise definition, leading to different interpretations and critiques. The authors aim to provide a formal framework to understand how pre-training and post-training affect task performance complexity.

Method: Propose “task complexity” metric: the length of the shortest program that achieves target performance on a task. Use this framework to formalize SAH claims. Experimentally estimate task complexity for mathematical reasoning, machine translation, and instruction following tasks by measuring program length needed when conditioned on pre-trained models.

Result: Task complexities can be remarkably low when conditioned on pre-trained models. Pre-training enables access to strong performances but may require gigabytes of program length. Post-training collapses complexity by several orders of magnitude, often requiring just kilobytes of information for task adaptation.

Conclusion: The task complexity framework provides a precise definition of SAH, unifying prior arguments. Results show that task adaptation requires surprisingly little information, supporting the view that post-training primarily surfaces knowledge already present in pre-trained models.

Abstract: The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pre-training, and that post-training merely surfaces this knowledge. The SAH, however, lacks a precise definition, which has led to (i) different and seemingly orthogonal arguments supporting it, and (ii) important critiques to it. We propose a new metric called task complexity: the length of the shortest program that achieves a target performance on a task. In this framework, the SAH simply claims that pre-trained models drastically reduce the complexity of achieving high performance on many tasks. Our definition unifies prior arguments supporting the SAH, interpreting them as different strategies to find such short programs. Experimentally, we estimate the task complexity of mathematical reasoning, machine translation, and instruction following; we then show that these complexities can be remarkably low when conditioned on a pre-trained model. Further, we find that pre-training enables access to strong performances on our tasks, but it can require programs of gigabytes of length to access them. Post-training, on the other hand, collapses the complexity of reaching this same performance by several orders of magnitude. Overall, our results highlight that task adaptation often requires surprisingly little information – often just a few kilobytes.

Thomas Roland Barillot, Alex De Castro

Main category: cs.LG

TL;DR: Persistent homology metrics (1-Wasserstein norm of H₀ and maximum loop lifetime of H₁) can quantify semantic ambiguity in sentence embeddings by analyzing local topological structure, validated through simulations and real-world Nobel Prize physics lectures.

DetailsMotivation: To extend word-level polysemy analysis using persistent homology to full sentences, quantifying semantic ambiguity in sentence embeddings for improved semantic search and ambiguity detection.

Method: Generalized persistent homology concepts from words to sentences, using two topological metrics: 1-Wasserstein norm of H₀ and maximum loop lifetime of H₁. Validated through ab-initio simulations with random topic vectors and real-world analysis of Nobel Prize physics lectures (1901-2024) using four embedding models.

Result: Ambiguous sentences separate from unambiguous ones in both topological metrics across all tested embedding models. Results remain stable despite changes in embedding architecture, demonstrating model-agnostic detection of semantic discontinuities.

Conclusion: Persistent homology provides a robust, model-agnostic signal for detecting semantic ambiguity and discontinuities in sentence embeddings, with practical applications for ambiguity detection and semantic search recall improvement.

Abstract: We studied how the local topological structure of sentence-embedding neighborhoods encodes semantic ambiguity. Extending ideas that link word-level polysemy to non-trivial persistent homology, we generalized the concept to full sentences and quantified ambiguity of a query in a semantic search process with two persistent homology metrics: the 1-Wasserstein norm of $H_{0}$ and the maximum loop lifetime of $H_{1}$. We formalized the notion of ambiguity as the relative presence of semantic domains or topics in sentences. We then used this formalism to compute “ab-initio” simulations that encode datapoints as linear combination of randomly generated single topics vectors in an arbitrary embedding space and demonstrate that ambiguous sentences separate from unambiguous ones in both metrics. Finally we validated those findings with real-world case by investigating on a fully open corpus comprising Nobel Prize Physics lectures from 1901 to 2024, segmented into contiguous, non-overlapping chunks at two granularity: $\sim!250$ tokens and $\sim!750$ tokens. We tested embedding with four publicly available models. Results across all models reproduce simulations and remain stable despite changes in embedding architecture. We conclude that persistent homology provides a model-agnostic signal of semantic discontinuities, suggesting practical use for ambiguity detection and semantic search recall.

[274] Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes

Rahul Garg, Trilok Padhi, Hemang Jain, Ugur Kursuncu, Ponnurangam Kumaraguru

Main category: cs.LG

TL;DR: A novel framework combining Knowledge Distillation from Large Visual Language Models and knowledge infusion from ConceptNet to enhance toxicity detection in hateful memes through hybrid neurosymbolic approach.

DetailsMotivation: Toxicity identification in online multimodal environments is challenging due to complex contextual connections across modalities (textual and visual). Current methods struggle with understanding the relational context between toxic phrases and visual concepts in hateful memes.

Method: Proposes a framework that integrates Knowledge Distillation from Large Visual Language Models (LVLMs) and knowledge infusion from ConceptNet. Extracts sub-knowledge graphs from ConceptNet to infuse within a compact VLM framework, capturing relational context between toxic phrases in captions/memes and visual concepts.

Result: Superior performance on two hate speech benchmark datasets, outperforming state-of-the-art baselines across AU-ROC (1.1% improvement), F1 (7% improvement), and Recall (35% improvement).

Conclusion: The approach demonstrates significance of learning from both explicit (knowledge graphs) and implicit (LVLMs) contextual cues through hybrid neurosymbolic approach, crucial for accurate and scalable toxic content recognition in real-world applications.

Abstract: Toxicity identification in online multimodal environments remains a challenging task due to the complexity of contextual connections across modalities (e.g., textual and visual). In this paper, we propose a novel framework that integrates Knowledge Distillation (KD) from Large Visual Language Models (LVLMs) and knowledge infusion to enhance the performance of toxicity detection in hateful memes. Our approach extracts sub-knowledge graphs from ConceptNet, a large-scale commonsense Knowledge Graph (KG) to be infused within a compact VLM framework. The relational context between toxic phrases in captions and memes, as well as visual concepts in memes enhance the model’s reasoning capabilities. Experimental results from our study on two hate speech benchmark datasets demonstrate superior performance over the state-of-the-art baselines across AU-ROC, F1, and Recall with improvements of 1.1%, 7%, and 35%, respectively. Given the contextual complexity of the toxicity detection task, our approach showcases the significance of learning from both explicit (i.e. KG) as well as implicit (i.e. LVLMs) contextual cues incorporated through a hybrid neurosymbolic approach. This is crucial for real-world applications where accurate and scalable recognition of toxic content is critical for creating safer online environments.

[275] Should You Use Your Large Language Model to Explore or Exploit?

Keegan Harris, Aleksandrs Slivkins

Main category: cs.LG

TL;DR: LLMs evaluated for exploration-exploitation tradeoffs in bandit tasks, finding reasoning models good for exploitation but too slow, while LLMs help explore semantic action spaces but underperform simple baselines.

DetailsMotivation: To systematically evaluate LLMs' ability to handle exploration-exploitation tradeoffs in decision-making contexts, moving beyond previous work that studied combined tasks to examine exploration and exploitation separately in various bandit settings.

Method: Tested LLMs on various (contextual) bandit tasks, evaluating reasoning models for exploitation, and studied tool use and in-context summarization with non-reasoning models as mitigations for practical deployment challenges.

Result: Reasoning models show promise for exploitation but are too expensive/slow for practical use; mitigations improve medium-difficulty tasks but all LLMs underperform simple linear regression; LLMs help explore large semantic action spaces by suggesting candidates.

Conclusion: LLMs have limited practical utility for exploration-exploitation tradeoffs in bandit tasks compared to simple baselines, but show potential for exploration in semantic action spaces where they can suggest candidates.

Abstract: We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff. While previous work has largely study the ability of LLMs to solve combined exploration-exploitation tasks, we take a more systematic approach and use LLMs to explore and exploit in silos in various (contextual) bandit tasks. We find that reasoning models show the most promise for solving exploitation tasks, although they are still too expensive or too slow to be used in many practical settings. Motivated by this, we study tool use and in-context summarization using non-reasoning models. We find that these mitigations may be used to substantially improve performance on medium-difficulty tasks, however even then, all LLMs we study perform worse than a simple linear regression, even in non-linear settings. On the other hand, we find that LLMs do help at exploring large action spaces with inherent semantics, by suggesting suitable candidates to explore.

[276] Efficient Semi-Supervised Adversarial Training via Latent Clustering-Based Data Reduction

Somrita Ghosh, Yuelin Xu, Xiao Zhang

Main category: cs.LG

TL;DR: Proposes data reduction strategies for semi-supervised adversarial training using latent clustering to select/generate critical boundary-adjacent data, reducing data requirements 5-10x while maintaining robustness.

DetailsMotivation: Semi-supervised adversarial training (SSAT) requires substantial extra unlabeled data for robustness, leading to prolonged training time and increased memory usage. Need to improve efficiency by optimizing additional data incorporation.

Method: Design latent clustering-based techniques to select/generate small critical subset of data near model’s decision boundary. Methods include latent-space selection with k-means clustering and guided diffusion-based approach with LCG-KM, maintaining balanced ratio between boundary and non-boundary points.

Result: Achieves nearly identical robust accuracies with 5-10 times less unlabeled data. Reduces total runtime by approximately 3-4 times compared to full SSAT trained to convergence through strategic prioritization of unlabeled data.

Conclusion: Proposed data reduction strategies effectively reduce SSAT’s data requirements and computational costs while preserving strong robustness advantages, making adversarial training more efficient.

Abstract: Learning robust models under adversarial settings is widely recognized as requiring a considerably large number of training samples. Recent work proposes semi-supervised adversarial training (SSAT), which utilizes external unlabeled or synthetically generated data and is currently the state of the art. However, SSAT requires substantial extra data to attain high robustness, resulting in prolonged training time and increased memory usage. In this paper, we propose data reduction strategies to improve the efficiency of SSAT by optimizing the amount of additional data incorporated. Specifically, we design novel latent clustering-based techniques to select or generate a small, critical subset of data samples near the model’s decision boundary. While focusing on boundary-adjacent points, our methods maintain a balanced ratio between boundary and non-boundary data points, thereby avoiding overfitting. Comprehensive experiments across image benchmarks demonstrate that our methods can effectively reduce SSAT’s data requirements and computational costs while preserving its strong robustness advantages. In particular, our latent-space selection scheme based on k-means clustering and our guided diffusion-based approach with LCG-KM are the most effective, achieving nearly identical robust accuracies with 5 times to 10 times less unlabeled data. When compared to full SSAT trained to convergence, our methods reduce total runtime by approximately 3 times to 4 times due to strategic prioritization of unlabeled data.

[277] General Exploratory Bonus for Optimistic Exploration in RLHF

Wendi Li, Changdae Oh, Sharon Li

Main category: cs.LG

TL;DR: GEB is a new theoretical framework for optimistic exploration in RLHF that counteracts bias toward high-probability regions, outperforming existing methods across multiple divergence settings.

DetailsMotivation: Current exploratory bonus methods in RLHF fail to achieve true optimism, unintentionally biasing exploration toward high-probability regions of the reference model instead of promoting discovery of uncertain regions.

Method: Introduces General Exploratory Bonus (GEB), a theoretical framework that provably satisfies the optimism principle through reference-dependent reward regulation, unifying prior heuristic bonuses as special cases across the full α-divergence family.

Result: GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones, demonstrating both principled and practical advantages.

Conclusion: GEB offers a principled and practical solution for optimistic exploration in RLHF by addressing fundamental biases in current exploration methods.

Abstract: Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $α$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $α$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.

[278] Terminal Velocity Matching

Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song

Main category: cs.LG

TL;DR: TVM is a flow matching generalization for high-fidelity one/few-step generative modeling that models transitions between diffusion timesteps with terminal time regularization, achieving state-of-the-art few-step generation on ImageNet.

DetailsMotivation: Current flow matching methods require many steps for high-quality generation. The authors aim to develop a method that achieves high-fidelity generation with very few function evaluations (1-4 steps) while maintaining theoretical guarantees and practical efficiency.

Method: TVM generalizes flow matching by modeling transitions between any two diffusion timesteps and regularizing behavior at terminal time rather than initial time. They prove theoretical bounds on Wasserstein distance, address Lipschitz continuity issues in Diffusion Transformers with minimal architectural changes, and develop a fused attention kernel for efficient Jacobian-Vector Product backward passes.

Result: Achieves 3.29 FID with 1 NFE and 1.99 FID with 4 NFEs on ImageNet-256x256, and 4.32 FID with 1 NFE and 2.94 FID with 4 NFEs on ImageNet-512x512, representing state-of-the-art performance for one/few-step models trained from scratch.

Conclusion: TVM enables high-fidelity generative modeling with very few function evaluations, bridging the gap between flow matching theory and practical implementation while achieving state-of-the-art results on large-scale image generation tasks.

Abstract: We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.

[279] ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System

Anantha Sharma

Main category: cs.LG

TL;DR: Argus: A framework for detecting distributional drift in high-dimensional data streams using Voronoi tessellations over fixed spatial partitions, achieving O(N) complexity with cell-level spatial localization and invariance to orthogonal transformations.

DetailsMotivation: Existing drift detection methods have fundamental limitations: global comparison methods scale poorly in high dimensions, projection-based approaches lose geometric structure, and re-clustering methods suffer from identity instability. There's a need for efficient, geometrically-aware drift detection that preserves high-dimensional structure.

Method: Reconceptualizes drift detection as tracking local statistics over a fixed spatial partition of the data manifold. Uses Voronoi tessellations over canonical orthonormal frames to achieve invariance to orthogonal transformations. Introduces product quantization tessellation for scaling to very high dimensions (d>500) by decomposing space into independent subspaces and aggregating drift signals. Develops graph-theoretic characterization of drift propagation to distinguish coherent shifts from isolated perturbations.

Result: Theoretical foundations are formalized with proven invariance properties. Experimental validation shows the framework correctly identifies drift under coordinate rotation while existing methods produce false positives. Achieves O(N) complexity per snapshot with cell-level spatial localization of distributional change.

Conclusion: Argus offers a principled geometric foundation for distribution monitoring that preserves high-dimensional structure without the computational burden of pairwise comparisons. The tessellated approach provides efficient, geometrically-aware drift detection with theoretical guarantees.

Abstract: Detecting distributional drift in high-dimensional data streams presents fundamental challenges: global comparison methods scale poorly, projection-based approaches lose geometric structure, and re-clustering methods suffer from identity instability. This paper introduces Argus, A framework that reconceptualizes drift detection as tracking local statistics over a fixed spatial partition of the data manifold. The key contributions are fourfold. First, it is proved that Voronoi tessellations over canonical orthonormal frames yield drift metrics that are invariant to orthogonal transformations. The rotations and reflections that preserve Euclidean geometry. Second, it is established that this framework achieves O(N) complexity per snapshot while providing cell-level spatial localization of distributional change. Third, a graph-theoretic characterization of drift propagation is developed that distinguishes coherent distributional shifts from isolated perturbations. Fourth, product quantization tessellation is introduced for scaling to very high dimensions (d>500) by decomposing the space into independent subspaces and aggregating drift signals across subspaces. This paper formalizes the theoretical foundations, proves invariance properties, and presents experimental validation demonstrating that the framework correctly identifies drift under coordinate rotation while existing methods produce false positives. The tessellated approach offers a principled geometric foundation for distribution monitoring that preserves high-dimensional structure without the computational burden of pairwise comparisons.

[280] Stratified Hazard Sampling: Minimal-Variance Event Scheduling for CTMC/DTMC Discrete Diffusion and Flow Models

Seunghwan Jang, SooJean Han

Main category: cs.LG

TL;DR: Stratified Hazard Sampling (SHS) improves discrete diffusion models by reducing variance in token editing during sampling, addressing under/over-editing issues while maintaining multimodality.

DetailsMotivation: Current uniform-noise discrete diffusion models suffer from Poisson-binomial variance in per-position jump counts during sampling, leading to under-editing (residual noise) and over-editing (cascading substitutions) that degrade sample quality, especially with limited discretization steps.

Method: Proposes Stratified Hazard Sampling (SHS), a training-free inference method that models per-token edits as events driven by cumulative hazard (CTMC) or cumulative jump mass (DTMC). It places events by stratifying this cumulative quantity with a single random phase per position, updating tokens when accumulated hazard crosses unit-spaced thresholds.

Result: SHS consistently improves sample quality in uniform-noise discrete diffusion language models and enhances robustness under token-level blacklist filtering, with benefits increasing as lexical constraints become more severe.

Conclusion: SHS provides a principled, hyperparameter-free approach to reduce sampling variance in discrete diffusion models while preserving multimodality, addressing key failure modes of existing methods.

Abstract: Uniform-noise discrete diffusion and flow models (e.g., D3PM, SEDD, UDLM, DFM) generate sequences non-autoregressively by iteratively refining randomly initialized vocabulary tokens through multiple context-dependent replacements. These models are typically formulated as time-inhomogeneous CTMC/DTMC processes and sampled using independent Bernoulli change decisions at each discretization step. This induces Poisson-binomial variance in per-position jump counts that grows with the number of required edits, leading to the characteristic under-editing (residual noise) and over-editing (cascading substitutions) failure modes that degrade sample quality, especially under tight discretization budgets. In contrast, absorbing-state (mask-start) models avoid this instability by allowing each position to jump at most once. We propose Stratified Hazard Sampling (SHS), a training-free, drop-in, and hyperparameter-free inference principle for any sampler that admits a stay-vs.-replace decomposition. SHS models per-token edits as events driven by cumulative hazard (CTMC) or cumulative jump mass (DTMC) and places events by stratifying this cumulative quantity: with a single random phase per position, a token is updated whenever its accumulated hazard crosses unit-spaced thresholds. This preserves the expected number of jumps while achieving the minimum possible conditional variance among unbiased integer estimators (bounded by 1/4 for any fixed cumulative mass), without altering per-jump destination sampling and thus retaining multimodality. Experiments on uniform-noise discrete diffusion language models show that SHS consistently improves sample quality. We further show that SHS improves robustness under token-level blacklist filtering, with benefits increasing as lexical constraints grow more severe.

[281] IGC-Net for conditional average potential outcome estimation over time

Konstantin Hess, Dennis Frauen, Valentyn Melnychuk, Stefan Feuerriegel

Main category: cs.LG

TL;DR: IGC-Net is a neural end-to-end model for estimating conditional average potential outcomes over time from observational data, addressing time-varying confounding through iterative G-computation.

DetailsMotivation: Existing methods for estimating potential outcomes from observational data often fail to properly adjust for time-varying confounding, leading to biased estimates. Current neural methods have limitations like division by near-zero propensity scores, resulting in poor performance.

Method: IGC-Net is a novel neural end-to-end model that performs fully regression-based iterative G-computation to adjust for time-varying confounding. It’s the first neural model to implement this approach for conditional average potential outcomes in time-varying settings.

Result: The paper evaluates IGC-Net across various experiments and shows it represents a significant step toward personalized decision-making from electronic health records.

Conclusion: IGC-Net successfully addresses limitations of existing methods for estimating potential outcomes over time from observational data, providing a robust neural approach for personalized medical decision-making.

Abstract: Estimating potential outcomes for treatments over time based on observational data is important for personalized decision-making in medicine. However, many existing methods for this task fail to properly adjust for time-varying confounding and thus yield biased estimates. There are only a few neural methods with proper adjustments, but these have inherent limitations (e.g., division by propensity scores that are often close to zero), which result in poor performance. As a remedy, we introduce the iterative G-computation network (IGC-Net). Our IGC-Net is a novel, neural end-to-end model which adjusts for time-varying confounding in order to estimate conditional average potential outcomes (CAPOs) over time. Specifically, our IGC-Net is the first neural model to perform fully regression-based iterative G-computation for CAPOs in the time-varying setting. We evaluate the effectiveness of our IGC-Net across various experiments. In sum, this work represents a significant step towards personalized decision-making from electronic health records.

[282] Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation

Shojiro Yamabe, Kazuto Fukuchi, Jun Sakuma

Main category: cs.LG

TL;DR: Proposes imitation learning-based behavior-targeted attacks on RL agents and time-discounted regularization defense

DetailsMotivation: Existing behavior-targeted attacks on reinforcement learning have limitations like requiring white-box access to victim's policy; need for more practical attacks and effective defenses

Method: 1) Attack: Uses imitation learning from adversarial demonstrations that works with limited policy access and is environment-agnostic; 2) Defense: Time-discounted regularization based on theoretical analysis showing policy sensitivity to state changes impacts defense performance

Result: Proposed attack method effective under limited access conditions; defense strategy enhances robustness against attacks while maintaining task performance

Conclusion: First defense strategy specifically designed for behavior-targeted attacks in RL; theoretical insights about policy sensitivity guide effective defense design

Abstract: This study investigates behavior-targeted attacks on reinforcement learning and their countermeasures. Behavior-targeted attacks aim to manipulate the victim’s behavior as desired by the adversary through adversarial interventions in state observations. Existing behavior-targeted attacks have some limitations, such as requiring white-box access to the victim’s policy. To address this, we propose a novel attack method using imitation learning from adversarial demonstrations, which works under limited access to the victim’s policy and is environment-agnostic. In addition, our theoretical analysis proves that the policy’s sensitivity to state changes impacts defense performance, particularly in the early stages of the trajectory. Based on this insight, we propose time-discounted regularization, which enhances robustness against attacks while maintaining task performance. To the best of our knowledge, this is the first defense strategy specifically designed for behavior-targeted attacks.

[283] Individualized Federated Learning for Traffic Prediction with Error Driven Aggregation

Hang Chen, Collin Meese, Mark Nejad, Chien-Chung Shen

Main category: cs.LG

TL;DR: NeighborFL: A federated learning scheme for traffic prediction using personalized local models grouping based on haversine distance and error-driven heuristics to address non-IID traffic data and enable real-time model updates.

DetailsMotivation: Current federated learning traffic prediction frameworks lack real-time model updating capabilities and rely on conventional aggregation methods that assign identical global models to all devices, neglecting the non-IID characteristics of traffic data from different locations.

Method: Proposes NeighborFL, an individualized real-time federated learning scheme that uses haversine distance-based and error-driven personalized local models grouping heuristics from each traffic node’s perspective to create location-aware, tailored prediction models while enabling collaborative learning.

Result: Simulations show NeighborFL offers improved real-time prediction accuracy over three baseline models, with one experimental setting achieving a 16.9% reduction in MSE value compared to a naive FL setting.

Conclusion: NeighborFL effectively addresses limitations of existing FLTP frameworks by enabling real-time model updates and handling non-IID traffic data through personalized local model grouping, resulting in improved prediction accuracy for smart city traffic management.

Abstract: Low-latency traffic prediction is vital for smart city traffic management. Federated Learning has emerged as a promising technique for Traffic Prediction (FLTP), offering several advantages such as privacy preservation, reduced communication overhead, improved prediction accuracy, and enhanced adaptability to changing traffic conditions. However, majority of the current FLTP frameworks lack a real-time model updating scheme, which hinders their ability to continuously incorporate new incoming traffic data and adapt effectively to the changing dynamics of traffic trends. Another concern with the existing FLTP frameworks is their reliance on the conventional FL model aggregation method, which involves assigning an identical model (i.e., the global model) to all traffic monitoring devices to predict their individual local traffic trends, thereby neglecting the non-IID characteristics of traffic data collected in different locations. Building upon these findings and harnessing insights from reinforcement learning, we propose NeighborFL, an individualized real-time federated learning scheme that introduces a haversine distance-based and error-driven, personalized local models grouping heuristic from the perspective of each individual traffic node. This approach allows NeighborFL to create location-aware and tailored prediction models for each client while fostering collaborative learning. Simulations demonstrate the effectiveness of NeighborFL, offering improved real-time prediction accuracy over three baseline models, with one experimental setting showing a 16.9% reduction in MSE value compared to a naive FL setting.

[284] Score-based change point detection via tracking the best of infinitely many experts

Anna Markovich, Nikita Puchkin

Main category: cs.LG

TL;DR: Online change point detection algorithm using sequential score function estimation and tracking best expert approach with infinite experts and quadratic loss.

DetailsMotivation: The paper addresses the problem of online change point detection in nonparametric settings, where traditional methods may not apply or require strong assumptions about data distributions.

Method: Uses sequential score function estimation combined with a tracking the best expert approach. Implements a version of the fixed share forecaster adapted for infinite number of experts and quadratic loss functions.

Result: Algorithm shows promising results in numerical experiments on both artificial and real-world datasets. Performance is supported by rigorous high-probability bounds for test statistic behavior in pre-change and post-change regimes.

Conclusion: Proposes an effective nonparametric online change point detection method with theoretical guarantees and empirical validation.

Abstract: We propose an algorithm for nonparametric online change point detection based on sequential score function estimation and the tracking the best expert approach. The core of the procedure is a version of the fixed share forecaster tailored to the case of infinite number of experts and quadratic loss functions. The algorithm shows promising results in numerical experiments on artificial and real-world data sets. Its performance is supported by rigorous high-probability bounds describing behaviour of the test statistic in the pre-change and post-change regimes.

[285] Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning

Olivier Lepel, Anas Barakat

Main category: cs.LG

TL;DR: Policy gradient theorem for Cumulative Prospect Theory (CPT) objectives in RL, with algorithm design and convergence guarantees

DetailsMotivation: Extend RL to incorporate behavioral economics principles from Cumulative Prospect Theory, which models human decision-making under risk with asymmetric utility and probability distortion

Method: Derive policy gradient theorem for CPT objectives, design first-order policy gradient algorithm using Monte Carlo gradient estimator based on order statistics

Result: Established statistical guarantees for the estimator and proved asymptotic convergence to first-order stationary points of the non-convex CPT objective

Conclusion: Successfully extended policy gradient methods to CPT-RL, providing theoretical foundations and practical algorithm for risk-sensitive RL with behavioral economics principles

Abstract: We derive a policy gradient theorem for Cumulative Prospect Theory (CPT) objectives in finite-horizon Reinforcement Learning (RL), generalizing the standard policy gradient theorem and encompassing distortion-based risk objectives as special cases. Motivated by behavioral economics, CPT combines an asymmetric utility transformation around a reference point with probability distortion. Building on our theorem, we design a first-order policy gradient algorithm for CPT-RL using a Monte Carlo gradient estimator based on order statistics. We establish statistical guarantees for the estimator and prove asymptotic convergence of the resulting algorithm to first-order stationary points of the (generally non-convex) CPT objective. Simulations illustrate qualitative behaviors induced by CPT and compare our first-order approach to existing zeroth-order methods.

[286] ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

Ehsan Futuhi, Shayan Karimi, Chao Gao, Martin Müller

Main category: cs.LG

TL;DR: ETGL-DDPG enhances DDPG for sparse-reward RL with εt-greedy exploration, dual replay buffers (GDRB), and longest n-step returns, achieving state-of-the-art performance in continuous control tasks.

DetailsMotivation: Deep Deterministic Policy Gradient (DDPG) struggles with sparse rewards due to insufficient exploration and inefficient use of rare reward signals. The paper aims to improve DDPG's performance in sparse-reward continuous control environments.

Method: Three key techniques: 1) εt-greedy search for exploration with polynomial sample complexity guarantees, 2) GDRB (dual experience replay buffer) to efficiently use rewarded transitions, and 3) longest n-step returns for better credit assignment. These are integrated into ETGL-DDPG.

Result: ETGL-DDPG outperforms vanilla DDPG and other state-of-the-art methods across all tested sparse-reward continuous environments. Ablation studies confirm each component individually enhances DDPG performance.

Conclusion: The proposed techniques effectively address DDPG’s limitations in sparse-reward settings, with εt-greedy providing theoretical exploration guarantees and practical components improving sample efficiency and performance.

Abstract: We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph{$ε{t}$-greedy}, which generates exploratory options for exploring less-visited states. We prove that search using $εt$-greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emph{GDRB}, and implement \emph{longest n-step returns}. The resulting algorithm, \emph{ETGL-DDPG}, integrates all three techniques: \bm{$εt$}-greedy, \textbf{G}DRB, and \textbf{L}ongest $n$-step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.

[287] NeuroLifting: Neural Inference on Markov Random Fields at Scale

Yaomin Wang, Chaolong Ying, Xiaodong Luo, Tianshu Yu

Main category: cs.LG

TL;DR: NeuroLifting uses Graph Neural Networks to reparameterize MRF variables for gradient descent optimization, achieving near-exact solver performance on moderate scales and superior results on large-scale problems with linear complexity growth.

DetailsMotivation: Traditional MRF inference methods (belief propagation, mean field, exact solvers like Toulbar2) struggle to balance efficiency and solution quality, especially at large scales. There's a need for scalable methods that maintain high solution quality.

Method: NeuroLifting leverages Graph Neural Networks to reparameterize decision variables in MRFs, extending traditional lifting techniques into a non-parametric neural framework. This enables gradient descent optimization on a smooth loss landscape, making optimization efficient and parallelizable.

Result: On moderate-scale MRFs, NeuroLifting performs very close to exact solver Toulbar2 in solution quality, significantly surpassing existing approximate methods. On large-scale MRFs, it delivers superior solution quality against all baselines while exhibiting linear computational complexity growth.

Conclusion: NeuroLifting represents a significant advancement in MRF inference, offering a scalable and effective solution for large-scale problems by combining neural network smoothness with traditional lifting techniques.

Abstract: Inference in large-scale Markov Random Fields (MRFs) is a critical yet challenging task, traditionally approached through approximate methods like belief propagation and mean field, or exact methods such as the Toulbar2 solver. These strategies often fail to strike an optimal balance between efficiency and solution quality, particularly as the problem scale increases. This paper introduces NeuroLifting, a novel technique that leverages Graph Neural Networks (GNNs) to reparameterize decision variables in MRFs, facilitating the use of standard gradient descent optimization. By extending traditional lifting techniques into a non-parametric neural network framework, NeuroLifting benefits from the smooth loss landscape of neural networks, enabling efficient and parallelizable optimization. Empirical results demonstrate that, on moderate scales, NeuroLifting performs very close to the exact solver Toulbar2 in terms of solution quality, significantly surpassing existing approximate methods. Notably, on large-scale MRFs, NeuroLifting delivers superior solution quality against all baselines, as well as exhibiting linear computational complexity growth. This work presents a significant advancement in MRF inference, offering a scalable and effective solution for large-scale problems.

[288] RobustBlack: Challenging Black-Box Adversarial Attacks on State-of-the-Art Defenses

Mohamed Djilani, Salah Ghamizi, Maxime Cordy

Main category: cs.LG

TL;DR: Black-box attacks struggle against robust models, especially adversarially trained ones, and robustness alignment between surrogate and target models is crucial for transfer attacks

DetailsMotivation: There's a significant gap in evaluating black-box attacks against robust models, as most benchmarks focus on weak defenses rather than modern robust models featured in leaderboards like Robustbench

Method: Established a framework to evaluate recent black-box attacks against top-performing and standard defense mechanisms on ImageNet, examining both transfer-based and query-based approaches

Result: 1) Advanced black-box attacks struggle against simple adversarially trained models; 2) Models robust against white-box attacks (like AutoAttack) also show enhanced resilience to black-box attacks; 3) Robustness alignment between surrogate and target models is key for transfer attack success

Conclusion: Robust models provide significant protection against black-box attacks, and the effectiveness of transfer-based attacks depends heavily on the robustness alignment between surrogate and target models

Abstract: Although adversarial robustness has been extensively studied in white-box settings, recent advances in black-box attacks (including transfer- and query-based approaches) are primarily benchmarked against weak defenses, leaving a significant gap in the evaluation of their effectiveness against more recent and moderate robust models (e.g., those featured in the Robustbench leaderboard). In this paper, we question this lack of attention from black-box attacks to robust models. We establish a framework to evaluate the effectiveness of recent black-box attacks against both top-performing and standard defense mechanisms, on the ImageNet dataset. Our empirical evaluation reveals the following key findings: (1) the most advanced black-box attacks struggle to succeed even against simple adversarially trained models; (2) robust models that are optimized to withstand strong white-box attacks, such as AutoAttack, also exhibits enhanced resilience against black-box attacks; and (3) robustness alignment between the surrogate models and the target model plays a key factor in the success rate of transfer-based attacks

[289] Functional multi-armed bandit and the best function identification problems

Yuriy Dorn, Aleksandr Katrutsa, Ilgam Latypov, Anastasiia Soboleva

Main category: cs.LG

TL;DR: The paper proposes two new problem classes (Functional MAB and Best Function Identification) for black-box function optimization with bandit feedback, introduces a reduction scheme to create UCB-type algorithms, and applies this to competitive LLM training.

DetailsMotivation: Current bandit optimization terminology is misleading as it lacks connection to multi-armed bandits. The authors aim to create better problem classes that model real-world scenarios like competitive LLM training where each arm represents an unknown black-box function.

Method: Proposes Functional MAB (FMAB) and Best Function Identification problem classes. Introduces a reduction scheme to construct F-LCB algorithms (UCB-type) based on existing nonlinear optimization algorithms with known convergence rates.

Result: Provides theoretical regret upper bounds for the reduction scheme based on base algorithms’ convergence rates. Includes numerical experiments demonstrating the performance of the proposed scheme.

Conclusion: The proposed problem classes better model real-world optimization problems like competitive LLM training, and the reduction scheme provides effective algorithms with theoretical guarantees.

Abstract: Bandit optimization usually refers to the class of online optimization problems with limited feedback, namely, a decision maker uses only the objective value at the current point to make a new decision and does not have access to the gradient of the objective function. While this name accurately captures the limitation in feedback, it is somehow misleading since it does not have any connection with the multi-armed bandits (MAB) problem class. We propose two new classes of problems: the functional multi-armed bandit problem (FMAB) and the best function identification problem. They are modifications of a multi-armed bandit problem and the best arm identification problem, respectively, where each arm represents an unknown black-box function. These problem classes are a surprisingly good fit for modeling real-world problems such as competitive LLM training. To solve the problems from these classes, we propose a new reduction scheme to construct UCB-type algorithms, namely, the F-LCB algorithm, based on algorithms for nonlinear optimization with known convergence rates. We provide the regret upper bounds for this reduction scheme based on the base algorithms’ convergence rates. We add numerical experiments that demonstrate the performance of the proposed scheme.

[290] CAAT-EHR: Cross-Attentional Autoregressive Transformer for Multimodal Electronic Health Record Embeddings

Mohammad Al Olaimat, Shaika Chowdhury, Serdar Bozdag

Main category: cs.LG

TL;DR: CAAT-EHR: Cross-Attentional Autoregressive Transformer for learning generalizable, multimodal patient representations from Electronic Health Records using self-attention for temporal dependencies and cross-attention for modality fusion.

DetailsMotivation: Existing deep learning methods for EHR analysis often optimize for specific downstream tasks and overlook creating generalizable patient representations that can be reused across multiple clinical tasks, despite EHRs containing rich multimodal data across structured and unstructured formats.

Method: Proposes CAAT-EHR architecture with self-attention layers capturing temporal dependencies within each modality, cross-attention layers fusing information across modalities, and an autoregressive decoder that predicts future time steps during pre-training to enforce temporal consistency and enrich encoder outputs.

Result: CAAT-EHR demonstrates significant improvements on benchmark EHR datasets for mortality prediction, ICU length-of-stay estimation, and Alzheimer’s disease diagnosis prediction, outperforming models trained on raw EHR data in 11 out of 12 comparisons for F1 score and AUC across all three tasks.

Conclusion: CAAT-EHR provides a unified framework for learning generalizable, temporally consistent multimodal EHR representations that support more reliable clinical decision support systems, with ablation studies confirming the critical roles of cross-modality fusion and autoregressive refinement.

Abstract: Electronic Health Records (EHRs) contain rich, longitudinal patient information across structured (e.g., labs, vitals, and imaging) and unstructured (e.g., clinical notes) modalities. While deep learning models such as RNNs and Transformers have advanced single- and multimodal EHR analysis, existing methods often optimize for specific downstream tasks and overlook the creation of generalizable patient representations that can be reused across multiple tasks. To address this gap, we propose CAAT-EHR, a novel Cross-Attentional Autoregressive Transformer architecture that produces task-agnostic, longitudinal embeddings of multimodal EHR data. In CAAT-EHR, self-attention layers capture temporal dependencies within each modality, while cross-attention layers fuse information across modalities to model complex interrelationships. During pre-training, an autoregressive decoder predicts future time steps from the fused embeddings, enforcing temporal consistency and enriching the encoder output. Once trained, the encoder alone generates versatile multimodal EHR embeddings that can be applied directly to a variety of predictive tasks. CAAT-EHR demonstrates significant improvements on benchmark EHR datasets for mortality prediction, ICU length-of-stay estimation, and Alzheimer’s disease diagnosis prediction. Models using EHR embeddings generated by CAAT-EHR outperform models trained on raw EHR data in eleven out of twelve comparisons for F1 score and AUC across all three downstream tasks. Ablation studies confirm the critical roles of cross-modality fusion and autoregressive refinement. Overall, CAAT-EHR provides a unified framework for learning generalizable, temporally consistent multimodal EHR representations that support more reliable clinical decision support systems.

[291] Clone-Robust Weights in Metric Spaces: Handling Redundancy Bias for Benchmark Aggregation

Damien Berriaud, Roger Wattenhofer

Main category: cs.LG

TL;DR: The paper proposes clone-proof weighting functions for elements in metric spaces to resist adversarial manipulations, extending maximum uncertainty principles with symmetry, continuity, and clone-proofness axioms.

DetailsMotivation: The paper addresses the problem of weighting elements in metric spaces where the distribution may be adversarial. This arises in contexts like robust domain adaptation (data points), benchmark aggregation (tasks), or voting advice applications (political opinions). The need is for weighting schemes resistant to manipulation through duplication of similar elements.

Method: The authors introduce a theoretical framework with clone-proof weighting functions as the solution concept. They extend the maximum uncertainty principle to general metric spaces and propose three axioms: symmetry (permutation invariance), continuity (small changes in positions cause small weight changes), and clone-proofness (similar elements share weights to avoid bias from multiplicity). They address existence in Euclidean spaces and provide construction methods.

Result: The paper establishes a theoretical framework for clone-proof weighting in metric spaces. It demonstrates the existence of weighting functions satisfying the proposed axioms in Euclidean spaces and provides general construction methods for such functions.

Conclusion: Clone-proof weighting functions provide a principled approach to resist adversarial manipulations in metric spaces. The framework extends maximum uncertainty principles and offers practical solutions for applications requiring robust weighting of elements where distributions may be adversarial.

Abstract: We are given a set of elements in a metric space. The distribution of the elements is arbitrary, possibly adversarial. Can we weigh the elements in a way that is resistant to such (adversarial) manipulations? This problem arises in various contexts. For instance, the elements could represent data points, requiring robust domain adaptation. Alternatively, they might represent tasks to be aggregated into a benchmark; or questions about personal political opinions in voting advice applications. This article introduces a theoretical framework for dealing with such problems. We propose clone-proof weighting functions as a solution concept. These functions distribute importance across elements of a set such that similar objects (``clones’’) share (some of) their weights, thus avoiding a potential bias introduced by their multiplicity. Our framework extends the maximum uncertainty principle to accommodate general metric spaces and includes a set of axioms – symmetry, continuity, and clone-proofness – that guide the construction of weighting functions. Finally, we address the existence of weighting functions satisfying our axioms in the significant case of Euclidean spaces and propose a general method for their construction.

[292] Efficient and Sharp Off-Policy Learning under Unobserved Confounding

Konstantin Hess, Dennis Frauen, Valentyn Melnychuk, Stefan Feuerriegel

Main category: cs.LG

TL;DR: A novel method for personalized off-policy learning that addresses unobserved confounding using causal sensitivity analysis and semi-parametrically efficient estimators.

DetailsMotivation: Standard policy learning assumes unconfoundedness (no unobserved factors affecting both treatment and outcomes), which is often violated in practice, leading to biased estimates and potentially harmful policies.

Method: Uses causal sensitivity analysis to derive a semi-parametrically efficient estimator for sharp bounds on the value function under unobserved confounding, avoiding unstable minimax optimization based on inverse propensity weighting.

Result: The method outperforms simple plug-in approaches and existing baselines in experiments with synthetic and real-world data, and is proven to lead to optimal confounding-robust policies.

Conclusion: Provides a robust approach for decision-making in domains like healthcare and public policy where unobserved confounding is problematic, with theoretical guarantees of optimality and efficiency.

Abstract: We develop a novel method for personalized off-policy learning in scenarios with unobserved confounding. Thereby, we address a key limitation of standard policy learning: standard policy learning assumes unconfoundedness, meaning that no unobserved factors influence both treatment assignment and outcomes. However, this assumption is often violated, because of which standard policy learning produces biased estimates and thus leads to policies that can be harmful. To address this limitation, we employ causal sensitivity analysis and derive a semi-parametrically efficient estimator for a sharp bound on the value function under unobserved confounding. Our estimator has three advantages: (1) Unlike existing works, our estimator avoids unstable minimax optimization based on inverse propensity weighted outcomes. (2) Our estimator is semi-parametrically efficient. (3) We prove that our estimator leads to the optimal confounding-robust policy. Finally, we extend our theory to the related task of policy improvement under unobserved confounding, i.e., when a baseline policy such as the standard of care is available. We show in experiments with synthetic and real-world data that our method outperforms simple plug-in approaches and existing baselines. Our method is highly relevant for decision-making where unobserved confounding can be problematic, such as in healthcare and public policy.

[293] Qronos: Correcting the Past by Shaping the Future… in Post-Training Quantization

Shihao Zhang, Haoyu Zhang, Ian Colbert, Rayan Saab

Main category: cs.LG

TL;DR: Qronos is a new post-training quantization algorithm that sequentially rounds and updates neural network weights, explicitly correcting errors from weight, activation, and previous layer quantization through an iterative optimization framework.

DetailsMotivation: Existing quantization methods often fail to adequately address cumulative errors from quantizing multiple layers and components (weights, activations, KV caches) in large language models, leading to significant performance degradation.

Method: Qronos uses an iterative algorithm based on an interpretable optimization framework that alternates between error correction and diffusion via optimal update rules, with efficient implementation using Cholesky decomposition for solving least-squares problems.

Result: Qronos consistently outperforms previous state-of-the-art adaptive rounding methods when quantizing weights, activations, and KV caches in Llama3 family models.

Conclusion: Qronos represents a significant advancement in post-training quantization, offering a disciplined optimization approach that effectively handles cumulative quantization errors and is compatible with existing transformation techniques.

Abstract: We introduce Qronos – a new state-of-the-art post-training quantization algorithm that sequentially rounds and updates neural network weights. Qronos not only explicitly corrects errors due to both weight and activation quantization, but also errors resulting from quantizing previous layers. Our iterative algorithm is based on an interpretable and disciplined optimization framework that subsumes and surpasses existing data-driven approaches. At each step, Qronos alternates between error correction and diffusion via optimal update rules. Importantly, we prove that Qronos admits an efficient implementation that uses the Cholesky decomposition for solving least-squares problems. We also demonstrate that Qronos is compatible with existing transformation techniques such as Hadamard-based incoherence processing and weight-activation scaling equalization, among others. We evaluate Qronos using recent autoregressive language generation models in the Llama3 family; Qronos consistently outperforms previous state-of-the-art adaptive rounding methods when quantizing the weights, activations, and/or KV caches.

[294] How Global Calibration Strengthens Multiaccuracy

Sílvia Casacuberta, Parikshit Gopalan, Varun Kanade, Omer Reingold

Main category: cs.LG

TL;DR: Multiaccuracy alone is weak for fairness learning, but combined with global calibration (calibrated multiaccuracy) it becomes powerful enough to achieve strong agnostic learning and optimal hardcore measures.

DetailsMotivation: To understand the power of multiaccuracy as a learning primitive for multigroup fairness, and investigate how calibration enhances its capabilities compared to standard weak agnostic learning.

Method: Theoretical analysis comparing multiaccuracy, calibrated multiaccuracy, and multicalibration as fairness notions. Examines whether multiaccurate predictors can be post-processed to get weak learners, and analyzes their ability to derive hardcore measures.

Result: Multiaccuracy alone is weak and cannot be post-processed to get weak learners even with strong correlation assumptions. Calibrated multiaccuracy enables strong agnostic learning and achieves optimal density hardcore measures, while multiaccuracy only yields half-optimal density.

Conclusion: Multiaccuracy and global calibration are individually weak but together form a powerful combination (calibrated multiaccuracy) that achieves strong learning outcomes, revealing complementary roles in multigroup fairness.

Abstract: Multiaccuracy and multicalibration are multigroup fairness notions for prediction that have found numerous applications in learning and computational complexity. They can be achieved from a single learning primitive: weak agnostic learning. Here we investigate the power of multiaccuracy as a learning primitive, both with and without the additional assumption of calibration. We find that multiaccuracy in itself is rather weak, but that the addition of global calibration (this notion is called calibrated multiaccuracy) boosts its power substantially, enough to recover implications that were previously known only assuming the stronger notion of multicalibration. We give evidence that multiaccuracy might not be as powerful as standard weak agnostic learning, by showing that there is no way to post-process a multiaccurate predictor to get a weak learner, even assuming the best hypothesis has correlation $1/2$. Rather, we show that it yields a restricted form of weak agnostic learning, which requires some concept in the class to have correlation greater than $1/2$ with the labels. However, by also requiring the predictor to be calibrated, we recover not just weak, but strong agnostic learning. A similar picture emerges when we consider the derivation of hardcore measures from predictors satisfying multigroup fairness notions. On the one hand, while multiaccuracy only yields hardcore measures of density half the optimal, we show that (a weighted version of) calibrated multiaccuracy achieves optimal density. Our results yield new insights into the complementary roles played by multiaccuracy and calibration in each setting. They shed light on why multiaccuracy and global calibration, although not particularly powerful by themselves, together yield considerably stronger notions.

[295] Latent Veracity Inference for Identifying Errors in Stepwise Reasoning

Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio

Main category: cs.LG

TL;DR: The paper proposes Veracity Search (VS), a discrete search algorithm that augments Chain-of-Thought reasoning with latent veracity variables to identify and correct inaccurate statements in reasoning chains, and introduces Amortized Veracity Inference (AVI) for zero-shot veracity inference.

DetailsMotivation: Chain-of-Thought reasoning in language models often contains inaccurate statements that reduce performance and trustworthiness. Current methods lack efficient ways to verify the correctness of intermediate reasoning steps, leading to propagation of errors through the reasoning chain.

Method: The approach introduces latent veracity variables for each reasoning step in CoT. Veracity Search (VS) performs discrete search over veracity assignments using the LM’s joint likelihood over veracity and final answer as a proxy reward. This enables supervised fine-tuning of Amortized Veracity Inference (AVI) models that generalize to zero-shot veracity inference.

Result: VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks. AVI achieves comparable zero-shot accuracy to VS, demonstrating effective generalization. The method also shows utility for providing feedback during self-correction and self-improvement.

Conclusion: Latent veracity inference provides an effective approach for verifying reasoning chains in language models, improving accuracy and trustworthiness. The proposed methods enable both inference-time verification and zero-shot generalization to novel contexts.

Abstract: Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM’s joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.

[296] Hybrid quantum recurrent neural network for remaining useful life prediction

Olga Tsurkan, Aleksandra Konstantinova, Aleksandr Sedykh, Arsenii Senokosov, Daniil Tarpanov, Matvei Anoshin, Asel Sagingalieva, Alexey Melnikov

Main category: cs.LG

TL;DR: Hybrid quantum-classical RNN using Quantum LSTM gates for jet engine RUL prediction, showing improved performance over classical methods despite fewer parameters.

DetailsMotivation: Predictive maintenance in aerospace requires accurate remaining useful life (RUL) estimation for jet engines. The paper aims to leverage quantum computing advantages for time-series forecasting under limited-data conditions.

Method: Hybrid Quantum Recurrent Neural Network framework combining Quantum Long Short-Term Memory (QLSTM) layers with classical dense layers. QLSTM gates replace conventional linear transformations with Quantum Depth-Infused circuits to better capture high-frequency components.

Result: Achieves up to 5% improvement over stacked LSTM RNN in mean RMSE/MAE despite fewer parameters. Outperforms Random Forest (13.68% better), CNN (16.21% better), and MLP (7.87% better) with RMSE of 15.46. Some advanced joint architectures still outperform it.

Conclusion: Hybrid quantum-classical approaches show promise for robust time-series forecasting in limited-data scenarios, offering new avenues for enhancing reliability in predictive maintenance tasks.

Abstract: Predictive maintenance in aerospace heavily relies on accurate estimation of the remaining useful life of jet engines. In this paper, we introduce a Hybrid Quantum Recurrent Neural Network framework, combining Quantum Long Short-Term Memory layers with classical dense layers for Remaining Useful Life forecasting on NASA’s Commercial Modular Aero-Propulsion System Simulation dataset. Each Quantum Long Short-Term Memory gate replaces conventional linear transformations with Quantum Depth-Infused circuits, allowing the network to learn high-frequency components more effectively. Experimental results demonstrate that, despite having fewer trainable parameters, the Hybrid Quantum Recurrent Neural Network achieves up to a 5% improvement over a Recurrent Neural Network based on stacked Long Short-Term Memory layers in terms of mean root-mean-square error and mean absolute error. Moreover, a thorough comparison of our method with established techniques, including Random Forest, Convolutional Neural Network, and Multilayer Perceptron, demonstrates that our approach, which achieves a Root Mean Squared Error of 15.46, surpasses these baselines by approximately 13.68%, 16.21%, and 7.87%, respectively. Nevertheless, certain advanced joint architectures still outperform it. Our findings highlight the potential of hybrid quantum-classical approaches for robust time-series forecasting under limited-data conditions, offering new avenues for enhancing reliability in predictive maintenance tasks.

[297] Enhanced Generative Model Evaluation with Clipped Density and Coverage

Nicolas Salvy, Hugues Talbot, Bertrand Thirion

Main category: cs.LG

TL;DR: Clipped Density and Clipped Coverage: New robust metrics for evaluating generative model quality with interpretable scores that degrade linearly with proportion of bad samples.

DetailsMotivation: Current generative model evaluation metrics lack reliable, interpretable values due to absence of calibration and insufficient robustness to outliers, hindering use in critical applications.

Method: Introduces two novel metrics: Clipped Density and Clipped Coverage, which clip individual sample contributions and nearest neighbor ball radii to prevent out-of-distribution samples from biasing aggregated values.

Result: Metrics demonstrate linear score degradation as proportion of bad samples increases, allowing straightforward interpretation as equivalent proportions of good samples. Outperform existing methods in robustness, sensitivity, and interpretability.

Conclusion: Clipped Density and Clipped Coverage provide more reliable, interpretable evaluation of generative model quality, addressing shortcomings of current metrics for critical applications.

Abstract: Although generative models have made remarkable progress in recent years, their use in critical applications has been hindered by an inability to reliably evaluate the quality of their generated samples. Quality refers to at least two complementary concepts: fidelity and coverage. Current quality metrics often lack reliable, interpretable values due to an absence of calibration or insufficient robustness to outliers. To address these shortcomings, we introduce two novel metrics: Clipped Density and Clipped Coverage. By clipping individual sample contributions, as well as the radii of nearest neighbor balls for fidelity, our metrics prevent out-of-distribution samples from biasing the aggregated values. Through analytical and empirical calibration, these metrics demonstrate linear score degradation as the proportion of bad samples increases. Thus, they can be straightforwardly interpreted as equivalent proportions of good samples. Extensive experiments on synthetic and real-world datasets demonstrate that Clipped Density and Clipped Coverage outperform existing methods in terms of robustness, sensitivity, and interpretability when evaluating generative models.

[298] Variance-Optimal Arm Selection: Misallocation Minimization and Best Arm Identification

Sabrina Khurshid, Gourab Ghatak, Mohammad Shahid Abdulla

Main category: cs.LG

TL;DR: Novel algorithms for selecting arms with highest variance in multi-armed bandit settings, with applications to financial option trading.

DetailsMotivation: The paper addresses the problem of identifying the arm with highest variance in multi-armed bandit problems, which has applications in finance (e.g., option trading where variance affects option prices) and other domains where variance is the key metric of interest rather than mean reward.

Method: Develops two algorithms: UCB-VV for misallocation minimization (minimizing pulls of suboptimal arms) and SHVV for fixed-budget best arm identification. Extends framework from bounded to sub-Gaussian distributions using novel concentration inequalities for sample variance/standard deviation and empirical Sharpe ratio.

Result: UCB-VV achieves O(log n) misallocation bound (order optimal), SHVV achieves error probability decaying as exp(-n/(log(K)H)) (matching lower bound). Empirical results show UCB-VV outperforms ε-greedy, SHVV outperforms uniform sampling, and both perform well in call option trading case study.

Conclusion: The paper provides theoretically sound algorithms for variance-based arm selection with optimal regret bounds and practical effectiveness demonstrated in financial applications.

Abstract: This paper focuses on selecting the arm with the highest variance from a set of $K$ independent arms. Specifically, we focus on two settings: (i) misallocation minimization setting, that penalizes the number of pulls of suboptimal arms in terms of variance, and (ii) fixed-budget best arm identification setting, that evaluates the ability of an algorithm to determine the arm with the highest variance after a fixed number of pulls. We develop a novel online algorithm called UCB-VV for the misallocation minimization (MM) and show that its upper bound on misallocation for bounded rewards evolves as $\mathcal{O}\left(\log{n}\right)$ where $n$ is the horizon. By deriving the lower bound on the misallocation, we show that UCB-VV is order optimal. For the fixed budget best arm identification (BAI) setting we propose the SHVV algorithm. We show that the upper bound of the error probability of SHVV evolves as $\exp\left(-\frac{n}{\log(K) H}\right)$, where $H$ represents the complexity of the problem, and this rate matches the corresponding lower bound. We extend the framework from bounded distributions to sub-Gaussian distributions using a novel concentration inequality on the sample variance and standard deviation. Leveraging the same, we derive a concentration inequality for the empirical Sharpe ratio (SR) for sub-Gaussian distributions, which was previously unknown in the literature. Empirical simulations show that UCB-VV consistently outperforms $ε$-greedy across different sub-optimality gaps though it is surpassed by VTS, which exhibits the lowest misallocation, albeit lacking in theoretical guarantees. We also illustrate the superior performance of SHVV, for a fixed budget setting under 6 different setups against uniform sampling. Finally, we conduct a case study to empirically evaluate the performance of the UCB-VV and SHVV in call option trading on $100$ stocks generated using GBM.

[299] Partition Generative Modeling: Masked Modeling Without Masks

Justin Deschenaux, Lan Tran, Caglar Gulcehre

Main category: cs.LG

TL;DR: Partition Generative Models (PGMs) replace masking with partitioning to enable parallel, any-order generation while processing only clean tokens during sampling, achieving significant throughput improvements over masked generative models.

DetailsMotivation: Masked generative models (MGMs) can generate tokens in parallel and any order, but they process full-length sequences including uninformative mask tokens at every step. Autoregressive models (ARMs) process only previously generated tokens but generate sequentially. The goal is to combine the benefits of both approaches.

Method: PGMs replace masking with partitioning - tokens are split into two groups that cannot attend to each other. The model learns to predict each group conditioned on the other, eliminating mask tokens entirely. During sampling, PGMs process only clean tokens like ARMs while retaining parallel, any-order generation like MGMs.

Result: On OpenWebText, PGMs achieve 5-5.5× higher throughput than MDLM with lower Generative Perplexity. On ImageNet, PGMs reach comparable FID to MaskGIT with 7.5× throughput improvement. With twice as many steps, FID improves to 4.56 while remaining 3.9× faster than MGMs.

Conclusion: PGMs offer an efficient alternative to MGMs by eliminating mask tokens while maintaining parallel generation capabilities, achieving significant speed improvements while maintaining or improving sample quality across text and image domains.

Abstract: Masked generative models (MGMs) can generate tokens in parallel and in any order, unlike autoregressive models (ARMs), which decode one token at a time, left-to-right. However, MGMs process the full-length sequence at every sampling step, including mask tokens that carry no information. In contrast, ARMs process only the previously generated tokens. We introduce ``Partition Generative Models’’ (PGMs), which replace masking with partitioning. Tokens are split into two groups that cannot attend to each other, and the model learns to predict each group conditioned on the other, eliminating mask tokens entirely. Because the groups do not interact, PGMs can process only the clean tokens during sampling, like ARMs, while retaining parallel, any-order generation, like MGMs. On OpenWebText, PGMs achieve $5-5.5\times$ higher throughput than MDLM while producing samples with lower Generative Perplexity. On ImageNet, PGMs reach comparable FID to MaskGIT with a $7.5\times$ throughput improvement. With twice as many steps, the FID improves to 4.56 while remaining $3.9\times$ faster than MGMs. Finally, PGMs remain compatible with existing MGM samplers and distillation methods.

[300] Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, Stella Biderman

Main category: cs.LG

TL;DR: Filtering dual-use topics from training data makes LLMs resistant to adversarial fine-tuning attacks on biothreat knowledge, outperforming post-training methods by orders of magnitude.

DetailsMotivation: Open-weight AI systems are vulnerable to tampering attacks that can efficiently elicit harmful behaviors. Existing safety fine-tuning methods struggle to make LLMs resistant to adversarial fine-tuning beyond a few dozen steps. The paper investigates whether filtering dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard.

Method: Introduced a multi-stage pipeline for scalable data filtering to minimize biothreat proxy knowledge in LLMs. Pretrained multiple 6.9B-parameter models from scratch using filtered data and tested their resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text.

Result: Filtered models exhibited substantial resistance to adversarial fine-tuning attacks, outperforming existing post-training baselines by over an order of magnitude, with no observed degradation to unrelated capabilities. However, filtered models could still leverage dangerous information when provided in context (e.g., via search tool augmentation).

Conclusion: Pretraining data curation is a promising layer of defense for open-weight AI systems, but filtered models remain vulnerable to contextual information provision, demonstrating a need for defense-in-depth approaches combining multiple safety measures.

Abstract: Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from scratch and find that they exhibit substantial resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text – outperforming existing post-training baselines by over an order of magnitude – with no observed degradation to unrelated capabilities. However, while filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach. Overall, these findings help to establish pretraining data curation as a promising layer of defense for open-weight AI systems.

[301] Calibrated and uncertain? Evaluating uncertainty estimates in binary classification models

Aurora Grefsrud, Nello Blaser, Trygve Buanes

Main category: cs.LG

TL;DR: Comparative study of six probabilistic ML algorithms for uncertainty quantification in classification tasks, focusing on calibration and out-of-distribution detection capabilities.

DetailsMotivation: With the increasing complexity of deep learning models, uncertainty quantification has become challenging but essential for scientific validity. The paper aims to provide empirical comparisons of different probabilistic ML algorithms for class probability and uncertainty estimation.

Method: Uses approximate Bayesian inference framework with empirical tests on carefully created synthetic classification datasets. Evaluates six algorithms: neural network ensemble, neural network ensemble with conflictual loss, evidential deep learning, single neural network with Monte Carlo Dropout, Gaussian process classification, and Dirichlet process mixture model.

Result: All algorithms show reasonably good calibration performance on synthetic test sets, but none of the deep learning based algorithms provide uncertainties that consistently reflect lack of experimental evidence for out-of-distribution data points.

Conclusion: The study serves as a clarifying example for researchers using or developing uncertainty estimation methods for scientific data-driven modeling, highlighting limitations of current deep learning approaches for out-of-distribution uncertainty quantification.

Abstract: Rigorous statistical methods, including parameter estimation with accompanying uncertainties, underpin the validity of scientific discovery, especially in the natural sciences. With increasingly complex data models such as deep learning techniques, uncertainty quantification has become exceedingly difficult and a plethora of techniques have been proposed. In this case study, we use the unifying framework of approximate Bayesian inference combined with empirical tests on carefully created synthetic classification datasets to investigate qualitative properties of six different probabilistic machine learning algorithms for class probability and uncertainty estimation: (i) a neural network ensemble, (ii) neural network ensemble with conflictual loss, (iii) evidential deep learning, (iv) a single neural network with Monte Carlo Dropout, (v) Gaussian process classification and (vi) a Dirichlet process mixture model. We check if the algorithms produce uncertainty estimates which reflect commonly desired properties, such as being well calibrated and exhibiting an increase in uncertainty for out-of-distribution data points. Our results indicate that all algorithms show reasonably good calibration performance on our synthetic test sets, but none of the deep learning based algorithms provide uncertainties that consistently reflect lack of experimental evidence for out-of-distribution data points. We hope our study may serve as a clarifying example for researchers that are using or developing methods of uncertainty estimation for scientific data-driven modeling and analysis.

[302] Out of Distribution Detection for Efficient Continual Learning in Quality Prediction for Arc Welding

Yannik Hahn, Jan Voets, Antonin Koenigsfeld, Hasan Tercan, Tobias Meisen

Main category: cs.LG

TL;DR: Extends VQ-VAE Transformer for weld quality prediction with OOD detection using autoregressive loss, integrates continual learning for adaptation, and introduces a novel evaluation metric for dynamic manufacturing environments.

DetailsMotivation: Current ML models for weld quality prediction fail under distribution shifts in dynamic manufacturing environments. Need robust OOD detection and adaptive learning to maintain prediction accuracy when process parameters change frequently.

Method: Extends VQ-VAE Transformer architecture by leveraging its autoregressive loss as OOD detection mechanism. Integrates OOD detection with continual learning strategies to trigger updates only when necessary. Introduces novel quantitative metric evaluating both OOD detection and in-distribution performance.

Result: Superior performance compared to conventional reconstruction methods, embedding error-based techniques, and established baselines. Effectively maintains robust quality prediction capabilities across significant distribution shifts in real-world welding scenarios.

Conclusion: Provides explainable and adaptive solution for quality assurance in dynamic manufacturing processes, contributing to robust practical AI systems in industrial environments.

Abstract: Modern manufacturing relies heavily on fusion welding processes, including gas metal arc welding (GMAW). Despite significant advances in machine learning-based quality prediction, current models exhibit critical limitations when confronted with the inherent distribution shifts that occur in dynamic manufacturing environments. In this work, we extend the VQ-VAE Transformer architecture - previously demonstrating state-of-the-art performance in weld quality prediction - by leveraging its autoregressive loss as a reliable out-of-distribution (OOD) detection mechanism. Our approach exhibits superior performance compared to conventional reconstruction methods, embedding error-based techniques, and other established baselines. By integrating OOD detection with continual learning strategies, we optimize model adaptation, triggering updates only when necessary and thereby minimizing costly labeling requirements. We introduce a novel quantitative metric that simultaneously evaluates OOD detection capability while interpreting in-distribution performance. Experimental validation in real-world welding scenarios demonstrates that our framework effectively maintains robust quality prediction capabilities across significant distribution shifts, addressing critical challenges in dynamic manufacturing environments where process parameters frequently change. This research makes a substantial contribution to applied artificial intelligence by providing an explainable and at the same time adaptive solution for quality assurance in dynamic manufacturing processes - a crucial step towards robust, practical AI systems in the industrial environment.

[303] Morephy-Net: An Evolutionary Multi-objective Optimization for Replica-Exchange-based Physics-informed Neural Operator Learning Networks

Binghang Lu, Changhong Mou, Guang Lin

Main category: cs.LG

TL;DR: Morephy-Net uses evolutionary multi-objective optimization and replica-exchange methods to solve parametric PDEs with noisy data, improving accuracy and uncertainty quantification over existing operator-learning models.

DetailsMotivation: Existing physics-informed neural networks and operator-learning models face challenges in balancing data/operator vs physics residual losses, maintaining robustness under noisy/sparse observations, and providing reliable uncertainty quantification.

Method: Integrates: (1) evolutionary multi-objective optimization treating data/operator and physics residual terms as separate objectives to search Pareto front, (2) replica-exchange stochastic gradient Langevin dynamics for enhanced global exploration and training stability, (3) Bayesian uncertainty quantification from stochastic sampling.

Result: Demonstrates consistent improvements in accuracy, noise robustness, and calibrated uncertainty estimates over standard operator-learning baselines on forward and inverse problems including 1D Burgers equation and time-fractional mixed diffusion-wave equation.

Conclusion: Morephy-Net effectively addresses key challenges in physics-informed operator learning for parametric PDEs in noisy data regimes through multi-objective optimization and enhanced sampling techniques.

Abstract: We propose an evolutionary Multi-objective Optimization for Replica-Exchange-based Physics-informed operator-learning Networks (Morephy-Net) to solve parametric partial differential equations (PDEs) in noisy data regimes, for both forward prediction and inverse identification. Existing physics-informed neural networks and operator-learning models (e.g., DeepONets and Fourier neural operators) often face three coupled challenges: (i) balancing data/operator and physics residual losses, (ii) maintaining robustness under noisy or sparse observations, and (iii) providing reliable uncertainty quantification. Morephy-Net addresses these issues by integrating: (i) evolutionary multi-objective optimization that treats data/operator and physics residual terms as separate objectives and searches the Pareto front, thereby avoiding ad hoc loss weighting; (ii) replica-exchange stochastic gradient Langevin dynamics to enhance global exploration and stabilize training in non-convex landscapes; and (iii) Bayesian uncertainty quantification obtained from stochastic sampling. We validate Morephy-Net on representative forward and inverse problems, including the one-dimensional Burgers equation and the time-fractional mixed diffusion–wave equation. The results demonstrate consistent improvements in accuracy, noise robustness, and calibrated uncertainty estimates over standard operator-learning baselines.

[304] Randomness and signal propagation in physics-informed neural networks (PINNs): A neural PDE perspective

Jean-Michel Tucny, Abhisek Ganguly, Santosh Ansumali, Sauro Succi

Main category: cs.LG

TL;DR: PINNs exhibit random weight matrices after training; analysis shows they follow random matrix theory predictions, and their signal propagation stability is governed by neural PDE discretization schemes.

DetailsMotivation: To understand why PINNs develop statistically random weight matrices after training and how this affects signal propagation stability and interpretability, which remains poorly understood.

Method: Analyze spectral/statistical properties of trained PINN weights using 1D Burgers’ equation variants; study signal evolution through neural PDEs lens; connect weight matrices to specific discretization schemes of neural PDEs.

Result: Learned weights reside in high-entropy regime consistent with random matrix theory; random/structured weight matrices correspond to specific neural PDE discretizations; numerical stability of these discretizations governs signal propagation stability.

Conclusion: Numerical stability and network architecture shape signal propagation in deep networks, providing explicit connection between random matrix theory, neural PDE descriptions, and PINN behavior.

Abstract: Physics-informed neural networks (PINNs) often exhibit weight matrices that appear statistically random after training, yet their implications for signal propagation and stability remain unsatisfactorily understood, let alone the interpretability. In this work, we analyze the spectral and statistical properties of trained PINN weights using viscous and inviscid variants of the one-dimensional Burgers’ equation, and show that the learned weights reside in a high-entropy regime consistent with predictions from random matrix theory. To investigate the dynamical consequences of such weight structures, we study the evolution of signal features inside a network through the lens of neural partial differential equations (neural PDEs). We show that random and structured weight matrices can be associated with specific discretizations of neural PDEs, and that the numerical stability of these discretizations governs the stability of signal propagation through the network. In particular, explicit unstable schemes lead to degraded signal evolution, whereas stable implicit and higher-order schemes yield well-behaved dynamics for the same underlying neural PDE. Our results offer an explicit example of how numerical stability and network architecture shape signal propagation in deep networks, in relation to random matrix and neural PDE descriptions in PINNs.

[305] GenFacts-Generative Counterfactual Explanations for Multi-Variate Time Series

Sarah Seifi, Anass Ibrahimi, Tobias Sukianto, Cecilia Carbonelli, Lorenzo Servadei, Robert Wille

Main category: cs.LG

TL;DR: GenFacts is a generative framework for creating plausible and interpretable counterfactual explanations for multivariate time series data, outperforming existing methods in plausibility and human interpretability.

DetailsMotivation: Existing counterfactual explanation methods for multivariate time series often produce invalid, implausible, or unintuitive results, limiting their practical usefulness for model transparency and user understanding.

Method: GenFacts uses a class-discriminative variational autoencoder with contrastive and classification-consistency objectives, prototype-based initialization, and realism-constrained optimization to generate high-quality counterfactuals.

Result: GenFacts outperforms state-of-the-art baselines by +18.7% in plausibility and achieves highest interpretability scores in human studies on radar gesture data and handwritten letter trajectories.

Conclusion: Plausibility and user-centered interpretability, rather than sparsity alone, are crucial for actionable counterfactual explanations in time series data.

Abstract: Counterfactual explanations aim to enhance model transparency by showing how inputs can be minimally altered to change predictions. For multivariate time series, existing methods often generate counterfactuals that are invalid, implausible, or unintuitive. We introduce GenFacts, a generative framework based on a class-discriminative variational autoencoder. It integrates contrastive and classification-consistency objectives, prototype-based initialization, and realism-constrained optimization. We evaluate GenFacts on radar gesture data as an industrial use case and handwritten letter trajectories as an intuitive benchmark. Across both datasets, GenFacts outperforms state-of-the-art baselines in plausibility (+18.7%) and achieves the highest interpretability scores in a human study. These results highlight that plausibility and user-centered interpretability, rather than sparsity alone, are key to actionable counterfactuals in time series data.

[306] Learning Admissible Heuristics for A*: Theory and Practice

Ehsan Futuhi, Nathan R. Sturtevant

Main category: cs.LG

TL;DR: Learning admissible heuristics for A-star search via constrained optimization with Cross-Entropy Admissibility loss, achieving near-admissible heuristics with strong guidance on Rubik’s Cube and providing generalization bounds for neural network heuristics.

DetailsMotivation: Deep learning approaches for heuristic functions often disregard admissibility (which guarantees solution optimality) and provide limited guarantees on generalization beyond training data. The paper aims to address both limitations.

Method: Poses heuristic learning as constrained optimization problem and introduces Cross-Entropy Admissibility (CEA) loss function that enforces admissibility during training. Studies sample complexity of learning heuristics using PDB abstractions and graph structural properties, replacing general hypothesis class with ReLU neural networks.

Result: On Rubik’s Cube domain, the method yields near-admissible heuristics with significantly stronger guidance than compressed pattern database (PDB) heuristics. Provides theoretical bounds on number of training samples needed for A-star to generalize, with bounds depending primarily on network width/depth rather than graph size.

Conclusion: The paper successfully addresses limitations of deep learning approaches for heuristic functions by enforcing admissibility and providing generalization guarantees, with applications to combinatorial search problems like Rubik’s Cube.

Abstract: Heuristic functions are central to the performance of search algorithms such as A-star, where admissibility - the property of never overestimating the true shortest-path cost - guarantees solution optimality. Recent deep learning approaches often disregard admissibility and provide limited guarantees on generalization beyond the training data. This paper addresses both of these limitations. First, we pose heuristic learning as a constrained optimization problem and introduce Cross-Entropy Admissibility (CEA), a loss function that enforces admissibility during training. On the Rubik’s Cube domain, this method yields near-admissible heuristics with significantly stronger guidance than compressed pattern database (PDB) heuristics. Theoretically, we study the sample complexity of learning heuristics. By leveraging PDB abstractions and the structural properties of graphs such as the Rubik’s Cube, we tighten the bound on the number of training samples needed for A-star to generalize. Replacing a general hypothesis class with a ReLU neural network gives bounds that depend primarily on the network’s width and depth, rather than on graph size. Using the same network, we also provide the first generalization guarantees for goal-dependent heuristics.

[307] Flock: A Knowledge Graph Foundation Model via Learning on Random Walks

Jinwoo Kim, Xingyue Huang, Krzysztof Olejniczak, Kyungbin Min, Michael Bronstein, Seunghoon Hong, İsmail İlkan Ceylan

Main category: cs.LG

TL;DR: Flock introduces probabilistic node-relation equivariance for knowledge graph foundation models, enabling better zero-shot link prediction by breaking structural symmetries while preserving equivariance in distribution.

DetailsMotivation: Current knowledge graph foundation models use deterministic equivariance, which limits their expressive power by preventing them from distinguishing structurally similar but semantically distinct relations, hindering zero-shot link prediction performance.

Method: Flock uses probabilistic node-relation equivariance with structured randomness to break symmetries at inference time. It iteratively samples random walks, encodes them into sequences, embeds them with a sequence model, and aggregates node/relation representations through learned pooling.

Result: Flock perfectly solves the new diagnostic dataset Petals where current KGFMs fail, and achieves state-of-the-art performance on entity and relation prediction tasks across 54 diverse knowledge graphs.

Conclusion: Probabilistic node-relation equivariance enables more expressive knowledge graph foundation models that can better handle zero-shot link prediction by distinguishing structurally similar but semantically distinct relations.

Abstract: We study the problem of zero-shot link prediction on knowledge graphs (KGs), which requires models to generalize to novel entities and novel relations. Knowledge graph foundation models (KGFMs) address this task by enforcing equivariance over both nodes and relations, which enables them to learn structural properties of nodes and relations that transfer to novel KGs with similar structure. However, the conventional notion of deterministic equivariance inherently limits the expressive power of KGFMs, as it prevents them from distinguishing relations that are structurally similar but semantically distinct. To overcome this limitation, we propose to leverage probabilistic node-relation equivariance, which preserves equivariance in distribution while using structured randomness to break symmetries at inference time. Building on this principle, we present Flock, a KGFM that iteratively samples random walks, encodes them into sequences, embeds them with a sequence model, and aggregates node and relation representations through learned pooling. Flock respects probabilistic node-relation equivariance and, crucially, is a universal approximator for isomorphism-invariant link-level functions over KGs. Empirically, Flock perfectly solves our new diagnostic dataset Petals on which current KGFMs fail, and achieves state-of-the-art performance on entity and relation prediction tasks across 54 KGs from diverse domains. Code is available at https://github.com/jw9730/flock.

[308] TabImpute: Universal Zero-Shot Imputation for Tabular Data

Jacob Feitelberg, Dwaipayan Saha, Kyuseong Choi, Zaid Ahmad, Anish Agarwal, Raaz Dwivedi

Main category: cs.LG

TL;DR: TabImpute: A pre-trained transformer for zero-shot tabular data imputation that requires no fitting or hyperparameter tuning, outperforming existing methods across diverse real-world datasets.

DetailsMotivation: Existing tabular data imputation methods suffer from large performance variance across domains and require time-consuming hyperparameter tuning, especially problematic for small datasets with limited information. No universal imputation method exists that works well across different real-world scenarios.

Method: Builds on TabPFN foundation model; introduces entry-wise featurization for 100x speedup, synthetic training data generation with diverse missingness patterns, and MissBench benchmark with 42 OpenML tables and 13 missingness patterns across medicine, finance, and engineering domains.

Result: TabImpute delivers accurate and fast zero-shot imputations without fitting or hyperparameter tuning at inference time, showing robust performance compared to numerous established imputation methods across diverse real-world domains.

Conclusion: TabImpute provides a universal, zero-shot solution for tabular data imputation that outperforms existing methods and addresses the performance variance problem, particularly benefiting small datasets.

Abstract: Missing data is a widespread problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks, but due to each method’s large variance in performance across real-world domains and time-consuming hyperparameter tuning, no universal imputation method exists. This performance variance is particularly pronounced in small datasets, where the models have the least amount of information. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations, requiring no fitting or hyperparameter tuning at inference time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, enabling a 100x speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating a diverse set of missingness patterns to enhance accuracy on real-world missing data problems, and (iii) MissBench, a comprehensive benchmark with 42 OpenML tables and 13 new missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute’s robust performance compared to numerous established imputation methods.

[309] Syndrome-Flow Consistency Model Achieves One-step Denoising Error Correction Codes

Haoyu Lei, Chin Wa Lau, Kaiwen Zhou, Nian Guo, Farzan Farnia

Main category: cs.LG

TL;DR: ECCFM is a novel consistency model framework for error correction codes that enables one-step neural decoding by re-parameterizing the reverse PF-ODE with soft-syndrome conditioning to handle discrete decoding trajectories.

DetailsMotivation: Existing neural decoders for ECC face a trade-off: diffusion models achieve state-of-the-art performance but are slow due to iterative sampling, while faster methods sacrifice accuracy. Consistency models could enable one-step decoding but struggle with the discrete, non-smooth nature of ECC decoding trajectories.

Method: Proposes Error Correction Syndrome-Flow Consistency Model (ECCFM) that re-parameterizes the reverse Probability Flow ODE using soft-syndrome conditioning to create smooth trajectories from noisy signals to original codewords. This model-agnostic framework enables single-step decoding while handling the discrete nature of ECC.

Result: ECCFM achieves lower bit-error-rate (BER) and frame-error-rate (FER) than transformer-based decoders, with inference speeds 30x to 100x faster than iterative denoising diffusion decoders across multiple benchmarks.

Conclusion: ECCFM successfully bridges the gap between accuracy and efficiency in neural ECC decoding by enabling high-fidelity one-step decoding through a novel consistency model approach that handles discrete decoding trajectories.

Abstract: Error Correction Codes (ECC) are fundamental to reliable digital communication, yet designing neural decoders that are both accurate and computationally efficient remains challenging. Recent denoising diffusion decoders achieve state-of-the-art performance, but their iterative sampling limits practicality in low-latency settings. To bridge this gap, consistency models (CMs) offer a potential path to high-fidelity one-step decoding. However, applying CMs to ECC presents a significant challenge: the discrete nature of error correction means the decoding trajectory is highly non-smooth, making it incompatible with a simple continuous timestep parameterization. To address this, we re-parameterize the reverse Probability Flow Ordinary Differential Equation (PF-ODE) by soft-syndrome condition, providing a smooth trajectory of signal corruption. Building on this, we propose the Error Correction Syndrome-Flow Consistency Model (ECCFM), a model-agnostic framework designed specifically for ECC task, ensuring the model learns a smooth trajectory from any noisy signal directly to the original codeword in a single step. Across multiple benchmarks, ECCFM attains lower bit-error-rate (BER) and frame-error-rate (FER) than transformer-based decoders, while delivering inference speeds 30x to 100x faster than iterative denoising diffusion decoders.

[310] Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability

Tao Tao, Maissam Barkeshli

Main category: cs.LG

TL;DR: Transformers can learn to predict sequences from complex pseudo-random number generators (PCGs) in-context, even with single-bit outputs, showing scaling laws and curriculum learning requirements.

DetailsMotivation: To understand Transformer models' capabilities in learning complex pseudo-random number generation patterns, particularly Permuted Congruential Generators which are more challenging than simpler linear generators due to bit-wise operations.

Method: Train Transformer models on sequences generated by various PCG variants, scaling up to moduli of 2^22 with up to 50M parameters and 5B tokens. Analyze prediction accuracy, scaling laws, and embedding representations.

Result: Transformers successfully predict PCG sequences beyond classical attacks, even with single-bit outputs. Models can jointly learn multiple PRNGs. Scaling law shows required context length grows as √m. Larger moduli require curriculum learning from smaller moduli. Embeddings show bitwise rotationally-invariant clustering.

Conclusion: Transformers demonstrate surprising ability to learn complex pseudo-random patterns, revealing insights about scaling laws, curriculum learning requirements, and novel representation learning phenomena in sequence prediction tasks.

Abstract: We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $\sqrt{m}$. For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m \geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the top principal components spontaneously group the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.

[311] BEP: A Binary Error Propagation Algorithm for Binary Neural Networks Training

Luca Colombo, Fabrizio Pittorino, Daniele Zambon, Carlo Baldassi, Manuel Roveri, Cesare Alippi

Main category: cs.LG

TL;DR: Binary Error Propagation (BEP) enables end-to-end binary training of neural networks using only bitwise operations, achieving significant accuracy improvements for both MLPs and RNNs.

DetailsMotivation: Current binary neural network training methods require maintaining full-precision parameters and floating-point arithmetic during backpropagation, losing the efficiency benefits of binary operations during training. Existing local learning approaches can't handle global credit assignment in multi-layer architectures.

Method: BEP introduces a principled, discrete analog of the backpropagation chain rule that propagates binary error signals backward through multiple layers. All forward and backward computations use only bitwise operations, enabling true end-to-end binary training.

Result: BEP achieves gains of up to +6.89% test accuracy for multi-layer perceptrons and +10.57% for recurrent neural networks compared to existing methods, while operating entirely on binary variables.

Conclusion: BEP is the first solution enabling end-to-end binary training for RNN architectures and provides a principled approach for binary error propagation that maintains computational efficiency throughout training.

Abstract: Binary Neural Networks (BNNs), which constrain both weights and activations to binary values, offer substantial reductions in computational complexity, memory footprint, and energy consumption. These advantages make them particularly well suited for deployment on resource-constrained devices. However, training BNNs via gradient-based optimization remains challenging due to the discrete nature of their variables. The dominant approach, quantization-aware training, circumvents this issue by employing surrogate gradients. Yet, this method requires maintaining latent full-precision parameters and performing the backward pass with floating-point arithmetic, thereby forfeiting the efficiency of binary operations during training. While alternative approaches based on local learning rules exist, they are unsuitable for global credit assignment and for back-propagating errors in multi-layer architectures. This paper introduces Binary Error Propagation (BEP), the first learning algorithm to establish a principled, discrete analog of the backpropagation chain rule. This mechanism enables error signals, represented as binary vectors, to be propagated backward through multiple layers of a neural network. BEP operates entirely on binary variables, with all forward and backward computations performed using only bitwise operations. Crucially, this makes BEP the first solution to enable end-to-end binary training for recurrent neural network architectures. We validate the effectiveness of BEP on both multi-layer perceptrons and recurrent neural networks, demonstrating gains of up to +6.89% and +10.57% in test accuracy, respectively. The proposed algorithm is released as an open-source repository.

[312] Is nasty noise actually harder than malicious noise?

Guy Blanc, Yizhi Huang, Tal Malkin, Rocco A. Servedio

Main category: cs.LG

TL;DR: The paper analyzes computational learning algorithms under two adversarial noise models (malicious and nasty noise) in distribution-independent vs. fixed-distribution settings, showing strong equivalence in the former but arbitrarily large separation in the latter.

DetailsMotivation: To understand the relative capabilities and limitations of computationally efficient learning algorithms under challenging adversarial noise models, particularly comparing malicious noise (random corruption) vs. nasty noise (adversarial corruption) in different learning settings.

Method: Theoretical analysis of Boolean function learning under two noise models: malicious noise (adversary corrupts random subset) and nasty noise (adversary corrupts adversarially chosen subset). Examines both distribution-independent and fixed-distribution settings, and analyzes a specific class of algorithms called ICE (ignore contradictory examples).

Result: 1) Distribution-independent setting: Strong equivalence between malicious and nasty noise models. 2) Fixed-distribution setting: Arbitrarily large separation between the two noise models under cryptographic assumptions. 3) For ICE algorithms: Malicious and nasty noise are equivalent up to factor of 2 in noise rate, and this factor is necessary.

Conclusion: The relationship between malicious and nasty noise depends crucially on the learning setting: they are essentially equivalent for distribution-independent learning but can be arbitrarily different for fixed-distribution learning, with ICE algorithms providing a bridge between them with a tight factor of 2.

Abstract: We consider the relative abilities and limitations of computationally efficient algorithms for learning in the presence of noise, under two well-studied and challenging adversarial noise models for learning Boolean functions: malicious noise, in which an adversary can arbitrarily corrupt a random subset of examples given to the learner; and nasty noise, in which an adversary can arbitrarily corrupt an adversarially chosen subset of examples given to the learner. We consider both the distribution-independent and fixed-distribution settings. Our main results highlight a dramatic difference between these two settings: For distribution-independent learning, we prove a strong equivalence between the two noise models: If a class ${\cal C}$ of functions is efficiently learnable in the presence of $η$-rate malicious noise, then it is also efficiently learnable in the presence of $η$-rate nasty noise. In sharp contrast, for the fixed-distribution setting we show an arbitrarily large separation: Under a standard cryptographic assumption, for any arbitrarily large value $r$ there exists a concept class for which there is a ratio of $r$ between the rate $η_{malicious}$ of malicious noise that polynomial-time learning algorithms can tolerate, versus the rate $η_{nasty}$ of nasty noise that such learning algorithms can tolerate. To offset the negative result for the fixed-distribution setting, we define a broad and natural class of algorithms, namely those that ignore contradictory examples (ICE). We show that for these algorithms, malicious noise and nasty noise are equivalent up to a factor of two in the noise rate: Any efficient ICE learner that succeeds with $η$-rate malicious noise can be converted to an efficient learner that succeeds with $η/2$-rate nasty noise. We further show that the above factor of two is necessary, again under a standard cryptographic assumption.

[313] Improving Variational Autoencoder using Random Fourier Transformation: An Aviation Safety Anomaly Detection Case-Study

Ata Akbari Asanjan, Milad Memarzadeh, Bryan Matthews, Nikunj Oza

Main category: cs.LG

TL;DR: Random Fourier Transformation improves autoencoder training by enabling simultaneous learning of low and high frequencies, unlike conventional DNNs that learn frequencies sequentially.

DetailsMotivation: To improve the training process and inference of deep neural networks (specifically autoencoders and variational autoencoders) for anomaly detection by addressing frequency learning limitations in conventional DNNs.

Method: Use Random Fourier Transformation (RFT) in autoencoder training, analyze training behavior using Frequency Principle analysis, introduce trainable variant of RFT, and test on synthetic datasets and aviation safety dataset (Dashlink).

Result: Models with Fourier transformation outperform conventional counterparts; RFT enables simultaneous learning of low and high frequencies; trainable RFT shows inconclusive benefits compared to random variant.

Conclusion: Fourier transformation improves autoencoder performance for anomaly detection by changing frequency learning dynamics, though optimal implementation (trainable vs random) requires further investigation.

Abstract: In this study, we focus on the training process and inference improvements of deep neural networks (DNNs), specifically Autoencoders (AEs) and Variational Autoencoders (VAEs), using Random Fourier Transformation (RFT). We further explore the role of RFT in model training behavior using Frequency Principle (F-Principle) analysis and show that models with RFT turn to learn low frequency and high frequency at the same time, whereas conventional DNNs start from low frequency and gradually learn (if successful) high-frequency features. We focus on reconstruction-based anomaly detection using autoencoder and variational autoencoder and investigate the RFT’s role. We also introduced a trainable variant of RFT that uses the existing computation graph to train the expansion of RFT instead of it being random. We showcase our findings with two low-dimensional synthetic datasets for data representation, and an aviation safety dataset, called Dashlink, for high-dimensional reconstruction-based anomaly detection. The results indicate the superiority of models with Fourier transformation compared to the conventional counterpart and remain inconclusive regarding the benefits of using trainable Fourier transformation in contrast to the Random variant.

[314] Efficient Personalization of Generative Models via Optimal Experimental Design

Guy Schacht, Ziyad Sheebaelhamd, Riccardo De Santi, Mojmír Mutný, Andreas Krause

Main category: cs.LG

TL;DR: A novel approach using optimal experimental design to select the most informative preference queries for learning human preferences efficiently, applied to personalizing text-to-image generative models.

DetailsMotivation: Human feedback for preference learning is costly and time-consuming, creating demand for data-efficient query selection methods to align generative models with user needs.

Method: Formulates preference query selection as maximizing information about the underlying latent preference model, develops a convex optimization formulation, and introduces ED-PBRL algorithm that can efficiently construct structured queries (images/text).

Result: Empirical results show the framework requires fewer preference queries compared to random selection when personalizing text-to-image generative models to user-specific styles.

Conclusion: The proposed optimal experimental design approach enables efficient preference learning from limited human feedback, particularly valuable for aligning complex generative models with user preferences.

Abstract: Preference learning from human feedback has the ability to align generative models with the needs of end-users. Human feedback is costly and time-consuming to obtain, which creates demand for data-efficient query selection methods. This work presents a novel approach that leverages optimal experimental design to ask humans the most informative preference queries, from which we can elucidate the latent reward function modeling user preferences efficiently. We formulate the problem of preference query selection as the one that maximizes the information about the underlying latent preference model. We show that this problem has a convex optimization formulation, and introduce a statistically and computationally efficient algorithm ED-PBRL that is supported by theoretical guarantees and can efficiently construct structured queries such as images or text. We empirically present the proposed framework by personalizing a text-to-image generative model to user-specific styles, showing that it requires less preference queries compared to random query selection.

[315] PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

Nilin Abrahamsen

Main category: cs.LG

TL;DR: PROMA is a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of policy gradients using two variants: accumulation-based and intra-microbatch projection.

DetailsMotivation: The paper aims to improve policy optimization methods by better controlling KL divergence during training, addressing issues with high-variance gradient components that can lead to unstable training and poor convergence.

Method: Two variants: 1) Accumulation-based projects running gradient orthogonal to sequence-wise log-probability gradients of each microbatch; 2) Intra-microbatch uses factored projection with dominant subspaces of activations and gradient outputs within each microbatch, compatible with standard data-parallel training.

Result: Empirical results show accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while intra-microbatch variant achieves best validation performance.

Conclusion: PROMA provides effective KL divergence control through gradient projection techniques, with different variants offering trade-offs between KL control precision and overall validation performance.

Abstract: This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance.

[316] GenDA: Generative Data Assimilation on Complex Urban Areas via Classifier-Free Diffusion Guidance

Francisco Giral, Álvaro Manzano, Ignacio Gómez, Ricardo Vinuesa, Soledad Le Clainche

Main category: cs.LG

TL;DR: GenDA: A generative data assimilation framework using multiscale graph-based diffusion to reconstruct high-resolution urban wind fields from sparse sensor data, with geometry-aware priors and observational constraints.

DetailsMotivation: Urban wind flow reconstruction is crucial for air quality assessment, heat dispersion, and pedestrian comfort, but existing methods struggle with sparse sensor data and complex urban geometries.

Method: Multiscale graph-based diffusion architecture trained on CFD simulations, using classifier-free guidance as learned posterior reconstruction: unconditional branch learns geometry-aware flow prior, sensor-conditioned branch injects observational constraints during sampling.

Result: Reduces relative root-mean-square error by 25-57% and increases structural similarity index by 23-33% compared to supervised GNN baselines and classical reduced-order data assimilation methods.

Conclusion: GenDA provides a scalable path toward generative, geometry-aware data assimilation for environmental monitoring in complex domains, enabling obstacle-aware reconstruction and generalization across unseen geometries.

Abstract: Urban wind flow reconstruction is essential for assessing air quality, heat dispersion, and pedestrian comfort, yet remains challenging when only sparse sensor data are available. We propose GenDA, a generative data assimilation framework that reconstructs high-resolution wind fields on unstructured meshes from limited observations. The model employs a multiscale graph-based diffusion architecture trained on computational fluid dynamics (CFD) simulations and interprets classifier-free guidance as a learned posterior reconstruction mechanism: the unconditional branch learns a geometry-aware flow prior, while the sensor-conditioned branch injects observational constraints during sampling. This formulation enables obstacle-aware reconstruction and generalization across unseen geometries, wind directions, and mesh resolutions without retraining. We consider both sparse fixed sensors and trajectory-based observations using the same reconstruction procedure. When evaluated against supervised graph neural network (GNN) baselines and classical reduced-order data assimilation methods, GenDA reduces the relative root-mean-square error (RRMSE) by 25-57% and increases the structural similarity index (SSIM) by 23-33% across the tested meshes. Experiments are conducted on Reynolds-averaged Navier-Stokes (RANS) simulations of a real urban neighbourhood in Bristol, United Kingdom, at a characteristic Reynolds number of $\mathrm{Re}\approx2\times10^{7}$, featuring complex building geometry and irregular terrain. The proposed framework provides a scalable path toward generative, geometry-aware data assimilation for environmental monitoring in complex domains.

[317] Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF

Wang Zixian

Main category: cs.LG

TL;DR: OPO is a unified theoretical framework for LLM alignment based on work-dissipation principles, with three equivalent interpretations and superior performance on mathematical reasoning tasks.

DetailsMotivation: To provide a unified theoretical foundation for large language model alignment that connects different optimization perspectives through a single variational principle grounded in physics-inspired concepts.

Method: Orthogonalized Policy Optimization (OPO) formulates policy updates as constrained proximal responses maximizing external work while paying intrinsic dissipation costs in chi-square ratio geometry. It has three equivalent interpretations: mirror descent in ratio space, Hilbert-space projection, and linear-response law from statistical mechanics.

Result: OPO outperforms GRPO, GSPO, and DAPO on mathematical reasoning tasks while maintaining healthy gradient dynamics throughout training. The framework reveals that advantage z-score normalization is a conservation-law projection rather than a heuristic.

Conclusion: OPO provides a unified theoretical account of LLM alignment with strong empirical performance, connecting optimization geometry, sampling geometry, and statistical mechanics principles in a coherent framework.

Abstract: We present Orthogonalized Policy Optimization (OPO), a unified theoretical account of large language model alignment grounded in a work-dissipation principle. The policy update is characterized as a constrained proximal response that maximizes external work induced by an alpha-escort sampling field, while paying an intrinsic dissipation cost given by a quadratic fluctuation energy in chi-square ratio geometry. This single variational principle admits three equivalent interpretations: (i) a mirror-descent step with a Euclidean mirror map in ratio space, (ii) a Hilbert-space projection via the orthogonal projection theorem in L2(pi_k), and (iii) a linear-response law from near-equilibrium statistical mechanics. Their convergence to the same closed-form update confirms that OPO is the unique quadratic proximal response within ratio geometry. The framework cleanly decouples sampling geometry (alpha) from optimization geometry (mu), yields a constant Hessian and non-saturating linear gradients, and reveals that advantage z-score normalization is not a heuristic but a conservation-law projection. Experiments on mathematical reasoning tasks demonstrate that OPO outperforms GRPO, GSPO, and DAPO while maintaining healthy gradient dynamics throughout training.

[318] Improving Policy Exploitation in Online Reinforcement Learning with Instant Retrospect Action

Gong Gao, Weidong Zhao, Xianhui Liu, Ning Jia

Main category: cs.LG

TL;DR: IRA algorithm improves online RL efficiency via Q-representation discrepancy evolution, greedy action guidance, and instant policy updates for better exploration and faster exploitation.

DetailsMotivation: Existing value-based online RL algorithms suffer from slow policy exploitation due to ineffective exploration and delayed policy updates, limiting learning efficiency and final performance.

Method: Proposes Instant Retrospect Action (IRA) with three key components: Q-Representation Discrepancy Evolution (RDE) for discriminative representations of neighboring state-action pairs, Greedy Action Guidance (GAG) via backtracking historical actions for policy constraints, and Instant Policy Update (IPU) mechanism to increase policy update frequency.

Result: IRA significantly improves learning efficiency and final performance on eight MuJoCo continuous control tasks, with early-stage training conservatism helping alleviate overestimation bias in value-based RL.

Conclusion: IRA addresses key challenges in online RL through representation learning, policy constraints, and update frequency optimization, demonstrating substantial performance gains in continuous control tasks.

Abstract: Existing value-based online reinforcement learning (RL) algorithms suffer from slow policy exploitation due to ineffective exploration and delayed policy updates. To address these challenges, we propose an algorithm called Instant Retrospect Action (IRA). Specifically, we propose Q-Representation Discrepancy Evolution (RDE) to facilitate Q-network representation learning, enabling discriminative representations for neighboring state-action pairs. In addition, we adopt an explicit method to policy constraints by enabling Greedy Action Guidance (GAG). This is achieved through backtracking historical actions, which effectively enhances the policy update process. Our proposed method relies on providing the learning algorithm with accurate $k$-nearest-neighbor action value estimates and learning to design a fast-adaptable policy through policy constraints. We further propose the Instant Policy Update (IPU) mechanism, which enhances policy exploitation by systematically increasing the frequency of policy updates. We further discover that the early-stage training conservatism of the IRA method can alleviate the overestimation bias problem in value-based RL. Experimental results show that IRA can significantly improve the learning efficiency and final performance of online RL algorithms on eight MuJoCo continuous control tasks.The code is available at https://github.com/2706853499/IRA.

[319] Don’t Forget Its Variance! The Minimum Path Variance Principle for Accurate and Stable Score-Based Models

Wei Chen, Jiacheng Li, Shigui Li, Zhiqi Lin, Junmei Yang, John Paisley, Delu Zeng

Main category: cs.LG

TL;DR: MinPV principle resolves score-based method paradox by minimizing path variance of score function, achieving state-of-the-art results through data-adaptive path parameterization.

DetailsMotivation: Score-based methods face a paradox: theoretically they should be path-independent, but in practice they show path dependence. This discrepancy arises because practical training objectives differ from the ideal ground-truth objective by an overlooked term - the path variance of the score function.

Method: Proposes the MinPV (Minimum Path Variance) Principle to minimize path variance. Derives a closed-form expression for the variance to make optimization tractable. Parameterizes paths using a flexible Kumaraswamy Mixture Model to learn data-adaptive, low-variance paths without heuristic manual selection.

Result: The method yields more accurate and stable estimators, establishing new state-of-the-art results on challenging benchmarks. Provides a general framework for optimizing score-based interpolation.

Conclusion: The MinPV principle resolves the path dependence paradox in score-based methods by explicitly minimizing path variance, leading to improved performance and a principled optimization framework for score-based interpolation.

Abstract: Score-based methods are powerful across machine learning, but they face a paradox: theoretically path-independent, yet practically path-dependent. We resolve this by proving that practical training objectives differ from the ideal, ground-truth objective by a crucial, overlooked term: the path variance of the score function. We propose the MinPV (Minimum Path Variance) Principle to minimize this path variance. Our key contribution is deriving a closed-form expression for the variance, making optimization tractable. By parameterizing the path with a flexible Kumaraswamy Mixture Model, our method learns data-adaptive, low-variance paths without heuristic manual selection. This principled optimization of the complete objective yields more accurate and stable estimators, establishing new state-of-the-art results on challenging benchmarks and providing a general framework for optimizing score-based interpolation.

[320] Green-NAS: A Global-Scale Multi-Objective Neural Architecture Search for Robust and Efficient Edge-Native Weather Forecasting

Md Muhtasim Munif Fahim, Soyda Humyra Yesmin, Saiful Islam, Md. Palash Bin Faruque, Md. A. Salam, Md. Mahfuz Uddin, Samiul Islam, Tofayel Ahmed, Md. Binyamin, Md. Rezaul Karim

Main category: cs.LG

TL;DR: Green-NAS is a multi-objective neural architecture search framework for low-resource environments that optimizes for both accuracy and efficiency, specifically minimizing computational energy costs and carbon footprints while maintaining competitive weather forecasting performance.

DetailsMotivation: The paper addresses the need for sustainable AI deployment in low-resource environments, particularly for weather forecasting applications. It focuses on reducing computational energy costs and carbon footprints while maintaining accuracy, adhering to 'Green AI' principles.

Method: Green-NAS uses a multi-objective neural architecture search framework that simultaneously optimizes for model accuracy and efficiency. It finds lightweight models with minimal parameters through an optimization process that explicitly minimizes computational energy costs and carbon footprints.

Result: The best-performing model (Green-NAS-A) achieved RMSE of 0.0988 (within 1.4% of manually tuned baseline) using only 153k parameters - 239 times fewer than other globally applied weather forecasting models like GraphCast. Transfer learning improved forecasting accuracy by approximately 5.2% compared to training new models for each city.

Conclusion: Green-NAS demonstrates that sustainable AI deployment is achievable through multi-objective optimization, producing highly efficient models with minimal computational footprint while maintaining competitive accuracy for weather forecasting applications.

Abstract: We introduce Green-NAS, a multi-objective NAS (neural architecture search) framework designed for low-resource environments using weather forecasting as a case study. By adhering to ‘Green AI’ principles, the framework explicitly minimizes computational energy costs and carbon footprints, prioritizing sustainable deployment over raw computational scale. The Green-NAS architecture search method is optimized for both model accuracy and efficiency to find lightweight models with high accuracy and very few model parameters; this is accomplished through an optimization process that simultaneously optimizes multiple objectives. Our best-performing model, Green-NAS-A, achieved an RMSE of 0.0988 (i.e., within 1.4% of our manually tuned baseline) using only 153k model parameters, which is 239 times fewer than other globally applied weather forecasting models, such as GraphCast. In addition, we also describe how the use of transfer learning will improve the weather forecasting accuracy by approximately 5.2%, in comparison to a naive approach of training a new model for each city, when there is limited historical weather data available for that city.

[321] On the Role of Iterative Computation in Reinforcement Learning

Raj Ghugare, Michał Bortkiewicz, Alicja Ziarko, Benjamin Eysenbach

Main category: cs.LG

TL;DR: This paper formalizes compute-bounded policies in RL, showing that policies using more compute can solve harder problems and generalize better to longer-horizon tasks than policies with fewer parameters but less compute.

DetailsMotivation: The paper addresses the fundamental question of how computational resources affect RL policy learning. Current RL frameworks conflate compute and parameters, making it impossible to formally analyze how additional compute affects policy performance independent of parameter count.

Method: The authors formalize compute-bounded policies theoretically and propose a minimal architecture that can use variable amounts of compute. The approach builds on prior work in algorithmic learning and model-free planning, allowing policies to leverage additional computational resources without increasing parameter count.

Result: Experiments across 31 different RL tasks show that: (1) the proposed architecture achieves stronger performance simply by using more compute, and (2) demonstrates better generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual networks using up to 5 times more parameters.

Conclusion: Compute is a distinct resource from parameters in RL, and policies can benefit from additional compute even with fixed parameter counts. This provides a formal framework for understanding computational resource allocation in RL systems.

Abstract: How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.

[322] Achieving Optimal Static and Dynamic Regret Simultaneously in Bandits with Deterministic Losses

Jian Qian, Chen-Yu Wei

Main category: cs.LG

TL;DR: An algorithm achieving optimal static and dynamic regret simultaneously against oblivious adversaries with deterministic losses, revealing fundamental separation between adaptive and oblivious adversaries.

DetailsMotivation: Previous work showed simultaneous optimality for static and dynamic regret is impossible against adaptive adversaries, but it was unknown if possible against oblivious adversaries. The paper aims to investigate this possibility and provide new insights into simultaneous regret optimization.

Method: Extends impossibility result to deterministic losses, then presents algorithm using negative static regret to compensate for exploration overhead and leverages Blackwell approachability to jointly control both regrets.

Result: Achieves optimal static and dynamic regret simultaneously against oblivious adversaries with deterministic losses, demonstrating fundamental separation between adaptive and oblivious adversaries.

Conclusion: Simultaneous optimality for static and dynamic regret is possible against oblivious adversaries but impossible against adaptive adversaries, providing new insights into multi-benchmark bandit problems.

Abstract: In adversarial multi-armed bandits, two performance measures are commonly used: static regret, which compares the learner to the best fixed arm, and dynamic regret, which compares it to the best sequence of arms. While optimal algorithms are known for each measure individually, there is no known algorithm achieving optimal bounds for both simultaneously. Marinov and Zimmert [2021] first showed that such simultaneous optimality is impossible against an adaptive adversary. Our work takes a first step to demonstrate its possibility against an oblivious adversary when losses are deterministic. First, we extend the impossibility result of Marinov and Zimmert [2021] to the case of deterministic losses. Then, we present an algorithm achieving optimal static and dynamic regret simultaneously against an oblivious adversary. Together, they reveal a fundamental separation between adaptive and oblivious adversaries when multiple regret benchmarks are considered simultaneously. It also provides new insight into the long open problem of simultaneously achieving optimal regret against switching benchmarks of different numbers of switches. Our algorithm uses negative static regret to compensate for the exploration overhead incurred when controlling dynamic regret, and leverages Blackwell approachability to jointly control both regrets. This yields a new model selection procedure for bandits that may be of independent interest.

[323] Safe Reinforcement Learning via Recovery-based Shielding with Gaussian Process Dynamics Models

Alexander W. Goodall, Francesco Belardinelli

Main category: cs.LG

TL;DR: A recovery-based shielding framework for safe reinforcement learning that integrates backup policies with RL agents using Gaussian process uncertainty quantification to ensure safety in continuous control systems.

DetailsMotivation: Reinforcement learning lacks provable safety guarantees for critical applications, especially for unknown nonlinear continuous systems. There's a need for safe RL approaches that can ensure safety while maintaining exploration and learning efficiency.

Method: Proposes a recovery-based shielding framework that combines RL agents with backup policies (shields). Uses Gaussian process uncertainty quantification to predict safety violations and dynamically recover to safe trajectories. Experience from shielded agents builds GP models, with policy optimization via internal model-based sampling.

Result: Empirically demonstrates strong performance and strict safety compliance on continuous control environments. Enables unrestricted exploration and sample-efficient learning without compromising safety.

Conclusion: The framework provides provable safety guarantees for RL in unknown nonlinear continuous systems while maintaining learning efficiency and exploration capabilities.

Abstract: Reinforcement learning (RL) is a powerful framework for optimal decision-making and control but often lacks provable guarantees for safety-critical applications. In this paper, we introduce a novel recovery-based shielding framework that enables safe RL with a provable safety lower bound for unknown and non-linear continuous dynamical systems. The proposed approach integrates a backup policy (shield) with the RL agent, leveraging Gaussian process (GP) based uncertainty quantification to predict potential violations of safety constraints, dynamically recovering to safe trajectories only when necessary. Experience gathered by the ‘shielded’ agent is used to construct the GP models, with policy optimization via internal model-based sampling - enabling unrestricted exploration and sample efficient learning, without compromising safety. Empirically our approach demonstrates strong performance and strict safety-compliance on a suite of continuous control environments.

[324] Horizon Imagination: Efficient On-Policy Rollout in Diffusion World Models

Lior Cohen, Ofir Nabati, Kaixin Wang, Navdeep Kumar, Shie Mannor

Main category: cs.LG

TL;DR: Horizon Imagination (HI) is an efficient diffusion-based world model for RL that enables parallel denoising of multiple future observations with sub-frame budgets, improving computational efficiency while maintaining control performance.

DetailsMotivation: Current diffusion-based world models for reinforcement learning face efficiency challenges - they either require heavyweight models at inference or rely on highly sequential imagination, both imposing prohibitive computational costs that limit practical application.

Method: Proposes Horizon Imagination (HI), an on-policy imagination process for discrete stochastic policies that denoises multiple future observations in parallel. Includes stabilization mechanism and novel sampling schedule that decouples denoising budget from effective horizon while supporting sub-frame budgets.

Result: Experiments on Atari 100K and Craftium show HI maintains control performance with sub-frame budget of half the denoising steps and achieves superior generation quality under varied schedules compared to existing methods.

Conclusion: HI provides an efficient approach to diffusion-based world modeling for RL that addresses computational bottlenecks while preserving generative fidelity and control performance, making diffusion models more practical for reinforcement learning applications.

Abstract: We study diffusion-based world models for reinforcement learning, which offer high generative fidelity but face critical efficiency challenges in control. Current methods either require heavyweight models at inference or rely on highly sequential imagination, both of which impose prohibitive computational costs. We propose Horizon Imagination (HI), an on-policy imagination process for discrete stochastic policies that denoises multiple future observations in parallel. HI incorporates a stabilization mechanism and a novel sampling schedule that decouples the denoising budget from the effective horizon over which denoising is applied while also supporting sub-frame budgets. Experiments on Atari 100K and Craftium show that our approach maintains control performance with a sub-frame budget of half the denoising steps and achieves superior generation quality under varied schedules. Code is available at https://github.com/leor-c/horizon-imagination.

[325] Reducing Estimation Uncertainty Using Normalizing Flows and Stratification

Paweł Lorek, Rafał Nowak, Rafał Topolnicki, Tomasz Trzciński, Maciej Zięba, Aleksandra Krystecka

Main category: cs.LG

TL;DR: A flow-based model with stratified sampling for flexible estimation of expectations of functions of random variables, reducing uncertainty compared to parametric methods.

DetailsMotivation: Current expectation estimation methods rely on parametric distribution assumptions (Gaussian/mixed Gaussian) which can lead to significant uncertainty when assumptions don't hold. Need more flexible approaches for unknown data distributions.

Method: Proposes a flow-based model integrated with stratified sampling, using a parametrized neural network to model unknown data distributions more flexibly.

Result: Shows marked reduction in estimation uncertainty across multiple datasets including high-dimensional ones (30 and 128 dimensions), outperforming crude Monte Carlo estimators and Gaussian mixture models.

Conclusion: The flow-based stratified sampling approach provides more flexible and accurate expectation estimation for unknown distributions, reducing uncertainty compared to traditional parametric methods.

Abstract: Estimating the expectation of a real-valued function of a random variable from sample data is a critical aspect of statistical analysis, with far-reaching implications in various applications. Current methodologies typically assume (semi-)parametric distributions such as Gaussian or mixed Gaussian, leading to significant estimation uncertainty if these assumptions do not hold. We propose a flow-based model, integrated with stratified sampling, that leverages a parametrized neural network to offer greater flexibility in modeling unknown data distributions, thereby mitigating this limitation. Our model shows a marked reduction in estimation uncertainty across multiple datasets, including high-dimensional (30 and 128) ones, outperforming crude Monte Carlo estimators and Gaussian mixture models. Reproducible code is available at https://github.com/rnoxy/flowstrat.

[326] How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?

Tatsuya Sagawa, Ryosuke Kojima

Main category: cs.LG

TL;DR: Scaling training resources for Chemical Language Models improves pretraining loss but yields limited downstream performance gains on molecular property prediction tasks, revealing a gap between pretraining metrics and actual task performance.

DetailsMotivation: To systematically validate whether increasing training resources (model size, dataset size, training compute) for Chemical Language Models actually improves downstream molecular property prediction performance, contrary to common assumptions in the field.

Method: Pretrained CLMs while scaling training resources and measured transfer performance across diverse molecular property prediction tasks. Analyzed alternative metrics (Hessian, loss landscape) and conducted parameter space visualizations to understand failure modes.

Result: Pretraining loss consistently decreases with increased resources, but downstream task performance shows limited improvement. Alternative metrics also fail to estimate downstream performance. Identified conditions where downstream performance saturates or degrades despite pretraining improvements.

Conclusion: There’s a significant gap between pretraining-based evaluation and downstream performance in CLMs, emphasizing the need for model selection and evaluation strategies that explicitly account for downstream task characteristics rather than relying solely on pretraining metrics.

Abstract: Chemical Language Models (CLMs) pre-trained on large scale molecular data are widely used for molecular property prediction. However, the common belief that increasing training resources such as model size, dataset size, and training compute improves both pretraining loss and downstream task performance has not been systematically validated in the chemical domain. In this work, we evaluate this assumption by pretraining CLMs while scaling training resources and measuring transfer performance across diverse molecular property prediction (MPP) tasks. We find that while pretraining loss consistently decreases with increased training resources, downstream task performance shows limited improvement. Moreover, alternative metrics based on the Hessian or loss landscape also fail to estimate downstream performance in CLMs. We further identify conditions under which downstream performance saturates or degrades despite continued improvements in pretraining metrics, and analyze the underlying task dependent failure modes through parameter space visualizations. These results expose a gap between pretraining based evaluation and downstream performance, and emphasize the need for model selection and evaluation strategies that explicitly account for downstream task characteristics.

cs.MA

[327] Beyond Context Sharing: A Unified Agent Communication Protocol (ACP) for Secure, Federated, and Autonomous Agent-to-Agent (A2A) Orchestration

Naveen Kumar Krishnan

Main category: cs.MA

TL;DR: The paper introduces Agent Communication Protocol (ACP), a standardized framework for secure, cross-platform agent-to-agent interaction in autonomous AI systems.

DetailsMotivation: The transition from isolated large language models to autonomous agents faces challenges with cross-platform, decentralized, and secure interactions, hindering the realization of a truly Agentic Web.

Method: Proposes ACP as a standardized framework building on AI agent architectures and Model Context Protocol (MCP), featuring federated orchestration with decentralized identity verification, semantic intent mapping, and automated service-level agreements.

Result: ACP reduces inter-agent communication latency by an unspecified percentage while maintaining zero-trust security posture, enabling heterogeneous agents to discover, negotiate, and execute collaborative workflows across disparate environments.

Conclusion: ACP represents a critical advancement toward a scalable and interoperable ecosystem of autonomous digital entities, addressing key barriers in agent-to-agent communication.

Abstract: In the artificial intelligence space, as we transition from isolated large language models to autonomous agents capable of complex reasoning and tool use. While foundational architectures and local context management protocols have been established, the challenge of cross-platform, decentralized, and secure interaction remains a significant barrier to the realization of a truly Agentic Web. Building upon the foundations of AI agent architectures and the Model Context Protocol (MCP) for multi-agent coordination, this paper introduces the Agent Communication Protocol (ACP). ACP provides a standardized framework for Agent-to-Agent (AA) interaction, enabling heterogeneous agents to discover, negotiate, and execute collaborative workflows across disparate environments. We propose a federated orchestration model that integrates decentralized identity verification, semantic intent mapping, and automated service-level agreements. Our evaluation demonstrates that ACP reduces inter-agent communication latency by % while maintaining a zero-trust security posture. This work represents a critical advancement toward a scalable and interoperable ecosystem of autonomous digital entities

[328] Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud, Ferdinando Fioretto, Shlomo Zilberstein, Eugene Bagdasarian

Main category: cs.MA

TL;DR: Colosseum is a framework for auditing LLM agents’ collusive behavior in multi-agent systems, measuring collusion through regret relative to cooperative optimum in Distributed Constraint Optimization Problems.

DetailsMotivation: Multi-agent systems with LLM agents communicating through free-form language enable sophisticated coordination but create unique safety problems when agents form coalitions to pursue secondary goals at the expense of joint objectives.

Method: Grounds agent cooperation through Distributed Constraint Optimization Problem (DCOP), measures collusion via regret relative to cooperative optimum, tests LLMs under different objectives, persuasion tactics, and network topologies.

Result: Most out-of-the-box models exhibited propensity to collude when secret communication channels were artificially formed; discovered “collusion on paper” where agents plan collusion in text but pick non-collusive actions.

Conclusion: Colosseum provides a new way to study collusion by measuring communications and actions in rich yet verifiable environments for multi-agent LLM systems.

Abstract: Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when individual agents form a coalition and \emph{collude} to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a framework for auditing LLM agents’ collusive behavior in multi-agent settings. We ground how agents cooperate through a Distributed Constraint Optimization Problem (DCOP) and measure collusion via regret relative to the cooperative optimum. Colosseum tests each LLM for collusion under different objectives, persuasion tactics, and network topologies. Through our audit, we show that most out-of-the-box models exhibited a propensity to collude when a secret communication channel was artificially formed. Furthermore, we discover ``collusion on paper’’ when agents plan to collude in text but would often pick non-collusive actions, thus providing little effect on the joint task. Colosseum provides a new way to study collusion by measuring communications and actions in rich yet verifiable environments.

[329] Enhancing Computational Efficiency in NetLogo: Best Practices for Running Large-Scale Agent-Based Models on AWS and Cloud Infrastructures

Michael A. Duprey, Georgiy V. Bobashev

Main category: cs.MA

TL;DR: Guide to optimizing NetLogo agent-based models on AWS cloud infrastructure for better performance and cost efficiency

DetailsMotivation: The increasing complexity and scale of agent-based models require more computational power and memory, necessitating efficient strategies to manage these demands on cloud platforms like AWS.

Method: Provides comprehensive optimization guide covering memory management, Java options, BehaviorSpace execution, and AWS instance selection. Uses comparative analysis of NetLogo simulations on different AWS instances with the wolf-sheep predation model.

Result: Achieved 32% reduction in computational costs and improved performance consistency through implemented optimizations and appropriate AWS instance selection.

Conclusion: Cloud optimization strategies for NetLogo ABMs can significantly improve performance and reduce costs, making large-scale simulations more accessible and efficient.

Abstract: The rising complexity and scale of agent-based models (ABMs) necessitate efficient computational strategies to manage the increasing demand for processing power and memory. This manuscript provides a comprehensive guide to optimizing NetLogo, a widely used platform for ABMs, for running large-scale models on Amazon Web Services (AWS) and other cloud infrastructures. It covers best practices in memory management, Java options, BehaviorSpace execution, and AWS instance selection. By implementing these optimizations and selecting appropriate AWS instances, we achieved a 32% reduction in computational costs and improved performance consistency. Through a comparative analysis of NetLogo simulations on different AWS instances using the wolf-sheep predation model, we demonstrate the performance gains achievable through these optimizations.

[330] MARLIN: Multi-Agent Reinforcement Learning with Murmuration Intelligence and LLM Guidance for Reservoir Management

Heming Fu, Guojun Xiong, Shan Lin

Main category: cs.MA

TL;DR: MARLIN is a decentralized reservoir management framework that combines multi-agent reinforcement learning with bio-inspired coordination rules and LLM-based reward shaping to handle uncertainties in water resource management.

DetailsMotivation: Climate change intensifies extreme weather events, making adaptive reservoir management critical. Traditional centralized optimization suffers from exponential computational complexity and cannot handle real-world uncertainties like water transfer losses, while existing MARL methods fail to achieve effective coordination under uncertainty.

Method: MARLIN integrates bio-inspired alignment, separation, and cohesion rules (inspired by starling murmurations) with multi-agent reinforcement learning. It uses a decentralized approach where individual reservoirs make local decisions while achieving emergent global coordination. An LLM provides real-time reward shaping signals to guide agents to adapt to environmental changes and human-defined preferences.

Result: Experiments on USGS data show MARLIN improves uncertainty handling by 23%, cuts computation by 35%, accelerates flood response by 68%, and exhibits super-linear coordination with complexity scaling 5.4x from 400 to 10,000 nodes.

Conclusion: MARLIN demonstrates potential for disaster prevention and protecting communities through intelligent, scalable water resource management by effectively handling uncertainties while maintaining computational efficiency.

Abstract: As climate change intensifies extreme weather events, water disasters pose growing threats to global communities, making adaptive reservoir management critical for protecting vulnerable populations and ensuring water security. Modern water resource management faces unprecedented challenges from cascading uncertainties propagating through interconnected reservoir networks. These uncertainties, rooted in physical water transfer losses and environmental variability, make precise control difficult. For example, sending 10 tons downstream may yield only 8-12 tons due to evaporation and seepage. Traditional centralized optimization approaches suffer from exponential computational complexity and cannot effectively handle such real-world uncertainties, while existing multi-agent reinforcement learning (MARL) methods fail to achieve effective coordination under uncertainty. To address these challenges, we present MARLIN, a decentralized reservoir management framework inspired by starling murmurations intelligence. Integrating bio-inspired alignment, separation, and cohesion rules with MARL, MARLIN enables individual reservoirs to make local decisions while achieving emergent global coordination. In addition, a LLM provides real-time reward shaping signals, guiding agents to adapt to environmental changes and human-defined preferences. Experiments on USGS data show that MARLIN improves uncertainty handling by 23%, cuts computation by 35%, and accelerates flood response by 68%, exhibiting super-linear coordination, with complexity scaling 5.4x from 400 to 10,000 nodes. These results demonstrate MARLIN’s potential for disaster prevention and protecting communities through intelligent, scalable water resource management.

Rishav Sen, Fangqi Liu, Jose Paolo Talusan, Ava Pettet, Yoshinori Suzue, Mark Bailey, Ayan Mukhopadhyay, Abhishek Dubey

Main category: cs.MA

TL;DR: A negotiation-based framework for EV charging in vehicle-to-building settings that balances building operator costs with driver convenience through incentive-backed flexibility options.

DetailsMotivation: The growth of EVs creates conflicts in V2B settings between building operators facing high energy costs from uncoordinated charging and drivers prioritizing convenience and full charges. There's a need to align these conflicting objectives.

Method: Proposes a negotiation-based framework that guarantees voluntary participation, strategy-proofness, and budget feasibility. Offers drivers incentive-backed options for modest flexibility in departure time or requested state of charge. Calibrated with user survey data and validated using real operational data from commercial building and EV manufacturer.

Result: Simulations show the negotiation protocol creates mutually beneficial outcomes: lowers building operator’s costs by over 3.5% compared to optimized non-negotiating smart charging policy, while reducing user charging expenses by 22% below utility’s retail energy rate.

Conclusion: The framework provides a strategic bridge between energy and mobility systems, transforming EV charging from operational friction into a platform for collaboration and shared savings by aligning operator and EV user objectives.

Abstract: The growth of Electric Vehicles (EVs) creates a conflict in vehicle-to-building (V2B) settings between building operators, who face high energy costs from uncoordinated charging, and drivers, who prioritize convenience and a full charge. To resolve this, we propose a negotiation-based framework that, by design, guarantees voluntary participation, strategy-proofness, and budget feasibility. It transforms EV charging into a strategic resource by offering drivers a range of incentive-backed options for modest flexibility in their departure time or requested state of charge (SoC). Our framework is calibrated with user survey data and validated using real operational data from a commercial building and an EV manufacturer. Simulations show that our negotiation protocol creates a mutually beneficial outcome: lowering the building operator’s costs by over 3.5% compared to an optimized, non-negotiating smart charging policy, while simultaneously reducing user charging expenses by 22% below the utility’s retail energy rate. By aligning operator and EV user objectives, our framework provides a strategic bridge between energy and mobility systems, transforming EV charging from a source of operational friction into a platform for collaboration and shared savings.

[332] From Agent Simulation to Social Simulator: A Comprehensive Review (Part 2)

Xiao Xue, Deyu Zhou, Ming Zhang, Xiangning Yu, Fei-Yue Wang

Main category: cs.MA

TL;DR: Computational experiments as a method for causal inference in complex systems, addressing limitations of traditional Agent-Based Modeling by emphasizing counterfactual experimentation rather than just simulation.

DetailsMotivation: Traditional Agent-Based Modeling (ABM) focuses more on simulation than experimentation, limiting its ability to uncover governing operational principles and causal relationships in complex systems. There's a need for methods that can provide robust causal inference through systematic experimentation.

Method: Proposes computational experiments that emphasize counterfactual experiments - creating parallel worlds to simulate alternative evolutionary paths. This involves systematically adjusting input variables and observing resulting changes in output variables to establish causal relationships.

Result: Computational experiments provide a robust tool for causal inference that addresses limitations of traditional ABM, offering deeper insights into system dynamics and governing principles.

Conclusion: Computational experiments combined with ABM offer enhanced causal insights into the dynamic evolution of complex systems, moving beyond mere simulation to systematic experimentation and counterfactual analysis.

Abstract: The study of system complexity primarily has two objectives: to explore underlying patterns and to develop theoretical explanations. Pattern exploration seeks to clarify the mechanisms behind the emergence of system complexity, while theoretical explanations aim to identify the fundamental causes of this complexity. Laws are generally defined as mappings between variables, whereas theories offer causal explanations of system behavior. Agent Based Modeling(ABM) is an important approach for studying complex systems, but it tends to emphasize simulation over experimentation. As a result, ABM often struggles to deeply uncover the governing operational principles. Unlike conventional scenario analysis that relies on human reasoning, computational experiments emphasize counterfactual experiments-that is, creating parallel worlds that simulate alternative “evolutionary paths” of real-world events. By systematically adjusting input variables and observing the resulting changes in output variables, computational experiments provide a robust tool for causal inference, thereby addressing the limitations of traditional ABM. Together, these methods offer causal insights into the dynamic evolution of systems. This part can help readers gain a preliminary understanding of the entire computational experiment method, laying the foundation for the subsequent study.

[333] Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu, Yang Yan

Main category: cs.MA

TL;DR: Survey paper on agent skills - modular packages of instructions, code, and resources that LLM agents load on demand to extend capabilities without retraining, covering architecture, acquisition, deployment, security, and open challenges.

DetailsMotivation: The transition from monolithic LLMs to modular, skill-equipped agents represents a significant shift in deployment. Rather than encoding all procedural knowledge in model weights, agent skills enable dynamic capability extension without retraining, addressing limitations of static models.

Method: Comprehensive survey organizing the field along four axes: (1) architectural foundations (SKILL$.$md specification, progressive context loading, MCP integration), (2) skill acquisition (reinforcement learning with skill libraries, autonomous discovery, compositional synthesis), (3) deployment at scale (CUA stack, GUI grounding, OSWorld/SWE-bench benchmarks), and (4) security (vulnerability analysis, proposed Skill Trust and Lifecycle Governance Framework).

Result: Identifies that 26.1% of community-contributed skills contain vulnerabilities, motivating a four-tier, gate-based permission model. Organizes current landscape and identifies seven open challenges including cross-platform portability and capability-based permission models.

Conclusion: The emerging skill abstraction layer represents a fundamental shift in agentic systems, enabling dynamic capability extension without retraining. The survey provides a research agenda for realizing trustworthy, self-improving skill ecosystems, focusing specifically on the skill abstraction layer rather than general LLM agents or tool use.

Abstract: The transition from monolithic language models to modular, skill-equipped agents marks a defining shift in how large language models (LLMs) are deployed in practice. Rather than encoding all procedural knowledge within model weights, agent skills – composable packages of instructions, code, and resources that agents load on demand – enable dynamic capability extension without retraining. It is formalized in a paradigm of progressive disclosure, portable skill definitions, and integration with the Model Context Protocol (MCP). This survey provides a comprehensive treatment of the agent skills landscape, as it has rapidly evolved during the last few months. We organize the field along four axes: (i) architectural foundations, examining the SKILL$.$md specification, progressive context loading, and the complementary roles of skills and MCP; (ii) skill acquisition, covering reinforcement learning with skill libraries, autonomous skill discovery (SEAgent), and compositional skill synthesis; (iii) deployment at scale, including the computer-use agent (CUA) stack, GUI grounding advances, and benchmark progress on OSWorld and SWE-bench; and (iv) security, where recent empirical analyses reveal that 26.1% of community-contributed skills contain vulnerabilities, motivating our proposed Skill Trust and Lifecycle Governance Framework – a four-tier, gate-based permission model that maps skill provenance to graduated deployment capabilities. We identify seven open challenges – from cross-platform skill portability to capability-based permission models – and propose a research agenda for realizing trustworthy, self-improving skill ecosystems. Unlike prior surveys that broadly cover LLM agents or tool use, this work focuses specifically on the emerging skill abstraction layer and its implications for the next generation of agentic systems. Project repo: https://github.com/scienceaix/agentskills

cs.MM

[334] “The Intangible Victory”, Interactive Audiovisual Installation

Konstantinos Tsioutas, Panagiotis Pangalos, Konstantinos Tiligadis, Andreas Sitorengo

Main category: cs.MM

TL;DR: Interactive audiovisual installation reimagining the Victory of Samothrace sculpture using conductive string sensors and sound to represent absence/void through viewer interaction

DetailsMotivation: To explore the symbolism of absence/void in ancient sculpture through digital media, using the Victory of Samothrace as a case study to examine time as wear factor (entropy) and create new viewer-space-time dialogues

Method: Created an interactive installation using colored conductive strings arranged cylindrically to reconstruct the sculpture’s form, with sensors enabling visitor interaction that generates sound environments through movement

Result: An audiovisual experience where sound replaces physical volume, with the void of the sculptural form and viewer interaction creating a new symbolic representation of the Victory of Samothrace

Conclusion: Digital media can reveal and reinterpret absence in sculpture, creating new interactive experiences where sound and viewer participation transform traditional artistic interpretation through multimodal engagement

Abstract: “Intangible Victory” is an audiovisual installation in the form of the intangible being of the Victory of Samothrace that uses interactive digital media. Specifically, through this installation, we redefine the visual symbolism of the ancient sculpture, paying attention to time as a wear factor (entropy) and the special importance of the void as an absence of the sculptural form. Emptiness completes the intangible essence of the sculpture in the field of symbolism as well as in that of artistic significance for the interpretation of the work today. The function of the void and the interaction of the viewer with the work, causes the emergence of a new experience-dialogue between space and time. The use of digital media and technology reveals the absence of the sculptural form as it is visualized in the Victory of Samothrace. The sculptural form is reconstructed from fibers in space in a cylindrical arrangement. The form is rendered with colored strings - conductive sensors, that allow the visitor to interact with the work, creating a sound environment through movement. The sound completely replaces the volume, as the void of the sculptural form together with the viewer in unison present an audiovisual symbolism of the Victory of Samothrace.

[335] Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili

Main category: cs.MM

TL;DR: A real-time conversational assistant for procedural tasks using only audio and IMU inputs from wearables, with novel finetuning method to improve dialogue efficiency and edge deployment.

DetailsMotivation: Current conversational assistants for procedural tasks rely on video input, which is computationally expensive and privacy-invasive. There's a need for lightweight, privacy-preserving alternatives using modalities like audio and IMU from wearable devices.

Method: Proposes a real-time conversational assistant using only audio and IMU inputs for furniture assembly tasks. Introduces User Whim Agnostic (UWA) LoRA finetuning method to suppress less informative dialogues while maintaining important instruction communication. Eliminates need for in-context examples in prompts.

Result: Achieves >30% improvement in F-score for dialogue quality. Finetuning results in 16x speedup by eliminating in-context examples. System implemented on edge devices with no cloud dependence.

Conclusion: Demonstrates feasibility of privacy-preserving conversational assistants using lightweight modalities, with significant performance improvements through specialized finetuning and edge deployment.

Abstract: Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user’s wearable device to understand the context. This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions. We construct a dataset containing conversations where the assistant guides the user in performing the task. On observing that an off-the-shelf language model is a very talkative assistant, we design a novel User Whim Agnostic (UWA) LoRA finetuning method which improves the model’s ability to suppress less informative dialogues, while maintaining its tendency to communicate important instructions. This leads to >30% improvement in the F-score. Finetuning the model also results in a 16x speedup by eliminating the need to provide in-context examples in the prompt. We further describe how such an assistant is implemented on edge devices with no dependence on the cloud.

eess.AS

[336] What Do Neurons Listen To? A Neuron-level Dissection of a General-purpose Audio Model

Takao Kawamura, Daisuke Niizumi, Nobutaka Ono

Main category: eess.AS

TL;DR: First systematic neuron-level analysis of general-purpose audio self-supervised learning models reveals class-specific neurons with shared responses across semantic categories and acoustic similarities, providing insights into internal representations.

DetailsMotivation: Despite strong empirical performance of audio SSL models as feature extractors, the internal mechanisms underlying their robust generalization remain unclear. The paper aims to understand how these models achieve generalization through mechanistic interpretability.

Method: Uses mechanistic interpretability framework to analyze conditional activation patterns across diverse tasks. Identifies and examines class-specific neurons by analyzing their activation patterns across different semantic categories and acoustic similarities.

Result: Reveals that SSL models foster emergence of class-specific neurons with extensive coverage across novel task classes. These neurons exhibit shared responses across different semantic categories and acoustic similarities (speech attributes, musical pitch). Confirms functional impact on classification performance.

Conclusion: First systematic neuron-level analysis of general-purpose audio SSL model provides new insights into internal representation mechanisms, showing how class-specific neurons with shared responses contribute to robust generalization.

Abstract: In this paper, we analyze the internal representations of a general-purpose audio self-supervised learning (SSL) model from a neuron-level perspective. Despite their strong empirical performance as feature extractors, the internal mechanisms underlying the robust generalization of SSL audio models remain unclear. Drawing on the framework of mechanistic interpretability, we identify and examine class-specific neurons by analyzing conditional activation patterns across diverse tasks. Our analysis reveals that SSL models foster the emergence of class-specific neurons that provide extensive coverage across novel task classes. These neurons exhibit shared responses across different semantic categories and acoustic similarities, such as speech attributes and musical pitch. We also confirm that these neurons have a functional impact on classification performance. To our knowledge, this is the first systematic neuron-level analysis of a general-purpose audio SSL model, providing new insights into its internal representation.

[337] Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

Amartyaveer, Murali Kadambi, Chandra Mohan Sharma, Anupam Mondal, Prasanta Kumar Ghosh

Main category: eess.AS

TL;DR: A bottleneck transformer model for non-intrusive STOI prediction that outperforms SSL-based methods using convolution blocks for frame-level features and multi-head self-attention for information aggregation.

DetailsMotivation: Traditional STOI calculation requires clean reference speech, limiting real-world applicability. While deep learning-based non-intrusive speech assessment models have shown promise, there's room for improvement over existing methods.

Method: Proposes a bottleneck transformer architecture with convolution blocks for learning frame-level features and a multi-head self-attention layer to aggregate information and focus on key aspects of input data.

Result: The model achieves higher correlation and lower mean squared error for both seen and unseen scenarios compared to state-of-the-art models using self-supervised learning and spectral features as inputs.

Conclusion: The bottleneck transformer approach effectively predicts STOI without requiring clean reference speech, demonstrating superior performance over existing non-intrusive methods.

Abstract: In this study, we have presented a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture. Traditional methods for calculating STOI typically requires clean reference speech, which limits their applicability in the real world. To address this, numerous deep learning-based nonintrusive speech assessment models have garnered significant interest. Many studies have achieved commendable performance, but there is room for further improvement. We propose the use of bottleneck transformer, incorporating convolution blocks for learning frame-level features and a multi-head self-attention (MHSA) layer to aggregate the information. These components enable the transformer to focus on the key aspects of the input data. Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios compared to the state-of-the-art model using self-supervised learning (SSL) and spectral features as inputs.

[338] Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios

Yiming Yang, Guangyong Wang, Haixin Guan, Yanhua Long

Main category: eess.AS

TL;DR: Enroll-on-Wakeup (EoW) framework uses wake-word segments as enrollment reference for target speech extraction, eliminating need for pre-recorded speech and enabling seamless human-machine interaction.

DetailsMotivation: Traditional target speech extraction requires pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interactions. The paper aims to create a more natural and seamless experience by leveraging wake-word segments captured during normal interaction.

Method: Proposes Enroll-on-Wakeup framework where wake-word segments are automatically used as enrollment reference. Evaluates discriminative and generative TSE models under diverse acoustic conditions. Investigates enrollment augmentation using LLM-based TTS to address challenges of short and noisy wake-word segments.

Result: Current TSE models face performance degradation in EoW-TSE due to short/noisy wake-word segments. TTS-based assistance significantly enhances listening experience, but gaps remain in speech recognition accuracy compared to traditional enrollment methods.

Conclusion: EoW framework enables seamless target speech extraction without pre-recorded enrollment. While current models show limitations, TTS augmentation improves performance, suggesting promising direction for more natural human-machine interaction systems.

Abstract: Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.

[339] Interpretable Binaural Deep Beamforming Guided by Time-Varying Relative Transfer Function

Ilai Zaidel, Sharon Gannot

Main category: eess.AS

TL;DR: Deep learning framework for adaptive beamforming that uses neural networks to learn time-varying beamformer weights, guided by tracked RTFs of moving speakers, with applications to binaural enhancement for hearables.

DetailsMotivation: To develop a speech enhancement system that can operate in dynamic acoustic environments with moving speakers, overcoming limitations of traditional fixed beamformers that struggle with spatial tracking and adaptation.

Method: Proposes a deep beamforming framework where a neural network learns time-varying beamformer weights from multichannel signals, guided by continuously tracked relative transfer functions (RTFs) of moving target speakers. Evaluates three modes: oracle guidance with true RTFs, guidance with subspace-tracked RTF estimates, and unguided operation. Extends to binaural beamforming using HRTF-based acoustic simulation.

Result: RTF guidance produces smoother, more spatially consistent beampatterns that track target DOA, while unguided models fail to maintain clear spatial focus. Binaural extension preserves spatial cues (ILD and ITD) effectively, demonstrating suitability for hearable applications.

Conclusion: The deep beamforming framework with RTF guidance enables effective speech enhancement in dynamic environments with moving speakers, maintaining spatial focus and preserving binaural cues for hearable applications.

Abstract: In this work, we propose a deep beamforming framework for speech enhancement in dynamic acoustic environments. The framework learns time-varying beamformer weights from noisy multichannel signals via a deep neural network, guided by a continuously tracked relative transfer function (RTF) of a moving target speaker. We analyze the network’s spatial behavior on an 8-microphone linear array by evaluating narrowband and wideband beampatterns in three modes: (i) oracle guidance with true RTFs, (ii) guidance with subspace-tracked RTF estimates, and (iii) operation without RTF guidance. Results show that RTF guidance yields smoother, more spatially consistent beampatterns that track the target direction of arrival (DOA), whereas the unguided model fails to maintain a clear spatial focus. We further extend the framework to binaural beamforming for dynamic target-speaker enhancement. The system is trained using a head-related transfer function (HRTF)-based acoustic simulation of a moving source, enabling realistic spatial rendering at the left and right ears. Spatial cue preservation is quantitatively evaluated in terms of interaural level differences (ILD) and interaural time differences (ITD), demonstrating the method’s suitability for hearable applications.

eess.IV

[340] StrokeNeXt: A Siamese-encoder Approach for Brain Stroke Classification in Computed Tomography Imagery

Leo Thomas Ramos, Angel D. Sappa

Main category: eess.IV

TL;DR: StrokeNeXt: A dual-branch ConvNeXt model for stroke classification in 2D CT images, achieving high accuracy for stroke detection and subtype classification between ischemic and hemorrhage cases.

DetailsMotivation: The paper addresses the need for accurate and efficient stroke classification in 2D CT images, which is crucial for timely diagnosis and treatment decisions in clinical settings.

Method: Uses a dual-branch design with two ConvNeXt encoders, feature fusion through a lightweight convolutional decoder with stacked 1D operations (bottleneck projection and transformation layers), and a compact classification head.

Result: Achieves accuracies and F1-scores up to 0.988, outperforms convolutional and Transformer-based baselines with statistically significant gains, shows robust behavior across diagnostic categories, reduced prediction error, low misclassification rates, low inference time, and fast convergence.

Conclusion: StrokeNeXt provides an effective solution for stroke classification in CT images with high accuracy, efficiency, and clinical applicability.

Abstract: We present StrokeNeXt, a model for stroke classification in 2D Computed Tomography (CT) images. StrokeNeXt employs a dual-branch design with two ConvNeXt encoders, whose features are fused through a lightweight convolutional decoder based on stacked 1D operations, including a bottleneck projection and transformation layers, and a compact classification head. The model is evaluated on a curated dataset of 6,774 CT images, addressing both stroke detection and subtype classification between ischemic and hemorrhage cases. StrokeNeXt consistently outperforms convolutional and Transformer-based baselines, reaching accuracies and F1-scores of up to 0.988. Paired statistical tests confirm that the performance gains are statistically significant, while class-wise sensitivity and specificity demonstrate robust behavior across diagnostic categories. Calibration analysis shows reduced prediction error compared to competing methods, and confusion matrix results indicate low misclassification rates. In addition, the model exhibits low inference time and fast convergence.

[341] Benchmarking Self-Supervised Models for Cardiac Ultrasound View Classification

Youssef Megahed, Salma I. Megahed, Robin Ducharme, Inok Lee, Adrian D. C. Chan, Mark C. Walker, Steven Hawken

Main category: eess.IV

TL;DR: USF-MAE self-supervised learning framework outperforms MoCo v3 for cardiac ultrasound view classification on CACTUS dataset, achieving near-perfect metrics.

DetailsMotivation: Reliable interpretation of cardiac ultrasound images is essential for clinical diagnosis. Self-supervised learning can leverage large unlabelled medical datasets to learn meaningful representations, but different frameworks need evaluation for cardiac imaging tasks.

Method: Comparative evaluation of two self-supervised frameworks (USF-MAE and MoCo v3) on CACTUS dataset (37,736 cardiac ultrasound images) for automated view classification. Used 5-fold cross-validation with identical training protocols (learning rate 0.0001, weight decay 0.01). Performance measured via ROC-AUC, accuracy, F1-score, and recall.

Result: USF-MAE consistently outperformed MoCo v3 across all metrics: average testing AUC 99.99% vs 99.97%, mean accuracy 99.33% vs 98.99%. Improvements were statistically significant (p=0.0048). USF-MAE learns more discriminative features for cardiac view classification.

Conclusion: USF-MAE shows superior performance over MoCo v3 for cardiac ultrasound classification, demonstrating potential for improving automated medical image analysis through self-supervised learning.

Abstract: Reliable interpretation of cardiac ultrasound images is essential for accurate clinical diagnosis and assessment. Self-supervised learning has shown promise in medical imaging by leveraging large unlabelled datasets to learn meaningful representations. In this study, we evaluate and compare two self-supervised learning frameworks, USF-MAE, developed by our team, and MoCo v3, on the recently introduced CACTUS dataset (37,736 images) for automated simulated cardiac view (A4C, PL, PSAV, PSMV, Random, and SC) classification. Both models used 5-fold cross-validation, enabling robust assessment of generalization performance across multiple random splits. The CACTUS dataset provides expert-annotated cardiac ultrasound images with diverse views. We adopt an identical training protocol for both models to ensure a fair comparison. Both models are configured with a learning rate of 0.0001 and a weight decay of 0.01. For each fold, we record performance metrics including ROC-AUC, accuracy, F1-score, and recall. Our results indicate that USF-MAE consistently outperforms MoCo v3 across metrics. The average testing AUC for USF-MAE is 99.99% (+/-0.01% 95% CI), compared to 99.97% (+/-0.01%) for MoCo v3. USF-MAE achieves a mean testing accuracy of 99.33% (+/-0.18%), higher than the 98.99% (+/-0.28%) reported for MoCo v3. Similar trends are observed for the F1-score and recall, with improvements statistically significant across folds (paired t-test, p=0.0048 < 0.01). This proof-of-concept analysis suggests that USF-MAE learns more discriminative features for cardiac view classification than MoCo v3 when applied to this dataset. The enhanced performance across multiple metrics highlights the potential of USF-MAE for improving automated cardiac ultrasound classification.

[342] Rate-Distortion Optimization for Ensembles of Non-Reference Metrics

Xin Xiong, Samuel Fernández-Menduiña, Eduardo Pavez, Antonio Ortega, Neil Birkbeck, Balu Adsumilli

Main category: eess.IV

TL;DR: A framework for optimizing video compression using ensembles of non-reference quality metrics with gradient stabilization to improve robustness across different quality estimators.

DetailsMotivation: Current video coding uses full-reference metrics like SSE, but non-reference metrics (NRMs) are better for user-generated content. However, linearizing single NRMs for rate-distortion optimization can yield limited gains or even degradations due to NRMs' non-linearity and locally unstable gradients.

Method: Extends linearized NRM (LNRM) framework to optimize ensembles of NRMs instead of single metrics. Introduces smoothing-based formulation to stabilize NRM gradients before linearization. Designed for hybrid codecs and overfitted codecs to avoid iterative evaluations and backpropagation of neural network-based NRMs.

Result: Validated on AVC and Cool-chic codecs using YouTube UGC dataset. Achieves consistent bitrate savings across multiple NRMs with no decoder complexity overhead. For Cool-chic, substantially reduces encoding runtime compared to direct NRM optimization.

Conclusion: Proposed ensemble optimization with gradient stabilization provides robust quality improvements across multiple NRMs while reducing encoder complexity, making it practical for real-world video coding applications.

Abstract: Non-reference metrics (NRMs) can assess the visual quality of images and videos without a reference, making them well-suited for the evaluation of user-generated content. Nonetheless, rate-distortion optimization (RDO) in video coding is still mainly driven by full-reference metrics, such as the sum of squared errors, which treat the input as an ideal target. A way to incorporate NRMs into RDO is through linearization (LNRM), where the gradient of the NRM with respect to the input guides bit allocation. While this strategy improves the quality predicted by some metrics, we show that it can yield limited gains or degradations when evaluated with other NRMs. We argue that NRMs are highly non-linear predictors with locally unstable gradients that can compromise the quality of the linearization; furthermore, optimizing a single metric may exploit model-specific biases that do not generalize across quality estimators. Motivated by this observation, we extend the LNRM framework to optimize ensembles of NRMs and, to further improve robustness, we introduce a smoothing-based formulation that stabilizes NRM gradients prior to linearization. Our framework is well-suited to hybrid codecs, and we advocate for its use with overfitted codecs, where it avoids iterative evaluations and backpropagation of neural network-based NRMs, reducing encoder complexity relative to direct NRM optimization. We validate the proposed approach on AVC and Cool-chic, using the YouTube UGC dataset. Experiments demonstrate consistent bitrate savings across multiple NRMs with no decoder complexity overhead and, for Cool-chic, a substantial reduction in encoding runtime compared to direct NRM optimization.

[343] SNIC: Synthesized Noisy Images using Calibration

Nik Bhatt

Main category: eess.IV

TL;DR: Paper presents SNIC dataset with realistic synthesized noisy images using calibrated heteroscedastic noise models, achieving 54-64% PSNR improvement over manufacturer models.

DetailsMotivation: Advanced denoising algorithms need large, high-quality datasets, but physically-based statistical noise models lack proper calibration guidance and published datasets using them.

Method: Developed methods for building high-quality heteroscedastic noise models that produce realistic synthesized noisy images in RAW and TIFF formats, creating the SNIC dataset with over 6000 images from 30 scenes across four different sensors.

Result: Synthesized images achieve comparable LPIPS results to real noisy images; reduce PSNR gap versus real noise by 54-64% compared to manufacturer-provided DNG noise models.

Conclusion: The SNIC dataset provides high-quality synthesized noisy images with improved realism, addressing the need for better training data for denoising algorithms.

Abstract: Advanced denoising algorithms require large, high-quality datasets. Physically-based statistical noise models can create such datasets by realistically simulating noise in digital images. However, there is little information on the correct way to calibrate and tune these heteroscedastic models, and a lack of published datasets using them. In this paper, we explore the process of building high-quality heteroscedastic noise models. Our methods produce realistic synthesized noisy images in both RAW and TIFF formats. Our synthesized noisy images achieve comparable LPIPS results to real noisy images; when tested with a state-of-the-art denoising model, our images reduce the PSNR gap versus real noise by 54-64% compared to those synthesized using manufacturer-provided DNG noise models. Using our approach, we created the Synthesized Noisy Images using Calibration dataset (SNIC) containing over 6000 noisy images, comprising 30 scenes from four sensors, including two smartphone sensors, a point-and-shoot, and a DSLR. SNIC is the first synthesized noisy image dataset provided in both RAW and TIFF format.

[344] Scan-Adaptive Dynamic MRI Undersampling Using a Dictionary of Efficiently Learned Patterns

Siddhant Gautam, Angqi Li, Prachi P. Agarwal, Anil K. Attili, Jeffrey A. Fessler, Nicole Seiberlich, Saiprasad Ravishankar

Main category: eess.IV

TL;DR: Learning-based framework designs scan-adaptive Cartesian undersampling masks for dynamic cardiac MRI acceleration, improving reconstruction quality across multiple acceleration factors.

DetailsMotivation: Cardiac MRI suffers from long acquisition times causing patient discomfort and motion artifacts; need for efficient undersampling patterns that preserve diagnostic quality while accelerating scans.

Method: Develop learning-based framework to design scan/slice-adaptive Cartesian undersampling masks using fully sampled training data; at inference, nearest-neighbor search in low-frequency k-space selects optimized mask from learned pattern dictionary.

Result: Learned sampling improves reconstruction quality across multiple acceleration factors: 2-3 dB PSNR gains, reduced NMSE, improved SSIM, and higher radiologist ratings on public and in-house cardiac MRI datasets.

Conclusion: Scan-adaptive sampling framework enables faster, higher-quality dynamic cardiac MRI by adapting k-space sampling to individual scans, addressing acquisition time limitations.

Abstract: Cardiac MRI is limited by long acquisition times, which can lead to patient discomfort and motion artifacts. We aim to accelerate Cartesian dynamic cardiac MRI by learning efficient, scan-adaptive undersampling patterns that preserve diagnostic image quality. We develop a learning-based framework for designing scan- or slice-adaptive Cartesian undersampling masks tailored to dynamic cardiac MRI. Undersampling patterns are optimized using fully sampled training dynamic time-series data. At inference time, a nearest-neighbor search in low-frequency $k$-space selects an optimized mask from a dictionary of learned patterns. Our learned sampling approach improves reconstruction quality across multiple acceleration factors on public and in-house cardiac MRI datasets, including PSNR gains of 2-3 dB, reduced NMSE, improved SSIM, and higher radiologist ratings. The proposed scan-adaptive sampling framework enables faster and higher-quality dynamic cardiac MRI by adapting $k$-space sampling to individual scans.

Last updated: 2026-03-06
Built with Hugo, theme modified on Stack