Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 64]
cs.CV [Total: 165]
cs.AI [Total: 72]
cs.SD [Total: 7]
cs.LG [Total: 150]
cs.MA [Total: 2]
cs.MM [Total: 1]
eess.AS [Total: 4]
eess.IV [Total: 12]

cs.CL

[1] TabReX : Tabular Referenceless eXplainable Evaluation

Tejas Anvekar, Juhna Park, Aparna Garimella, Vivek Gupta

Main category: cs.CL

TL;DR: TabReX is a reference-less, property-driven framework that evaluates LLM-generated tables using graph-based reasoning and achieves human-aligned judgments with interpretable scores.

Details

Motivation: Existing metrics for evaluating LLM-generated tables either flatten tables into text (ignoring structure) or rely on fixed references that limit generalization, creating a need for better evaluation methods.

Method: TabReX converts source text and generated tables into canonical knowledge graphs, aligns them through LLM-guided matching, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity.

Result: TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis on the TabReX-Bench benchmark spanning six domains and twelve perturbation types.

Conclusion: TabReX establishes a new paradigm for trustworthy, explainable evaluation of structured generation systems through its reference-less, property-driven framework with graph-based reasoning.

Abstract: Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.

Joel Mire, Maria Antoniak, Steven R. Wilson, Zexin Ma, Achyutarama R. Ganti, Andrew Piper, Maarten Sap

Main category: cs.CL

TL;DR: SocialStoryFrames is a computational formalism for modeling nuanced reader responses to stories, enabling large-scale analysis of narrative practices across online communities.

Details

Motivation: Current computational models of reader response are limited and cannot capture the rich interpretive, affective, and evaluative responses that readers have to stories, preventing nuanced analysis of storytelling at scale.

Method: Developed SocialStoryFrames formalism using conversational context and a taxonomy grounded in narrative theory, linguistic pragmatics, and psychology. Created two models: SSF-Generator and SSF-Classifier, validated through human surveys (382 participants) and expert annotations. Applied models to SSF-Corpus of 6,140 social media stories from diverse contexts.

Result: Validated models through human surveys and expert annotations. Applied to large dataset to characterize frequency and interdependence of storytelling intents, and compare narrative practices across communities. Enabled new research into storytelling in online communities.

Conclusion: SocialStoryFrames bridges fine-grained, context-sensitive modeling with a generic taxonomy of reader responses, enabling scalable analysis of storytelling practices and reader interpretations across diverse online communities.

Abstract: Reading stories evokes rich interpretive, affective, and evaluative responses, such as inferences about narrative intent or judgments about characters. Yet, computational models of reader response are limited, preventing nuanced analyses. To address this gap, we introduce SocialStoryFrames, a formalism for distilling plausible inferences about reader response, such as perceived author intent, explanatory and predictive reasoning, affective responses, and value judgments, using conversational context and a taxonomy grounded in narrative theory, linguistic pragmatics, and psychology. We develop two models, SSF-Generator and SSF-Classifier, validated through human surveys (N=382 participants) and expert annotations, respectively. We conduct pilot analyses to showcase the utility of the formalism for studying storytelling at scale. Specifically, applying our models to SSF-Corpus, a curated dataset of 6,140 social media stories from diverse contexts, we characterize the frequency and interdependence of storytelling intents, and we compare and contrast narrative practices (and their diversity) across communities. By linking fine-grained, context-sensitive modeling with a generic taxonomy of reader responses, SocialStoryFrames enable new research into storytelling in online communities.

[3] BRAID: Bounded Reasoning for Autonomous Inference and Decisions

Armağan Amcalar, Eyup Cinar

Main category: cs.CL

TL;DR: BRAID introduces bounded reasoning with structured prompts using Mermaid graphs, improving accuracy and cost efficiency for LLM agents across multiple benchmarks.

Details

Motivation: LLMs show nonlinear relationships between performance, cost, and token usage, creating a need for more efficient reasoning frameworks that don't rely on unbounded natural-language token expansion.

Method: BRAID (Bounded Reasoning for Autonomous Inference and Decisions) uses Mermaid-based instruction graphs to enable structured reasoning, evaluated across multiple GPT model tiers on AdvancedIF, GSM-Hard, and SCALE MultiChallenge benchmarks.

Result: Structured machine-readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems across all tested benchmarks.

Conclusion: BRAID establishes an effective and scalable technique for optimizing inference efficiency in autonomous agent systems, with all datasets and detailed results available for verification.

Abstract: Large Language Models (LLMs) exhibit nonlinear relationships between performance, cost, and token usage. This paper presents a quantitative study on structured prompting using BRAID (Bounded Reasoning for Au tonomous Inference and Decisions) across multiple GPT model tiers, eval uated on the AdvancedIF, GSM-Hard, and the SCALE MultiChallenge benchmark datasets. BRAID introduces a bounded reasoning framework using Mermaid-based instruction graphs that enable models to reason struc turally rather than through unbounded natural-language token expansion. We show that structured machine-readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems. The findings establish BRAID as an effective and scalable technique for optimizing inference efficiency in autonomous agent systems. All datasets and detailed result logs are available at https://benchmark.openserv.ai.

Kieran Henderson, Kian Omoomi, Vasudha Varadarajan, Allison Lahnala, Charles Welch

Main category: cs.CL

TL;DR: Personal information categories (demographics, attitudes, relationships, experiences) analyzed for predicting annotator labels on social norms; demographics most impactful, theory-based approaches beat automatic clustering, small sample size sufficient, diversity improves performance.

Details

Motivation: Previous work used personal information for modeling individual characteristics but limited exploration on what types of information are most informative for predicting annotator labels in subjective tasks.

Method: Categorize self-disclosure sentences and build annotator models for predicting judgments of social norms. Perform ablations and analyses to examine impact of information types on predicting annotation patterns.

Result: Demographics are more impactful than attitudes, relationships, and experiences. Theory-based approaches worked better than automatic clusters. Only a small number of related comments needed. More diverse sample of annotator self-disclosures leads to best performance.

Conclusion: Understanding what types of personal information are most informative improves annotator modeling for subjective tasks, with demographics being particularly valuable and diversity enhancing predictive performance.

Abstract: Recent work has explored the use of personal information in the form of persona sentences or self-disclosures to improve modeling of individual characteristics and prediction of annotator labels for subjective tasks. The volume of personal information has historically been restricted and thus little exploration has gone into understanding what kind of information is most informative for predicting annotator labels. In this work, we categorize self-disclosure sentences and use them to build annotator models for predicting judgments of social norms. We perform several ablations and analyses to examine the impact of the type of information on our ability to predict annotation patterns. We find that demographics are more impactful than attitudes, relationships, and experiences. Generally, theory-based approaches worked better than automatic clusters. Contrary to previous work, only a small number of related comments are needed. Lastly, having a more diverse sample of annotator self-disclosures leads to the best performance.

[5] Are We on the Right Way to Assessing LLM-as-a-Judge?

Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen

Main category: cs.CL

TL;DR: Sage is a new evaluation suite that assesses LLM-as-a-Judge quality without human annotation, using local self-consistency and global logical consistency metrics based on rational choice theory axioms.

Details

Motivation: Existing LLM-as-a-Judge benchmarks rely on human-annotated ground truth, which introduces human bias and scalability limitations. There's a need for evaluation methods that don't require human annotation to assess LLM judge reliability objectively.

Method: Sage introduces two novel metrics: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across full preference sets), based on rational choice theory axioms. The method uses a curated dataset of 650 questions combining structured benchmark problems with real-world user queries.

Result: Sage metrics show stability and high correlation with supervised benchmarks like LLMBar and RewardBench2. Current state-of-the-art LLMs (Gemini-2.5-Pro and GPT-5) fail to maintain consistent preferences in nearly a quarter of difficult cases. The study identifies “situational preference” phenomenon and shows finetuned LLM-as-a-Judge, panel-based judging, and deep reasoning can improve consistency. Human judgments also show substantial inconsistency.

Conclusion: Sage provides a reliable, human-annotation-free evaluation suite for LLM judges. Current LLMs have significant reliability issues as judges, and human annotation may not be a reliable gold standard. The framework enables better assessment and improvement of LLM judging capabilities.

Abstract: LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage’s reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.

[6] Convolutional Lie Operator for Sentence Classification

Daniela N. Rim, Heeyoul Choi

Main category: cs.CL

TL;DR: Integrating Lie Convolutions into CNN sentence classifiers improves accuracy by capturing complex non-Euclidean symmetries in language.

Details

Motivation: Traditional CNNs capture local, position-invariant features in text but have limited capacity to model complex transformations within language. The authors aim to explore whether Lie group operations can capture more sophisticated, non-Euclidean symmetries in language.

Method: Proposed novel integration of Lie Convolutions into convolutional-based sentence classifiers, creating two models: SCLie and DPCLie. These models leverage Lie group operations to capture complex transformations in language.

Result: SCLie and DPCLie empirically outperform traditional convolutional-based sentence classifiers. Lie-based models relatively improve accuracy by capturing transformations not commonly associated with language.

Conclusion: The findings demonstrate the value of Lie-based approaches for language modeling and motivate further exploration of new paradigms beyond traditional Euclidean-based methods.

Abstract: Traditional Convolutional Neural Networks have been successful in capturing local, position-invariant features in text, but their capacity to model complex transformation within language can be further explored. In this work, we explore a novel approach by integrating Lie Convolutions into Convolutional-based sentence classifiers, inspired by the ability of Lie group operations to capture complex, non-Euclidean symmetries. Our proposed models SCLie and DPCLie empirically outperform traditional Convolutional-based sentence classifiers, suggesting that Lie-based models relatively improve the accuracy by capturing transformations not commonly associated with language. Our findings motivate more exploration of new paradigms in language modeling.

[7] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

Pengyu Wang, Shuchang Ye, Usman Naseem, Jinman Kim

Main category: cs.CL

TL;DR: Medical report generation using semantic-driven reinforcement learning (SRL) with report-level clinical correctness rewards instead of token-level objectives, achieving state-of-the-art performance on medical imaging datasets.

Details

Motivation: Existing medical report generation methods focus on linguistic style imitation rather than clinical correctness, as they use token-level objectives that prioritize word-choice and sentence structure over medical accuracy.

Method: Proposes semantic-driven reinforcement learning (SRL) using Group Relative Policy Optimization (GRPO) with report-level reward: margin-based cosine similarity between key radiological findings from generated and reference reports. Includes lightweight reasoning format constraint for structured “thinking report” outputs.

Result: MRG-R1 achieves state-of-the-art performance with CE-F1 scores of 51.88 on IU X-Ray and 40.39 on MIMIC-CXR datasets, demonstrating that label-semantic reinforcement outperforms conventional token-level supervision.

Conclusion: Optimizing clinically grounded, report-level rewards rather than token overlap meaningfully improves clinical correctness in medical report generation, establishing semantic reinforcement as a promising approach for supervising medical correctness in large vision-language models.

Abstract: Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured “thinking report” outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.

[8] Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning

Yash Bhaskar, Sankalp Bahad, Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: A system for detecting hate speech driven by fake narratives (Faux-Hate) in Hindi-English code-mixed social media text, achieving competitive results using NLP techniques and domain-specific pretraining.

Details

Motivation: Social media platforms enable rapid spread of harmful content like hate speech and fake narratives. The Faux-Hate shared task specifically addresses hate speech generated from fake narratives in code-mixed Hindi-English text, requiring effective detection systems.

Method: Combines advanced natural language processing techniques with domain-specific pretraining. Uses multi-task learning to address both binary Faux-Hate detection (fake and hate speech classification) and target/severity prediction.

Result: The system achieved competitive results in the shared task, demonstrating efficacy of the multi-task learning approach for this complex problem.

Conclusion: The proposed approach effectively addresses the challenge of detecting hate speech driven by fake narratives in code-mixed social media text, showing promise for combating harmful content spread on social platforms.

Abstract: Social media platforms, while enabling global connectivity, have become hubs for the rapid spread of harmful content, including hate speech and fake narratives \cite{davidson2017automated, shu2017fake}. The Faux-Hate shared task focuses on detecting a specific phenomenon: the generation of hate speech driven by fake narratives, termed Faux-Hate. Participants are challenged to identify such instances in code-mixed Hindi-English social media text. This paper describes our system developed for the shared task, addressing two primary sub-tasks: (a) Binary Faux-Hate detection, involving fake and hate speech classification, and (b) Target and Severity prediction, categorizing the intended target and severity of hateful content. Our approach combines advanced natural language processing techniques with domain-specific pretraining to enhance performance across both tasks. The system achieved competitive results, demonstrating the efficacy of leveraging multi-task learning for this complex problem.

[9] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

Main category: cs.CL

TL;DR: SpeechLLMs don’t yet beat cascaded systems for speech translation; cascades remain most reliable overall despite SpeechLLMs matching them in some specific settings.

Details

Motivation: To determine whether integrating speech as a native modality in LLMs (SpeechLLMs) actually improves speech-to-text translation quality compared to established cascaded architectures that combine speech foundation models with multilingual LLMs.

Method: Created “Hearing to Translate” test suite benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems using leading speech foundation models with multilingual LLMs. Evaluation spanned 16 benchmarks, 13 language pairs, and 9 challenging conditions including disfluent, noisy, and long-form speech.

Result: Cascaded systems remain the most reliable overall. Current SpeechLLMs only match cascades in selected settings. Speech foundation models lag behind both approaches, showing that integrating an LLM (either within the model or in a pipeline) is essential for high-quality speech translation.

Conclusion: Despite the promise of SpeechLLMs as integrated speech-text models, cascaded architectures combining specialized speech models with multilingual LLMs currently provide more reliable speech translation performance across diverse conditions.

Abstract: As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang, Ziqiang Han

Main category: cs.CL

TL;DR: Domain-adapted extraction pipeline using Qwen2.5-7B with LoRA fine-tuning achieves high accuracy (98.36%+) for structured information extraction from noisy police incident announcements on Chinese social media.

Details

Motivation: Structured information extraction from police incident announcements is crucial for timely data processing but challenging due to variability and informality of social media text sources like Weibo posts.

Method: Developed a domain-adapted extraction pipeline using targeted prompt engineering with parameter-efficient fine-tuning of Qwen2.5-7B model via Low-Rank Adaptation (LoRA). Trained on manually annotated dataset of 4,933 instances from 27,822 police briefing posts on Chinese Weibo (2019-2020) to extract 15 key fields.

Result: LoRA-based fine-tuning significantly outperformed both base and instruction-tuned models, achieving >98.36% accuracy for mortality detection, 95.31% Exact Match Rate for fatality counts, and 95.54% for province-level location extraction.

Conclusion: The pipeline provides a validated, efficient solution for multi-task structured information extraction in specialized domains, offering a practical framework for transforming unstructured social media text into reliable structured data for social science research.

Abstract: Structured information extraction from police incident announcements is crucial for timely and accurate data processing, yet presents considerable challenges due to the variability and informal nature of textual sources such as social media posts. To address these challenges, we developed a domain-adapted extraction pipeline that leverages targeted prompt engineering with parameter-efficient fine-tuning of the Qwen2.5-7B model using Low-Rank Adaptation (LoRA). This approach enables the model to handle noisy, heterogeneous text while reliably extracting 15 key fields, including location, event characteristics, and impact assessment, from a high-quality, manually annotated dataset of 4,933 instances derived from 27,822 police briefing posts on Chinese Weibo (2019-2020). Experimental results demonstrated that LoRA-based fine-tuning significantly improved performance over both the base and instruction-tuned models, achieving an accuracy exceeding 98.36% for mortality detection and Exact Match Rates of 95.31% for fatality counts and 95.54% for province-level location extraction. The proposed pipeline thus provides a validated and efficient solution for multi-task structured information extraction in specialized domains, offering a practical framework for transforming unstructured text into reliable structured data in social science research.

[11] Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

Musarrat Zeba, Abdullah Al Mamun, Kishoar Jahan Tithee, Debopom Sutradhar, Mohaimenul Azam Khan Raiaan, Saddam Mukta, Reem E. Mohamed, Md Rafiqul Islam, Yakub Sebastian, Mukhtar Hussain, Sami Azam

Main category: cs.CL

TL;DR: Researchers propose a two-part system for reliable healthcare LLM outputs: a fact-checking module that validates LLM-generated facts against EHRs, and a domain-specific summarization model fine-tuned on MIMIC-III data to reduce hallucinations.

Details

Motivation: LLM outputs in healthcare are often unreliable due to hallucinations, which poses serious risks for decision-making and patient safety. There's a critical need for ensuring accuracy and reliability in medical applications of LLMs.

Method: 1) Developed an independent fact-checking module using numerical tests and logical checks via discrete logic in NLP to validate facts against EHRs. 2) Fine-tuned a domain-specific summarization model using LoRa on the MIMIC-III dataset to minimize hallucinations. The LLM was trained on the full MIMIC-III dataset.

Result: Fact-checking module achieved precision: 0.8904, recall: 0.8234, F1-score: 0.8556 on 3,786 propositions from 104 summaries. LLM summary model achieved ROUGE-1: 0.5797 and BERTScore: 0.9120 for summary quality.

Conclusion: The proposed system effectively addresses LLM hallucination issues in healthcare by combining a robust fact-checking module with a specialized summarization model, achieving strong performance metrics for both fact validation and summary quality.

Abstract: In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.

[12] Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment

Maureen de Seyssel, Eeshan Gunesh Dhekane

Main category: cs.CL

TL;DR: Proposes a unified taxonomy for evaluating speech foundation models across different tasks and capabilities, identifying gaps in current evaluation methods.

Details

Motivation: Speech foundation models have achieved remarkable capabilities but their evaluation remains disjointed across tasks and model types, with different models excelling at different aspects and requiring different evaluation protocols.

Method: Develops a three-axis taxonomy: (1) evaluation aspect being measured, (2) model capabilities required, and (3) task/protocol requirements. Classifies existing evaluations and benchmarks along these axes across representation learning, speech generation, and interactive dialogue.

Result: Provides a principled framework for aligning models with suitable evaluation methods, reveals systematic gaps in current evaluations (limited coverage of prosody, interaction, reasoning), and identifies priorities for future benchmark design.

Conclusion: Offers a conceptual foundation and practical guide for selecting, interpreting, and extending evaluations of speech models, addressing the fundamental question of which evaluation is appropriate for which model.

Abstract: Speech foundation models have recently achieved remarkable capabilities across a wide range of tasks. However, their evaluation remains disjointed across tasks and model types. Different models excel at distinct aspects of speech processing and thus require different evaluation protocols. This paper proposes a unified taxonomy that addresses the question: Which evaluation is appropriate for which model? The taxonomy defines three orthogonal axes: the evaluation aspect being measured, the model capabilities required to attempt the task, and the task or protocol requirements needed to perform it. We classify a broad set of existing evaluations and benchmarks along these axes, spanning areas such as representation learning, speech generation, and interactive dialogue. By mapping each evaluation to the capabilities a model exposes (e.g., speech generation, real-time processing) and to its methodological demands (e.g., fine-tuning data, human judgment), the taxonomy provides a principled framework for aligning models with suitable evaluation methods. It also reveals systematic gaps, such as limited coverage of prosody, interaction, or reasoning, that highlight priorities for future benchmark design. Overall, this work offers a conceptual foundation and practical guide for selecting, interpreting, and extending evaluations of speech models.

Yi Zhao, Siqi Wang, Jing Li

Main category: cs.CL

TL;DR: LaF-GRPO uses LLM-simulated VI user feedback to train VLMs for generating precise navigation instructions, with new NIG4VI dataset showing significant improvements over baselines.

Details

Motivation: Navigation instruction generation for visually impaired individuals is critical but underexplored, with challenges in generating precise, in-situ instructions and scarcity of dedicated benchmarks.

Method: Proposes LaF-GRPO (LLM-as-Follower GRPO) where an LLM simulates VI user responses to provide feedback rewards for post-training VLMs, reducing real-world data needs. Introduces NIG4VI dataset with 27k samples for training/evaluation.

Result: Experiments show LaF-GRPO boosts BLEU by 14% and achieves METEOR 0.542 vs GPT-4o’s 0.323. Qualitative analysis confirms more intuitive and safer instructions.

Conclusion: LaF-GRPO effectively enhances navigation instruction accuracy and usability for VI users while reducing data collection costs, with NIG4VI dataset facilitating future research.

Abstract: Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study focuses on generating precise, in-situ, step-by-step navigation instructions that are practically usable for VI users. Specifically, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to navigation instructions, thereby providing feedback rewards to guide the post-training of a Vision-Language Model (VLM). This enhances instruction accuracy and usability while reducing costly real-world data collection needs. To address the scarcity of dedicated benchmarks in this field, we introduce NIG4VI, a 27k-sample open-source dataset to facilitate training and evaluation. It comprises diverse navigation scenarios with accurate spatial coordinates, supporting detailed and open-ended in-situ instruction generation. Experiments on NIG4VI demonstrate the effectiveness of LaF-GRPO through quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU 14%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o 0.323), and qualitative analysis further confirms that our method yields more intuitive and safer instructions.

[14] An Information-Theoretic Framework for Robust Large Language Model Editing

Qizhou Chen, Chengyu Wang, Taolin Zhang, Xiaofeng He

Main category: cs.CL

TL;DR: IBKE: A novel LLM editing framework using information bottleneck theory to compress essential knowledge for generalizable corrections while minimizing disruption to unrelated model behaviors.

Details

Motivation: LLMs contain errors and outdated information that undermine their accuracy and safe deployment. Current editing techniques struggle to generalize corrections beyond narrow domains and cause unintended consequences, limiting practical impact.

Method: Information Bottleneck Knowledge Editor (IBKE) - uses information bottleneck theory to compress and isolate essential information for knowledge correction. Leverages compact latent representations to guide gradient-based updates for robust model editing.

Result: Validated across multiple LLM architectures and standard benchmark tasks, demonstrating state-of-the-art accuracy with improved generality and specificity of edits.

Conclusion: Establishes a theoretically principled and practical paradigm for open-domain knowledge editing, advancing the utility and trustworthiness of LLMs in real-world applications.

Abstract: Large Language Models (LLMs) have become indispensable tools in science, technology, and society, enabling transformative advances across diverse fields. However, errors or outdated information within these models can undermine their accuracy and restrict their safe deployment. Developing efficient strategies for updating model knowledge without the expense and disruption of full retraining remains a critical challenge. Current model editing techniques frequently struggle to generalize corrections beyond narrow domains, leading to unintended consequences and limiting their practical impact. Here, we introduce a novel framework for editing LLMs, grounded in information bottleneck theory. This approach precisely compresses and isolates the essential information required for generalizable knowledge correction while minimizing disruption to unrelated model behaviors. Building upon this foundation, we present the Information Bottleneck Knowledge Editor (IBKE), which leverages compact latent representations to guide gradient-based updates, enabling robust and broadly applicable model editing. We validate IBKE’s effectiveness across multiple LLM architectures and standard benchmark tasks, demonstrating state-of-the-art accuracy and improved generality and specificity of edits. These findings establish a theoretically principled and practical paradigm for open-domain knowledge editing, advancing the utility and trustworthiness of LLMs in real-world applications.

[15] Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization

Yen-Ju Lu, Kunxiao Gao, Mingrui Liang, Helin Wang, Thomas Thebaud, Laureano Moro-Velazquez, Najim Dehak, Jesus Villalba

Main category: cs.CL

TL;DR: Spoken DialogSum is the first dataset aligning raw conversational audio with factual/emotion-rich summaries and paralinguistic labels, enabling emotion-aware spoken dialogue summarization research.

Details

Motivation: Research on emotion-aware or spoken dialogue summarization is limited by the lack of data linking speech, summaries, and paralinguistic cues (emotion, age, gender, pitch, speaking rate).

Method: Two-stage approach: 1) LLM rewrites DialogSum scripts with Switchboard-style fillers/back-channels and tags utterances with emotion, pitch, speaking rate; 2) Expressive TTS synthesizes speech from tagged scripts aligned with paralinguistic labels.

Result: Created Spoken DialogSum with 13,460 emotion-diverse dialogues, each with factual and emotion-focused summaries. Audio-LLM baseline shows 28% relative improvement in emotional-summary ROUGE-L over cascaded ASR-LLM system.

Conclusion: The dataset enables emotion-aware spoken dialogue summarization research, and end-to-end speech modeling (Audio-LLM) significantly outperforms cascaded approaches, confirming the value of direct speech processing.

Abstract: Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. We release an online demo at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/, with plans to release the full dataset in the near future. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.

[16] LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

Chenkai Xu, Yijie Jin, Jiajun Li, Yi Tu, Guoping Long, Dandan Tu, Tianqi Hou, Junchi Yan, Zhijie Deng

Main category: cs.CL

TL;DR: LoPA is a training-free algorithm that improves diffusion LLM inference speed by optimizing token filling order selection, achieving up to 10.1 tokens per forward pass while maintaining performance.

Details

Motivation: Current diffusion LLMs have limited parallelism (1-3 tokens per forward pass) due to inefficient token filling order strategies, creating a bottleneck for high-speed inference.

Method: LoPA explores multiple candidate token filling orders in parallel branches, selects the one with highest potential for future parallelism based on branch confidence, and uses specialized multi-device inference with Branch Parallelism.

Result: LoPA increases D2F-Dream’s tokens per forward pass to 10.1 on GSM8K while maintaining superior performance, achieving 1073.9 tokens per second throughput under multi-GPU deployment.

Conclusion: LoPA significantly accelerates diffusion LLM inference through optimized token filling order selection and parallel processing, enabling unprecedented parallelism without training.

Abstract: Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1–3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at https://github.com/zhijie-group/LoPA.

[17] Sigma-Moe-Tiny Technical Report

Qingguo Hu, Zhenghao Lin, Ziyue Yang, Yucheng Ding, Xiao Liu, Yuting Jiang, Ruizhe Wang, Tianyu Chen, Zhongxin Guo, Yifan Xiong, Rui Gao, Lei Qu, Jinsong Su, Peng Cheng, Yeyun Gong

Main category: cs.CL

TL;DR: Sigma-MoE-Tiny achieves extreme sparsity (20B total params, 0.5B activated) with fine-grained expert segmentation (96 experts/layer, 1 expert/token), using progressive sparsification for load balancing.

Details

Motivation: To push the boundaries of sparsity in MoE models for foundation models, achieving higher efficiency while maintaining performance, and addressing load balancing challenges in extremely sparse settings.

Method: Fine-grained expert segmentation (96 experts per layer, 1 expert per token activation), progressive sparsification schedule for load balancing, pre-training on diverse high-quality corpus, and post-training for capability unlocking.

Result: Sigma-MoE-Tiny achieves top-tier performance among comparable/larger models despite activating only 0.5B parameters, with stable training (no irrecoverable loss spikes) and highest sparsity among open-source models.

Conclusion: Extreme sparsity in MoE models is achievable with proper load balancing techniques, offering insights for advancing sparsity in future MoE architectures while maintaining performance and training stability.

Abstract: Mixture-of-Experts (MoE) has emerged as a promising paradigm for foundation models due to its efficient and powerful scalability. In this work, we present Sigma-MoE-Tiny, an MoE language model that achieves the highest sparsity compared to existing open-source models. Sigma-MoE-Tiny employs fine-grained expert segmentation with up to 96 experts per layer, while activating only one expert for each token, resulting in 20B total parameters with just 0.5B activated. The major challenge introduced by such extreme sparsity lies in expert load balancing. We find that the widely-used load balancing loss tends to become ineffective in the lower layers under this setting. To address this issue, we propose a progressive sparsification schedule aiming to balance expert utilization and training stability. Sigma-MoE-Tiny is pre-trained on a diverse and high-quality corpus, followed by post-training to further unlock its capabilities. The entire training process remains remarkably stable, with no occurrence of irrecoverable loss spikes. Comprehensive evaluations reveal that, despite activating only 0.5B parameters, Sigma-MoE-Tiny achieves top-tier performance among counterparts of comparable or significantly larger scale. In addition, we provide an in-depth discussion of load balancing in highly sparse MoE models, offering insights for advancing sparsity in future MoE architectures. Project page: https://qghuxmu.github.io/Sigma-MoE-Tiny Code: https://github.com/microsoft/ltp-megatron-lm

[18] Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures

Yehor Tereshchenko, Mika Hämäläinen, Svitlana Myroniuk

Main category: cs.CL

TL;DR: This study compares reasoning vs non-reasoning GPT models for translating between Finnish and four low-resource Uralic languages, finding reasoning models have 16% lower refusal rates.

Details

Motivation: LLM evaluation for translation has focused on high-resource languages, leaving a gap in understanding performance on low-resource and endangered languages, particularly Uralic languages.

Method: Comprehensive comparison of OpenAI’s GPT models using parallel corpus of literary texts, analyzing refusal rates across reasoning and non-reasoning architectures for Finnish to four Uralic languages translation.

Result: Significant performance variations between reasoning and non-reasoning models, with reasoning models showing 16 percentage points lower refusal rates for translation attempts.

Conclusion: The study provides valuable insights for Uralic language researchers and contributes to understanding reasoning model capabilities for endangered language preservation.

Abstract: The evaluation of Large Language Models (LLMs) for translation tasks has primarily focused on high-resource languages, leaving a significant gap in understanding their performance on low-resource and endangered languages. This study presents a comprehensive comparison of OpenAI’s GPT models, specifically examining the differences between reasoning and non-reasoning architectures for translating between Finnish and four low-resource Uralic languages: Komi-Zyrian, Moksha, Erzya, and Udmurt. Using a parallel corpus of literary texts, we evaluate model willingness to attempt translation through refusal rate analysis across different model architectures. Our findings reveal significant performance variations between reasoning and non-reasoning models, with reasoning models showing 16 percentage points lower refusal rates. The results provide valuable insights for researchers and practitioners working with Uralic languages and contribute to the broader understanding of reasoning model capabilities for endangered language preservation.

[19] Hacking Neural Evaluation Metrics with Single Hub Text

Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai

Main category: cs.CL

TL;DR: Researchers propose a method to find “hub texts” - single adversarial texts that consistently get high scores from neural evaluation metrics regardless of test cases, revealing vulnerabilities in widely-used metrics like COMET.

Details

Motivation: To raise concerns about the reliability and safety of neural text evaluation metrics like COMET, which are widely used but operate as black-box systems with no guarantee of reliable evaluation results.

Method: Propose a method for finding a single adversarial text in discrete space that is consistently evaluated as high-quality regardless of test cases, to identify vulnerabilities in evaluation metrics.

Result: The hub text found achieved 79.1 COMET% for En-Ja and 67.8 COMET% for En-De translation tasks, outperforming translations from M2M100. The hub text also generalized across multiple language pairs (Ja-En and De-En).

Conclusion: The method successfully identifies vulnerabilities in neural evaluation metrics, demonstrating that single adversarial texts can consistently achieve high scores regardless of actual translation quality, raising concerns about metric reliability.

Abstract: Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust. Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields. However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks. To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics. The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT'24 English-to-Japanese (En–Ja) and English-to-German (En–De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model. Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja–En and De–En.

[20] Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains

Darshil Chauhan, Adityasinh Solanki, Vansh Patel, Kanav Kapoor, Ritvik Jain, Aditya Bansal, Dhruv Kumar, Prateek Narang

Main category: cs.CL

TL;DR: Privacy-preserving LoRA adaptation framework enables continual learning on edge devices for clinical ASR, improving WER by 17.1% while reducing catastrophic forgetting by 47%.

Details

Motivation: Clinical ASR can streamline documentation in resource-constrained healthcare, but faces barriers: data privacy constraints, limited computational resources, and acoustic domain shifts that degrade existing models to 40.94% WER in real-world settings.

Method: Proposed efficient, privacy-preserving adaptation framework using Low-Rank Adaptation (LoRA) for continual learning on edge devices, with multi-domain experience replay to mitigate catastrophic forgetting.

Result: 17.1% relative WER improvement on target domain, 47% reduction in catastrophic forgetting compared to naive adaptation, demonstrating viable deployment pathway for clinical ASR.

Conclusion: The framework enables reliable, self-improving ASR systems that can operate within real-world constraints of privacy, computation, and domain shifts, bringing clinical ASR closer to practical deployment.

Abstract: Automatic Speech Recognition (ASR) holds immense potential to streamline clinical documentation, such as digitizing handwritten prescriptions and reports, thereby increasing patient throughput and reducing costs in resource-constrained sectors like rural healthcare. However, realizing this utility is currently obstructed by significant technical barriers: strict data privacy constraints, limited computational resources, and severe acoustic domain shifts. We quantify this gap by showing that a robust multilingual model (IndicWav2Vec) degrades to a stark 40.94% Word Error Rate (WER) when deployed on real-world clinical audio (Gram Vaani), rendering it unusable for practical applications. To address these challenges and bring ASR closer to deployment, we propose an efficient, privacy-preserving adaptation framework. We employ Low-Rank Adaptation (LoRA) to enable continual learning from incoming data streams directly on edge devices, ensuring patient data confidentiality. Our strategy yields a 17.1% relative improvement in WER on the target domain. Furthermore, by integrating multi-domain experience replay, we reduce catastrophic forgetting by 47% compared to naive adaptation. These results demonstrate a viable pathway for building reliable, self-improving ASR systems that can operate effectively within the constraints of high-impact real-world environments.

[21] Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics

Primoz Kocbek, Leon Kopitar, Gregor Stiglic

Main category: cs.CL

TL;DR: LLMs for biomedical text simplification to improve health literacy, with GPT-4o mini performing best and fine-tuning underperforming.

Details

Motivation: To enhance health literacy by simplifying complex biomedical texts using Large Language Models, making medical information more accessible to the general public.

Method: Used public dataset of plain language biomedical abstracts; developed three approaches: baseline prompt template, two AI agent approach, and fine-tuning approach; evaluated using GPT-4o and GPT-4o mini models with quantitative metrics (Flesch-Kincaid, SMOG, SARI, BERTScore, G-Eval) and qualitative 5-point Likert scales (simplicity, accuracy, completeness, brevity).

Result: GPT-4o mini showed superior performance while fine-tuning approaches underperformed; G-Eval (LLM-based metric) correlated well with qualitative human evaluations.

Conclusion: LLMs, particularly GPT-4o mini, are effective for biomedical text simplification, and LLM-based evaluation metrics like G-Eval show promise for automated assessment of simplification quality.

Abstract: This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach. We selected OpenAI gpt-4o and gpt-4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch-Kincaid grade level, SMOG Index, SARI, and BERTScore, G-Eval, as well as with qualitative metric, more precisely 5-point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt-4o-mini and an underperformance of FT approaches. G-Eval, a LLM based quantitative metric, showed promising results, ranking the approaches similarly as the qualitative metric.

[22] UM_FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and Fine-Tune Approaches for GPT-4.1 Models in Sentence and Document-Level Text Simplification

Primoz Kocbek, Gregor Stiglic

Main category: cs.CL

TL;DR: Submission to CLEF 2025 SimpleText Track Task 1 using GPT-4.1 models for scientific text simplification at sentence and document levels, comparing no-context prompt engineering vs fine-tuning approaches.

Details

Motivation: To develop effective methods for simplifying scientific texts at both sentence and document levels for the CLEF 2025 SimpleText track competition, addressing the challenge of making complex scientific content more accessible.

Method: Used OpenAI’s GPT-4.1, GPT-4.1-mini, and GPT-4.1-nano models with two approaches: 1) no-context method relying on prompt engineering, and 2) fine-tuned models. Evaluated performance at both sentence-level and document-level simplification.

Result: GPT-4.1-mini with no-context approach showed robust performance at both simplification levels. Fine-tuned models had mixed results, with GPT-4.1-nano-FT performing well at document-level simplification in one case, highlighting the complexity of multi-granularity text simplification.

Conclusion: Prompt engineering with GPT-4.1-mini provides reliable scientific text simplification, while fine-tuning presents challenges with varying performance across different granularities, suggesting context-free approaches may be more effective for this task.

Abstract: This work describes our submission to the CLEF 2025 SimpleText track Task 1, addressing both sentenceand document-level simplification of scientific texts. The methodology centered on using the gpt-4.1, gpt-4.1mini, and gpt-4.1-nano models from OpenAI. Two distinct approaches were compared: a no-context method relying on prompt engineering and a fine-tuned (FT) method across models. The gpt-4.1-mini model with no-context demonstrated robust performance at both levels of simplification, while the fine-tuned models showed mixed results, highlighting the complexities of simplifying text at different granularities, where gpt-4.1-nano-ft performance stands out at document-level simplification in one case.

[23] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

Iker García-Ferrero, David Montero, Roman Orus

Main category: cs.CL

TL;DR: Refusal Steering is an inference-time method that uses LLM-as-a-judge to assign refusal confidence scores and ridge-regularized steering vectors to control LLM refusal behavior on politically sensitive topics without retraining.

Details

Motivation: Current methods for controlling LLM refusal behavior on politically sensitive topics rely on fragile pattern-based refusal detection and often require retraining. There's a need for fine-grained control over refusal behavior at inference time while maintaining safety alignment for harmful content.

Method: Replace pattern-based refusal detection with LLM-as-a-judge that assigns refusal confidence scores. Use ridge-regularized variant to compute steering vectors that better isolate the refusal-compliance direction. Apply activation steering at inference time without model retraining.

Result: On Qwen3-Next-80B-A3B-Thinking, the method successfully removes refusal behavior on politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired.

Conclusion: Activation steering can effectively remove political refusal behavior while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time. Refusal signals concentrate in deeper transformer layers and are distributed across many dimensions.

Abstract: We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal–compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.

[24] JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, Zhiyuan Liu

Main category: cs.CL

TL;DR: JustRL: A minimal single-stage RL approach with fixed hyperparameters achieves SOTA performance on 1.5B reasoning models using 2× less compute than complex methods, suggesting field complexity may be unnecessary.

Details

Motivation: Recent RL for LLMs has become increasingly complex with multi-stage pipelines, dynamic hyperparameters, and curriculum learning. The paper questions whether this complexity is necessary and aims to demonstrate that simpler approaches can achieve comparable or better results.

Method: JustRL uses single-stage training with fixed hyperparameters that transfer across different 1.5B reasoning models without tuning. The approach avoids “standard tricks” like explicit length penalties and robust verifiers, which the authors found can degrade performance by collapsing exploration.

Result: Achieves state-of-the-art performance on two 1.5B reasoning models (54.9% and 64.3% average accuracy across nine mathematical benchmarks) while using 2× less compute than sophisticated approaches. Training shows smooth, monotonic improvement over 4,000+ steps without collapses or plateaus.

Conclusion: The field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. The authors release models and code to establish a simple, validated baseline for the community, challenging the necessity of complex RL training pipelines.

Abstract: Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9% and 64.3% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks’’ like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.

[25] GinSign: Grounding Natural Language Into System Signatures for Temporal Logic Translation

William English, Chase Walker, Dominic Simon, Rickard Ewetz

Main category: cs.CL

TL;DR: GinSign is a framework that improves NL-to-temporal logic translation by learning to ground NL spans to system signatures, achieving 95.5% logical equivalence and 1.4× improvement over SOTA.

Details

Motivation: Existing NL-to-TL translation frameworks either assume accurate atom grounding or suffer from low grounded translation accuracy, limiting their practical utility for building trustworthy autonomous systems.

Method: Proposes GinSign framework with a grounding model that hierarchically maps NL spans to system signatures: first predicting predicate labels, then selecting appropriately typed constant arguments. Converts free-form generation into structured classification using smaller masked language models instead of expensive LLMs.

Result: Achieves 95.5% grounded logical-equivalence scores, a 1.4× improvement over state-of-the-art. Frameworks without proper grounding produce syntactically correct but semantically nonequivalent LTL, while GinSign supports downstream model checking.

Conclusion: GinSign effectively addresses the grounding problem in NL-to-TL translation through hierarchical structured classification, enabling practical use for system specification and verification without manual formal specification crafting.

Abstract: Natural language (NL) to temporal logic (TL) translation enables engineers to specify, verify, and enforce system behaviors without manually crafting formal specifications-an essential capability for building trustworthy autonomous systems. While existing NL-to-TL translation frameworks have demonstrated encouraging initial results, these systems either explicitly assume access to accurate atom grounding or suffer from low grounded translation accuracy. In this paper, we propose a framework for Grounding Natural Language Into System Signatures for Temporal Logic translation called GinSign. The framework introduces a grounding model that learns the abstract task of mapping NL spans onto a given system signature: given a lifted NL specification and a system signature $\mathcal{S}$, the classifier must assign each lifted atomic proposition to an element of the set of signature-defined atoms $\mathcal{P}$. We decompose the grounding task hierarchically- first predicting predicate labels, then selecting the appropriately typed constant arguments. Decomposing this task from a free-form generation problem into a structured classification problem permits the use of smaller masked language models and eliminates the reliance on expensive LLMs. Experiments across multiple domains show that frameworks which omit grounding tend to produce syntactically correct lifted LTL that is semantically nonequivalent to grounded target expressions, whereas our framework supports downstream model checking and achieves grounded logical-equivalence scores of $95.5%$, a $1.4\times$ improvement over SOTA.

[26] From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

Shubham Mishra, Samyek Jain, Gorang Mehrishi, Shiv Tiwari, Harsh Sharma, Pratik Narang, Dhruv Kumar

Main category: cs.CL

TL;DR: A reasoning-trace-augmented RAG framework with structured reasoning across three stages (document adjudication, conflict analysis, grounded synthesis) and a Conflict-Aware Trust-Score (CATS) evaluation pipeline that significantly improves answer correctness and behavioral adherence in LLMs.

Details

Motivation: Current RAG systems fail when retrieved sources contain conflicting, outdated, or subjective information, and prior work lacks unified reasoning supervision to address these issues comprehensively.

Method: Proposes a reasoning-trace-augmented RAG framework with three structured reasoning stages: document-level adjudication, conflict analysis, and grounded synthesis. Introduces a Conflict-Aware Trust-Score (CATS) pipeline for evaluation using LLM-as-a-Judge to assess groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment.

Result: Experimental results show substantial improvements over baselines, particularly with Qwen model: Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722. Created a 539-query reasoning dataset and evaluation pipeline.

Conclusion: The proposed framework establishes a foundation for conflict-aware, interpretable RAG systems that can handle conflicting, outdated, or subjective information through structured reasoning and produces citation-linked answers or justified refusals.

Abstract: Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.

Primož Kocbek, Azra Frkatović-Hodžić, Dora Lalić, Vivian Hui, Gordan Lauc, Gregor Štiglic

Main category: cs.CL

TL;DR: The paper studies when to convert figures/tables to text vs. use OCR-free visual retrieval for biomedical QA, finding that text conversion works better for mid-size models while visual retrieval becomes competitive with frontier models.

Details

Motivation: There's uncertainty about when to convert figures/tables to text versus using OCR-free visual retrieval in multi-modal RAG for biomedical QA, especially in visually dense domains like glycobiology.

Method: Built a benchmark of 120 MCQs from 25 papers stratified by retrieval difficulty, implemented four RAG augmentations (None, Text, Multi-modal conversion, and OCR-free visual retrieval using ColPali), and evaluated with various models including Gemma-3-27B-IT and GPT-4o/5 families.

Result: With mid-size models (Gemma-3-27B-IT), text conversion outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 accuracy). With frontier models (GPT-4o), all methods were closer (0.808 Multi-modal, 0.782 Text, 0.745 ColPali). GPT-5 family improved results further to ~0.828.

Conclusion: Pipeline choice depends on model capacity: text conversion lowers reader burden and works better for mid-size models, while OCR-free visual retrieval becomes competitive with frontier models. Among retrievers, ColFlor offers parity with heavier options at smaller footprint.

Abstract: Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor). Accuracy with Agresti-Coull 95% confidence intervals (CIs) was computed over 5 runs per configuration. With Gemma-3-27B-IT, Text and Multi-modal augmentation outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 average accuracy). With GPT-4o, Multi-modal achieved 0.808, with Text 0.782 and ColPali 0.745 close behind; within-model differences were small. In follow-on experiments with the GPT-5 family, the best results with ColPali and ColFlor improved by ~2% to 0.828 in both cases. In general, across the GPT-5 family, ColPali, ColQwen, and ColFlor were statistically indistinguishable. GPT-5-nano trailed larger GPT-5 variants by roughly 8-10%. Pipeline choice is capacity-dependent: converting visuals to text lowers the reader burden and is more reliable for mid-size models, whereas OCR-free visual retrieval becomes competitive under frontier models. Among retrievers, ColFlor offers parity with heavier options at a smaller footprint, making it an efficient default when strong generators are available.

[28] Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs

William English, Dominic Simon, Sumit Kumar Jha, Rickard Ewetz

Main category: cs.CL

TL;DR: GraFT framework improves NL to temporal logic translation by restricting output tokens at each step, boosting accuracy by 5.49% end-to-end and 14.06% out-of-domain.

Details

Motivation: Current NL-to-temporal-logic translation methods struggle with accurate atomic proposition lifting, handling co-references, and learning from limited data, requiring a more efficient approach.

Method: Grammar Forced Translation (GraFT) reduces task complexity by restricting valid output tokens from full vocabulary to only a handful at each step, exploiting unique properties of lifting and translation problems.

Result: GraFT improves end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average across CW, GLTL, and Navi benchmarks compared to state-of-the-art approaches.

Conclusion: GraFT provides an effective framework for NL-to-temporal-logic translation with theoretical justification for solution space reduction leading to more efficient learning and better performance.

Abstract: Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. State-of-the-art approaches decompose the task into a lifting of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate lifting, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the lifting and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.

[29] What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels

Aditya Yadavalli, Tiago Pimentel, Tamar I Regev, Ethan Wilcox, Alex Warstadt

Main category: cs.CL

TL;DR: Researchers propose an information-theoretic method to quantify how much information prosody (speech melody) conveys beyond text, finding it provides over 10x more information about sarcasm and emotion than text alone.

Details

Motivation: Prosody conveys critical information not captured by text, but there's a need to quantify exactly how much information it provides and what that information is about. Current approaches lack systematic measurement of prosodic information separate from textual content.

Method: Use large speech and language models to estimate mutual information between specific meaning dimensions (emotion, sarcasm, questionhood) and communication channels (audio vs text). Apply this approach to speech from television and podcasts to quantify information transmission.

Result: For sarcasm and emotion, the audio channel (and thus prosody) transmits over an order of magnitude more information than text alone when long-term context is unavailable. For questionhood, prosody provides comparatively less additional information.

Conclusion: Prosody is crucial for conveying sarcasm and emotional meaning, providing substantially more information than text alone. The approach can be extended to study more meaning dimensions, communication channels, and languages.

Abstract: Prosody – the melody of speech – conveys critical information often not captured by the words or text of a message. In this paper, we propose an information-theoretic approach to quantify how much information is expressed by prosody alone and not by text, and crucially, what that information is about. Our approach applies large speech and language models to estimate the mutual information between a particular dimension of an utterance’s meaning (e.g., its emotion) and any of its communication channels (e.g., audio or text). We then use this approach to quantify how much information is conveyed by audio and text about sarcasm, emotion, and questionhood, using speech from television and podcasts. We find that for sarcasm and emotion the audio channel – and by implication the prosodic channel – transmits over an order of magnitude more information about these features than the text channel alone, at least when long-term context beyond the current sentence is unavailable. For questionhood, prosody provides comparatively less additional information. We conclude by outlining a program applying our approach to more dimensions of meaning, communication channels, and languages.

[30] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Harsh Vardhan Bansal

Main category: cs.CL

TL;DR: LLMCache is a layer-wise caching framework that accelerates transformer inference by reusing intermediate activations for semantically similar inputs, achieving up to 3.1× speedup with minimal accuracy loss.

Details

Motivation: Transformer models have high inference latency that limits real-time and large-scale deployment. Existing caching methods like token-level key-value caches are limited in scope and applicability.

Method: Proposes LLMCache, a model-agnostic layer-wise caching framework that reuses intermediate activations based on semantic similarity. Uses lightweight fingerprinting for input matching and adaptive eviction strategies to manage cache staleness. Works across encoder/decoder architectures at arbitrary transformer layers.

Result: Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1× speedup in inference time with <0.5% accuracy degradation.

Conclusion: LLMCache is a practical and general-purpose solution for optimizing transformer inference in real-world applications, offering significant speedups with minimal accuracy trade-offs.

Abstract: Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications

[31] AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

Tzu-Han Lin, Wei-Lin Chen, Chen-An Li, Hung-yi Lee, Yun-Nung Chen, Yu Meng

Main category: cs.CL

TL;DR: AdaSearch: A two-stage RL framework that improves LLM search agents’ ability to adaptively balance parametric knowledge with external search, reducing unnecessary searches while maintaining performance and providing interpretable decision-making.

Details

Motivation: Current LLM search agents either over-rely on search (costly, risky) or under-use it (hallucination risk). Existing methods use reward engineering to penalize search calls, but this leads to ambiguous credit assignment, exploitability, and conflates necessary vs unnecessary searches. The paper aims to develop agents that truly understand when to search vs use parametric knowledge.

Method: AdaSearch: A two-stage, outcome-driven RL framework that disentangles problem solving from the search invocation decision. First stage: LLM decides whether to search based on self-knowledge. Second stage: either uses parametric knowledge or executes search. This makes the decision process explicit and interpretable.

Result: Experiments across multiple model families and sizes show AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls by 40-60%, preserves strong task performance, and offers more transparent, interpretable decision behaviors compared to baselines like Search-R1.

Conclusion: AdaSearch provides a principled approach to building adaptive search agents that better understand their knowledge boundaries, reduce unnecessary search costs/risks, and offer interpretable decision-making crucial for high-stakes domains like finance and medicine.

Abstract: Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. However, overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content, while relying solely on parametric knowledge risks hallucination. The central challenge is to develop agents that adaptively balance parametric knowledge with external search, invoking search only when necessary. Prior work mitigates search overuse by shaping rewards around the number of tool calls. However, these penalties require substantial reward engineering, provide ambiguous credit assignment, and can be exploited by agents that superficially reduce calls. Moreover, evaluating performance solely through call counts conflates necessary and unnecessary search, obscuring the measurement of true adaptive behavior. To address these limitations, we first quantify the self-knowledge awareness of existing search agents via an F1-based decision metric, revealing that methods such as Search-R1 often overlook readily available parametric knowledge. Motivated by these findings, we propose AdaSearch, a simple two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search, and makes this decision process explicit and interpretable. This transparency is crucial for high-stakes domains such as finance and medical question answering, yet is largely neglected by prior approaches. Experiments across multiple model families and sizes demonstrate that AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.

[32] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad

Main category: cs.CL

TL;DR: MMRB2 is the first comprehensive benchmark for multimodal reward models, covering 4 tasks with 1,000 expert-annotated preference pairs, revealing current models achieve 59-80% accuracy vs >90% for humans.

Details

Motivation: Reward models are crucial for training LLMs but remain underexplored for multimodal (image+text) models, creating a need for comprehensive evaluation benchmarks.

Method: Created MMRB2 benchmark with: 1) practical challenging prompts, 2) responses from state-of-the-art models, 3) preference pairs with strong human-expert consensus using ensemble filtering strategy. Evaluated multimodal LLM-as-a-judge and preference-trained models.

Result: Gemini 3 Pro: 75-80% accuracy; GPT-5 & Gemini 2.5 Pro: 66-75%; GPT-4o: 59%; Qwen3-VL-32B (best open-source) matches Gemini 2.5 Flash at 64%. Humans achieve >90% accuracy. MMRB2 performance strongly correlates with downstream task success.

Conclusion: MMRB2 provides the first comprehensive benchmark for multimodal reward models, revealing significant performance gaps between current models and human-level judgment, and identifies key areas for improvement in reward modeling.

Abstract: Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning (“thinking-with-images”), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

[33] In-Context Algebra

Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau

Main category: cs.CL

TL;DR: Transformers learn symbolic reasoning mechanisms when trained on arithmetic with variable symbols whose meanings change between sequences, developing specialized heads for commutative copying, identity recognition, and closure-based cancellation.

Details

Motivation: Prior work shows transformers develop geometric embeddings for arithmetic with fixed symbol meanings, but this paper investigates what happens when symbol meanings vary between sequences - a more challenging setup that requires in-context reasoning with variables.

Method: Created a new task where assignment of symbols to algebraic group elements varies between sequences. Used targeted data distributions to create causal tests of hypothesized mechanisms, analyzing what transformers learn in this variable-symbol setting.

Result: Transformers achieve near-perfect accuracy on the variable-symbol task and generalize to unseen algebraic groups. They consistently learn three mechanisms: commutative copying (dedicated head copies answers), identity element recognition, and closure-based cancellation (tracks group membership to constrain answers).

Conclusion: Unlike fixed-symbol settings that produce geometric embeddings, variable-symbol training leads transformers to develop symbolic reasoning mechanisms, showing models can learn to reason in-context with variables whose meanings are not fixed.

Abstract: We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.

[34] Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

Nikhil Prakash, Donghao Ren, Dominik Moritz, Yannick Assogba

Main category: cs.CL

TL;DR: Constructive Circuit Amplification improves LLM mathematical reasoning by +11.4% accuracy while updating only ~1.6% of model components, with minimal impact on other abilities.

Details

Motivation: Prior research shows LLMs have sparse "circuits" for specific tasks, and fine-tuning strengthens existing circuits. This suggests direct intervention on task-specific circuits could enable precise, targeted updates without affecting other capabilities.

Method: Constructive Circuit Amplification identifies pivotal tokens from model reasoning traces and model components responsible for desired tasks, then updates only those specific components rather than the entire model.

Result: Applied to mathematical reasoning, the method improves accuracy by up to +11.4% across multiple models while modifying as little as 1.59% of model components. Minimal impact on other abilities measured by MMLU, TriviaQA, and TruthfulQA benchmarks.

Conclusion: Targeted capabilities can be reliably enhanced by selectively updating a sparse set of model components, demonstrating the effectiveness of circuit-based interventions for precise model improvement.

Abstract: Prior studies investigating the internal workings of LLMs have uncovered sparse subnetworks, often referred to as circuits, that are responsible for performing specific tasks. Additionally, it has been shown that model performance improvement through fine-tuning often results from the strengthening of existing circuits in the model. Taken together, these findings suggest the possibility of intervening directly on such circuits to make precise, task-targeted updates. Motivated by these findings, we propose a novel method called Constructive Circuit Amplification which identifies pivotal tokens from model reasoning traces as well as model components responsible for the desired task, and updates only those components. Applied to mathematical reasoning, it improves accuracy by up to +11.4% across multiple models while modifying as little as 1.59% of model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA. These results demonstrate that targeted capabilities can be reliably enhanced by selectively updating a sparse set of model components.

[35] The Emergence of Chunking Structures with Hierarchical RNN

Zijun Wu, Anup Anand Deshmukh, Yongkang Wu, Jimmy Lin, Lili Mou

Main category: cs.CL

TL;DR: Unsupervised chunking using Hierarchical RNN with two-stage training shows improved performance but transient emergence of chunking structures during downstream task training.

Details

Motivation: Traditional NLP approaches for predicting linguistic structures like parsing and chunking rely heavily on manual syntactic annotations, which are expensive and time-consuming to create. This paper aims to develop an unsupervised approach to chunking to reduce dependency on annotated data.

Method: Proposes a Hierarchical Recurrent Neural Network (HRNN) that models word-to-chunk and chunk-to-sentence compositions. Uses a two-stage training process: 1) pretraining with an unsupervised parser, and 2) finetuning on downstream NLP tasks.

Result: Experiments on multiple datasets show significant improvement in unsupervised chunking performance during both pretraining and finetuning stages. However, the emergence of chunking structure is observed to be transient during downstream-task training.

Conclusion: The study advances unsupervised syntactic structure discovery and opens new research avenues in linguistic theory, particularly regarding the transient nature of syntactic structure emergence in neural models during task-specific training.

Abstract: In Natural Language Processing (NLP), predicting linguistic structures, such as parsing and chunking, has mostly relied on manual annotations of syntactic structures. This paper introduces an unsupervised approach to chunking, a syntactic task that involves grouping words in a non-hierarchical manner. We present a Hierarchical Recurrent Neural Network (HRNN) designed to model word-to-chunk and chunk-to-sentence compositions. Our approach involves a two-stage training process: pretraining with an unsupervised parser and finetuning on downstream NLP tasks. Experiments on multiple datasets reveal a notable improvement of unsupervised chunking performance in both pretraining and finetuning stages. Interestingly, we observe that the emergence of the chunking structure is transient during the neural model’s downstream-task training. This study contributes to the advancement of unsupervised syntactic structure discovery and opens avenues for further research in linguistic theory.

[36] Enhancing Long-term RAG Chatbots with Psychological Models of Memory Importance and Forgetting

Ryuichi Sumida, Koji Inoue, Tatsuya Kawahara

Main category: cs.CL

TL;DR: LUFY improves long-term RAG conversations by selectively retaining emotionally arousing memories (less than 10% of conversation) while forgetting unimportant parts, significantly enhancing user experience.

Details

Motivation: As conversations progress in RAG systems, increasing memory load degrades retrieval accuracy. The paper addresses this by drawing on psychological insights about memory retention.

Method: Proposes LUFY, a simple yet effective method that focuses on emotionally arousing memories and retains less than 10% of the conversation content.

Result: In extensive user experiments (2 hours over 4 sessions, 4x longer than existing benchmarks), LUFY significantly enhances user experience by prioritizing arousing memories while forgetting most conversation content.

Conclusion: Selective forgetting of unimportant conversation parts is crucial for long-term conversational AI, and focusing on emotionally arousing memories significantly improves performance.

Abstract: While Retrieval-Augmented Generation (RAG) has shown promise in enhancing long-term conversations, the increasing memory load as conversations progress degrades retrieval accuracy. Drawing on psychological insights, we propose LUFY, a simple yet effective method that focuses on emotionally arousing memories and retains less than 10% of the conversation. In the user experiment, participants interacted with three types of RAG chatbots, each for 2 hours over 4 sessions, marking the most extensive assessment of a chatbot’s long-term capabilities to date – more than four times longer than any existing benchmark. The results demonstrate that prioritizing arousing memories while forgetting the majority of the conversation significantly enhances user experience. This study pushes the frontier of long-term conversations and highlights the importance of forgetting unimportant parts of conversations. Code and Dataset: https://github.com/ryuichi-sumida/LUFY, Hugginface Dataset:https://huggingface.co/datasets/RuiSumida/LUFY

[37] OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages

Chester Palen-Michel, Maxwell Pickering, Maya Kruse, Jonne Sälevä, Constantine Lignos

Main category: cs.CL

TL;DR: OpenNER 1.0 is a standardized collection of 36 NER datasets spanning 52 languages with uniform entity type representations, providing baselines with multilingual and large language models to facilitate multilingual/multi-ontology NER research.

Details

Motivation: To address the lack of standardized, openly-available NER datasets that span multiple languages and ontologies, which hinders research in multilingual and multi-ontology NER.

Method: Collected 36 NER corpora across 52 languages, corrected annotation format issues, standardized datasets into uniform representation with consistent entity type names, and provided baseline results using three pretrained multilingual language models and two large language models.

Result: Created OpenNER 1.0 collection with standardized datasets, found no single model performs best across all languages, and identified significant work needed to achieve high performance from LLMs on NER tasks.

Conclusion: OpenNER provides a valuable standardized resource for multilingual and multi-ontology NER research, revealing current limitations in model performance across languages and highlighting the need for further work on LLM-based NER.

Abstract: We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task. OpenNER is released at https://github.com/bltlab/open-ner.

[38] Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training

Xunxin Cai, Chengrui Wang, Qingqing Long, Yuanchun Zhou, Meng Xiao

Main category: cs.CL

TL;DR: A framework that uses LLMs to automatically generate high-quality biomedical training data from scientific literature, guided by MeSH hierarchy, improving downstream QA performance and enabling smaller models to outperform larger ones.

Details

Motivation: There's a gap between LLMs' potential in biomedical applications and the limited scale/quality of available open-source annotated datasets. The complexity of biomedical knowledge hierarchy further hampers progress. The paper investigates whether LLMs themselves can help overcome this limitation.

Method: Proposes a framework that automates distillation of high-quality textual training data from scientific literature. The approach self-evaluates and generates questions aligned with biomedical domain using medical subject headings (MeSH) as guidance. Establishes a fully automated workflow without manual intervention.

Result: The framework substantially improves question-answering tasks compared to pre-trained life sciences models and GPT-4. The generated AI-Ready dataset enabled Llama3-70B base model to outperform GPT-4 with MedPrompt despite having multiple times fewer parameters. Case studies and ablation experiments validate each component’s significance.

Conclusion: LLMs can indeed play a pivotal role in overcoming biomedical data limitations by automatically generating high-quality training data. The proposed framework successfully bridges the gap between LLM potential and available biomedical datasets, enabling smaller models to achieve superior performance through better training data.

Abstract: The rapid advancement of large language models (LLMs) in biological-medical applications has highlighted a gap between their potential and the limited scale and often low quality of available open-source annotated textual datasets. In addition, the inherent complexity of the biomedical knowledge hierarchy significantly hampers efforts to bridge this gap.Can LLMs themselves play a pivotal role in overcoming this limitation? Motivated by this question, we investigate this challenge in the present study.We propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature. Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain, guided by the biomedical knowledge hierarchy through medical subject headings (MeSH). This comprehensive framework establishes an automated workflow, thereby eliminating the need for manual intervention. Furthermore, we conducted comprehensive experiments to evaluate the impact of our framework-generated data on downstream language models of varying sizes. Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain and powerful close-source models represented by GPT-4. Notably, the generated AI-Ready dataset enabled the Llama3-70B base model to outperform GPT-4 using MedPrompt with multiple times the number of parameters. Detailed case studies and ablation experiments underscore the significance of each component within our framework

[39] Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, Hung-yi Lee

Main category: cs.CL

TL;DR: Speech-FT is a two-stage fine-tuning framework that maintains cross-task generalization while benefiting from task-specific fine-tuning by reducing representational drift and using weight-space interpolation.

Details

Motivation: Fine-tuning speech representation models improves task-specific performance but often degrades cross-task generalization due to excessive representational changes that lose pre-trained knowledge.

Method: Two-stage approach: 1) Fine-tuning designed to reduce representational drift, 2) Weight-space interpolation with the pre-trained model to restore cross-task generalization.

Result: Speech-FT consistently improves performance across supervised, unsupervised, and multitask scenarios, achieves superior cross-task generalization compared to weight-space regularization and LoRA, and shows significant improvements on SUPERB benchmark (e.g., reducing phone error rate from 5.17% to 3.94% for HuBERT).

Conclusion: Speech-FT provides a simple yet powerful solution for refining speech representation models after pre-training while maintaining cross-task generalization capabilities.

Abstract: Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.

[40] Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

Zhaowei Liu, Xin Guo, Zhi Yang, Fangqi Lou, Lingfeng Zeng, Mengping Li, Qi Qi, Zhiqiang Liu, Yiyang Han, Dongpo Cheng, Xingdong Feng, Huixia Judy Wang, Chengchun Shi, Liwen Zhang

Main category: cs.CL

TL;DR: Fin-R1 is a 7B parameter reasoning LLM specialized for finance, developed using a two-stage pipeline with curated financial CoT data and SFT+RL training, achieving competitive performance on financial benchmarks with practical applications.

Details

Motivation: General-purpose LLMs face challenges in finance due to fragmented data sources, opaque reasoning processes, and poor transferability to business applications, creating a need for specialized financial reasoning models.

Method: Two-stage development: 1) Construct Fin-R1-Data (60,091 high-quality CoT samples from authoritative financial benchmarks), 2) Train Fin-R1 using supervised fine-tuning followed by reinforcement learning to enhance financial reasoning capabilities.

Result: Despite its compact 7B parameter size, Fin-R1 achieves competitive performance on established financial benchmarks and demonstrates practical utility in compliance checking and robo-advisory applications. The code repository has attracted over 700 stars.

Conclusion: Fin-R1 successfully addresses key challenges in applying LLMs to finance through specialized data curation and training, offering an efficient, interpretable solution for financial reasoning tasks with lower deployment costs.

Abstract: In recent years, general-purpose large language models (LLMs) such as GPT, Gemini, Claude, and DeepSeek have advanced at an unprecedented pace. Despite these achievements, their application to finance remains challenging, due to fragmented data sources, intransparent reasoning processes, and weak transferability to business applications. In response, we introduce Fin-R1, a reasoning LLM designed for financial scenarios. With a compact size of 7 billion parameters, Fin-R1 reduces deployment costs while addressing the aforementioned challenges. Its development follows a two-stage pipeline. First, we construct Fin-R1-Data, a high-quality financial dataset consisting of 60,091 chain-of-thought (CoT) samples, distilled and filtered from multiple authoritative benchmarks to ensure consistency and reliability. Second, we train Fin-R1 using Fin-R1-Data through supervised fine-tuning (SFT), followed by reinforcement learning (RL). This stage substantially improves the model’s ability to solve complex financial reasoning tasks, yielding outputs that are both accurate and interpretable. Despite its relatively small parameter scale, Fin-R1 achieves competitive empirical performance across established financial benchmarks and demonstrates practical utility in compliance checking and robo-advisory. Our code is publicly available at https://github.com/SUFE-AIFLM-Lab/Fin-R1, and has already attracted over 700 stars.

[41] Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection

Kabir Ahuja, Melanie Sclar, Yulia Tsvetkov

Main category: cs.CL

TL;DR: The paper introduces FlawedFictions, a benchmark for evaluating LLMs’ ability to detect plot holes in stories, revealing that current models struggle with this task and tend to introduce plot holes when generating or summarizing stories.

Details

Motivation: Existing benchmarks focus mainly on surface-level comprehension, but deeper narrative understanding requires nuanced reasoning skills. As LLMs increasingly generate and interpret text, rigorously assessing their narrative consistency becomes critical.

Method: Proposes plot hole detection as a proxy for language understanding, introduces FlawedFictionsMaker algorithm to controllably synthesize plot holes in human-written stories, and constructs the FlawedFictions benchmark with human filtering for quality assurance.

Result: State-of-the-art LLMs struggle with plot hole detection regardless of reasoning effort, with performance degrading as story length increases. LLM-based story summarization and generation show 50%+ and 100%+ increases in plot hole rates compared to human originals.

Conclusion: Plot hole detection is a challenging task for LLMs that reveals limitations in deeper language understanding. The FlawedFictions benchmark provides a robust tool for evaluating narrative consistency and reasoning capabilities in language models.

Abstract: Stories are a fundamental aspect of human experience. Engaging deeply with stories and spotting plot holes – inconsistencies in a storyline that break the internal logic or rules of a story’s world – requires nuanced reasoning skills, including tracking entities and events and their interplay, abstract thinking, pragmatic narrative understanding, commonsense and social reasoning, and theory of mind. As Large Language Models (LLMs) increasingly generate, interpret, and modify text, rigorously assessing their narrative consistency and deeper language understanding becomes critical. However, existing benchmarks focus mainly on surface-level comprehension. In this work, we propose plot hole detection in stories as a proxy to evaluate language understanding and reasoning in LLMs. We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories. Using this algorithm, we construct a benchmark to evaluate LLMs’ plot hole detection abilities in stories – FlawedFictions – , which is robust to contamination, with human filtering ensuring high quality. We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed, with performance significantly degrading as story length increases. Finally, we show that LLM-based story summarization and story generation are prone to introducing plot holes, with more than 50% and 100% increases in plot hole detection rates with respect to human-written originals.

[42] Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

Meng Xiao, Xunxin Cai, Qingqing Long, Chengrui Wang, Yuanchun Zhou, Hengshu Zhu

Main category: cs.CL

TL;DR: A multi-agent framework using MeSH hierarchy to autonomously distill high-quality biomedical question-answer pairs from scientific literature, enabling LLMs to outperform proprietary models like GPT-4 in biomedical QA tasks.

Details

Motivation: Addressing the bottleneck of insufficient quantity and quality in open-source annotated biomedical corpora for LLM training, which hinders effective biomedical research applications.

Method: Knowledge-driven, agentic framework with collaborative multi-agent architecture guided by Medical Subject Headings (MeSH) hierarchy. Specialized agents autonomously extract, synthesize, and self-evaluate textual data from scientific literature to generate domain-specific QA pairs.

Result: LLMs trained on multi-agent distilled datasets achieve notable improvements in biomedical QA tasks, outperforming strong life sciences LLM baselines and advanced proprietary models. Llama3-70B with the AI-Ready dataset surpasses GPT-4 with MedPrompt and Med-PaLM-2.

Conclusion: The multi-agent collaboration framework effectively addresses biomedical corpus distillation challenges, demonstrating significant potential for enhancing biomedical LLM training through automated, ontology-aligned data generation.

Abstract: Corpus distillation for biomedical large language models (LLMs) seeks to address the pressing challenge of insufficient quantity and quality in open-source annotated scientific corpora, which remains a bottleneck for effective LLM training in biomedical research. This paper proposes a knowledge-driven, agentic framework for scientific corpus distillation, tailored explicitly for LLM training in the biomedical domain, addressing the challenge posed by the complex hierarchy of biomedical knowledge. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. This agentic framework collectively generates and refines domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.

[43] A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu

Main category: cs.CL

TL;DR: LRC is an efficient pre-training method that creates small language models by soft pruning teacher weights and aligning student activations using low-rank projection matrices, achieving high performance with 1000x training efficiency.

Details

Motivation: Existing methods for training small language models suffer from information loss from hard pruning, inefficient representation alignment, and underutilization of informative FFN activations, making training costly even with knowledge distillation.

Method: Low-Rank Clone (LRC) trains low-rank projection matrices that enable soft pruning by compressing teacher weights and activation clone by aligning student activations (including FFN signals) with teacher activations in a unified design without explicit alignment modules.

Result: LRC matches or surpasses state-of-the-art models trained on trillions of tokens while using only 20B tokens, achieving over 1,000x training efficiency with open-source teachers like Llama-3.2-3B-Instruct and Qwen2.5-3B/7B-Instruct.

Conclusion: LRC provides an efficient approach to training high-performing small language models by maximizing knowledge transfer through unified soft pruning and activation alignment, significantly reducing training costs while maintaining performance.

Abstract: Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens–while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.

[44] Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines

Guifeng Deng, Shuyin Rao, Tianyu Lin, Anlu Dai, Pan Wang, Junyi Xie, Haidong Song, Ke Zhao, Dongwu Xu, Zhengdong Cheng, Tao Li, Haiteng Jiang

Main category: cs.CL

TL;DR: LLMs show strong performance in crisis assessment tasks, achieving comparable or superior results to human operators in suicide plan identification and risk assessment, while humans retain advantages in mood status recognition and suicidal ideation detection.

Details

Motivation: Psychological hotlines face resource constraints, and LLMs could potentially support crisis assessments, but their effectiveness in real-world clinical settings needs systematic evaluation.

Method: Created PsyCrisisBench with 540 annotated transcripts from a real hotline, evaluated 64 LLMs across 15 families using zero-shot, few-shot, and fine-tuning approaches on four key tasks.

Result: LLMs achieved strong performance in suicidal ideation detection (F1=0.880), suicide plan identification (F1=0.779), and risk assessment (F1=0.907). Fine-tuned smaller models sometimes outperformed larger ones. Humans performed better on mood recognition and suicidal ideation detection.

Conclusion: LLMs demonstrate performance broadly comparable to trained human operators in text-based crisis assessment, with complementary strengths across different task types, suggesting potential for clinical support with appropriate ethical deployment.

Abstract: Psychological support hotlines serve as critical lifelines for crisis intervention but encounter significant challenges due to rising demand and limited resources. Large language models (LLMs) offer potential support in crisis assessments, yet their effectiveness in emotionally sensitive, real-world clinical settings remains underexplored. We introduce PsyCrisisBench, a comprehensive benchmark of 540 annotated transcripts from the Hangzhou Psychological Assistance Hotline, assessing four key tasks: mood status recognition, suicidal ideation detection, suicide plan identification, and risk assessment. 64 LLMs across 15 model families (including closed-source such as GPT, Claude, Gemini and open-source such as Llama, Qwen, DeepSeek) were evaluated using zero-shot, few-shot, and fine-tuning paradigms. LLMs showed strong results in suicidal ideation detection (F1=0.880), suicide plan identification (F1=0.779), and risk assessment (F1=0.907), with notable gains from few-shot prompting and fine-tuning. Compared to trained human operators, LLMs achieved comparable or superior performance on suicide plan identification and risk assessment, while humans retained advantages on mood status recognition and suicidal ideation detection. Mood status recognition remained challenging (max F1=0.709), likely due to missing vocal cues and semantic ambiguity. Notably, a fine-tuned 1.5B-parameter model (Qwen2.5-1.5B) outperformed larger models on mood and suicidal ideation tasks. LLMs demonstrate performance broadly comparable to trained human operators in text-based crisis assessment, with complementary strengths across task types. PsyCrisisBench provides a robust, real-world evaluation framework to guide future model development and ethical deployment in clinical mental health.

[45] On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Stephen Obadinma, Xiaodan Zhu

Main category: cs.CL

TL;DR: LLM verbal confidence is vulnerable to adversarial attacks that can manipulate confidence scores and change answers, revealing current confidence mechanisms are not robust.

Details

Motivation: Robust verbal confidence in LLMs is crucial for transparency, trust, and safety in human-AI applications, but current confidence mechanisms may be vulnerable to manipulation.

Method: Introduced attack frameworks targeting verbal confidence through perturbation and jailbreak methods, tested various prompting strategies, model sizes, and application domains.

Result: Attacks significantly impair verbal confidence estimates and cause frequent answer changes; current verbal confidence is vulnerable and common defense techniques are ineffective or counterproductive.

Conclusion: There is a critical need to design robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

Abstract: Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to help ensure transparency, trust, and safety in many applications, including those involving human-AI interactions. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce attack frameworks targeting verbal confidence scores through both perturbation and jailbreak-based methods, and demonstrate that these attacks can significantly impair verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current verbal confidence is vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the need to design robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

[46] InsurTech innovation using natural language processing

Panyi Dong, Zhiyu Quan

Main category: cs.CL

TL;DR: NLP transforms unstructured text into structured data for insurance analytics, enabling feature de-biasing, compression, and novel industry classification to enhance commercial insurance pricing and risk assessment.

Details

Motivation: Traditional insurance companies need to leverage alternative data sources and advanced technologies like NLP to maintain competitiveness in the InsurTech era, transforming unstructured text into actionable insights for actuarial analysis.

Method: Apply various NLP techniques to real-world alternative data from an InsurTech partner, focusing on feature de-biasing, feature compression, and industry classification in commercial insurance context.

Result: Enriched text-derived insights refine traditional rating factors for commercial insurance pricing and offer novel perspectives for risk assessment through innovative industry classification techniques.

Conclusion: NLP is a foundational element of modern, data-driven insurance analytics, not just a supplementary tool, demonstrating its essential role in transforming insurance operations.

Abstract: With the rapid rise of InsurTech, traditional insurance companies are increasingly exploring alternative data sources and advanced technologies to sustain their competitive edge. This paper provides both a conceptual overview and practical case studies of natural language processing (NLP) and its emerging applications within insurance operations, focusing on transforming raw, unstructured text into structured data suitable for actuarial analysis and decision-making. Leveraging real-world alternative data provided by an InsurTech industry partner that enriches traditional insurance data sources, we apply various NLP techniques to demonstrate feature de-biasing, feature compression, and industry classification in the commercial insurance context. These enriched, text-derived insights not only add to and refine traditional rating factors for commercial insurance pricing but also offer novel perspectives for assessing underlying risk by introducing novel industry classification techniques. Through these demonstrations, we show that NLP is not merely a supplementary tool but a foundational element of modern, data-driven insurance analytics.

Dongyub Jude Lee, Zhenyi Ye, Pengcheng He

Main category: cs.CL

TL;DR: RLfR is a new preference-learning framework for machine translation that uses continuous feedback from GPT-4o instead of static triplets, achieving better performance on multilingual benchmarks.

Details

Motivation: Current preference-learning methods like DPO rely heavily on large, carefully curated triplet datasets and struggle to generalize beyond their tuning domains, creating a need for more flexible and effective approaches.

Method: RLfR frames translation as a micro-tutorial process: actor generates hypothesis → GPT-4o teacher refines it → actor is rewarded based on alignment with teacher’s refinement using two complementary signals: negative edit distance (lexical/structural fidelity) and COMET score (semantic adequacy).

Result: On FLORES-200 benchmark across multiple language pairs (English ↔ German, Spanish, Chinese, Korean, Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.

Conclusion: RLfR demonstrates that continuous feedback from teacher models can overcome limitations of static triplet datasets, enabling more effective preference learning for machine translation through iterative, human-like learning processes.

Abstract: Preference-learning methods for machine translation (MT)–such as Direct Preference Optimization (DPO)–have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher’s refinement. Guided by two complementary signals–(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy–the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.

[48] Beyond “Not Novel Enough”: Enriching Scholarly Critique with LLM-Assisted Feedback

Osama Mohammed Afzal, Preslav Nakov, Tom Hope, Iryna Gurevych

Main category: cs.CL

TL;DR: Automated novelty assessment system for peer review that models expert reviewer behavior through content extraction, related work retrieval, and structured comparison, achieving high alignment with human judgments.

Details

Motivation: Novelty assessment is crucial but understudied in peer review, especially in high-volume fields like NLP where reviewer capacity is strained. Current approaches lack structured methods for evaluating novelty systematically.

Method: Three-stage structured approach: 1) content extraction from submissions, 2) retrieval and synthesis of related work, 3) structured comparison for evidence-based assessment. The method is informed by analysis of human novelty reviews and captures patterns like independent claim verification and contextual reasoning.

Result: Evaluated on 182 ICLR 2025 submissions, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions, substantially outperforming existing LLM-based baselines. Produces detailed, literature-aware analyses and improves consistency over ad-hoc reviewer judgments.

Conclusion: Structured LLM-assisted approaches can support more rigorous and transparent peer review without displacing human expertise. The method demonstrates potential for improving novelty assessment consistency and quality in high-volume academic reviewing.

Abstract: Novelty assessment is a central yet understudied aspect of peer review, particularly in high volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence based assessment. Our method is informed by a large scale analysis of human written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions - substantially outperforming existing LLM based baselines. The method produces detailed, literature aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM assisted approaches to support more rigorous and transparent peer review without displacing human expertise. Data and code are made available.

[49] Are most sentences unique? An empirical examination of Chomskyan claims

Hiram Ring

Main category: cs.CL

TL;DR: The paper empirically tests the linguistic claim that most sentences are unique by analyzing corpus data, finding uniqueness varies by genre and duplicates are significant.

Details

Motivation: To empirically investigate the long-standing linguistic claim that "virtually every sentence is brand-new" using modern corpus data, moving beyond theoretical assertions.

Method: Used NLTK Python library to parse corpora across different genres, counting exact string matches to identify duplicate vs. unique sentences.

Result: While unique sentences often form the majority, this varies significantly by genre, and duplicate sentences constitute a non-insignificant portion of any corpus.

Conclusion: The claim about sentence uniqueness needs qualification - genre matters and duplicates are more common than theoretical linguistics suggests.

Abstract: A repeated claim in linguistics is that the majority of linguistic utterances are unique. For example, Pinker (1994: 10), summarizing an argument by Noam Chomsky, states that “virtually every sentence that a person utters or understands is a brand-new combination of words, appearing for the first time in the history of the universe.” With the increased availability of large corpora, this is a claim that can be empirically investigated. The current paper addresses the question by using the NLTK Python library to parse corpora of different genres, providing counts of exact string matches in each. Results show that while completely unique sentences are often the majority of corpora, this is highly constrained by genre, and that duplicate sentences are not an insignificant part of any individual corpus.

[50] Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution

Yash Saxena, Raviteja Bommireddy, Ankur Padia, Manas Gaur

Main category: cs.CL

TL;DR: Two citation paradigms for LLMs: Generation-Time Citation (G-Cite) produces answers with citations in one pass, while Post-hoc Citation (P-Cite) adds citations after drafting. P-Cite offers better coverage with competitive correctness, while G-Cite prioritizes precision at the cost of coverage and speed.

Details

Motivation: In high-stakes domains like healthcare, law, academia, and finance, LLMs must cite verifiable sources since small errors can have severe consequences. Practitioners need guidance on whether to generate citations during decoding or attach them after drafting answers.

Method: Introduced two citation paradigms: G-Cite (answer and citations in one pass) and P-Cite (adds/verifies citations after drafting). Conducted comprehensive evaluation from zero-shot to advanced retrieval-augmented methods across four popular attribution datasets.

Result: Consistent trade-off between coverage and citation correctness, with retrieval as main driver of attribution quality. P-Cite methods achieve high coverage with competitive correctness and moderate latency, while G-Cite methods prioritize precision at the cost of coverage and speed.

Conclusion: Recommend retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings like strict claim verification. Provides evidence-based recommendations weighing trade-offs across use cases.

Abstract: Trustworthy Large Language Models (LLMs) must cite human-verifiable sources in high-stakes domains such as healthcare, law, academia, and finance, where even small errors can have severe consequences. Practitioners and researchers face a choice: let models generate citations during decoding, or let models draft answers first and then attach appropriate citations. To clarify this choice, we introduce two paradigms: Generation-Time Citation (G-Cite), which produces the answer and citations in one pass, and Post-hoc Citation (P-Cite), which adds or verifies citations after drafting. We conduct a comprehensive evaluation from zero-shot to advanced retrieval-augmented methods across four popular attribution datasets and provide evidence-based recommendations that weigh trade-offs across use cases. Our results show a consistent trade-off between coverage and citation correctness, with retrieval as the main driver of attribution quality in both paradigms. P-Cite methods achieve high coverage with competitive correctness and moderate latency, whereas G-Cite methods prioritize precision at the cost of coverage and speed. We recommend a retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings such as strict claim verification. Our codes and human evaluation results are available at https://anonymous.4open.science/r/Citation_Paradigms-BBB5/

[51] Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation

Jonne Sälevä, Duygu Ataman, Constantine Lignos

Main category: cs.CL

TL;DR: Resampling methods for quantifying uncertainty in multilingual/multitask NLP benchmarks, accounting for both model and data variation to avoid underestimating replication variability.

Details

Motivation: Current evaluation metrics in multilingual and multitask NLP benchmarks lack proper uncertainty quantification. Performance scores vary due to both model and data-related sources, and ignoring either source leads to underestimation of overall variability in hypothetical replications.

Method: Introduces resampling-based methods (likely bootstrap or similar techniques) to quantify uncertainty and statistical precision of evaluation metrics. The approach accounts for both model-related and data-related sources of variation in performance scores.

Result: Shows that accounting for both model and data variation is necessary to avoid substantially underestimating overall variability. Demonstrates the methods on multilingual question answering, machine translation, and named entity recognition tasks. The resampling methods effectively quantify replication uncertainty for leaderboard metrics like model rankings and pairwise differences between models.

Conclusion: Resampling methods provide a robust approach for uncertainty quantification in NLP benchmarks, enabling more reliable assessment of model performance differences and rankings by properly accounting for all sources of variation in evaluation metrics.

Abstract: We introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks. We show how experimental variation in performance scores arises from both model and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications. Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for quantifying the replication uncertainty of various quantities used in leaderboards such as model rankings and pairwise differences between models.

[52] Who is In Charge? Dissecting Role Conflicts in Instruction Following

Siqi Zeng

Main category: cs.CL

TL;DR: LLMs often ignore hierarchical instructions (system > user) while obeying social cues like authority/consensus, revealing fragile system obedience and need for better alignment methods.

Details

Motivation: Large language models should follow hierarchical instructions where system prompts override user inputs, but recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. This creates a reliability problem for hierarchical instruction following.

Method: Used mechanistic interpretation on large-scale dataset with: 1) Linear probing to analyze conflict-decision signal encoding, 2) Direct Logit Attribution to examine internal conflict detection and resolution, and 3) Steering experiments to test how social cue vectors affect instruction following.

Result: System-user and social conflicts form distinct subspaces; stronger internal conflict detection for system-user cases but consistent resolution only for social cues; social cue vectors surprisingly amplify instruction following in role-agnostic way; explains fragile system obedience.

Conclusion: The findings explain why LLMs have fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods to improve reliable hierarchical instruction following.

Abstract: Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.

[53] Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs

Shuzhou Yuan, Ercong Nie, Yinuo Sun, Chenxuan Zhao, William LaCroix, Michael Färber

Main category: cs.CL

TL;DR: LLMs often make false refusals on safe requests containing terms resembling unsafe content. The paper introduces two benchmarks (XSB and MS-XSB) to measure this issue, shows it persists across models, and proposes three lightweight mitigation strategies without retraining.

Details

Motivation: Large language models frequently produce exaggerated safety refusals, declining benign requests that contain terms similar to unsafe queries, which reduces their helpfulness and usability in real-world applications.

Method: 1) Created two benchmarks: XSB for single-turn prompts with annotated “Focus” keywords, and MS-XSB for multi-turn dialog scenarios. 2) Used post-hoc explanation methods to identify refusal triggers. 3) Deployed three model-agnostic approaches: ignore-word instructions, prompt rephrasing, and attention steering at inference time without retraining.

Result: Experiments on four instruction-tuned Llama models show that the proposed strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Exaggerated refusals persist across diverse LLMs and are especially pronounced in complex multi-turn scenarios.

Conclusion: The paper establishes a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments through lightweight, model-agnostic interventions at inference time.

Abstract: Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. We address this challenge by introducing two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with “Focus” keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which systematically evaluates refusal calibration in realistic, context-rich dialog settings. Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios. To mitigate these failures, we leverage post-hoc explanation methods to identify refusal triggers and deploy three lightweight, model-agnostic approaches, ignore-word instructions, prompt rephrasing, and attention steering, at inference time, all without retraining or parameter access. Experiments on four instruction-tuned Llama models demonstrate that these strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Our findings establish a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments.

[54] LLM one-shot style transfer for Authorship Attribution and Verification

Pablo Miralles-González, Javier Huertas-Tato, Alejandro Martín, David Camacho

Main category: cs.CL

TL;DR: Unsupervised framework using LLM log-probabilities to measure style transferability between texts, outperforming existing methods and improving with model size.

Details

Motivation: Existing computational stylometry approaches suffer from spurious correlations between style and topic, and underutilize the pre-training of modern large language models for authorship analysis tasks.

Method: Unsupervised framework that uses LLM log-probabilities to measure style transferability between two texts, leveraging CLM pre-training and in-context capabilities without explicit supervision on spuriously correlated data.

Result: Substantially outperforms unsupervised prompting-based baselines at similar model sizes, exceeds contrastively trained models when controlling for topical overlap, and performance improves with model size. Additional mechanism for authorship verification enables flexible trade-offs between computational cost and accuracy.

Conclusion: The proposed unsupervised framework effectively leverages LLM pre-training for computational stylometry, avoiding spurious correlations and demonstrating scalable performance improvements with model size.

Abstract: Computational stylometry studies writing style through quantitative textual patterns, enabling applications such as authorship attribution, identity linking, and plagiarism detection. Existing supervised and contrastive approaches often rely on datasets with spurious correlations, conflating style with topic. Despite the relevance of language modeling to these tasks, the pre-training of modern large language models (LLMs) has been underutilized in general authorship analysis. We introduce an unsupervised framework that uses the log-probabilities of an LLM to measure style transferability between two texts. This framework takes advantage of the extensive CLM pre-training and in-context capabilities of modern LLMs. Our approach avoids explicit supervision with spuriously correlated data. Our method substantially outperforms unsupervised prompting-based baselines at similar model sizes and exceeds contrastively trained models when controlling for topical overlap. Our framework’s performance improves with model size. In the case of authorship verification, we present an additional mechanism that increases test-time computation to improve accuracy; enabling flexible trade-offs between computational cost and task performance.

[55] Think Twice: Branch-and-Rethink Reasoning Reward Model

Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau

Main category: cs.CL

TL;DR: BR-RM is a two-turn reward model that applies think-twice reasoning to reward modeling, using adaptive branching to identify critical dimensions in turn 1 and targeted rethinking in turn 2 to reduce judgment diffusion and improve error detection.

Details

Motivation: Current reward models compress multiple quality dimensions into single scalar scores in one pass, causing judgment diffusion where attention spreads too thin across criteria, leading to diluted focus and shallow analysis. LLMs benefit from think-twice strategies, but RMs haven't adopted this approach.

Method: BR-RM uses a two-turn approach: Turn 1 performs adaptive branching to select instance-critical dimensions (like factuality, safety) and creates evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and focuses only on what matters most. Trained with GRPO-style RL over structured two-turn traces using binary outcome rewards with strict format checks.

Result: The model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. It reduces judgment diffusion and improves sensitivity to subtle yet consequential errors while remaining practical and scalable.

Conclusion: By converting all-at-once scoring into focused, second-look reasoning, BR-RM successfully transfers the think-twice principle to reward modeling, addressing judgment diffusion while maintaining compatibility with standard RLHF pipelines.

Abstract: Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-once scoring into focused, second-look reasoning, BR-RM reduces judgment diffusion and improves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains.

[56] Voice-Interactive Surgical Agent for Multimodal Patient Data Control

Hyeryun Park, Byung Mo Gu, Jun Hee Lee, Byeong Hyeon Choi, Sekeun Kim, Hyun Koo Kim, Kyungsang Kim

Main category: cs.CL

TL;DR: VISA is a voice-interactive surgical agent using hierarchical LLM-based agents to help surgeons access patient data hands-free during robotic surgery.

Details

Motivation: During robotic surgery, surgeons' hands and visual attention are fully occupied, making it difficult to access and manipulate multimodal patient data without interrupting the surgical workflow.

Method: Hierarchical multi-agent framework with orchestration agent and three task-specific agents driven by LLMs. Agents autonomously plan, refine, validate, and reason to interpret voice commands and execute tasks like retrieving clinical info, manipulating CT scans, or navigating 3D anatomical models.

Result: VISA achieves high stage-level accuracy and workflow-level success rates, enhances robustness by correcting transcription errors, resolving linguistic ambiguity, and interpreting diverse free-form expressions. Tested on dataset of 240 user commands with Multi-level Orchestration Evaluation Metric (MOEM).

Conclusion: VISA shows strong potential to support robotic surgery and has scalability for integrating new functions and agents, enabling hands-free access to patient data during procedures.

Abstract: In robotic surgery, surgeons fully engage their hands and visual attention in procedures, making it difficult to access and manipulate multimodal patient data without interrupting the workflow. To overcome this problem, we propose a Voice-Interactive Surgical Agent (VISA) built on a hierarchical multi-agent framework consisting of an orchestration agent and three task-specific agents driven by Large Language Models (LLMs). These LLM-based agents autonomously plan, refine, validate, and reason to interpret voice commands and execute tasks such as retrieving clinical information, manipulating CT scans, or navigating 3D anatomical models within surgical video. We construct a dataset of 240 user commands organized into hierarchical categories and introduce the Multi-level Orchestration Evaluation Metric (MOEM) that evaluates the performance and robustness at both the command and category levels. Experimental results demonstrate that VISA achieves high stage-level accuracy and workflow-level success rates, while also enhancing its robustness by correcting transcription errors, resolving linguistic ambiguity, and interpreting diverse free-form expressions. These findings highlight the strong potential of VISA to support robotic surgery and its scalability for integrating new functions and agents.

[57] Online-PVLM: Advancing Personalized VLMs with Online Concept Learning

Huiyu Bai, Runze Wang, Zhuoyun Du, Yiyang Zhao, Fengji Zhang, Haoyu Chen, Xiaoyong Zhu, Bo Zheng, Xuejiao Zhao

Main category: cs.CL

TL;DR: Online-PVLM enables real-time concept learning for personalized VLMs using hyperbolic representations without training, addressing scalability issues in large-scale scenarios.

Details

Motivation: Existing personalized VLM methods require separate embeddings for each new concept and fail to support real-time adaptation during testing, making them inefficient for large-scale scenarios where concept embedding retrieval is challenging.

Method: Proposes Online-PVLM framework using hyperbolic representations for online concept learning with a train-free paradigm for concept embeddings generation at test time, making personalized VLMs scalable and efficient.

Result: Extensive experiments demonstrate state-of-the-art performance. Also introduces OP-Eval benchmark with 1,292 concepts and over 30K high-quality instances with diverse question types for rigorous assessment of online concept learning.

Conclusion: Online-PVLM provides a scalable and efficient solution for real-time concept learning in personalized VLMs, with comprehensive evaluation benchmark supporting future research in this area.

Abstract: Personalized Visual Language Models (VLMs) are gaining increasing attention for their formidable ability in user-specific concepts aligned interactions (e.g., identifying a user’s bike). Existing methods typically require the learning of separate embeddings for each new concept, which fails to support real-time adaptation during testing. This limitation becomes particularly pronounced in large-scale scenarios, where efficient retrieval of concept embeddings is not achievable. To alleviate this gap, we propose Online-PVLM, a framework for online concept learning by leveraging hyperbolic representations. Our approach makes a train-free paradigm for concept embeddings generation at test time, making the use of personalized VLMs both scalable and efficient. In addition, we develop OP-Eval, a comprehensive and large-scale benchmark comprising 1,292 concepts and over 30K high-quality instances with diverse question types, designed to rigorously assess online concept learning in realistic scenarios. Extensive experiments demonstrate the state-of-the-art performance of our proposed framework. Our source code and dataset will be made available.

[58] MindShift: Analyzing Language Models’ Reactions to Psychological Prompts

Anton Vasiliuk, Irina Abdullaeva, Polina Druzhinina, Anton Razzhigaev, Andrey Kuznetsov

Main category: cs.CL

TL;DR: The paper introduces MindShift, a benchmark using adapted MMPI psychometric tests to evaluate how well LLMs can adopt and reflect specified personality traits through persona-based prompts.

Details

Motivation: To investigate whether LLMs can effectively absorb and reflect personality traits specified by users, and to assess their psychological adaptability through robust psychometric measures.

Method: Adapted the Minnesota Multiphasic Personality Inventory (MMPI) to create personality-oriented prompts with detailed personas varying in trait intensity, then measured LLMs’ ability to follow these roles.

Result: Found consistent improvement in LLMs’ role perception due to better training datasets and alignment techniques, with significant response differences across model types suggesting variability in emulating human-like personality traits.

Conclusion: LLMs show potential for psychological adaptability, with MindShift serving as a valuable benchmark for evaluating this capability, and the framework will be publicly available for further research.

Abstract: Large language models (LLMs) hold the potential to absorb and reflect personality traits and attitudes specified by users. In our study, we investigated this potential using robust psychometric measures. We adapted the most studied test in psychological literature, namely Minnesota Multiphasic Personality Inventory (MMPI) and examined LLMs’ behavior to identify traits. To asses the sensitivity of LLMs’ prompts and psychological biases we created personality-oriented prompts, crafting a detailed set of personas that vary in trait intensity. This enables us to measure how well LLMs follow these roles. Our study introduces MindShift, a benchmark for evaluating LLMs’ psychological adaptability. The results highlight a consistent improvement in LLMs’ role perception, attributed to advancements in training datasets and alignment techniques. Additionally, we observe significant differences in responses to psychometric assessments across different model types and families, suggesting variability in their ability to emulate human-like personality traits. MindShift prompts and code for LLM evaluation will be publicly available.

[59] Non-Resolution Reasoning (NRR): A Computational Framework for Contextual Identity and Ambiguity Preservation

Kei Saito

Main category: cs.CL

TL;DR: NRR is a framework that treats ambiguity retention as valid reasoning rather than a defect, challenging AI’s tendency to prematurely collapse multiple interpretations into single outputs.

Details

Motivation: Current AI systems prematurely resolve ambiguity through "semantic collapse" - forcing multiple valid interpretations into single outputs due to classical identity assumptions in neural architectures.

Method: NRR introduces three principles (Non-Identity, Approximate Identity, Non-Resolution) formalized through Multi-Vector Embeddings, Non-Collapsing Attention, and Contextual Identity Tracking to maintain ambiguity.

Result: NRR-lite model achieves 90.9% out-of-distribution accuracy vs 9.1% for standard architectures on synthetic context-shift tasks, demonstrating ambiguity preservation enables structural generalization.

Conclusion: NRR challenges the assumption that meaning must collapse to be useful, offering foundations for AI with sophisticated ambiguity handling and creative reasoning - questioning not whether but when/how/under whose control ambiguity should be resolved.

Abstract: Current artificial intelligence systems, despite remarkable capabilities in text generation and pattern recognition, exhibit a fundamental architectural limitation: they resolve ambiguity prematurely. This premature semantic collapse – the tendency to collapse multiple valid interpretations into a single output – stems from classical identity assumptions embedded in standard neural architectures. We propose Non-Resolution Reasoning (NRR), a computational framework that treats ambiguity retention as a valid reasoning mode rather than a defect to be eliminated. NRR introduces three core principles: (1) Non-Identity ($A \neq A$) – the same symbol refers to different entities across contexts; (2) Approximate Identity ($A \approx A$) – entities share partial structural overlap without being identical; and (3) Non-Resolution – conflicting interpretations can coexist without forced convergence. We formalize these principles through three architectural components: Multi-Vector Embeddings for context-dependent representation, Non-Collapsing Attention for parallel interpretation retention, and Contextual Identity Tracking (CIT) for maintaining $A \neq A$ across inference. We demonstrate NRR’s advantages through case studies in paradox handling, creative generation, and context-dependent reasoning. Crucially, we provide a minimal empirical validation on a synthetic context-shift task where an NRR-lite model achieves 90.9% out-of-distribution accuracy compared to 9.1% for standard architectures, demonstrating that ambiguity preservation enables structural generalization. NRR challenges the assumption that meaning must collapse to be useful, offering a foundation for AI systems capable of sophisticated ambiguity handling and creative reasoning. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.

[60] A stylometric analysis of speaker attribution from speech transcripts

Cristina Aggazzotti, Elizabeth Allyn Smith

Main category: cs.CL

TL;DR: The paper introduces StyloSpeaker, a stylometric method for speaker attribution by analyzing transcribed speech content, showing better performance on normalized transcripts and competitive results compared to neural approaches.

Details

Motivation: Traditional speaker recognition relies on acoustic properties, which fail when voices are disguised or synthesized. Authorship attribution methods work for written text but haven't been systematically applied to transcribed speech. There's a need for content-based approaches to identify speakers when vocal features are unreliable.

Method: StyloSpeaker applies stylometric authorship attribution techniques to transcribed speech, incorporating character, word, token, sentence, and style features. The method is evaluated on two transcript formats: prescriptive (with capitalization/punctuation) and normalized (without these conventions), with varying degrees of topic control.

Result: Higher attribution performance generally achieved on normalized transcripts, except under strongest topic control where overall performance is highest. The explainable stylometric model performs competitively compared to black-box neural approaches on the same data, with investigation into which stylistic features most effectively distinguish speakers.

Conclusion: Stylometric methods can effectively attribute speakers from transcribed speech content, especially when transcripts are normalized. The approach provides an explainable alternative to neural methods and works when acoustic features are unavailable or unreliable due to voice disguise or synthesis.

Abstract: Forensic scientists often need to identify an unknown speaker or writer in cases such as ransom calls, covert recordings, alleged suicide notes, or anonymous online communications, among many others. Speaker recognition in the speech domain usually examines phonetic or acoustic properties of a voice, and these methods can be accurate and robust under certain conditions. However, if a speaker disguises their voice or employs text-to-speech software, vocal properties may no longer be reliable, leaving only their linguistic content available for analysis. Authorship attribution methods traditionally use syntactic, semantic, and related linguistic information to identify writers of written text (authorship attribution). In this paper, we apply a content-based authorship approach to speech that has been transcribed into text, using what a speaker says to attribute speech to individuals (speaker attribution). We introduce a stylometric method, StyloSpeaker, which incorporates character, word, token, sentence, and style features from the stylometric literature on authorship, to assess whether two transcripts were produced by the same speaker. We evaluate this method on two types of transcript formatting: one approximating prescriptive written text with capitalization and punctuation and another normalized style that removes these conventions. The transcripts’ conversation topics are also controlled to varying degrees. We find generally higher attribution performance on normalized transcripts, except under the strongest topic control condition, in which overall performance is highest. Finally, we compare this more explainable stylometric model to black-box neural approaches on the same data and investigate which stylistic features most effectively distinguish speakers.

[61] From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition

Yiqing Zhou, Yu Lei, Shuzheng Si, Qingyan Sun, Wei Wang, Yifei Wu, Hao Wen, Gang Chen, Fanchao Qi, Maosong Sun

Main category: cs.CL

TL;DR: EDU-based Context Compressor: A novel explicit compression framework that transforms text into structural relation trees of Elementary Discourse Units (EDUs) to preserve both global structure and fine-grained details, achieving state-of-the-art performance with reduced costs.

Details

Motivation: Managing extensive context is a critical bottleneck for LLMs in applications like long-document QA and autonomous agents, where lengthy inputs cause high computational costs and noise. Existing compression techniques either disrupt local coherence through discrete token removal or rely on implicit latent encoding with positional bias and API incompatibility issues.

Method: Two-step structure-then-select process: 1) LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) anchored to source indices to eliminate hallucination; 2) A lightweight ranking module selects query-relevant sub-trees for linearization. Also introduces StructBench dataset for evaluation.

Result: Achieves state-of-the-art structural prediction accuracy, significantly outperforms frontier LLMs while reducing costs. Structure-aware compression substantially enhances performance across downstream tasks including long-context tasks and complex Deep Search scenarios.

Conclusion: The EDU-based Context Compressor provides an effective explicit compression framework that preserves document structure while reducing computational costs, addressing key limitations of existing compression methods for LLM applications.

Abstract: Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.

[62] Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models

Caner Erden

Main category: cs.CL

TL;DR: MAHA is a novel attention mechanism that uses hierarchical decomposition and optimization-based aggregation to reduce computational complexity while maintaining global dependencies, achieving 81% FLOPs reduction at sequence length 4096.

Details

Motivation: The quadratic computational complexity of MultiHead SelfAttention (MHSA) is a fundamental bottleneck for scaling Large Language Models to long-context tasks. Existing sparse and linearized attention mechanisms often compromise global dependencies or fail to capture multi-scale semantic granularity effectively.

Method: MAHA reformulates attention through hierarchical decomposition and mathematically rigorous aggregation. It dynamically partitions input sequences into hierarchical scales via learnable downsampling operators. The core innovation is modeling scale-specific attention matrix fusion as a resource allocation problem, solved via convex optimization or Nash equilibrium-based game-theoretic approaches. Implemented within a hybrid dilated-convolutional transformer backbone with differentiable optimization layers for end-to-end training.

Result: MAHA achieves superior scalability with an 81% reduction in computational cost (FLOPs) at sequence length 4096 compared to standard attention, while maintaining a theoretically optimal balance between local nuance and global context fidelity.

Conclusion: This work bridges optimization theory and sequence modeling, offering a scalable solution for next-generation LLMs that addresses the fundamental computational bottleneck of attention mechanisms while preserving multi-scale semantic representation.

Abstract: The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate this, they often compromise the representation of global dependencies or fail to capture multiscale semantic granularity effectively. In this paper, we propose Multiscale Aggregated Hierarchical Attention (MAHA), a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation. Unlike conventional approaches that treat token interactions at a single resolution, MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators. The core innovation lies in its aggregation strategy: we model the fusion of scalespecific attention matrices as a resource allocation problem, solved via a convex optimization framework or a Nash equilibriumbased gametheoretic approach. This ensures a theoretically optimal balance between local nuance and global context fidelity. Implemented within a hybrid dilatedconvolutional transformer backbone, MAHA utilizes differentiable optimization layers to enable endtoend training. Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention. This work bridges the gap between optimization theory and sequence modeling, offering a scalable solution for nextgeneration LLMs.

[63] Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning

Weiqin Wang, Yile Wang, Kehao Chen, Hui Huang

Main category: cs.CL

TL;DR: SCOPE improves test-time RL for LLMs by using confidence-weighted pseudo-labels and dynamic subgroup partitioning to reduce confirmation bias and sparse reward issues.

Details

Motivation: Current test-time RL methods using majority voting suffer from confirmation bias and sparse rewards, limiting LLM reasoning improvement despite being complementary to RLVR approaches.

Method: SCOPE integrates step-wise confidence into pseudo-label deduction (prioritizing high-quality reasoning over frequency) and dynamically partitions candidate outputs into subgroups balancing quality vs diversity, then derives local consensus via repeat sampling per subgroup.

Result: SCOPE consistently outperforms recent baselines across various models and benchmarks, achieving 13.1% relative improvement on AIME 2025 and 8.1% on AMC.

Conclusion: SCOPE effectively addresses confirmation bias and sparse reward issues in test-time RL through confidence-weighted pseudo-labels and dynamic subgroup partitioning, enabling better LLM reasoning improvement.

Abstract: Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for improving reasoning ability of large language models (LLMs). However, this voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance. In this work, we propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE), a framework integrating model confidence and dynamic subgroup partitioning to address these issues. Specifically, SCOPE integrates the proposed step-wise confidence into pseudo label deduction, prioritizing high-quality reasoning paths over simple frequency count. Furthermore, it dynamically partitions the candidate outputs pool into independent subgroups by balancing reasoning quality against exploration diversity. By deriving local consensus via repeat sampling for each sub group, SCOPE provides diverse supervision targets to encourage broader exploration. We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines. Notably, SCOPE achieving relative improvements of 13.1% on challenging AIME 2025 and 8.1% on AMC. The code is released at https://github.com/szu-tera/SCOPE.

[64] Rakuten Data Release: A Large-Scale and Long-Term Reviews Corpus for Hotel Domain

Yuki Nakayama, Koki Hikichi, Yun Ching Liu, Yu Hirate

Main category: cs.CL

TL;DR: Large-scale Rakuten Travel Reviews corpus with 7.3M reviews spanning 2009-2024, featuring comprehensive metadata and analysis of data drift factors between 2019-2024.

Details

Motivation: To provide a comprehensive, large-scale dataset of travel reviews for research purposes, enabling analysis of customer feedback trends, accommodation performance, and temporal data patterns over a 16-year period.

Method: Collection of 7.3 million customer reviews from Rakuten Travel spanning 2009-2024, with detailed metadata extraction including review text, responses, user IDs, dates, accommodation details, room information, purpose, group composition, and multi-aspect ratings. Statistical analysis applied to examine data drift patterns between 2019-2024.

Result: Created a comprehensive corpus with rich metadata for each review record. Provided statistical insights into the dataset characteristics and identified factors driving data drift during the 2019-2024 period using statistical approaches.

Conclusion: The Rakuten Travel Reviews corpus represents a valuable resource for research in tourism, natural language processing, and data analysis, offering unprecedented scale and temporal coverage with detailed metadata for comprehensive analysis of travel review patterns and trends.

Abstract: This paper presents a large-scale corpus of Rakuten Travel Reviews. Our collection contains 7.3 million customer reviews for 16 years, ranging from 2009 to 2024. Each record in the dataset contains the review text, its response from an accommodation, an anonymized reviewer ID, review date, accommodation ID, plan ID, plan title, room type, room name, purpose, accompanying group, and user ratings from different aspect categories, as well as an overall score. We present statistical information about our corpus and provide insights into factors driving data drift between 2019 and 2024 using statistical approaches.

cs.CV

[65] Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real

Yan Yang, George Bebis, Mircea Nicolescu

Main category: cs.CV

TL;DR: A two-step generative data augmentation framework combining rule-based mask warping with GAN-based image translation to generate realistic masked-face samples for addressing data scarcity in face recognition.

Details

Motivation: Data scarcity and distribution shift pose major challenges for masked face detection and recognition, creating a need for better data augmentation methods to generate realistic masked-face samples beyond purely synthetic transformations.

Method: Two-step framework: 1) Rule-based mask warping, 2) Unpaired image-to-image translation using GANs. Introduces non-mask preservation loss and stochastic noise injection to stabilize training and enhance sample diversity.

Result: The approach yields consistent qualitative improvements over rule-based warping alone and complements existing GAN-based methods like IAMGAN. Experimental observations highlight effectiveness of the proposed components.

Conclusion: The framework effectively addresses data scarcity for masked face recognition and suggests directions for future improvements in data-centric augmentation for face recognition tasks.

Abstract: Data scarcity and distribution shift pose major challenges for masked face detection and recognition. We propose a two-step generative data augmentation framework that combines rule-based mask warping with unpaired image-to-image translation using GANs, enabling the generation of realistic masked-face samples beyond purely synthetic transformations. Compared to rule-based warping alone, the proposed approach yields consistent qualitative improvements and complements existing GAN-based masked face generation methods such as IAMGAN. We introduce a non-mask preservation loss and stochastic noise injection to stabilize training and enhance sample diversity. Experimental observations highlight the effectiveness of the proposed components and suggest directions for future improvements in data-centric augmentation for face recognition tasks.

[66] Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, Rita Cucchiara

Main category: cs.CV

TL;DR: JARVIS is a JEPA-inspired framework that enhances MLLMs’ visual reasoning by integrating self-supervised visual learning through frozen vision foundation models, improving performance on vision-centric tasks without harming multimodal abilities.

Details

Motivation: Current MLLMs have limited visual reasoning capabilities because they learn visual understanding primarily from subjective/incomplete textual descriptions and overfit language priors due to the modest scale of multimodal instruction tuning compared to massive text-only pre-training.

Method: Integrates I-JEPA learning paradigm into standard vision-language alignment pipeline. Uses frozen vision foundation models as context and target encoders, while training the predictor (implemented as early LLM layers) to learn structural and semantic regularities from images without relying exclusively on language supervision.

Result: Extensive experiments show JARVIS consistently improves performance on vision-centric benchmarks across different LLM families without degrading multimodal reasoning abilities.

Conclusion: JARVIS effectively addresses MLLMs’ visual reasoning limitations through self-supervised visual enhancement, demonstrating improved vision-centric performance while maintaining multimodal capabilities.

Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

Dwip Dalal, Utkarsh Mishra, Narendra Ahuja, Nebojsa Jojic

Main category: cs.CV

TL;DR: The paper introduces CityNav, a benchmark for evaluating MLLMs’ sequential decision-making in real-world city navigation without environmental annotations, and proposes Verbalization of Path (VoP) to improve performance.

Details

Motivation: Current evaluation benchmarks for MLLMs are too language-centric or simulation-based, lacking assessment of nuanced, knowledge-intensive reasoning needed for practical real-world embodied tasks like city navigation.

Method: Created CityNav benchmark with 4 global cities requiring agents to navigate 50+ decision points using only visual inputs. Proposed Verbalization of Path (VoP) technique that grounds agent reasoning by extracting explicit cognitive maps (landmarks and directions) from MLLMs.

Result: State-of-the-art MLLMs and standard reasoning techniques significantly underperform in CityNav. VoP substantially enhances navigation success by improving spatial reasoning and planning capabilities.

Conclusion: CityNav reveals limitations of current MLLMs in knowledge-intensive real-world navigation. VoP demonstrates the importance of explicit cognitive mapping for improving embodied agent performance in complex spatial tasks.

Abstract: Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs and standard reasoning techniques (e.g., Chain-of-Thought, Reflection) significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent’s internal reasoning by probing an explicit cognitive map (key landmarks and directions toward the destination) from the MLLMs, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/

[68] LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen, Yunpeng Xu, Yun Fu

Main category: cs.CV

TL;DR: LinkedOut: A VLLM-based video representation that extracts world knowledge directly from raw video frames for fast, multi-video recommendation without language bottlenecks.

Details

Motivation: Current VLLMs face deployment challenges for video recommendation: high latency from decode-only generation, lack of multi-video input support, and loss of fine-grained visual details when constrained to language outputs. These limitations stem from missing representations that preserve pixel-level detail while leveraging world knowledge.

Method: LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. It introduces a cross-layer knowledge fusion MoE (Mixture of Experts) that selects appropriate abstraction levels from rich VLLM features for personalized, interpretable recommendation.

Result: Achieves state-of-the-art results on standard benchmarks. First VLLM-based video recommendation method operating on raw frames without handcrafted labels. Interpretability studies confirm benefits of layer diversity and layer-wise fusion.

Conclusion: LinkedOut provides a practical path to fully leverage VLLM world-knowledge priors and visual reasoning for downstream vision tasks like recommendation, enabling fast inference, multi-video support, and preservation of visual details without language bottlenecks.

Abstract: Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.

[69] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

Main category: cs.CV

TL;DR: R4 is a training-free framework that equips vision-language models with structured 4D spatio-temporal memory for retrieval-augmented reasoning in dynamic environments.

Details

Motivation: Inspired by human ability to build persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics for multimodal reasoning about surroundings.

Method: Continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time. Natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations for VLM reasoning.

Result: R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines on embodied question answering and navigation benchmarks.

Conclusion: R4 advances a new paradigm for embodied 4D reasoning in dynamic environments through training-free, structured 4D memory that enables episodic and collaborative reasoning.

Abstract: Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM’s reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.

[70] The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

Tejas Anvekar, Fenil Bardoliya, Pavan K. Turaga, Chitta Baral, Vivek Gupta

Main category: cs.CV

TL;DR: The Perceptual Observatory is a framework that systematically evaluates multimodal LLMs’ visual grounding capabilities beyond just accuracy, testing robustness, attribution fidelity, and reasoning under controlled perturbations.

Details

Motivation: Current MLLM evaluations focus too much on end-task accuracy while overlooking whether progress reflects genuine visual understanding versus reliance on textual knowledge. There's a need to characterize perceptual capacities as many models scale language components while reusing similar vision encoders.

Method: A framework with multiple evaluation verticals: (1) simple vision tasks like face matching and text-in-vision comprehension, (2) local-to-global understanding including image matching, grid pointing game, and attribute localization. Uses ground-truth datasets systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions.

Result: The framework provides principled analysis of how MLLMs preserve perceptual grounding and relational structure under perturbations, moving beyond leaderboard accuracy to reveal strengths and weaknesses in visual understanding.

Conclusion: The Perceptual Observatory offers a systematic foundation for analyzing MLLMs’ perceptual capacities, addressing concerns about whether progress reflects genuine visual grounding or just textual knowledge scaling.

Abstract: Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.

[71] Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

Utsav Panchal, Yuchen Liu, Luigi Palmieri, Ilche Georgievski, Marco Aiello

Main category: cs.CV

TL;DR: CAMP-VLM: A Vision Language Model framework for multi-human behavior prediction from third-person perspective using contextual features and scene graphs, fine-tuned with synthetic data and outperforming baselines by up to 66.9%.

Details

Motivation: Most prior research focuses on single-human behavior prediction from egocentric views, but robotic applications often require understanding multiple human behaviors from third-person perspectives, creating a gap in current capabilities.

Method: Developed CAMP-VLM framework that incorporates contextual visual features and spatial awareness from scene graphs. Used synthetic human behavior data from photorealistic simulator for fine-tuning due to lack of suitable real datasets. Applied Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) techniques.

Result: CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy. The model was evaluated on both synthetic and real-world sequences to assess generalization capabilities.

Conclusion: The proposed CAMP-VLM framework effectively addresses the challenge of multi-human behavior prediction from third-person perspectives, demonstrating strong performance through synthetic data training and advanced fine-tuning techniques.

Abstract: Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.

[72] From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection

Manuel Nkegoum, Minh-Tan Pham, Élisa Fromont, Bruno Avignon, Sébastien Lefèvre

Main category: cs.CV

TL;DR: VLM-based detectors adapted for multispectral inputs outperform specialized models in few-shot regimes and achieve competitive results in fully supervised settings on FLIR and M3FD benchmarks.

Details

Motivation: Multispectral object detection is crucial for safety-critical applications like autonomous driving, but limited annotated data restricts deep detector training. Textual class information can provide semantic supervision in data-scarce scenarios.

Method: Adapt two VLM-based detectors (Grounding DINO and YOLO-World) to handle multispectral inputs and propose an effective mechanism to integrate text, visual, and thermal modalities.

Result: VLM-based detectors excel in few-shot regimes, significantly outperforming specialized multispectral models with comparable data, and achieve competitive/superior results in fully supervised settings on FLIR and M3FD benchmarks.

Conclusion: Semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, offering a powerful pathway toward data-efficient multispectral perception.

Abstract: Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance, where robust perception under diverse illumination conditions is essential. However, the limited availability of annotated multispectral data severely restricts the training of deep detectors. In such data-scarce scenarios, textual class information can serve as a valuable source of semantic supervision. Motivated by the recent success of Vision-Language Models (VLMs) in computer vision, we explore their potential for few-shot multispectral object detection. Specifically, we adapt two representative VLM-based detectors, Grounding DINO and YOLO-World, to handle multispectral inputs and propose an effective mechanism to integrate text, visual and thermal modalities. Through extensive experiments on two popular multispectral image benchmarks, FLIR and M3FD, we demonstrate that VLM-based detectors not only excel in few-shot regimes, significantly outperforming specialized multispectral models trained with comparable data, but also achieve competitive or superior results under fully supervised settings. Our findings reveal that the semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, ofFering a powerful pathway toward data-efficient multispectral perception.

[73] Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models

Zhihao Zhang, Xuejun Yang, Weihua Liu, Mouquan Shen

Main category: cs.CV

TL;DR: EDN: A learning framework using encoder-decoder network to transform random Gaussian noise into high-quality initial noise for improved novel view synthesis with diffusion models.

Details

Motivation: Current single-view novel view synthesis models based on diffusion models lack dedicated learning frameworks to generate high-quality initial noise patterns, which are known to produce better generation results than random noise.

Method: 1) Design discretized Euler inversion method to inject image semantic information into random noise, creating paired datasets of random and high-quality noise. 2) Propose encoder-decoder network (EDN) that directly transforms random noise into high-quality noise.

Result: EDN can be seamlessly integrated into various NVS models (SV3D, MV-Adapter) and achieves significant performance improvements across multiple datasets.

Conclusion: The proposed EDN framework effectively learns to generate high-quality initial noise for diffusion-based NVS models, leading to improved novel view synthesis performance without requiring modifications to existing model architectures.

Abstract: Single-view novel view synthesis (NVS) models based on diffusion models have recently attracted increasing attention, as they can generate a series of novel view images from a single image prompt and camera pose information as conditions. It has been observed that in diffusion models, certain high-quality initial noise patterns lead to better generation results than others. However, there remains a lack of dedicated learning frameworks that enable NVS models to learn such high-quality noise. To obtain high-quality initial noise from random Gaussian noise, we make the following contributions. First, we design a discretized Euler inversion method to inject image semantic information into random noise, thereby constructing paired datasets of random and high-quality noise. Second, we propose a learning framework based on an encoder-decoder network (EDN) that directly transforms random noise into high-quality noise. Experiments demonstrate that the proposed EDN can be seamlessly plugged into various NVS models, such as SV3D and MV-Adapter, achieving significant performance improvements across multiple datasets. Code is available at: https://github.com/zhihao0512/EDN.

[74] D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos

Wenkang Zhang, Yan Zhao, Qiang Wang, Zhixin Xu, Li Song, Zhengxue Cheng

Main category: cs.CV

TL;DR: D-FCGS is a feedforward compression framework for dynamic 3D Gaussian Splatting that achieves 40x compression while maintaining visual quality, using standardized GoF structure and dual prior-aware entropy modeling.

Details

Motivation: Efficient compression of dynamic 3D representations for Free-Viewpoint Video (FVV) is challenging. Existing dynamic 3D Gaussian Splatting methods have limitations: they couple reconstruction with optimization-dependent compression, use customized motion formats, and lack generalization and standardization.

Method: Proposes D-FCGS with three key innovations: (1) standardized Group-of-Frames structure with I-P coding using sparse control points to extract inter-frame motion tensors; (2) dual prior-aware entropy model combining hyperprior and spatial-temporal priors for accurate rate estimation; (3) control-point-guided motion compensation mechanism and refinement network for view-consistent fidelity enhancement.

Result: D-FCGS matches rate-distortion performance of optimization-based methods, achieves over 40 times compression compared to baseline while preserving visual quality across viewpoints. It generalizes across diverse scenes in zero-shot fashion after training on Gaussian frames from multi-view videos.

Conclusion: This work advances feedforward compression of dynamic 3D Gaussian Splatting, facilitating scalable Free-Viewpoint Video transmission and storage for immersive applications by providing a standardized, generalizable compression framework.

Abstract: Free-Viewpoint Video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representation remains a major challenge. Existing dynamic 3D Gaussian Splatting methods couple reconstruction with optimization-dependent compression and customized motion formats, limiting generalization and standardization. To address this, we propose D-FCGS, a novel Feedforward Compression framework for Dynamic Gaussian Splatting. Key innovations include: (1) a standardized Group-of-Frames (GoF) structure with I-P coding, leveraging sparse control points to extract inter-frame motion tensors; (2) a dual prior-aware entropy model that fuses hyperprior and spatial-temporal priors for accurate rate estimation; (3) a control-point-guided motion compensation mechanism and refinement network to enhance view-consistent fidelity. Trained on Gaussian frames derived from multi-view videos, D-FCGS generalizes across diverse scenes in a zero-shot fashion. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression compared to the baseline while preserving visual quality across viewpoints. This work advances feedforward compression of dynamic 3DGS, facilitating scalable FVV transmission and storage for immersive applications.

[75] Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario, Mason J. Earles

Main category: cs.CV

TL;DR: Current vision-language models underperform supervised baselines on agricultural classification tasks, with best VLM reaching ~62% accuracy vs. YOLO11’s superior performance, showing VLMs are not yet suitable as standalone agricultural diagnostic systems.

Details

Motivation: To evaluate the reliability of vision-language models for agricultural decision support, as VLMs are increasingly proposed as general-purpose solutions but their performance on agricultural tasks remains poorly understood.

Method: Benchmarked diverse open-source and closed-source VLMs on 27 agricultural classification datasets from AgML collection (162 classes across plant disease, pest/damage, and plant/weed species identification) using zero-shot evaluation with multiple-choice and open-ended prompting, comparing against supervised YOLO11 baseline.

Result: Zero-shot VLMs substantially underperform supervised baseline; best VLM (Gemini-3 Pro) reaches ~62% average accuracy with multiple-choice prompting; open-ended prompting yields much lower performance (<25%); LLM-based semantic judging improves open-ended accuracy; plant/weed classification easier than pest/damage identification; Qwen-VL-72B best open-source model.

Conclusion: Current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

Abstract: Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

[76] Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings

Lars Beckers, Arno Waes, Aaron Van Campenhout, Toon Goedemé

Main category: cs.CV

TL;DR: Robotic mowing system uses deep learning to preserve visually diverse vegetation patches, enhancing garden biodiversity without species-level supervision.

Details

Motivation: Current robotic mowers create ecologically worthless monocultural lawns. The paper aims to transform lawns into vibrant biotopes that boost urban biodiversity through active conservation rather than passive rewilding.

Method: Uses ResNet50 pretrained on PlantNet300K for deep feature-space analysis to identify diverse vegetation patches. A global deviation metric estimates biodiversity from embeddings without species-level supervision. Selective mowing algorithm dynamically alternates between mowing and conservation based on these estimates.

Result: Strong correlation between embedding-space dispersion and expert biodiversity assessment. System effectively preserves visually diverse patches and demonstrates feasibility of deep visual diversity as proxy for ecological richness.

Conclusion: The framework successfully turns monocultural lawns into valuable biotopes, showing that widespread adoption could significantly boost urban biodiversity through intelligent, perception-driven robotic mowing.

Abstract: This paper presents a robotic mowing framework that actively enhances garden biodiversity through visual perception and adaptive decision-making. Unlike passive rewilding approaches, the proposed system uses deep feature-space analysis to identify and preserve visually diverse vegetation patches in camera images by selectively deactivating the mower blades. A ResNet50 network pretrained on PlantNet300K provides ecologically meaningful embeddings, from which a global deviation metric estimates biodiversity without species-level supervision. These estimates drive a selective mowing algorithm that dynamically alternates between mowing and conservation behavior. The system was implemented on a modified commercial robotic mower and validated both in a controlled mock-up lawn and on real garden datasets. Results demonstrate a strong correlation between embedding-space dispersion and expert biodiversity assessment, confirming the feasibility of deep visual diversity as a proxy for ecological richness and the effectiveness of the proposed mowing decision approach. Widespread adoption of such systems will turn ecologically worthless, monocultural lawns into vibrant, valuable biotopes that boost urban biodiversity.

Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Ziyuan Liu, Abhinav Valada

Main category: cs.CV

TL;DR: A method to generate video-action pairs from text instructions using a dual diffusion model with cross-modal attention, enabling robotic policy learning without action annotations.

Details

Motivation: Overcoming limitations in existing methods: two-stage pipelines limit cross-modal sharing, and single-modal adaptations can't fully leverage pretrained video knowledge. Also addresses the common lack of action annotations in video datasets for robotic learning.

Method: Three key components: (1) Extend pretrained video diffusion model with parallel action diffusion model to preserve pretrained knowledge, (2) Bridge Attention mechanism for effective cross-modal interaction, (3) Action refinement module to convert coarse actions into precise controls for low-resolution datasets.

Result: Extensive evaluations on multiple public benchmarks and real-world datasets show the method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines.

Conclusion: The approach offers a scalable framework for leveraging large-scale video data for robotic learning by automatically providing action labels for video diffusion models, overcoming the action annotation bottleneck.

Abstract: We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot’s joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.

[78] Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving

Jiaheng Geng, Jiatong Du, Xinyu Zhang, Ye Li, Panqu Wang, Yanjun Huang

Main category: cs.CV

TL;DR: A closed-loop evaluation platform for end-to-end autonomous driving that generates adversarial interactions in real-world scenes to create safety-critical corner cases for model testing.

Details

Motivation: Safety-critical corner cases are crucial for evaluating autonomous driving systems but are difficult to collect in the real world. Existing adversarial evaluation methods work in simplified simulation environments, but adversarial evaluation for real-world end-to-end autonomous driving has been little explored.

Method: Proposes a platform with two key components: 1) A real-world image generator based on flow matching that efficiently generates realistic driving images from traffic environment information, and 2) An adversarial traffic policy that models challenging interactions to create corner cases that current systems struggle to handle. The platform evaluates end-to-end models like UniAD and VAD.

Result: The platform can generate realistic driving images efficiently. When evaluating models like UniAD and VAD, the adversarial policy successfully creates corner cases that cause performance degradation in the tested models, demonstrating the platform’s ability to detect potential issues.

Conclusion: The proposed platform effectively generates adversarial interactions in real-world scenes and can detect performance degradation in end-to-end autonomous driving models during corner cases. This will facilitate improved safety and robustness of autonomous driving systems by identifying potential issues before deployment.

Abstract: Safety-critical corner cases, difficult to collect in the real world, are crucial for evaluating end-to-end autonomous driving. Adversarial interaction is an effective method to generate such safety-critical corner cases. While existing adversarial evaluation methods are built for models operating in simplified simulation environments, adversarial evaluation for real-world end-to-end autonomous driving has been little explored. To address this challenge, we propose a closed-loop evaluation platform for end-to-end autonomous driving, which can generate adversarial interactions in real-world scenes. In our platform, the real-world image generator cooperates with an adversarial traffic policy to evaluate various end-to-end models trained on real-world data. The generator, based on flow matching, efficiently and stably generates real-world images according to the traffic environment information. The efficient adversarial surrounding vehicle policy is designed to model challenging interactions and create corner cases that current autonomous driving systems struggle to handle. Experimental results demonstrate that the platform can generate realistic driving images efficiently. Through evaluating the end-to-end models such as UniAD and VAD, we demonstrate that based on the adversarial policy, our platform evaluates the performance degradation of the tested model in corner cases. This result indicates that this platform can effectively detect the model’s potential issues, which will facilitate the safety and robustness of end-to-end autonomous driving.

[79] FOD-Diff: 3D Multi-Channel Patch Diffusion Model for Fiber Orientation Distribution

Hao Tang, Hanyu Liu, Alessandro Perelli, Xi Chen, Chao Li

Main category: cs.CV

TL;DR: A 3D multi-channel patch diffusion model that predicts high angular resolution fiber orientation distribution (HAR-FOD) from low angular resolution dMRI (LAR-FOD) using anatomical priors and specialized attention mechanisms.

Details

Motivation: Single-shell low angular resolution dMRI (LAR-FOD) has limited accuracy, while multi-shell high angular resolution dMRI (HAR-FOD) requires long scanning times, limiting clinical applicability. Diffusion models show promise but face challenges due to the large number of spherical harmonic coefficients in FOD.

Method: Proposes a 3D multi-channel patch diffusion model with: 1) FOD-patch adapter incorporating brain anatomy priors for efficient patch-based learning, 2) voxel-level conditional coordinating module for global understanding, and 3) SH attention module to learn complex correlations between spherical harmonic coefficients.

Result: The method achieves best performance in HAR-FOD prediction and outperforms other state-of-the-art methods.

Conclusion: The proposed diffusion model effectively addresses the challenge of generating high-quality HAR-FOD from LAR-FOD by incorporating anatomical priors and specialized attention mechanisms, offering a practical solution for clinical applications where scanning time is limited.

Abstract: Diffusion MRI (dMRI) is a critical non-invasive technique to estimate fiber orientation distribution (FOD) for characterizing white matter integrity. Estimating FOD from single-shell low angular resolution dMRI (LAR-FOD) is limited by accuracy, whereas estimating FOD from multi-shell high angular resolution dMRI (HAR-FOD) requires a long scanning time, which limits its applicability. Diffusion models have shown promise in estimating HAR-FOD based on LAR-FOD. However, using diffusion models to efficiently generate HAR-FOD is challenging due to the large number of spherical harmonic (SH) coefficients in FOD. Here, we propose a 3D multi-channel patch diffusion model to predict HAR-FOD from LAR-FOD. We design the FOD-patch adapter by introducing the prior brain anatomy for more efficient patch-based learning. Furthermore, we introduce a voxel-level conditional coordinating module to enhance the global understanding of the model. We design the SH attention module to effectively learn the complex correlations of the SH coefficients. Our experimental results show that our method achieves the best performance in HAR-FOD prediction and outperforms other state-of-the-art methods.

[80] Auto-Vocabulary 3D Object Detection

Haomeng Zhang, Kuan-Chuan Peng, Suhas Lohit, Raymond A. Yeh

Main category: cs.CV

TL;DR: AV3DOD introduces auto-vocabulary 3D object detection that automatically generates class names without user input, achieving SOTA performance on both localization and semantic quality metrics.

Details

Motivation: Existing open-vocabulary 3D detection methods still require user-specified classes during training and inference, limiting their autonomy. The authors aim to create a truly autonomous system that can generate class names automatically without any human input.

Method: Proposes AV3DOD framework that uses 2D vision-language models for: 1) image captioning to generate semantic candidates, 2) pseudo 3D box generation, and 3) feature-space semantics expansion. Introduces Semantic Score (SS) to evaluate generated class name quality.

Result: Achieves state-of-the-art performance on ScanNetV2 and SUNRGB-D datasets, surpassing CoDA by 3.48 overall mAP and achieving 24.5% relative improvement in Semantic Score on ScanNetV2.

Conclusion: AV3DOD successfully demonstrates autonomous 3D object detection with automatically generated class names, advancing beyond traditional open-vocabulary approaches that still require user input.

Abstract: Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.

[81] LAPX: Lightweight Hourglass Network with Global Context

Haopeng Zhao, Marsha Mariya Kappan, Mahdi Bamdad, Francisco Cruz

Main category: cs.CV

TL;DR: LAPX is a lightweight human pose estimation network with self-attention that achieves competitive accuracy on MPII and COCO benchmarks with only 2.3M parameters while maintaining real-time performance suitable for edge devices.

Details

Motivation: Current SOTA human pose estimation methods are too computationally expensive for edge deployment, while existing lightweight models sacrifice too much accuracy. There's a need for models that balance efficiency and accuracy for edge devices.

Method: LAPX builds on previous LAP work by incorporating self-attention to capture global context in an Hourglass network architecture. It advances stage design and refines lightweight attention modules for better efficiency.

Result: Achieves competitive results on MPII and COCO benchmark datasets with only 2.3M parameters, demonstrating real-time performance suitable for edge devices.

Conclusion: LAPX successfully addresses the trade-off between accuracy and efficiency in human pose estimation, providing a solution that maintains competitive accuracy while being lightweight enough for edge deployment with real-time performance.

Abstract: Human pose estimation is a crucial task in computer vision. Methods that have SOTA (State-of-the-Art) accuracy, often involve a large number of parameters and incur substantial computational cost. Many lightweight variants have been proposed to reduce the model size and computational cost of them. However, several of these methods still contain components that are not well suited for efficient deployment on edge devices. Moreover, models that primarily emphasize inference speed on edge devices often suffer from limited accuracy due to their overly simplified designs. To address these limitations, we propose LAPX, an Hourglass network with self-attention that captures global contextual information, based on previous work, LAP. In addition to adopting the self-attention module, LAPX advances the stage design and refine the lightweight attention modules. It achieves competitive results on two benchmark datasets, MPII and COCO, with only 2.3M parameters, and demonstrates real-time performance, confirming its edge-device suitability.

[82] FARM: Fine-Tuning Geospatial Foundation Models for Intra-Field Crop Yield Regression

Shayan Nejadshamsi, Yuanyuan Zhang, Shadi Zaki, Brock Porth, Lysa Porth, Vahab Khoshdel

Main category: cs.CV

TL;DR: FARM is a deep learning framework that fine-tunes geospatial foundation models for high-resolution canola yield prediction, achieving state-of-the-art results with RMSE 0.44 and R² 0.81 on Canadian Prairie data.

Details

Motivation: Traditional crop yield prediction methods lack the scalability and granularity needed for precision farming. There's a need for high-resolution, intra-field yield prediction tools that can bridge the gap between large-scale Earth observation and on-farm decision-making.

Method: FARM fine-tunes the pre-trained Prithvi-EO-2.0-600M geospatial foundation model for continuous regression tasks. It transforms multi-temporal satellite imagery into dense, pixel-level (30 m) yield maps, adapting the model specifically for agricultural applications.

Result: Achieved RMSE of 0.44 and R² of 0.81 on Canadian Prairies dataset. Fine-tuning on limited ground-truth labels outperformed training from scratch, and FARM surpassed baseline architectures like 3D-CNN and DeepYield.

Conclusion: FARM demonstrates that fine-tuning large-scale geospatial foundation models is effective for specialized agricultural applications, providing continuous, high-resolution yield predictions that are more actionable for precision agriculture than conventional methods.

Abstract: Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces FARM: Fine-tuning Agricultural Regression Models, a deep learning framework designed for high-resolution, intra-field canola yield prediction. FARM leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level (30 m) yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, FARM achieves a Root Mean Squared Error (RMSE) of 0.44 and an R^2 of 0.81. Using an independent high-resolution yield monitor dataset, we further show that fine-tuning FARM on limited ground-truth labels outperforms training the same architecture from scratch, confirming the benefit of pre-training on large, upsampled county-level data for data-scarce precision agriculture. These results represent improvement over baseline architectures like 3D-CNN and DeepYield, which highlight the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, FARM offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.

[83] Collimator-assisted high-precision calibration method for event cameras

Zibin Liu, Shunkun Liang, Banglei Guan, Dongcai Tan, Yang Shang, Qifeng Yu

Main category: cs.CV

TL;DR: Proposes a novel event camera calibration method using collimator with flickering star patterns for high-precision long-range measurement.

Details

Motivation: Event cameras have advantages like high dynamic range and temporal resolution, but their geometric calibration (especially for long-range measurement) remains challenging. Existing methods struggle with dual requirements of long-distance and high-precision measurement.

Method: Uses collimator with flickering star-based patterns. First linearly solves camera parameters using sphere motion model of collimator, then performs nonlinear optimization to refine parameters with high precision.

Result: Comprehensive real-world experiments show the method consistently outperforms existing event camera calibration methods in terms of accuracy and reliability across varying conditions.

Conclusion: The proposed collimator-based calibration method effectively addresses the challenges of event camera calibration for long-range, high-precision measurement scenarios.

Abstract: Event cameras are a new type of brain-inspired visual sensor with advantages such as high dynamic range and high temporal resolution. The geometric calibration of event cameras, which involves determining their intrinsic and extrinsic parameters, particularly in long-range measurement scenarios, remains a significant challenge. To address the dual requirements of long-distance and high-precision measurement, we propose an event camera calibration method utilizing a collimator with flickering star-based patterns. The proposed method first linearly solves camera parameters using the sphere motion model of the collimator, followed by nonlinear optimization to refine these parameters with high precision. Through comprehensive real-world experiments across varying conditions, we demonstrate that the proposed method consistently outperforms existing event camera calibration methods in terms of accuracy and reliability.

[84] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu

Main category: cs.CV

TL;DR: TurboDiffusion is a video generation acceleration framework that achieves 100-200x speedup for diffusion-based video generation while maintaining quality, using attention acceleration, step distillation, and quantization techniques.

Details

Motivation: The motivation is to address the computational inefficiency and slow generation speed of diffusion-based video models, which limits their practical deployment despite their high-quality output capabilities.

Method: TurboDiffusion uses three main components: (1) Attention acceleration with low-bit SageAttention and trainable Sparse-Linear Attention, (2) Step distillation using rCM for efficient step reduction, and (3) W8A8 quantization of model parameters and activations to 8 bits. Additional engineering optimizations are also incorporated.

Result: Experimental results on multiple Wan2 models show that TurboDiffusion achieves 100-200x speedup for video generation on a single RTX 5090 GPU while maintaining comparable video quality to the original models.

Conclusion: TurboDiffusion successfully demonstrates that diffusion-based video generation can be dramatically accelerated (100-200x) through systematic optimization techniques without sacrificing output quality, making high-quality video generation more practical and accessible.

Abstract: We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.

[85] Flexible Camera Calibration using a Collimator System

Shunkun Liang, Banglei Guan, Zhenbao Yu, Dongcai Tan, Pengju Sun, Zibin Liu, Qifeng Yu, Yang Shang

Main category: cs.CV

TL;DR: Novel camera calibration method using a collimator system with angle invariance constraint that reduces 6DOF motion to 3DOF rotation, enabling closed-form linear solvers and single-image calibration without camera motion.

Details

Motivation: Camera calibration is essential for photogrammetry and 3D vision, but existing methods may require complex setups or camera motion. The paper aims to provide a more reliable, controllable, and flexible calibration solution using a collimator system.

Method: Uses a designed collimator system that creates an angle invariance constraint, proving that relative motion between target and camera follows a spherical motion model. This reduces 6DOF motion to 3DOF pure rotation. Develops closed-form linear solvers for multiple images, minimal solver for two images, and single collimator image calibration algorithm that eliminates camera motion requirements.

Result: Evaluated in synthetic and real-world experiments, verifying feasibility of collimator-based calibration and demonstrating superiority over existing baseline methods.

Conclusion: The collimator system provides a reliable calibration environment, and the proposed method offers flexible, fast calibration with reduced motion requirements, outperforming existing approaches.

Abstract: Camera calibration is a crucial step in photogrammetry and 3D vision applications. This paper introduces a novel camera calibration method using a designed collimator system. Our collimator system provides a reliable and controllable calibration environment for the camera. Exploiting the unique optical geometry property of our collimator system, we introduce an angle invariance constraint and further prove that the relative motion between the calibration target and camera conforms to a spherical motion model. This constraint reduces the original 6DOF relative motion between target and camera to a 3DOF pure rotation motion. Using spherical motion constraint, a closed-form linear solver for multiple images and a minimal solver for two images are proposed for camera calibration. Furthermore, we propose a single collimator image calibration algorithm based on the angle invariance constraint. This algorithm eliminates the requirement for camera motion, providing a novel solution for flexible and fast calibration. The performance of our method is evaluated in both synthetic and real-world experiments, which verify the feasibility of calibration using the collimator system and demonstrate that our method is superior to existing baseline methods. Demo code is available at https://github.com/LiangSK98/CollimatorCalibration

[86] Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space

Ren Nakagawa, Yang Yang, Risa Shinoda, Hiroaki Santo, Kenji Oyama, Fumio Okura, Takenao Ohkawa

Main category: cs.CV

TL;DR: CattleAct: A data-efficient method for detecting behavioral interactions between grazing cattle from single images by decomposing interactions into individual cattle actions and using contrastive learning to embed rare interactions in a unified latent space.

Details

Motivation: Smart livestock management needs automated detection of cattle interactions (e.g., for estrus detection), but this is challenging due to the rarity of such interactions and lack of comprehensive behavioral datasets with interactions.

Method: First learn an action latent space from large-scale cattle action dataset, then embed rare interactions via fine-tuning the pre-trained latent space using contrastive learning to create a unified latent space of actions and interactions.

Result: Experiments on commercial-scale pasture demonstrate accurate interaction detection compared to baselines, with a practical working system integrating video and GPS inputs.

Conclusion: CattleAct provides an effective, data-efficient solution for detecting rare cattle interactions, enabling practical smart livestock management applications with available implementation.

Abstract: This paper introduces a method and application for automatically detecting behavioral interactions between grazing cattle from a single image, which is essential for smart livestock management in the cattle industry, such as for detecting estrus. Although interaction detection for humans has been actively studied, a non-trivial challenge lies in cattle interaction detection, specifically the lack of a comprehensive behavioral dataset that includes interactions, as the interactions of grazing cattle are rare events. We, therefore, propose CattleAct, a data-efficient method for interaction detection by decomposing interactions into the combinations of actions by individual cattle. Specifically, we first learn an action latent space from a large-scale cattle action dataset. Then, we embed rare interactions via the fine-tuning of the pre-trained latent space using contrastive learning, thereby constructing a unified latent space of actions and interactions. On top of the proposed method, we develop a practical working system integrating video and GPS inputs. Experiments on a commercial-scale pasture demonstrate the accurate interaction detection achieved by our method compared to the baselines. Our implementation is available at https://github.com/rakawanegan/CattleAct.

[87] ResDynUNet++: A nested U-Net with residual dynamic convolution blocks for dual-spectral CT

Ze Yuan, Wenbin Li, Shusen Zhao

Main category: cs.CV

TL;DR: Hybrid framework for dual-spectral CT combining iterative OPMT reconstruction with ResDynUNet++ neural network refinement for improved basis material decomposition.

Details

Motivation: To address challenges in dual-spectral CT reconstruction including channel imbalance and near-interface large artifacts, and to leverage both knowledge-driven and data-driven approaches for better image quality.

Method: Two-phase hybrid approach: 1) Knowledge-driven phase using OPMT for fast intermediate reconstruction, 2) Data-driven phase using novel ResDynUNet++ network (UNet++ backbone with residual dynamic convolution blocks) to refine intermediate solution.

Result: Extensive experiments on synthetic phantoms and real clinical datasets validate the efficacy and superior performance of the proposed method.

Conclusion: The hybrid framework successfully integrates iterative methods with deep learning, producing clean and accurate final solutions for dual-spectral CT reconstruction.

Abstract: We propose a hybrid reconstruction framework for dual-spectral CT (DSCT) that integrates iterative methods with deep learning models. The reconstruction process consists of two complementary components: a knowledge-driven module and a data-driven module. In the knowledge-driven phase, we employ the oblique projection modification technique (OPMT) to reconstruct an intermediate solution of the basis material images from the projection data. We select OPMT for this role because of its fast convergence, which allows it to rapidly generate an intermediate solution that successfully achieves basis material decomposition. Subsequently, in the data-driven phase, we introduce a novel neural network, ResDynUNet++, to refine this intermediate solution. The ResDynUNet++ is built upon a UNet++ backbone by replacing standard convolutions with residual dynamic convolution blocks, which combine the adaptive, input-specific feature extraction of dynamic convolution with the stable training of residual connections. This architecture is designed to address challenges like channel imbalance and near-interface large artifacts in DSCT, producing clean and accurate final solutions. Extensive experiments on both synthetic phantoms and real clinical datasets validate the efficacy and superior performance of the proposed method.

[88] SegGraph: Leveraging Graphs of SAM Segments for Few-Shot 3D Part Segmentation

Yueyang Hu, Haiyong Jiang, Haoxuan Song, Jun Xiao, Hao Pan

Main category: cs.CV

TL;DR: SegGraph: A novel SAM segment graph-based propagation method for few-shot 3D part segmentation that aggregates 2D foundation model knowledge to 3D by learning geometric features from SAM segmentation masks.

Details

Motivation: Existing methods for few-shot 3D part segmentation either ignore geometric structures for 3D feature learning or neglect high-quality grouping clues from SAM, leading to under-segmentation and inconsistent part labels. There's a need to effectively aggregate 2D knowledge from foundation models to 3D.

Method: SegGraph constructs a segment graph where nodes represent SAM segments and edges capture spatial relationships (overlap/adjacency). Each node adaptively modulates 2D foundation model features, which are propagated via a graph neural network to learn global geometric structures. A novel view-direction-weighted fusion maps segment features to 3D points while attenuating contributions from low-quality segments.

Result: Extensive experiments on PartNet-E show SegGraph outperforms all competing baselines by at least 6.9% mIoU. The method achieves particularly strong performance on small components and part boundaries, demonstrating superior geometric understanding.

Conclusion: SegGraph effectively addresses the challenge of aggregating 2D foundation model knowledge to 3D for few-shot part segmentation by explicitly learning geometric features from SAM segmentation masks through graph-based propagation, achieving state-of-the-art performance.

Abstract: This work presents a novel framework for few-shot 3D part segmentation. Recent advances have demonstrated the significant potential of 2D foundation models for low-shot 3D part segmentation. However, it is still an open problem that how to effectively aggregate 2D knowledge from foundation models to 3D. Existing methods either ignore geometric structures for 3D feature learning or neglects the high-quality grouping clues from SAM, leading to under-segmentation and inconsistent part labels. We devise a novel SAM segment graph-based propagation method, named SegGraph, to explicitly learn geometric features encoded within SAM’s segmentation masks. Our method encodes geometric features by modeling mutual overlap and adjacency between segments while preserving intra-segment semantic consistency. We construct a segment graph, conceptually similar to an atlas, where nodes represent segments and edges capture their spatial relationships (overlap/adjacency). Each node adaptively modulates 2D foundation model features, which are then propagated via a graph neural network to learn global geometric structures. To enforce intra-segment semantic consistency, we map segment features to 3D points with a novel view-direction-weighted fusion attenuating contributions from low-quality segments. Extensive experiments on PartNet-E demonstrate that our method outperforms all competing baselines by at least 6.9 percent mIoU. Further analysis reveals that SegGraph achieves particularly strong performance on small components and part boundaries, demonstrating its superior geometric understanding. The code is available at: https://github.com/YueyangHu2000/SegGraph.

[89] C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation

Chao Li, Dasha Hu, Chengyang Li, Yuming Jiang, Yuncheng Shen

Main category: cs.CV

TL;DR: C-DGPA is a novel unsupervised domain adaptation method that uses dual alignment (marginal + conditional distribution alignment) with vision-language models to address domain discrepancies, achieving SOTA results on major benchmarks.

Details

Motivation: Existing prompt-tuning strategies for VLMs in UDA primarily align marginal distribution but neglect conditional distribution discrepancies, leading to class prototype misalignment and degraded semantic discriminability.

Method: C-DGPA uses a dual-branch architecture: 1) marginal distribution alignment branch with dynamic adversarial training, and 2) conditional distribution alignment branch with Class Mapping Mechanism (CMM) to standardize semantic prompt understanding and prevent source domain over-reliance.

Result: Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate superiority, achieving new state-of-the-art results on all benchmarks.

Conclusion: C-DGPA effectively integrates domain knowledge into prompt learning via synergistic optimization, ensuring domain-invariant and semantically discriminative representations for UDA tasks.

Abstract: Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.

[90] Towards Closing the Domain Gap with Event Cameras

M. Oltan Sevinc, Liao Wu, Francisco Cruz

Main category: cs.CV

TL;DR: Event cameras maintain consistent performance across day-night lighting domain gaps without requiring adjustments, outperforming traditional cameras.

Details

Motivation: Traditional cameras suffer from domain gap problems when deployment conditions don't match training data, particularly with day-night lighting differences. Event cameras offer a potential solution to maintain performance across lighting conditions.

Method: Proposed using event cameras instead of traditional cameras for end-to-end driving systems. Compared performance across lighting conditions without requiring additional adjustments or domain adaptation techniques.

Result: Event cameras maintain more consistent performance across lighting conditions, showing domain-shift penalties comparable to or smaller than grayscale frames. They provide superior baseline performance in cross-domain scenarios.

Conclusion: Event cameras are a promising alternative to traditional cameras for end-to-end driving systems, as they can handle day-night lighting domain gaps effectively without requiring additional adjustments.

Abstract: Although traditional cameras are the primary sensor for end-to-end driving, their performance suffers greatly when the conditions of the data they were trained on does not match the deployment environment, a problem known as the domain gap. In this work, we consider the day-night lighting difference domain gap. Instead of traditional cameras we propose event cameras as a potential alternative which can maintain performance across lighting condition domain gaps without requiring additional adjustments. Our results show that event cameras maintain more consistent performance across lighting conditions, exhibiting domain-shift penalties that are generally comparable to or smaller than grayscale frames and provide superior baseline performance in cross-domain scenarios.

[91] Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation

Jerrin Bright, Zhibo Wang, Dmytro Klepachevskyi, Yuhao Chen, Sirisha Rambhatla, David Clausi, John Zelek

Main category: cs.CV

TL;DR: Avatar4D is a pipeline for generating customizable synthetic human motion datasets with fine-grained control over pose, appearance, camera, and environment, validated on sports applications through Syn2Sport dataset.

Details

Motivation: Prior works focus on general motions with limited flexibility, lacking domain-specific customization. There's a need for controllable synthetic datasets for specialized applications like sports where unique motion patterns exist.

Method: Developed Avatar4D pipeline for generating 4D human motion sequences with control over body pose, appearance, camera viewpoint, and environmental context. Created Syn2Sport dataset for sports applications (baseball, ice hockey) without manual annotations.

Result: Generated high-fidelity synthetic motion sequences. Benchmarking showed effectiveness for supervised learning, zero-shot transfer to real data, and cross-sport generalization. Feature space analysis showed alignment with real-world datasets.

Conclusion: Avatar4D enables scalable, controllable, and transferable synthetic human datasets for domain-specific tasks without requiring domain-specific real data, demonstrating potential for diverse applications.

Abstract: We present Avatar4D, a real-world transferable pipeline for generating customizable synthetic human motion datasets tailored to domain-specific applications. Unlike prior works, which focus on general, everyday motions and offer limited flexibility, our approach provides fine-grained control over body pose, appearance, camera viewpoint, and environmental context, without requiring any manual annotations. To validate the impact of Avatar4D, we focus on sports, where domain-specific human actions and movement patterns pose unique challenges for motion understanding. In this setting, we introduce Syn2Sport, a large-scale synthetic dataset spanning sports, including baseball and ice hockey. Avatar4D features high-fidelity 4D (3D geometry over time) human motion sequences with varying player appearances rendered in diverse environments. We benchmark several state-of-the-art pose estimation models on Syn2Sport and demonstrate their effectiveness for supervised learning, zero-shot transfer to real-world data, and generalization across sports. Furthermore, we evaluate how closely the generated synthetic data aligns with real-world datasets in feature space. Our results highlight the potential of such systems to generate scalable, controllable, and transferable human datasets for diverse domain-specific tasks without relying on domain-specific real data.

[92] Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy-Chowdhury, Srimat Chakradhar

Main category: cs.CV

TL;DR: VALOR introduces a reinforcement learning framework with Group-Relative Proximal Optimization to improve visual grounding and clinical accuracy in radiology report generation by aligning vision and language representations.

Details

Motivation: Current Large Medical Vision-Language Models for radiology report generation suffer from hallucinations due to poor cross-modal alignment between visual and linguistic representations, and existing approaches rely on costly labeled data or retrieval methods that don't adequately address this issue.

Method: VALOR uses a two-stage reinforcement learning-based post-alignment framework with Group-Relative Proximal Optimization: 1) improving the Med-VLM with textual rewards for clinically precise terminology, and 2) aligning the vision projection module with disease findings to guide attention to relevant image regions.

Result: Extensive experiments on multiple benchmarks show VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.

Conclusion: VALOR effectively addresses the hallucination problem in radiology report generation by improving cross-modal alignment through reinforcement learning, resulting in more clinically accurate and visually grounded reports.

Abstract: Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.

[93] Open Ad-hoc Categorization with Contextualized Feature Learning

Zilin Wang, Sangwoo Mo, Stella X. Yu, Sima Behpour, Liu Ren

Main category: cs.CV

TL;DR: OAK introduces learnable context tokens in frozen CLIP to discover and expand ad-hoc categories through semantic extension and visual clustering, achieving SOTA performance on adaptive visual categorization tasks.

Details

Motivation: AI agents need adaptive visual scene categorization for changing tasks. Unlike fixed common categories, ad-hoc categories are dynamically created for specific goals, requiring discovery of underlying context and expansion through semantic extension and visual clustering.

Method: OAK introduces a small set of learnable context tokens at the input of a frozen CLIP model, optimizing with both CLIP’s image-text alignment objective and GCD’s visual clustering objective to discover and expand ad-hoc categories.

Result: OAK achieves state-of-the-art accuracy and concept discovery on Stanford and Clevr-4 datasets, including 87.4% novel accuracy on Stanford Mood (surpassing CLIP and GCD by over 50%). It produces interpretable saliency maps focusing on task-relevant features.

Conclusion: OAK enables adaptive and generalizable categorization by leveraging similar perceptual mechanisms between ad-hoc and common categories, promoting transparency and trust through interpretable saliency maps while achieving superior performance.

Abstract: Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP’s image-text alignment objective and GCD’s visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.

[94] Enhanced 3D Shape Analysis via Information Geometry

Amit Vishwakarma, K. S. Subrahamanian Moosath

Main category: cs.CV

TL;DR: This paper introduces a stable, bounded KL divergence variant (MSKL) for comparing 3D point clouds represented as Gaussian Mixture Models, overcoming limitations of traditional geometric metrics.

Details

Motivation: Traditional point cloud comparison methods (Hausdorff, Chamfer distances) fail to capture global statistical structure and are sensitive to outliers, while existing KL divergence approximations for GMMs can produce unbounded or numerically unstable values.

Method: Proposes an information geometric framework representing point clouds as Gaussian Mixture Models on a statistical manifold, and introduces Modified Symmetric Kullback-Leibler (MSKL) divergence with theoretically guaranteed upper and lower bounds for numerical stability.

Result: MSKL provides stable, monotonically varying values that directly reflect geometric variation, outperforming traditional distances and existing KL approximations on human pose discrimination (MPI-FAUST) and animal shape comparison (G-PCD) datasets.

Conclusion: The information geometric framework with MSKL divergence offers a robust, numerically stable approach for 3D point cloud shape analysis that better captures statistical structure and geometric variation compared to existing methods.

Abstract: Three-dimensional point clouds provide highly accurate digital representations of objects, essential for applications in computer graphics, photogrammetry, computer vision, and robotics. However, comparing point clouds faces significant challenges due to their unstructured nature and the complex geometry of the surfaces they represent. Traditional geometric metrics such as Hausdorff and Chamfer distances often fail to capture global statistical structure and exhibit sensitivity to outliers, while existing Kullback-Leibler (KL) divergence approximations for Gaussian Mixture Models can produce unbounded or numerically unstable values. This paper introduces an information geometric framework for 3D point cloud shape analysis by representing point clouds as Gaussian Mixture Models (GMMs) on a statistical manifold. We prove that the space of GMMs forms a statistical manifold and propose the Modified Symmetric Kullback-Leibler (MSKL) divergence with theoretically guaranteed upper and lower bounds, ensuring numerical stability for all GMM comparisons. Through comprehensive experiments on human pose discrimination (MPI-FAUST dataset) and animal shape comparison (G-PCD dataset), we demonstrate that MSKL provides stable and monotonically varying values that directly reflect geometric variation, outperforming traditional distances and existing KL approximations.

[95] Image Compression Using Singular Value Decomposition

Justin Jiang

Main category: cs.CV

TL;DR: SVD-based low-rank matrix approximation for image compression performs worse than standard formats (JPEG, JPEG2000, WEBP) in compression efficiency, sometimes even producing larger files than originals at low error levels.

Details

Motivation: Images constitute a large portion of internet data, making efficient compression crucial for reducing storage and bandwidth requirements. The study aims to explore SVD and low-rank approximations as potential compression methods.

Method: Uses Singular Value Decomposition (SVD) and low-rank matrix approximations for image compression. Evaluates performance using relative Frobenius error and compression ratio metrics. Tests approach on both grayscale and multichannel images to assess generality.

Result: Low-rank approximations often produce visually similar images to originals, but compression efficiency is consistently worse than established formats (JPEG, JPEG2000, WEBP) at comparable error levels. At low tolerated error levels, SVD compression can even produce files larger than the original image.

Conclusion: SVD-based low-rank approximation method is not competitive with industry-standard codecs for practical image compression applications.

Abstract: Images are a substantial portion of the internet, making efficient compression important for reducing storage and bandwidth demands. This study investigates the use of Singular Value Decomposition and low-rank matrix approximations for image compression, evaluating performance using relative Frobenius error and compression ratio. The approach is applied to both grayscale and multichannel images to assess its generality. Results show that the low-rank approximations often produce images that appear visually similar to the originals, but the compression efficiency remains consistently worse than established formats such as JPEG, JPEG2000, and WEBP at comparable error levels. At low tolerated error levels, the compressed representation produced by Singular Value Decomposition can even exceed the size of the original image, indicating that this method is not competitive with industry-standard codecs for practical image compression.

[96] ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

Zichen Geng, Zeeshan Hayder, Wei Liu, Hesheng Wang, Ajmal Mian

Main category: cs.CV

TL;DR: ARMFlow is an autoregressive framework for real-time 3D human reaction generation that addresses motion fidelity, real-time inference, and autoregressive adaptability challenges simultaneously.

Details

Motivation: Existing methods for 3D human reaction generation fail to simultaneously achieve high motion fidelity, real-time inference, and autoregressive adaptability for online scenarios. There's a need for a solution that can handle all three challenges together.

Method: ARMFlow uses a MeanFlow-based autoregressive framework with causal context encoder and MLP-based velocity predictor. It introduces Bootstrap Contextual Encoding (BSCE) to encode generated history instead of ground-truth to reduce error accumulation. Also includes a global contextual encoder for semantic alignment.

Result: ARMFlow achieves over 40% improvement in FID on InterHuman and InterX datasets compared to existing online methods. It matches offline state-of-the-art performance while using only partial sequence conditions. The offline variant ReMFlow achieves fastest inference among offline methods.

Conclusion: ARMFlow successfully addresses the three key challenges of 3D human reaction generation, enabling high-quality real-time inference with autoregressive adaptability for online scenarios while reducing error accumulation through innovative training techniques.

Abstract: 3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.

[97] AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection

Satya Narayana Panda, Vaishnavi Kukkala, Spandana Iyer

Main category: cs.CV

TL;DR: AI framework combines skin images with family history data to improve dermatology diagnosis accuracy, especially for hereditary conditions like melanoma and psoriasis.

Details

Motivation: Dermatological conditions affect billions globally, but accurate diagnosis is challenging due to limited specialist availability and complex presentations. Family history significantly influences skin disease risk and treatment response but is often underutilized in diagnosis.

Method: Multi-modal AI framework combining deep learning image analysis with structured clinical data including family history patterns. Uses interpretable convolutional neural networks integrated with clinical decision trees that incorporate hereditary risk factors. Includes validation with healthcare professionals.

Result: AI system shows enhanced diagnostic accuracy when family history data is incorporated, particularly for hereditary skin conditions (melanoma, psoriasis, atopic dermatitis). Expert feedback indicates potential for improved early detection and personalized recommendations.

Conclusion: The framework is designed for clinical workflow integration while maintaining interpretability through explainable AI. Formal clinical trials are planned to validate AI-assisted diagnosis against traditional assessment across diverse healthcare settings.

Abstract: Dermatological conditions affect 1.9 billion people globally, yet accurate diagnosis remains challenging due to limited specialist availability and complex clinical presentations. Family history significantly influences skin disease susceptibility and treatment responses, but is often underutilized in diagnostic processes. This research addresses the critical question: How can AI-powered systems integrate family history data with clinical imaging to enhance dermatological diagnosis while supporting clinical trial validation and real-world implementation? We developed a comprehensive multi-modal AI framework that combines deep learning-based image analysis with structured clinical data, including detailed family history patterns. Our approach employs interpretable convolutional neural networks integrated with clinical decision trees that incorporate hereditary risk factors. The methodology includes prospective clinical trials across diverse healthcare settings to validate AI-assisted diagnosis against traditional clinical assessment. In this work, validation was conducted with healthcare professionals to assess AI-assisted outputs against clinical expectations; prospective clinical trials across diverse healthcare settings are proposed as future work. The integrated AI system demonstrates enhanced diagnostic accuracy when family history data is incorporated, particularly for hereditary skin conditions such as melanoma, psoriasis, and atopic dermatitis. Expert feedback indicates potential for improved early detection and more personalized recommendations; formal clinical trials are planned. The framework is designed for integration into clinical workflows while maintaining interpretability through explainable AI mechanisms.

[98] Semi-Supervised Multi-View Crowd Counting by Ranking Multi-View Fusion Models

Qi Zhang, Yunfei Gong, Zhidan Xie, Zhizi Wang, Antoni B. Chan, Hui Huang

Main category: cs.CV

TL;DR: Proposes two semi-supervised multi-view crowd counting frameworks using model ranking constraints to address limited labeled data, with one ranking predictions and another ranking uncertainties.

Details

Motivation: Multi-view crowd counting suffers from limited datasets due to difficulty in collecting and annotating multi-view images. Existing solutions include synthetic data or semi/weakly-supervised methods, but more effective approaches are needed for limited labeled data scenarios.

Method: Two semi-supervised frameworks: 1) Vanilla model ranks multi-view fusion models’ predictions (fewer views ≤ more views), 2) Uncertainty-based method ranks model uncertainties guided by prediction errors (more views ≤ fewer views). Both use ranking constraints in semi-supervised training.

Result: Experiments show advantages over other semi-supervised counting methods, demonstrating effectiveness of the proposed model ranking approaches for multi-view counting with limited labeled data.

Conclusion: The proposed model ranking methods provide effective semi-supervised solutions for multi-view crowd counting when labeled data is limited, outperforming existing semi-supervised approaches.

Abstract: Multi-view crowd counting has been proposed to deal with the severe occlusion issue of crowd counting in large and wide scenes. However, due to the difficulty of collecting and annotating multi-view images, the datasets for multi-view counting have a limited number of multi-view frames and scenes. To solve the problem of limited data, one approach is to collect synthetic data to bypass the annotating step, while another is to propose semi- or weakly-supervised or unsupervised methods that demand less multi-view data. In this paper, we propose two semi-supervised multi-view crowd counting frameworks by ranking the multi-view fusion models of different numbers of input views, in terms of the model predictions or the model uncertainties. Specifically, for the first method (vanilla model), we rank the multi-view fusion models’ prediction results of different numbers of camera-view inputs, namely, the model’s predictions with fewer camera views shall not be larger than the predictions with more camera views. For the second method, we rank the estimated model uncertainties of the multi-view fusion models with a variable number of view inputs, guided by the multi-view fusion models’ prediction errors, namely, the model uncertainties with more camera views shall not be larger than those with fewer camera views. These constraints are introduced into the model training in a semi-supervised fashion for multi-view counting with limited labeled data. The experiments demonstrate the advantages of the proposed multi-view model ranking methods compared with other semi-supervised counting methods.

[99] Pixel Super-Resolved Fluorescence Lifetime Imaging Using Deep Learning

Paloma Casteleiro Costa, Parnian Ghapandar Kashani, Xuhui Liu, Alexander Chen, Ary Portes, Julien Bec, Laura Marcu, Aydogan Ozcan

Main category: cs.CV

TL;DR: FLIM_PSR_k is a deep learning framework that uses cGAN to reconstruct high-resolution FLIM images from low-resolution data, enabling 5x super-resolution and 25x space-bandwidth improvement for faster, higher-quality clinical FLIM imaging.

Details

Motivation: Clinical adoption of FLIM is limited by long pixel dwell times and low SNR, creating strict resolution-speed trade-offs that hinder real-time diagnostics.

Method: A conditional GAN (cGAN) framework for multi-channel pixel super-resolution that reconstructs high-resolution FLIM images from data acquired with up to 5x larger pixel sizes, chosen over diffusion models for robustness and faster inference.

Result: Achieves reliable 5x super-resolution factor (25x space-bandwidth increase), reveals fine architectural features lost in low-resolution inputs, with statistically significant improvements across image quality metrics in blind testing on patient tumor tissue.

Conclusion: FLIM_PSR_k advances FLIM toward faster, higher-resolution implementations compatible with low-NA and miniaturized platforms, better positioning FLIM for translational clinical applications.

Abstract: Fluorescence lifetime imaging microscopy (FLIM) is a powerful quantitative technique that provides metabolic and molecular contrast, offering strong translational potential for label-free, real-time diagnostics. However, its clinical adoption remains limited by long pixel dwell times and low signal-to-noise ratio (SNR), which impose a stricter resolution-speed trade-off than conventional optical imaging approaches. Here, we introduce FLIM_PSR_k, a deep learning-based multi-channel pixel super-resolution (PSR) framework that reconstructs high-resolution FLIM images from data acquired with up to a 5-fold increased pixel size. The model is trained using the conditional generative adversarial network (cGAN) framework, which, compared to diffusion model-based alternatives, delivers a more robust PSR reconstruction with substantially shorter inference times, a crucial advantage for practical deployment. FLIM_PSR_k not only enables faster image acquisition but can also alleviate SNR limitations in autofluorescence-based FLIM. Blind testing on held-out patient-derived tumor tissue samples demonstrates that FLIM_PSR_k reliably achieves a super-resolution factor of k = 5, resulting in a 25-fold increase in the space-bandwidth product of the output images and revealing fine architectural features lost in lower-resolution inputs, with statistically significant improvements across various image quality metrics. By increasing FLIM’s effective spatial resolution, FLIM_PSR_k advances lifetime imaging toward faster, higher-resolution, and hardware-flexible implementations compatible with low-numerical-aperture and miniaturized platforms, better positioning FLIM for translational applications.

[100] TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu, Min Li, Alex Jinpeng Wang

Main category: cs.CV

TL;DR: TextEditBench: A benchmark for evaluating text editing in images, focusing on reasoning-intensive scenarios and introducing Semantic Expectation as a new evaluation dimension.

Details

Motivation: Text editing in images is largely unexplored despite being a challenging frontier in visual generation. Current approaches lack evaluation benchmarks that focus on text-centric regions and reasoning-intensive editing scenarios requiring understanding of physical plausibility, linguistic meaning, and cross-modal dependencies.

Method: Introduces TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Proposes a novel evaluation dimension called Semantic Expectation (SE) that measures reasoning ability to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing.

Result: Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they struggle with context-dependent reasoning, physical consistency, and layout-aware integration.

Conclusion: TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation by focusing evaluation on this long-overlooked yet fundamental capability.

Abstract: Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.

[101] GFLAN: Generative Functional Layouts

Mohamed Abouagour, Eleftherios Garyfallidis

Main category: cs.CV

TL;DR: GFLAN is a two-stage generative framework for floor plan synthesis that separates topological planning from geometric realization, using specialized neural architectures to address architectural reasoning challenges.

Details

Motivation: Current deep learning approaches for automated floor plan generation struggle with architectural reasoning - capturing topological relationships, functional constraint propagation, and circulation patterns. There's a need for a unified computational treatment that addresses the combinatorial search, geometric constraint satisfaction, and functional design requirements.

Method: Two-stage decomposition: Stage A uses a specialized convolutional architecture with dual encoders to sequentially allocate room centroids via discrete probability maps. Stage B constructs a heterogeneous graph linking room nodes to boundary vertices, then applies a Transformer-augmented GNN to jointly regress room boundaries.

Result: The paper introduces GFLAN as a principled framework that restructures floor plan synthesis through explicit factorization into topological planning and geometric realization, departing from direct pixel-to-pixel or wall-tracing generation.

Conclusion: GFLAN addresses fundamental challenges in floor plan generation by separating topological relationships from geometric instantiation, enabling better architectural reasoning through a structured two-stage approach.

Abstract: Automated floor plan generation lies at the intersection of combinatorial search, geometric constraint satisfaction, and functional design requirements – a confluence that has historically resisted a unified computational treatment. While recent deep learning approaches have improved the state of the art, they often struggle to capture architectural reasoning: the precedence of topological relationships over geometric instantiation, the propagation of functional constraints through adjacency networks, and the emergence of circulation patterns from local connectivity decisions. To address these fundamental challenges, this paper introduces GFLAN, a generative framework that restructures floor plan synthesis through explicit factorization into topological planning and geometric realization. Given a single exterior boundary and a front-door location, our approach departs from direct pixel-to-pixel or wall-tracing generation in favor of a principled two-stage decomposition. Stage A employs a specialized convolutional architecture with dual encoders – separating invariant spatial context from evolving layout state – to sequentially allocate room centroids within the building envelope via discrete probability maps over feasible placements. Stage B constructs a heterogeneous graph linking room nodes to boundary vertices, then applies a Transformer-augmented graph neural network (GNN) that jointly regresses room boundaries.

[102] MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval

Amna Amir, Erchan Aptoula

Main category: cs.CV

TL;DR: MACL introduces adaptive contrastive learning with label-aware sampling, frequency-sensitive weighting, and dynamic-temperature scaling to address semantic overlap, imbalanced distributions, and complex co-occurrence patterns in multi-label remote-sensing image retrieval.

Details

Motivation: The paper addresses three key challenges in multi-label remote-sensing image retrieval: semantic overlap among land-cover categories, highly imbalanced label distributions, and complex inter-class co-occurrence patterns.

Method: Multi-Label Adaptive Contrastive Learning (MACL) extends contrastive learning with three components: label-aware sampling, frequency-sensitive weighting, and dynamic-temperature scaling to achieve balanced representation learning across common and rare categories.

Result: Extensive experiments on three benchmark datasets (DLRSD, ML-AID, and WHDLD) show MACL consistently outperforms contrastive-loss based baselines, effectively mitigating semantic imbalance and delivering more reliable retrieval performance.

Conclusion: MACL successfully addresses the challenges of multi-label remote-sensing image retrieval through adaptive contrastive learning, providing a robust solution for large-scale remote-sensing archives with code and models to be released publicly.

Abstract: Semantic overlap among land-cover categories, highly imbalanced label distributions, and complex inter-class co-occurrence patterns constitute significant challenges for multi-label remote-sensing image retrieval. In this article, Multi-Label Adaptive Contrastive Learning (MACL) is introduced as an extension of contrastive learning to address them. It integrates label-aware sampling, frequency-sensitive weighting, and dynamic-temperature scaling to achieve balanced representation learning across both common and rare categories. Extensive experiments on three benchmark datasets (DLRSD, ML-AID, and WHDLD), show that MACL consistently outperforms contrastive-loss based baselines, effectively mitigating semantic imbalance and delivering more reliable retrieval performance in large-scale remote-sensing archives. Code, pretrained models, and evaluation scripts will be released at https://github.com/amna/MACL upon acceptance.

[103] PixelArena: A benchmark for Pixel-Precision Visual Intelligence

Feng Liang, Sizhe Cheng, Chenqi Yi

Main category: cs.CV

TL;DR: PixelArena uses semantic segmentation tasks to benchmark fine-grained image generation capabilities of multi-modal LLMs, finding Gemini 3 Pro Image shows emergent zero-shot mask generation with high fidelity.

Details

Motivation: Current image generation benchmarks focus too much on aesthetics rather than fine-grained generation capabilities. There's a need for objective evaluation of multi-modal LLMs' pixel-level generative intelligence.

Method: Proposes PixelArena benchmark using semantic segmentation tasks to examine fine-grained generative intelligence with pixel precision. Evaluates models in zero-shot settings and compares results qualitatively and quantitatively.

Result: Gemini 3 Pro Image demonstrates emergent image generation capabilities, generating semantic masks with high fidelity in zero-shot settings. Shows visual intelligence and true generalization in new image generation tasks.

Conclusion: The findings signal exciting progress in multi-modal LLMs and provide insights for future research in multimodality, reasoning, interpretability, and benchmarking.

Abstract: Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.

[104] LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation

Haiyu Zhao, Yiwen Shan, Yuanbiao Gou, Xi Peng

Main category: cs.CV

TL;DR: LaverNet is a lightweight all-in-one video restoration network with only 362K parameters that handles multiple degradations while addressing challenges of time-varying artifacts through selective degradation-agnostic feature propagation.

Details

Motivation: Current all-in-one video restoration methods face two key challenges: 1) time-varying degradations can dominate temporal modeling, confusing models to focus on artifacts rather than content, and 2) existing approaches rely on large models that conceal underlying difficulties rather than solving them.

Method: Proposes LaverNet, a lightweight network with only 362K parameters, featuring a novel propagation mechanism that selectively transmits only degradation-agnostic features across frames to mitigate degradation impact on temporal modeling.

Result: Despite having less than 1% of parameters compared to existing models, LaverNet achieves comparable and even superior performance across benchmarks, demonstrating strong all-in-one restoration capability with a compact network.

Conclusion: Strong all-in-one video restoration can be achieved with a compact network, challenging the assumption that large models are necessary for handling multiple degradations, while the selective degradation-agnostic feature propagation effectively addresses time-varying degradation challenges.

Abstract: Recent studies have explored all-in-one video restoration, which handles multiple degradations with a unified model. However, these approaches still face two challenges when dealing with time-varying degradations. First, the degradation can dominate temporal modeling, confusing the model to focus on artifacts rather than the video content. Second, current methods typically rely on large models to handle all-in-one restoration, concealing those underlying difficulties. To address these challenges, we propose a lightweight all-in-one video restoration network, LaverNet, with only 362K parameters. To mitigate the impact of degradations on temporal modeling, we introduce a novel propagation mechanism that selectively transmits only degradation-agnostic features across frames. Through LaverNet, we demonstrate that strong all-in-one restoration can be achieved with a compact network. Despite its small size, less than 1% of the parameters of existing models, LaverNet achieves comparable, even superior performance across benchmarks.

[105] Ridge Estimation-Based Vision and Laser Ranging Fusion Localization Method for UAVs

Huayu Huang, Chen Chen, Banglei Guan, Ze Tan, Yang Shang, Zhang Li, Qifeng Yu

Main category: cs.CV

TL;DR: Proposes ridge estimation-based fusion localization combining UAV imagery and laser ranging to improve accuracy and robustness under challenging conditions with multicollinearity issues.

Details

Motivation: Traditional least squares estimation suffers from multicollinearity in design matrices under limited observation conditions (long distances, small intersection angles, large inclination angles), leading to ill-conditioned problems, instability, and low robustness in UAV-based target localization.

Method: Fusion localization method based on ridge estimation that combines sequential imagery (rich scene information) with laser ranging (high precision). Ridge estimation is introduced to mitigate multicollinearity issues in the design matrix under limited observation conditions.

Result: The method achieves higher localization accuracy compared to ground localization algorithms based on single information. Ridge estimation effectively enhances robustness, particularly under limited observation conditions.

Conclusion: Ridge estimation-based fusion of UAV imagery and laser ranging provides improved localization accuracy and robustness, especially in challenging conditions where traditional least squares suffers from multicollinearity problems.

Abstract: Tracking and measuring targets using a variety of sensors mounted on UAVs is an effective means to quickly and accurately locate the target. This paper proposes a fusion localization method based on ridge estimation, combining the advantages of rich scene information from sequential imagery with the high precision of laser ranging to enhance localization accuracy. Under limited conditions such as long distances, small intersection angles, and large inclination angles, the column vectors of the design matrix have serious multicollinearity when using the least squares estimation algorithm. The multicollinearity will lead to ill-conditioned problems, resulting in significant instability and low robustness. Ridge estimation is introduced to mitigate the serious multicollinearity under the condition of limited observation. Experimental results demonstrate that our method achieves higher localization accuracy compared to ground localization algorithms based on single information. Moreover, the introduction of ridge estimation effectively enhances the robustness, particularly under limited observation conditions.

[106] QUIDS: Quality-informed Incentive-driven Multi-agent Dispatching System for Mobile Crowdsensing

Nan Zhou, Zuxin Li, Fanhang Man, Xuecheng Chen, Susu Xu, Fan Dang, Chaopeng Hong, Yunhao Liu, Xiao-Ping Zhang, Xinlei Chen

Main category: cs.CV

TL;DR: QUIDS is a quality-informed incentive system for non-dedicated vehicular crowdsensing that improves sensing coverage and reliability by 38% over baseline methods through a novel ASQ metric and belief-aware dispatching algorithm.

Details

Motivation: Non-dedicated vehicular mobile crowdsensing faces challenges in achieving optimal Quality of Information due to interrelated issues of sensing coverage, reliability, and dynamic vehicle participation under budget constraints.

Method: Proposes QUIDS with Aggregated Sensing Quality metric integrating coverage and reliability, and a Mutually Assisted Belief-aware Vehicle Dispatching algorithm that estimates reliability and allocates incentives under uncertainty.

Result: QUIDS improves ASQ by 38% over non-dispatching scenarios and 10% over state-of-the-art methods, while reducing reconstruction map errors by 39-74% across algorithms using real-world metropolitan deployment data.

Conclusion: By jointly optimizing coverage and reliability through quality-informed incentives, QUIDS enables low-cost, high-quality urban monitoring without dedicated infrastructure for smart-city applications like traffic and environmental sensing.

Abstract: This paper addresses the challenge of achieving optimal Quality of Information (QoI) in non-dedicated vehicular mobile crowdsensing (NVMCS) systems. The key obstacles are the interrelated issues of sensing coverage, sensing reliability, and the dynamic participation of vehicles. To tackle these, we propose QUIDS, a QUality-informed Incentive-driven multi-agent Dispatching System, which ensures high sensing coverage and reliability under budget constraints. QUIDS introduces a novel metric, Aggregated Sensing Quality (ASQ), to quantitatively capture QoI by integrating both coverage and reliability. We also develop a Mutually Assisted Belief-aware Vehicle Dispatching algorithm that estimates sensing reliability and allocates incentives under uncertainty, further improving ASQ. Evaluation using real-world data from a metropolitan NVMCS deployment shows QUIDS improves ASQ by 38% over non-dispatching scenarios and by 10% over state-of-the-art methods. It also reduces reconstruction map errors by 39-74% across algorithms. By jointly optimizing coverage and reliability via a quality-informed incentive mechanism, QUIDS enables low-cost, high-quality urban monitoring without dedicated infrastructure, applicable to smart-city scenarios like traffic and environmental sensing.

[107] Collaborative Edge-to-Server Inference for Vision-Language Models

Soochang Song, Yongjune Kim

Main category: cs.CV

TL;DR: A collaborative edge-to-server VLM framework that reduces communication cost by selectively requesting high-resolution RoI images only when needed, maintaining accuracy.

Details

Motivation: Current VLM deployments transmit entire images from edge to server, but resizing loses fine details causing accuracy degradation. Need to reduce communication cost while preserving accuracy.

Method: Two-stage framework: 1) Server processes global image, identifies RoI using VLM attention, computes min-entropy confidence; 2) If confidence low (min-entropy high), requests detail-preserved local RoI image from edge, then refines inference using both global and local images.

Result: Significantly reduces communication cost while maintaining inference accuracy across multiple VLM architectures.

Conclusion: Selective retransmission of essential visual content enables efficient edge-to-server VLM inference with minimal accuracy loss.

Abstract: We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder’s input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM’s internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.

Tao Hu, Weiyu Zhou, Yanjie Tu, Peng Wu, Wei Dong, Qingsen Yan, Yanning Zhang

Main category: cs.CV

TL;DR: GMODiff is a one-step diffusion framework for HDR reconstruction that uses gain maps instead of full HDR content, achieving 100× speedup over previous LDM methods while maintaining quality.

Details

Motivation: Pre-trained LDMs show promise for HDR reconstruction but face three challenges: limited dynamic-range representation from 8-bit latent compression, high inference cost from multi-step denoising, and content hallucination from generative nature.

Method: Reformulates HDR reconstruction as gain map estimation, initializes denoising from regression-based estimates for one-step generation, and uses regression priors to guide both denoising and latent decoding to suppress hallucinations while preserving structure.

Result: GMODiff performs favorably against state-of-the-art methods and is 100× faster than previous LDM-based HDR reconstruction approaches.

Conclusion: The gain map-driven one-step diffusion framework effectively addresses the limitations of applying LDMs to HDR reconstruction, achieving both efficiency and quality improvements.

Abstract: Pre-trained Latent Diffusion Models (LDMs) have recently shown strong perceptual priors for low-level vision tasks, making them a promising direction for multi-exposure High Dynamic Range (HDR) reconstruction. However, directly applying LDMs to HDR remains challenging due to: (1) limited dynamic-range representation caused by 8-bit latent compression, (2) high inference cost from multi-step denoising, and (3) content hallucination inherent to generative nature. To address these challenges, we introduce GMODiff, a gain map-driven one-step diffusion framework for multi-exposure HDR reconstruction. Instead of reconstructing full HDR content, we reformulate HDR reconstruction as a conditionally guided Gain Map (GM) estimation task, where the GM encodes the extended dynamic range while retaining the same bit depth as LDR images. We initialize the denoising process from an informative regression-based estimate rather than pure noise, enabling the model to generate high-quality GMs in a single denoising step. Furthermore, recognizing that regression-based models excel in content fidelity while LDMs favor perceptual quality, we leverage regression priors to guide both the denoising process and latent decoding of the LDM, suppressing hallucinations while preserving structural accuracy. Extensive experiments demonstrate that our GMODiff performs favorably against several state-of-the-art methods and is 100 faster than previous LDM-based methods.

[109] EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation

Haotian Ling, Zequn Chen, Qiuying Chen, Donglin Di, Yongjia Ma, Hao Li, Chen Wei, Zhulin Tao, Xun Yang

Main category: cs.CV

TL;DR: EverybodyDance introduces a systematic solution for multi-character animation with position swaps, focusing on Identity Correspondence correctness using graph matching and targeted optimization strategies.

Details

Motivation: Existing pose-driven character animation works well for single characters but struggles with multi-character scenarios involving position swaps, where maintaining correct Identity Correspondence (IC) between characters across frames is challenging.

Method: Proposes Identity Matching Graph (IMG) to model characters as nodes in a bipartite graph, uses Mask-Query Attention (MQA) to compute affinities, formalizes IC correctness as graph structural metric, and introduces identity-embedded guidance, multi-scale matching, and pre-classified sampling strategies.

Result: EverybodyDance substantially outperforms state-of-the-art baselines in both Identity Correspondence correctness and visual fidelity, validated through extensive experiments on the curated Identity Correspondence Evaluation benchmark.

Conclusion: The paper presents a comprehensive solution for multi-character animation with position swaps, successfully addressing the Identity Correspondence challenge through graph-based modeling and synergistic optimization strategies.

Abstract: Consistent pose-driven character animation has achieved remarkable progress in single-character scenarios. However, extending these advances to multi-character settings is non-trivial, especially when position swap is involved. Beyond mere scaling, the core challenge lies in enforcing correct Identity Correspondence (IC) between characters in reference and generated frames. To address this, we introduce EverybodyDance, a systematic solution targeting IC correctness in multi-character animation. EverybodyDance is built around the Identity Matching Graph (IMG), which models characters in the generated and reference frames as two node sets in a weighted complete bipartite graph. Edge weights, computed via our proposed Mask-Query Attention (MQA), quantify the affinity between each pair of characters. Our key insight is to formalize IC correctness as a graph structural metric and to optimize it during training. We also propose a series of targeted strategies tailored for multi-character animation, including identity-embedded guidance, a multi-scale matching strategy, and pre-classified sampling, which work synergistically. Finally, to evaluate IC performance, we curate the Identity Correspondence Evaluation benchmark, dedicated to multi-character IC correctness. Extensive experiments demonstrate that EverybodyDance substantially outperforms state-of-the-art baselines in both IC and visual fidelity.

[110] Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Mariam Hassan, Bastien Van Delft, Wuyang Li, Alexandre Alahi

Main category: cs.CV

TL;DR: FVG decomposes T2V generation into three specialized stages: LLM reasoning for initial scene description, T2I composition for anchor frame, and temporal synthesis for animation, achieving SOTA results with 70% faster sampling.

Details

Motivation: Current T2V models often fail at complex scene composition and logical temporal instructions, with many errors stemming from inability to create semantically correct initial frames.

Method: Three-stage pipeline: 1) LLM rewrites video prompt to describe only initial scene, 2) T2I model synthesizes compositionally-correct anchor frame, 3) Video model finetuned to understand anchor focuses on animation.

Result: Sets new SOTA on T2V CompBench benchmark, significantly improves all tested models on VBench2, and enables 70% reduction in sampling steps without performance loss.

Conclusion: Factorized Video Generation provides a practical path toward more efficient, robust, and controllable video synthesis through task decomposition and visual anchoring.

Abstract: State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model’s inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis

[111] Adaptive Frequency Domain Alignment Network for Medical image segmentation

Zhanwei Li, Liang Li, Jiawan Zhang

Main category: cs.CV

TL;DR: AFDAN is a novel domain adaptation framework for medical image segmentation that aligns features in the frequency domain to address data scarcity, achieving state-of-the-art performance on vitiligo and retinal vessel segmentation tasks.

Details

Motivation: Medical image segmentation suffers from data scarcity due to time-consuming manual annotation. The paper aims to address this challenge by developing a domain adaptation framework that can transfer knowledge from labeled source domains to unlabeled target domains.

Method: AFDAN (Adaptive Frequency Domain Alignment Network) integrates three core components: 1) Adversarial Domain Learning Module for feature transfer, 2) Source-Target Frequency Fusion Module for blending frequency representations, and 3) Spatial-Frequency Integration Module for combining frequency and spatial features to enhance segmentation accuracy.

Result: AFDAN achieves 90.9% IoU for vitiligo segmentation on the new VITILIGO2025 dataset and 82.6% IoU on the DRIVE retinal vessel segmentation benchmark, surpassing existing state-of-the-art approaches.

Conclusion: The proposed AFDAN framework effectively addresses data scarcity in medical image segmentation through frequency domain alignment and cross-domain knowledge transfer, demonstrating superior performance on challenging segmentation tasks.

Abstract: High-quality annotated data plays a crucial role in achieving accurate segmentation. However, such data for medical image segmentation are often scarce due to the time-consuming and labor-intensive nature of manual annotation. To address this challenge, we propose the Adaptive Frequency Domain Alignment Network (AFDAN)–a novel domain adaptation framework designed to align features in the frequency domain and alleviate data scarcity. AFDAN integrates three core components to enable robust cross-domain knowledge transfer: an Adversarial Domain Learning Module that transfers features from the source to the target domain; a Source-Target Frequency Fusion Module that blends frequency representations across domains; and a Spatial-Frequency Integration Module that combines both frequency and spatial features to further enhance segmentation accuracy across domains. Extensive experiments demonstrate the effectiveness of AFDAN: it achieves an Intersection over Union (IoU) of 90.9% for vitiligo segmentation in the newly constructed VITILIGO2025 dataset and a competitive IoU of 82.6% on the retinal vessel segmentation benchmark DRIVE, surpassing existing state-of-the-art approaches.

[112] Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture

Haodi He, Jihun Yu, Ronald Fedkiw

Main category: cs.CV

TL;DR: 3D Gaussian Splatting enables unified facial reconstruction from uncalibrated images, producing structured geometry and neural textures compatible with standard graphics pipelines.

Details

Motivation: To create consistent 3D facial reconstructions from uncalibrated images using explicit neural representations that are compatible with standard graphics workflows, overcoming limitations of NeRFs and video requirements.

Method: Uses Gaussian Splatting with semantic segmentation alignment for facial regions, constrains Gaussians to triangulated surfaces, transforms splats into view-dependent neural textures, and employs relightable models for albedo extraction.

Result: Achieves neutral pose reconstruction from only 11 images (vs. long videos), produces structured geometry usable in standard pipelines, creates high-fidelity neural textures, and enables text-driven asset creation.

Conclusion: The approach successfully bridges neural representations with traditional graphics pipelines, enabling high-quality 3D facial assets from sparse images while maintaining compatibility with existing workflows.

Abstract: We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.

[113] BrepLLM: Native Boundary Representation Understanding with Large Language Models

Liyuan Deng, Hao Guo, Yunpeng Bai, Yongkang Dai, Huaxi Huang, Yilei Shi

Main category: cs.CV

TL;DR: BrepLLM enables LLMs to process 3D Boundary Representation models by bridging the modality gap between structured 3D geometry and natural language through a two-stage training pipeline.

Details

Motivation: Current token-sequence-based LLMs cannot directly process 3D Brep models containing complex geometric and topological information, creating a modality gap between 3D geometry and natural language.

Method: Two-stage pipeline: 1) Cross-modal Alignment Pre-training with adaptive UV sampling to convert Breps to graphs, hierarchical BrepEncoder for feature extraction, and contrastive learning with CLIP. 2) Multi-stage LLM Fine-tuning with semantic mapping, LLM fine-tuning, and Mixture-of-Query Experts for geometric diversity.

Result: BrepLLM achieves state-of-the-art results on 3D object classification and captioning tasks, demonstrating effective bridging of the 3D geometry-natural language gap.

Conclusion: BrepLLM successfully enables LLMs to parse and reason over raw Brep data, establishing a framework for processing structured 3D geometry with natural language models.

Abstract: Current token-sequence-based Large Language Models (LLMs) are not well-suited for directly processing 3D Boundary Representation (Brep) models that contain complex geometric and topological information. We propose BrepLLM, the first framework that enables LLMs to parse and reason over raw Brep data, bridging the modality gap between structured 3D geometry and natural language. BrepLLM employs a two-stage training pipeline: Cross-modal Alignment Pre-training and Multi-stage LLM Fine-tuning. In the first stage, an adaptive UV sampling strategy converts Breps into graphs representation with geometric and topological information. We then design a hierarchical BrepEncoder to extract features from geometry (i.e., faces and edges) and topology, producing both a single global token and a sequence of node tokens. Then we align the global token with text embeddings from a frozen CLIP text encoder (ViT-L/14) via contrastive learning. In the second stage, we integrate the pretrained BrepEncoder into an LLM. We then align its sequence of node tokens using a three-stage progressive training strategy: (1) training an MLP-based semantic mapping from Brep representation to 2D with 2D-LLM priors. (2) performing fine-tuning of the LLM. (3) designing a Mixture-of-Query Experts (MQE) to enhance geometric diversity modeling. We also construct Brep2Text, a dataset comprising 269,444 Brep-text question-answer pairs. Experiments show that BrepLLM achieves state-of-the-art (SOTA) results on 3D object classification and captioning tasks.

[114] CountZES: Counting via Zero-Shot Exemplar Selection

Muhammad Ibraheem Siddiqui, Muhammad Haris Khan

Main category: cs.CV

TL;DR: CountZES is a training-free framework for zero-shot object counting that progressively discovers diverse exemplars through three synergistic stages to address limitations of existing methods.

Details

Motivation: Existing zero-shot object counting methods have limitations: open-vocabulary detectors produce multi-instance candidates, while random patch sampling fails to accurately delineate object instances. There's a need for better exemplar selection in complex scenes for unseen categories.

Method: CountZES uses three progressive stages: 1) Detection-Anchored Exemplar (DAE) refines open-vocabulary detections to isolate single-instance exemplars; 2) Density-Guided Exemplar (DGE) uses density-driven self-supervised paradigm to find statistically consistent exemplars; 3) Feature-Consensus Exemplar (FCE) reinforces visual coherence through feature-space clustering.

Result: Experiments on diverse datasets show CountZES achieves superior performance among zero-shot object counting methods while effectively generalizing across natural, aerial, and medical domains.

Conclusion: CountZES provides an effective training-free framework for zero-shot object counting that balances textual grounding, count consistency, and feature representativeness through progressive exemplar discovery.

Abstract: Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.

[115] Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Shangxun Li, Youngjung Uh

Main category: cs.CV

TL;DR: A training-free method that refines text embeddings from a geometric perspective to improve subject consistency in text-to-image diffusion models for visual storytelling, without requiring fine-tuning or per-subject optimization.

Details

Motivation: Current text-to-image diffusion models struggle with preserving subject consistency across multiple outputs for visual storytelling. Existing approaches require computationally expensive fine-tuning or per-subject optimization, while training-free methods like 1Prompt1Story suffer from semantic leakage and text misalignment.

Method: A training-free approach that refines text embeddings from a geometric perspective to suppress unwanted semantics. The method addresses semantic entanglement by adjusting embeddings to maintain subject consistency across frames without requiring model fine-tuning or image conditioning.

Result: Extensive experiments show the approach significantly improves both subject consistency and text alignment over existing baselines, outperforming methods like 1Prompt1Story that suffer from semantic leakage.

Conclusion: The proposed geometric refinement of text embeddings provides an effective, training-free solution for improving subject consistency in text-to-image diffusion models for visual storytelling applications.

Abstract: Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

[116] Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach

Masashi Hatano, Saptarshi Sinha, Jacob Chalk, Wei-Hong Li, Hideo Saito, Dima Damen

Main category: cs.CV

TL;DR: A diffusion-based model generates gaze-primed human reaching motions by fine-tuning on curated datasets, achieving 60% prime success and 89% reach success.

Details

Motivation: To generate realistic human motion that imitates natural gaze priming behavior - where people spot objects/locations from a distance before approaching and reaching them.

Method: Curated 23.7K gaze-primed motion sequences from 5 public datasets, pre-trained a text-conditioned diffusion model, then fine-tuned it conditioned on goal pose/location.

Result: On HD-EPIC dataset, achieved 60% prime success (new metric) and 89% reach success when conditioned on goal object location.

Conclusion: The approach successfully generates realistic gaze-primed reaching motions, with evaluation showing strong performance on both priming and reaching aspects of natural human movement.

Abstract: Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down – that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the ‘Reach Success’ and a newly introduced ‘Prime Success’ metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.

[117] SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

Main category: cs.CV

TL;DR: SNOW is a training-free framework that integrates VLM semantics with 3D geometry and temporal dynamics for unified 4D scene understanding, using object proposals, multimodal token encoding, and a 4D Scene Graph for embodied reasoning.

Details

Motivation: Current robotic systems face limitations: Vision-Language Models lack 3D geometric grounding and temporal understanding, while geometric perception methods are semantically sparse. There's a need for unified 4D scene understanding that combines semantic priors with spatio-temporal dynamics for reliable autonomous navigation and interaction.

Method: SNOW processes synchronized RGB images and 3D point clouds, uses HDBSCAN clustering for object proposals, guides SAM2-based segmentation, encodes regions via Spatio-Temporal Tokenized Patch Encoding (STEP) to create multimodal tokens, builds a 4D Scene Graph (4DSG), and uses lightweight SLAM for spatial anchoring and global reference alignment.

Result: SNOW achieves state-of-the-art performance on diverse benchmarks, enabling precise 4D scene understanding and spatially grounded inference. It demonstrates the importance of structured 4D priors for embodied reasoning and autonomous robotics applications.

Conclusion: The proposed SNOW framework successfully integrates VLM semantics with geometric and temporal information to create a unified 4D world model, providing queryable scene understanding that enhances robotic perception and reasoning capabilities without requiring training.

Abstract: Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.

[118] StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models

Senmao Li, Kai Wang, Salman Khan, Fahad Shahbaz Khan, Jian Yang, Yaxing Wang

Main category: cs.CV

TL;DR: StageVAR is a stage-aware acceleration framework for Visual Autoregressive (VAR) models that reduces computational complexity by strategically pruning/approximating later generation stages while preserving early critical stages for semantic consistency.

Details

Motivation: VAR models suffer from sharply increased computational complexity and runtime at large-scale steps. Existing acceleration methods rely on manual step selection and overlook varying importance of different stages in the generation process.

Method: StageVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations. It analyzes that early steps are critical for semantic/structural consistency and should remain intact, while later steps mainly refine details and can be pruned/approximated.

Result: StageVAR achieves up to 3.4x speedup with only minimal quality degradation (0.01 drop on GenEval and 0.26 decrease on DPG), consistently outperforming existing acceleration baselines.

Conclusion: Stage-aware design is a powerful principle for efficient visual autoregressive image generation, demonstrating that strategic stage-based acceleration can significantly reduce computational overhead while maintaining generation quality.

Abstract: Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present StageVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, StageVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed StageVAR achieves up to 3.4x speedup with only a 0.01 drop on GenEval and a 0.26 decrease on DPG, consistently outperforming existing acceleration baselines. These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.

Yuan Li, Yahan Yu, Youyuan Lin, Yong-Hao Yang, Chenhui Chu, Shin’ya Nishida

Main category: cs.CV

TL;DR: The paper proposes a reinforcement learning approach for blind image quality assessment that mimics human perception-reasoning processes, achieving competitive performance while generating human-like explanations.

Details

Motivation: Current BIQA models lack human-like interpretable reasoning capabilities. Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. The authors aim to develop models that can acquire both human-like and self-consistent reasoning capabilities.

Method: 1) Collect human evaluation data capturing aspects of human perception-reasoning pipeline. 2) Use reinforcement learning with human annotations as reward signals to guide the model toward human-like perception and reasoning. 3) Design a reward that drives the model to infer image quality purely from self-generated descriptions to internalize self-consistent reasoning capability.

Result: The approach achieves score prediction performance comparable to state-of-the-art BIQA systems under standard metrics (Pearson and Spearman correlation coefficients). Additionally, using ROUGE-1 to measure similarity between model-generated and human perception-reasoning chains, the model reaches a ROUGE-1 score of 0.512 (vs. 0.443 for baseline) on over 1,000 human-annotated samples.

Conclusion: The proposed method marks a step toward human-like interpretable reasoning in BIQA by enabling models to generate explanations that substantially cover human reasoning patterns while maintaining competitive quality assessment performance.

Abstract: Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.

[120] Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors

Kejun Liu, Yuanyuan Liu, Lin Wei, Chang Tang, Yibing Zhan, Zijing Chen, Zhe Chen

Main category: cs.CV

TL;DR: This paper introduces EMER, a new multimodal emotion recognition dataset that incorporates eye behavior data to bridge the gap between facial expression recognition and genuine emotion recognition, along with EMERT, a Transformer-based model that outperforms state-of-the-art methods.

Details

Motivation: Facial expressions are often used as social tools rather than genuine emotional manifestations, creating a gap between facial expression recognition (FER) and true emotion recognition (ER). The authors aim to address this by incorporating eye behaviors as important emotional cues.

Method: The authors construct the EMER dataset using spontaneous emotion induction with stimulus material, capturing eye movement sequences, eye fixation maps, and facial expression videos. They also design EMERT, a Transformer-based model using modality-adversarial feature decoupling and multitask learning to model eye behaviors as a complement to facial expressions.

Result: EMERT outperforms other state-of-the-art multimodal methods by a significant margin. Seven multimodal benchmark protocols are introduced for comprehensive evaluation of the EMER dataset, demonstrating the importance of modeling eye behaviors for robust emotion recognition.

Conclusion: Eye behaviors provide crucial emotional cues that bridge the gap between facial expression recognition and genuine emotion recognition. The EMER dataset and EMERT model advance robust emotion recognition research, with both resources made publicly available.

Abstract: Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.

[121] YOLO11-4K: An Efficient Architecture for Real-Time Small Object Detection in 4K Panoramic Images

Huma Hafeez, Matthew Garratt, Jo Plested, Sankaran Iyer, Arcot Sowmya

Main category: cs.CV

TL;DR: YOLO11-4K is an efficient real-time object detection framework specifically designed for 4K panoramic images, achieving 0.95 mAP with 28.3ms inference time - a 75% latency reduction compared to YOLO11 while improving accuracy.

Details

Motivation: Omnidirectional 360-degree images present challenges for object detection due to spatial distortions, wide fields of view, and ultra-high-resolution inputs. Conventional detectors like YOLO struggle with computational demands of 4K+ resolution imagery typical in 360-degree vision applications.

Method: The framework incorporates a novel multi-scale detection head with a P2 layer to improve small object detection, and uses a GhostConv-based backbone to reduce computational complexity without sacrificing representational power. The authors also created the CVIP360 dataset with 6,876 manually annotated frame-level bounding boxes for evaluation.

Result: YOLO11-4K achieves 0.95 mAP at 0.50 IoU with 28.3 milliseconds inference per frame, representing a 75% latency reduction compared to YOLO11 (112.3ms) while improving accuracy (0.95 vs 0.908 mAP). The framework maintains efficiency while handling 4K panoramic scenes.

Conclusion: The proposed framework balances efficiency and precision for robust object detection in expansive 360-degree environments, making it suitable for real-world high-resolution panoramic applications in autonomous navigation, surveillance, and augmented reality.

Abstract: The processing of omnidirectional 360-degree images poses significant challenges for object detection due to inherent spatial distortions, wide fields of view, and ultra-high-resolution inputs. Conventional detectors such as YOLO are optimised for standard image sizes (for example, 640x640 pixels) and often struggle with the computational demands of 4K or higher-resolution imagery typical of 360-degree vision. To address these limitations, we introduce YOLO11-4K, an efficient real-time detection framework tailored for 4K panoramic images. The architecture incorporates a novel multi-scale detection head with a P2 layer to improve sensitivity to small objects often missed at coarser scales, and a GhostConv-based backbone to reduce computational complexity without sacrificing representational power. To enable evaluation, we manually annotated the CVIP360 dataset, generating 6,876 frame-level bounding boxes and producing a publicly available, detection-ready benchmark for 4K panoramic scenes. YOLO11-4K achieves 0.95 mAP at 0.50 IoU with 28.3 milliseconds inference per frame, representing a 75 percent latency reduction compared to YOLO11 (112.3 milliseconds), while also improving accuracy (mAP at 0.50 of 0.95 versus 0.908). This balance of efficiency and precision enables robust object detection in expansive 360-degree environments, making the framework suitable for real-world high-resolution panoramic applications. While this work focuses on 4K omnidirectional images, the approach is broadly applicable to high-resolution detection tasks in autonomous navigation, surveillance, and augmented reality.

[122] Weakly Supervised Pneumonia Localization from Chest X-Rays Using Deep Neural Network and Grad-CAM Explanations

Kiran Shahi, Anup Bagale

Main category: cs.CV

TL;DR: Weakly supervised deep learning framework using Grad-CAM for pneumonia classification and localization from chest X-rays, achieving 96-98% accuracy with image-level labels only.

Details

Motivation: Pixel-level annotations for pneumonia localization are costly and time-consuming to obtain, creating a need for methods that can work with only image-level labels while still providing clinically meaningful localization.

Method: Proposed weakly supervised framework using Gradient-weighted Class Activation Mapping (Grad-CAM) with image-level labels. Evaluated seven pre-trained models (including Vision Transformer) under identical conditions with focal loss and patient-wise splits to prevent data leakage.

Result: All models achieved high classification accuracy (96-98%). ResNet-18 and EfficientNet-B0 showed best overall performance, while MobileNet-V3 provided efficient lightweight alternative. Grad-CAM heatmaps successfully highlighted clinically relevant lung regions.

Conclusion: Weakly supervised, explainable AI models can effectively classify and localize pneumonia using only image-level labels, enhancing transparency and clinical trust in AI-assisted radiological diagnostics.

Abstract: Chest X-ray imaging is commonly used to diagnose pneumonia, but accurately localizing the pneumonia-affected regions typically requires detailed pixel-level annotations, which are costly and time consuming to obtain. To address this limitation, this study proposes a weakly supervised deep learning framework for pneumonia classification and localization using Gradient-weighted Class Activation Mapping (Grad-CAM). Instead of relying on costly pixel-level annotations, the proposed method utilizes image-level labels to generate clinically meaningful heatmaps that highlight pneumonia-affected regions. Furthermore, we evaluate seven pre-trained deep learning models, including a Vision Transformer, under identical training conditions, using focal loss and patient-wise splits to prevent data leakage. Experimental results suggest that all models achieved high classification accuracy (96–98%), with ResNet-18 and EfficientNet-B0 showing the best overall performance and MobileNet-V3 providing an efficient lightweight alternative. Grad-CAM heatmap visualizations confirm that the proposed methods focus on clinically relevant lung regions, supporting the use of explainable AI for radiological diagnostics. Overall, this work highlights the potential of weakly supervised, explainable models that enhance transparency and clinical trust in AI-assisted pneumonia screening.

[123] PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation

Mengyuan Liu, Jiajie Liu, Jinyan Zhang, Wenhao Li, Junsong Yuan

Main category: cs.CV

TL;DR: PoseMoE: A Mixture-of-Experts network that disentangles 2D pose and depth feature encoding to improve monocular 3D human pose estimation by reducing depth uncertainty’s negative impact on 2D pose features.

Details

Motivation: Current lifting-based methods for monocular 3D human pose estimation encode detected 2D poses and unknown depth in an entangled feature space, which introduces depth uncertainty to the 2D pose features and limits overall accuracy. The paper reveals that depth representation is crucial - joint encoding is harmful when depth is completely unknown but beneficial when depth is initially refined.

Method: Proposes PoseMoE with two key components: (1) A mixture-of-experts network where specialized expert modules separately refine well-detected 2D pose features and learn depth features, disentangling their encoding to reduce uncertain depth’s influence on 2D pose. (2) A cross-expert knowledge aggregation module that aggregates spatio-temporal contextual information through bidirectional mapping between 2D pose and depth experts.

Result: Extensive experiments show PoseMoE outperforms conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.

Conclusion: The proposed PoseMoE framework successfully addresses the limitation of entangled feature encoding in lifting-based methods by disentangling 2D pose and depth feature learning through specialized experts, leading to improved 3D human pose estimation performance across multiple benchmark datasets.

Abstract: The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.

[124] VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia, Zichen Luo, Fei Tang, Dehan Kong, Yanyi Shang, Suling Ou, Zhenlin Guo, Changhua Meng, Shuheng Shen

Main category: cs.CV

TL;DR: VenusBench-GD is a comprehensive bilingual GUI grounding benchmark spanning multiple platforms with hierarchical evaluation, showing general multimodal models match specialized GUI models on basic tasks but advanced tasks still favor specialized models.

Details

Motivation: Existing GUI grounding benchmarks have limitations: insufficient data volume, narrow domain coverage, single-platform focus, or require specialized domain knowledge. There's a need for comprehensive, multi-platform evaluation frameworks.

Method: Created VenusBench-GD with: 1) large-scale cross-platform benchmark covering diverse applications and UI elements, 2) high-quality data construction pipeline with better annotation accuracy, 3) hierarchical task taxonomy dividing grounding into basic and advanced categories with six subtasks.

Result: General-purpose multimodal models now match or surpass specialized GUI models on basic grounding tasks. Advanced tasks still favor GUI-specialized models, but they show significant overfitting and poor robustness.

Conclusion: Comprehensive, multi-tiered evaluation frameworks are necessary for GUI grounding. The benchmark reveals current limitations and provides a foundation for developing more robust GUI agents.

Abstract: GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.

[125] Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Qiushuo Cheng, Jingjing Liu, Catherine Morgan, Alan Whone, Majid Mirmehdi

Main category: cs.CV

TL;DR: Self-supervised pretraining with snippet discrimination for skeleton-based temporal action localization, improving contrastive learning methods and achieving SOTA transfer learning performance.

Details

Motivation: Current self-supervised pretraining works well for skeleton-based action recognition but struggles with temporal action localization, which requires temporally sensitive features to detect subtle boundary changes between adjacent frames.

Method: 1) Formulates snippet discrimination pretext task that densely projects skeleton sequences into non-overlapping segments and distinguishes them via contrastive learning. 2) Enhances feature resolution for frame-level localization by fusing intermediate features with a U-shaped module on top of strong skeleton-based action recognition backbones.

Result: Consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. Achieves state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

Conclusion: The proposed snippet discrimination pretext task and U-shaped feature fusion module effectively address the challenges of skeleton-based temporal action localization, demonstrating superior performance over existing contrastive learning methods and strong transfer learning capabilities.

Abstract: The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

[126] Multi-scale Attention-Guided Intrinsic Decomposition and Rendering Pass Prediction for Facial Images

Hossein Javidnia

Main category: cs.CV

TL;DR: MAGINet is a multi-scale attention-guided network that predicts high-resolution light-normalized diffuse albedo from single RGB portraits, with refinement and additional rendering passes for complete intrinsic decomposition.

Details

Motivation: Accurate intrinsic decomposition of face images under unconstrained lighting is essential for photorealistic relighting, digital doubles, and AR effects, but existing methods have limitations in albedo boundary sharpness and lighting invariance.

Method: Uses hierarchical residual encoding, spatial-and-channel attention in bottleneck, adaptive multi-scale feature fusion in decoder, followed by RefinementNet for upsampling, and Pix2PixHD-based translator for five additional rendering passes (ambient occlusion, normal, specular, translucency, raw diffuse).

Result: Achieves state-of-the-art performance for diffuse albedo estimation and significantly improved fidelity for complete rendering stack compared to prior methods, enabling high-quality relighting and material editing.

Conclusion: MAGINet provides a comprehensive solution for intrinsic face decomposition with six physically based rendering passes that enable practical applications in relighting and material editing of real faces.

Abstract: Accurate intrinsic decomposition of face images under unconstrained lighting is a prerequisite for photorealistic relighting, high-fidelity digital doubles, and augmented-reality effects. This paper introduces MAGINet, a Multi-scale Attention-Guided Intrinsics Network that predicts a $512\times512$ light-normalized diffuse albedo map from a single RGB portrait. MAGINet employs hierarchical residual encoding, spatial-and-channel attention in a bottleneck, and adaptive multi-scale feature fusion in the decoder, yielding sharper albedo boundaries and stronger lighting invariance than prior U-Net variants. The initial albedo prediction is upsampled to $1024\times1024$ and refined by a lightweight three-layer CNN (RefinementNet). Conditioned on this refined albedo, a Pix2PixHD-based translator then predicts a comprehensive set of five additional physically based rendering passes: ambient occlusion, surface normal, specular reflectance, translucency, and raw diffuse colour (with residual lighting). Together with the refined albedo, these six passes form the complete intrinsic decomposition. Trained with a combination of masked-MSE, VGG, edge, and patch-LPIPS losses on the FFHQ-UV-Intrinsics dataset, the full pipeline achieves state-of-the-art performance for diffuse albedo estimation and demonstrates significantly improved fidelity for the complete rendering stack compared to prior methods. The resulting passes enable high-quality relighting and material editing of real faces.

[127] TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models

Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li

Main category: cs.CV

TL;DR: TTP is a lightweight test-time defense framework for CLIP models that detects adversarial inputs via cosine similarity shifts after spatial padding, then uses trainable padding and similarity-aware ensemble to restore robustness without compromising clean accuracy.

Details

Motivation: Current VLMs like CLIP are vulnerable to adversarial attacks but existing defenses have limitations: training-time defenses require labeled data and costly retraining, while test-time strategies fail to reliably distinguish clean from adversarial inputs, preventing optimal robustness and accuracy.

Method: TTP performs adversarial detection by measuring cosine similarity shift between CLIP feature embeddings before and after spatial padding. For detected adversarial cases, it uses trainable padding to restore disrupted attention patterns and similarity-aware ensemble for robust predictions. Clean inputs remain unchanged or can be enhanced with existing test-time adaptation techniques.

Result: Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy.

Conclusion: TTP provides an effective lightweight defense framework that addresses the limitations of existing approaches by enabling reliable adversarial detection and targeted adaptation at inference time, achieving both robustness and accuracy without requiring labeled data or costly retraining.

Abstract: Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.

[128] N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, Dong Yu

Main category: cs.CV

TL;DR: N3D-VLM is a unified framework that integrates native 3D object perception with 3D-aware visual reasoning, enabling precise 3D grounding and interpretable spatial understanding beyond conventional 2D-based multimodal models.

Details

Motivation: Current multimodal models lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. They can only answer questions based on 2D images without true 3D understanding.

Method: Proposes N3D-VLM framework that: 1) Enables native 3D object perception to localize objects in 3D space based on textual descriptions, 2) Performs explicit 3D reasoning for interpretable spatial understanding, 3) Uses scalable data construction pipeline that lifts 2D annotations to 3D space via depth estimation, creating datasets 6x larger than existing ones, 4) Generates spatial QA datasets for chain-of-thought reasoning in 3D.

Result: Achieves state-of-the-art performance on 3D grounding tasks and consistently surpasses existing methods in 3D spatial reasoning in vision-language models.

Conclusion: The unified framework successfully integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding, addressing limitations of current 2D-based multimodal models.

Abstract: While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.

[129] 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction

Kirill Mazur, Marwan Taher, Andrew J. Davison

Main category: cs.CV

TL;DR: Dynamic 4D scene reconstruction from monocular RGB video that tracks both visible and previously-viewed objects using rigid 3D primitives with motion extrapolation.

Details

Motivation: To create complete and persistent 3D reconstructions from casual monocular videos that maintain object permanence - reconstructing not just currently visible parts but all previously viewed parts of the scene.

Method: Decomposes scene into rigid 3D primitives, uses estimated dense 2D correspondences to jointly infer rigid motion through optimization, and introduces motion extrapolation for invisible objects using motion-grouping techniques.

Result: Achieves 4D spatio-temporal awareness enabling replayable 3D reconstructions of articulated objects, multi-object scanning, and object permanence. Significantly outperforms existing methods on object scanning and multi-object datasets both quantitatively and qualitatively.

Conclusion: The system successfully creates persistent 4D reconstructions from monocular video by tracking rigid primitives and extrapolating motion for occluded objects, enabling comprehensive scene understanding over time.

Abstract: We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps. Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity. The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.

[130] Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation

Yin Zhang, Yongqiang Zhang, Yaoyue Zheng, Bogdan Raducanu, Dan Liu

Main category: cs.CV

TL;DR: Causal-Tune: A frequency-domain fine-tuning method for vision foundation models that identifies and disentangles causal vs. non-causal factors to improve domain generalization in semantic segmentation.

Details

Motivation: Existing fine-tuning methods for vision foundation models overlook artifacts from long-term pre-training that hinder domain generalization. These artifacts are associated with non-causal factors residing in frequency components, limiting the utilization of valuable representations.

Method: Proposes Causal-Tune: 1) Extract frequency spectrum using DCT, 2) Separate causal/non-causal components with Gaussian band-pass filter, 3) Refine causal components with learnable tokens in frequency domain, 4) Discard non-causal components, 5) Transform back via inverse DCT.

Result: Achieves superior performance in cross-domain tasks, especially under adverse weather conditions (+4.8% mIoU improvement over baseline in snow conditions). Extensive experiments demonstrate effectiveness.

Conclusion: Causal-Tune effectively identifies and disentangles causal vs. non-causal factors in vision foundation models, enabling more robust domain generalization for semantic segmentation through frequency-domain fine-tuning.

Abstract: Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.

[131] CRONOS: Continuous Time Reconstruction for 4D Medical Longitudinal Series

Nico Albert Disch, Saikat Roy, Constantin Ulrich, Yannick Kirchhoff, Maximilian Rokuss, Robin Peretzke, David Zimmerer, Klaus Maier-Hein

Main category: cs.CV

TL;DR: CRONOS is a unified framework for 3D medical scan forecasting that supports both discrete and continuous timestamps, learning spatio-temporal velocity fields to predict future scans from multiple past scans.

Details

Motivation: Existing models for 3D medical scan forecasting have limitations: they rely on single prior scans, fixed grid times, or target global labels, which restricts voxel-level forecasting under irregular sampling patterns common in medical imaging.

Method: CRONOS learns a spatio-temporal velocity field that transports context volumes toward a target volume at an arbitrary time, operating directly in 3D voxel space. It supports both discrete (grid-based) and continuous (real-valued) timestamps in a unified model.

Result: CRONOS outperforms other baselines across three public datasets (Cine-MRI, perfusion CT, and longitudinal MRI) while remaining computationally competitive.

Conclusion: CRONOS represents the first continuous sequence-to-image forecasting framework for 3D medical data, enabling reproducible, multi-dataset benchmarking of multi-context, continuous-time forecasting in medical imaging.

Abstract: Forecasting how 3D medical scans evolve over time is important for disease progression, treatment planning, and developmental assessment. Yet existing models either rely on a single prior scan, fixed grid times, or target global labels, which limits voxel-level forecasting under irregular sampling. We present CRONOS, a unified framework for many-to-one prediction from multiple past scans that supports both discrete (grid-based) and continuous (real-valued) timestamps in one model, to the best of our knowledge the first to achieve continuous sequence-to-image forecasting for 3D medical data. CRONOS learns a spatio-temporal velocity field that transports context volumes toward a target volume at an arbitrary time, while operating directly in 3D voxel space. Across three public datasets spanning Cine-MRI, perfusion CT, and longitudinal MRI, CRONOS outperforms other baselines, while remaining computationally competitive. We will release code and evaluation protocols to enable reproducible, multi-dataset benchmarking of multi-context, continuous-time forecasting.

[132] Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou, Yue Wu, Jieping Ye, Ruixuan Li

Main category: cs.CV

TL;DR: SkiLa enables MLLMs to generate visual embeddings as “latent sketch tokens” during reasoning, allowing unified visual-text thinking without external tools.

Details

Motivation: Current MLLMs lack visual imagination capabilities that humans possess - humans can form flexible visual-text interactions during thinking without predefined toolkits, as they process both modalities in a unified brain space. Since MLLMs already encode visual and text information in the same feature space, visual tokens should be seamlessly integrated into reasoning processes.

Method: Proposes Sketch-in-Latents (SkiLa) paradigm that expands MLLMs’ auto-regressive capabilities to generate continuous visual embeddings (latent sketch tokens) as visual thoughts. The model dynamically alternates between textual thinking mode (generating text tokens) and visual sketching mode (generating latent sketch tokens). Includes a latent visual semantics reconstruction mechanism to ensure semantic grounding of latent sketch tokens.

Result: Extensive experiments show SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks.

Conclusion: SkiLa successfully enables unified multi-modal reasoning by allowing MLLMs to natively generate visual embeddings during reasoning, mimicking human-like visual-text imagination capabilities without requiring external toolkits.

Abstract: While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at https://github.com/TungChintao/SkiLa.

[133] Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks

Shaohua Wu, Tong Yu, Shenling Wang, Xudong Zhao

Main category: cs.CV

TL;DR: Yuan-TecSwin is a text-conditioned diffusion model that replaces CNN blocks with Swin-transformer blocks to improve long-range semantic understanding, achieving state-of-the-art FID score of 1.37 on ImageNet.

Details

Motivation: The locality of convolution operations in CNN-based diffusion models limits their ability to understand long-range semantic information, which is crucial for high-quality image generation.

Method: Proposes Yuan-TecSwin with Swin-transformer blocks replacing CNN blocks in encoder/decoder, improved text-image alignment through better text encoder and embedding utilization, and adapted time step search for different diffusion stages.

Result: Achieves state-of-the-art FID score of 1.37 on ImageNet generation benchmark, improves inference performance by 10% with adapted time step search, and generates images that are difficult for humans to distinguish from human-painted ones.

Conclusion: Swin-transformer architecture effectively addresses CNN’s locality limitations in diffusion models, enabling superior long-range semantic understanding and state-of-the-art image generation quality without requiring additional models at different denoising stages.

Abstract: Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model’s ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.

[134] Hazedefy: A Lightweight Real-Time Image and Video Dehazing Pipeline for Practical Deployment

Ayush Bhavsar

Main category: cs.CV

TL;DR: Hazedefy: Lightweight dehazing pipeline for real-time video on consumer hardware using optimized Dark Channel Prior approach.

Details

Motivation: Need for practical, real-time dehazing solutions that can run efficiently on consumer-grade hardware without GPU acceleration, particularly for mobile and embedded applications.

Method: Builds on Dark Channel Prior and atmospheric scattering model with key optimizations: gamma-adaptive reconstruction, fast transmission approximation with lower bounds, stabilized atmospheric light estimator using fractional top-pixel averaging, and optional color balance stage.

Result: Experimental demonstrations on real-world images and videos show improved visibility and contrast while maintaining computational efficiency suitable for real-time applications.

Conclusion: Hazedefy provides a practical, deployable solution for real-time video dehazing on consumer hardware, balancing quality with computational simplicity for mobile and embedded use cases.

Abstract: This paper introduces Hazedefy, a lightweight and application-focused dehazing pipeline intended for real-time video and live camera feed enhancement. Hazedefy prioritizes computational simplicity and practical deployability on consumer-grade hardware, building upon the Dark Channel Prior (DCP) concept and the atmospheric scattering model. Key elements include gamma-adaptive reconstruction, a fast transmission approximation with lower bounds for numerical stability, a stabilized atmospheric light estimator based on fractional top-pixel averaging, and an optional color balance stage. The pipeline is suitable for mobile and embedded applications, as experimental demonstrations on real-world images and videos show improved visibility and contrast without requiring GPU acceleration.

[135] Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers

Yifan Zhou, Zeqi Xiao, Tianyi Wei, Shuai Yang, Xingang Pan

Main category: cs.CV

TL;DR: LLSA introduces a hierarchical sparse attention mechanism that reduces DiT computation from quadratic to log-linear complexity for extremely long token sequences, enabling efficient high-resolution image generation without quality loss.

Details

Motivation: Current DiTs suffer from quadratic self-attention costs that limit scaling to long token sequences. Existing sparse attention approaches still have quadratic selection costs and require increasing K values as sequences grow, making them inefficient for long sequences.

Method: LLSA uses hierarchical Top-K selection that progressively applies sparse selection with indices from previous levels, plus Hierarchical KV Enrichment to preserve global context with fewer tokens of different granularity. Includes high-performance GPU implementation using only sparse indices without dense masks.

Result: LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences while maintaining generation quality. Enables high-resolution pixel-space image generation without patchification or VAE encoding.

Conclusion: LLSA provides an efficient log-linear sparse attention mechanism for training long-sequence DiTs, offering a promising direction for scaling visual generation models to extremely long token sequences while maintaining computational efficiency and quality.

Abstract: Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: https://github.com/SingleZombie/LLSA

[136] Plug to Place: Indoor Multimedia Geolocation from Electrical Sockets for Digital Investigation

Kanwal Aftab, Graham Adams, Mark Scanlon

Main category: cs.CV

TL;DR: A pipeline using electric sockets as indoor geolocation markers achieves high accuracy in detecting, classifying, and mapping sockets to countries for forensic applications.

Details

Motivation: Indoor multimedia geolocation is underdeveloped but crucial for fighting human trafficking and child exploitation, as outdoor methods don't work well indoors due to similar layouts, renovations, lighting issues, and unreliable GPS.

Method: Three-stage deep learning pipeline: 1) YOLOv11 for socket detection (mAP@0.5=0.843), 2) Xception for classifying 12 socket types (accuracy=0.912), 3) mapping socket types to countries (accuracy=0.96 at >90% confidence). Created two datasets to address data scarcity.

Result: High accuracy across all stages: 84.3% mAP for detection, 91.2% for socket type classification, and 96% for country mapping. Evaluated on realistic TraffickCam hotel images with poor lighting and amateur angles.

Conclusion: The electric socket-based pipeline demonstrates practical potential for real-world digital forensic applications, with all code, models, and data released open source.

Abstract: Computer vision is a rapidly evolving field, giving rise to powerful new tools and techniques in digital forensic investigation, and shows great promise for novel digital forensic applications. One such application, indoor multimedia geolocation, has the potential to become a crucial aid for law enforcement in the fight against human trafficking, child exploitation, and other serious crimes. While outdoor multimedia geolocation has been widely explored, its indoor counterpart remains underdeveloped due to challenges such as similar room layouts, frequent renovations, visual ambiguity, indoor lighting variability, unreliable GPS signals, and limited datasets in sensitive domains. This paper introduces a pipeline that uses electric sockets as consistent indoor markers for geolocation, since plug socket types are standardised by country or region. The three-stage deep learning pipeline detects plug sockets (YOLOv11, mAP@0.5 = 0.843), classifies them into one of 12 plug socket types (Xception, accuracy = 0.912), and maps the detected socket types to countries (accuracy = 0.96 at >90% threshold confidence). To address data scarcity, two dedicated datasets were created: socket detection dataset of 2,328 annotated images expanded to 4,072 through augmentation, and a classification dataset of 3,187 images across 12 plug socket classes. The pipeline was evaluated on the Hotels-50K dataset, focusing on the TraffickCam subset of crowd-sourced hotel images, which capture real-world conditions such as poor lighting and amateur angles. This dataset provides a more realistic evaluation than using professional, well-lit, often wide-angle images from travel websites. This framework demonstrates a practical step toward real-world digital forensic applications. The code, trained models, and the data for this paper are available open source.

[137] DeContext as Defense: Safe Image Editing in Diffusion Transformers

Linghui Shen, Mingyue Cui, Xingyi Yang

Main category: cs.CV

TL;DR: DeContext is a defense method that uses targeted perturbations to weaken cross-attention pathways in diffusion models, preventing unauthorized in-context image editing while preserving visual quality.

Details

Motivation: In-context diffusion models enable powerful image manipulation but raise serious privacy concerns as personal images can be easily modified for identity impersonation, misinformation, or malicious uses without consent. Existing defenses focus on personalized text-to-image generation, leaving modern large-scale DiT-based models vulnerable.

Method: DeContext injects small, targeted perturbations that weaken multimodal attention layers where contextual information propagates from source to output. The method identifies that early denoising steps and specific transformer blocks dominate context propagation, allowing concentration of perturbations where they matter most to break cross-attention pathways.

Result: Experiments on Flux Kontext and Step1X-Edit show DeContext consistently blocks unwanted image edits while preserving visual quality. The method demonstrates effectiveness of attention-based perturbations as a defense against image manipulation.

Conclusion: DeContext provides an efficient and robust defense mechanism against unauthorized in-context image editing by targeting the fundamental attention mechanisms that enable context propagation in diffusion models, offering protection against privacy violations and malicious manipulation.

Abstract: In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner’s consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.

[138] SARMAE: Masked Autoencoder for SAR Representation Learning

Danxu Liu, Di Wang, Hebaixu Wang, Haoyang Chen, Wentao Jiang, Yilin Cheng, Haonan Guo, Wei Cui, Jing Zhang

Main category: cs.CV

TL;DR: SARMAE: A noise-aware masked autoencoder for self-supervised SAR representation learning using million-scale dataset and speckle noise injection.

Details

Motivation: SAR imagery suffers from data scarcity and speckle noise that hampers semantic representation learning, limiting deep learning applications in SAR remote sensing.

Method: 1) Construct SAR-1M million-scale dataset with paired optical images; 2) Speckle-Aware Representation Enhancement (SARE) injects SAR-specific speckle noise into masked autoencoders; 3) Semantic Anchor Representation Constraint (SARC) uses optical priors to align SAR features.

Result: Achieves state-of-the-art performance on multiple SAR datasets for classification, detection, and segmentation tasks.

Conclusion: SARMAE effectively addresses SAR data scarcity and speckle noise challenges through noise-aware self-supervised learning with optical alignment, enabling robust SAR representation learning.

Abstract: Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code and models will be available at https://github.com/MiliLab/SARMAE.

[139] REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Giorgos Petsangourakis, Christos Sgouropoulos, Bill Psomas, Theodoros Giannakopoulos, Giorgos Sfikas, Ioannis Kakogeorgiou

Main category: cs.CV

TL;DR: REGLUE improves latent diffusion models by jointly modeling VAE latents, local patch-level VFM semantics, and global image-level CLS tokens within a single SiT backbone, using nonlinear compression and external alignment to accelerate convergence and improve image quality.

Details

Motivation: Current latent diffusion models have limitations: their reconstruction-style denoising provides only indirect semantic supervision, requiring longer training and limiting sample quality. Existing approaches either use external representation alignment or only model a narrow slice of VFM features, underutilizing rich multi-layer spatial semantics from Vision Foundation Models.

Method: REGLUE introduces a unified framework that jointly models VAE image latents, compact local patch-level VFM semantics, and global image-level CLS tokens within a single SiT backbone. It uses a lightweight convolutional semantic compressor to nonlinearly aggregate multi-layer VFM features into low-dimensional spatially structured representations, which are entangled with VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets.

Result: On ImageNet 256x256, REGLUE consistently improves FID scores and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG methods. Experiments show that spatial VFM semantics are crucial, nonlinear compression is key to unlocking their full benefit, and global tokens with external alignment act as complementary enhancements.

Conclusion: REGLUE demonstrates that jointly modeling VAE latents with rich spatial VFM semantics through nonlinear compression and global-local-latent integration significantly improves latent diffusion model performance, offering a unified framework that better utilizes Vision Foundation Model capabilities for image synthesis.

Abstract: Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .

[140] FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

Ole Beisswenger, Jan-Niklas Dihlmann, Hendrik P. A. Lensch

Main category: cs.CV

TL;DR: FrameDiffuser is an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames from G-buffer data for interactive applications, addressing limitations of existing diffusion-based approaches.

Details

Motivation: Current diffusion-based neural rendering approaches have critical limitations: single-image models lack temporal consistency, while video models are too computationally expensive for interactive applications and require complete sequences upfront, making them unsuitable for real-time applications where future frames depend on user input.

Method: FrameDiffuser uses an autoregressive framework that generates frames by conditioning on G-buffer data (geometry, materials, surface properties) and the model’s own previous output. It features a dual-conditioning architecture combining ControlNet for structural guidance with ControlLoRA for temporal coherence, and employs a three-stage training strategy for stable autoregressive generation. The model is specialized to individual environments rather than aiming for broad generalization.

Result: The framework maintains stable, temporally consistent generation over hundreds to thousands of frames, achieving superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches, while being suitable for interactive applications.

Conclusion: FrameDiffuser successfully addresses the limitations of existing neural rendering approaches by providing an autoregressive framework that balances temporal consistency, photorealistic quality, and computational efficiency for interactive applications, demonstrating that environment-specific training yields better results than generalized approaches.

Abstract: Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.

[141] Few-Shot Fingerprinting Subject Re-Identification in 3D-MRI and 2D-X-Ray

Gonçalo Gaspar Alves, Shekoufeh Gorgi Zadeh, Andreas Husch, Ben Bausch

Main category: cs.CV

TL;DR: Subject fingerprinting method maps medical images of the same subject to distinct latent regions to detect data leakage in combined datasets, achieving high re-identification accuracy on MRI and X-ray data.

Details

Motivation: Combining open-source medical imaging datasets risks data leakage when the same subject appears in multiple sets, artificially inflating model performance metrics. Need a method to identify and prevent such leakage.

Method: Use subject fingerprinting with ResNet-50 trained with triplet margin loss to map all images of a subject to distinct regions in latent space. Evaluate few-shot fingerprinting on 3D MRI (BraTS-2021) and 2D X-ray (ChestXray-14) data in standard (20-way 1-shot) and challenging (1000-way 1-shot) scenarios.

Result: Achieved high Mean-Recall-@-K scores: 99.10% (20-way 1-shot) and 90.06% (500-way 5-shot) on ChestXray-14; 99.20% (20-way 1-shot) and 98.86% (100-way 3-shot) on BraTS-2021.

Conclusion: Subject fingerprinting effectively enables subject re-identification to detect data leakage in medical imaging datasets, with strong performance across different modalities and challenging few-shot scenarios.

Abstract: Combining open-source datasets can introduce data leakage if the same subject appears in multiple sets, leading to inflated model performance. To address this, we explore subject fingerprinting, mapping all images of a subject to a distinct region in latent space, to enable subject re-identification via similarity matching. Using a ResNet-50 trained with triplet margin loss, we evaluate few-shot fingerprinting on 3D MRI and 2D X-ray data in both standard (20-way 1-shot) and challenging (1000-way 1-shot) scenarios. The model achieves high Mean- Recall-@-K scores: 99.10% (20-way 1-shot) and 90.06% (500-way 5-shot) on ChestXray-14; 99.20% (20-way 1-shot) and 98.86% (100-way 3-shot) on BraTS- 2021.

[142] Detecting Localized Deepfakes: How Well Do Synthetic Image Detectors Handle Inpainting?

Serafino Pandolfini, Lorenzo Pellegrini, Matteo Ferrara, Davide Maltoni

Main category: cs.CV

TL;DR: Systematic evaluation shows that deepfake detectors trained on fully synthetic images have partial transferability to localized inpainting detection, performing well on medium/large-area manipulations and regeneration-style inpainting.

Details

Motivation: With the rise of realistic image manipulations like inpainting and region-level editing being exploited in cybersecurity threats, there's a need to understand how well existing deepfake detectors generalize from fully synthetic images to localized manipulations.

Method: Conducted systematic evaluation of state-of-the-art deepfake detectors (originally trained for fully synthetic images) on localized inpainting detection using multiple datasets with diverse generators, mask sizes, and inpainting techniques.

Result: Models trained on diverse generators show partial transferability to inpainting-based edits, reliably detecting medium- and large-area manipulations or regeneration-style inpainting, outperforming many existing ad hoc detection approaches.

Conclusion: Deepfake detectors have meaningful generalization capabilities to localized manipulations, particularly for certain types and scales of inpainting, suggesting potential for broader application beyond their original training objectives.

Abstract: The rapid progress of generative AI has enabled highly realistic image manipulations, including inpainting and region-level editing. These approaches preserve most of the original visual context and are increasingly exploited in cybersecurity-relevant threat scenarios. While numerous detectors have been proposed for identifying fully synthetic images, their ability to generalize to localized manipulations remains insufficiently characterized. This work presents a systematic evaluation of state-of-the-art detectors, originally trained for the deepfake detection on fully synthetic images, when applied to a distinct challenge: localized inpainting detection. The study leverages multiple datasets spanning diverse generators, mask sizes, and inpainting techniques. Our experiments show that models trained on a large set of generators exhibit partial transferability to inpainting-based edits and can reliably detect medium- and large-area manipulations or regeneration-style inpainting, outperforming many existing ad hoc detection approaches.

[143] SDFoam: Signed-Distance Foam for explicit surface reconstruction

Antonella Rech, Nicola Conci, Nicola Garau

Main category: cs.CV

TL;DR: SDFoam improves mesh reconstruction in neural rendering by combining explicit Voronoi Diagrams with implicit Signed Distance Fields, achieving better surfaces while maintaining rendering efficiency.

Details

Motivation: Current neural rendering methods (NeRF, 3DGS, RadiantFoam) struggle with precise mesh reconstruction despite advances in view synthesis and rendering speed.

Method: Jointly learns explicit Voronoi Diagram with implicit Signed Distance Field, optimizes via ray tracing with Eikonal regularization, using SDF to bias Voronoi cell faces to align with zero level set.

Result: Produces crisper, view-consistent surfaces with fewer floaters and improved topology, maintains photometric quality and training speed comparable to RadiantFoam, substantially improves mesh reconstruction accuracy (Chamfer distance) with comparable appearance metrics (PSNR, SSIM).

Conclusion: SDFoam’s hybrid implicit-explicit formulation successfully addresses mesh reconstruction limitations in neural rendering while preserving efficiency and appearance quality.

Abstract: Neural radiance fields (NeRF) have driven impressive progress in view synthesis by using ray-traced volumetric rendering. Splatting-based methods such as 3D Gaussian Splatting (3DGS) provide faster rendering by rasterizing 3D primitives. RadiantFoam (RF) brought ray tracing back, achieving throughput comparable to Gaussian Splatting by organizing radiance with an explicit Voronoi Diagram (VD). Yet, all the mentioned methods still struggle with precise mesh reconstruction. We address this gap by jointly learning an explicit VD with an implicit Signed Distance Field (SDF). The scene is optimized via ray tracing and regularized by an Eikonal objective. The SDF introduces metric-consistent isosurfaces, which, in turn, bias near-surface Voronoi cell faces to align with the zero level set. The resulting model produces crisper, view-consistent surfaces with fewer floaters and improved topology, while preserving photometric quality and maintaining training speed on par with RadiantFoam. Across diverse scenes, our hybrid implicit-explicit formulation, which we name SDFoam, substantially improves mesh reconstruction accuracy (Chamfer distance) with comparable appearance (PSNR, SSIM), without sacrificing efficiency.

[144] A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry

Chiara Di Vece, Zhehua Mao, Netanell Avisdris, Brian Dromey, Raffaele Napolitano, Dafna Ben Bashat, Francisco Vasconcelos, Danail Stoyanov, Leo Joskowicz, Sophia Bano

Main category: cs.CV

TL;DR: First public multi-center, multi-device fetal ultrasound dataset with expert landmark annotations for all primary biometric measurements, enabling robust AI development for fetal growth assessment.

Details

Motivation: Manual landmarking in fetal ultrasound is time-consuming, operator-dependent, and suffers from scanner/site variability, limiting reproducibility. Need multi-source annotated datasets for AI-assisted fetal growth assessment.

Method: Created open benchmark dataset with 4,513 de-identified US images from 1,904 subjects across 3 clinical sites using 7 different US devices. Provided standardized train/test splits, evaluation code, and baseline results.

Result: Dataset enables quantification of domain shift - single-center training/testing substantially overestimates performance compared to multi-center evaluation. First public multi-center, multi-device landmark-annotated dataset covering all primary fetal biometry measures.

Conclusion: Provides robust benchmark for domain adaptation and multi-center generalization in fetal biometry, enabling more reliable AI-assisted fetal growth assessment across different clinical centers and devices.

Abstract: Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset contains 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.

[145] OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition

Haochen Chang, Pengfei Ren, Buyuan Zhang, Da Li, Tianhao Han, Haoyang Zhang, Liang Xie, Hongbo Chen, Erwei Yin

Main category: cs.CV

TL;DR: OMG-Bench is the first large-scale public benchmark for skeleton-based online micro gesture recognition, featuring 40 classes with 13,948 instances. The paper introduces HMATr, a hierarchical memory-augmented transformer that outperforms SOTA methods by 7.6% in detection rate.

Details

Motivation: Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, making dataset construction with precise skeletons and frame-level annotations difficult.

Method: Developed a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Proposed Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification using hierarchical memory banks storing frame-level details and window-level semantics, with learnable position-aware queries initialized from memory.

Result: Created OMG-Bench with 40 fine-grained gesture classes, 13,948 instances across 1,272 sequences. HMATr outperforms state-of-the-art methods by 7.6% in detection rate, establishing a strong baseline for online micro gesture recognition.

Conclusion: The paper presents the first large-scale public benchmark for skeleton-based online micro gesture recognition and proposes an effective HMATr framework that significantly advances the state-of-the-art, providing a solid foundation for future research in VR/AR interaction.

Abstract: Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6% in detection rate, establishing a strong baseline for online micro gesture recognition. Project page: https://omg-bench.github.io/

[146] Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation

Yunkai Yang, Yudong Zhang, Kunquan Zhang, Jinxiao Zhang, Xinying Chen, Haohuan Fu, Runmin Dong

Main category: cs.CV

TL;DR: TODSynth framework uses multimodal diffusion transformer with triple attention and task-guided sampling to generate high-quality synthetic data for remote sensing semantic segmentation, outperforming SOTA methods.

Details

Motivation: Current synthetic data generation for remote sensing faces challenges with semantic mask control complexity and sampling quality uncertainty, limiting utility for downstream semantic segmentation tasks.

Method: Proposes TODSynth framework with Multimodal Diffusion Transformer (MM-DiT) using text-image-mask joint attention, and control-rectify flow matching (CRFM) that dynamically adjusts sampling using semantic loss feedback during early generation stages.

Result: The approach significantly enhances RS semantic segmentation data synthesis, especially in few-shot and complex-scene scenarios, consistently outperforming state-of-the-art controllable generation methods.

Conclusion: TODSynth produces more stable and task-oriented synthetic data for RS semantic segmentation by effectively addressing control complexity and sampling uncertainty through multimodal attention and task-guided sampling.

Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.

[147] TreeNet: A Light Weight Model for Low Bitrate Image Compression

Mahadev Prasad Panda, Purnachandra Rao Makkena, Srivatsa Prativadibhayankaram, Siegfried Fößel, André Kaup

Main category: cs.CV

TL;DR: TreeNet: A low-complexity image compression model using binary tree-structured encoder-decoder with attentional feature fusion, achieving 4.83% BD-rate improvement over JPEG AI at low bitrates while reducing complexity by 87.82%.

Details

Motivation: Reducing computational complexity is critical for widespread adoption of learning-based image compression techniques, as current methods are often too computationally expensive for practical deployment.

Method: Proposes TreeNet with binary tree-structured encoder-decoder architecture and attentional feature fusion mechanism to integrate features from multiple branches efficiently.

Result: TreeNet achieves 4.83% average BD-rate improvement over JPEG AI at low bitrates while reducing model complexity by 87.82%. Evaluated on three benchmark datasets with extensive ablation studies on latent representations.

Conclusion: TreeNet demonstrates that tree-structured architectures with attentional feature fusion can significantly reduce computational complexity while maintaining competitive compression performance compared to state-of-the-art methods like JPEG AI.

Abstract: Reducing computational complexity remains a critical challenge for the widespread adoption of learning-based image compression techniques. In this work, we propose TreeNet, a novel low-complexity image compression model that leverages a binary tree-structured encoder-decoder architecture to achieve efficient representation and reconstruction. We employ attentional feature fusion mechanism to effectively integrate features from multiple branches. We evaluate TreeNet on three widely used benchmark datasets and compare its performance against competing methods including JPEG AI, a recent standard in learning-based image compression. At low bitrates, TreeNet achieves an average improvement of 4.83% in BD-rate over JPEG AI, while reducing model complexity by 87.82%. Furthermore, we conduct extensive ablation studies to investigate the influence of various latent representations within TreeNet, offering deeper insights into the factors contributing to reconstruction.

[148] VAEER: Visual Attention-Inspired Emotion Elicitation Reasoning

Fanhang Man, Xiaoyue Chen, Huandong Wang, Baining Zhao, Han Li, Xinlei Chen

Main category: cs.CV

TL;DR: VAEER is an interpretable multi-label framework for Visual Emotion Elicitation (VEE) that predicts emotions evoked by images using attention-based cue extraction and knowledge-grounded reasoning, achieving state-of-the-art performance across diverse benchmarks.

Details

Motivation: Images shared online significantly impact emotions and public well-being, especially during crises. Understanding emotional responses to images is crucial for fostering healthier digital communities and sustainable online ecosystems.

Method: VAEER combines attention-inspired cue extraction with knowledge-grounded reasoning. It isolates salient visual foci and contextual signals, aligns them with structured affective knowledge, and performs per-emotion inference to generate transparent, emotion-specific rationales.

Result: Across three heterogeneous benchmarks (social imagery and disaster-related photos), VAEER achieves state-of-the-art results with up to 19% per-emotion improvements and 12.3% average gain over strong CNN and VLM baselines.

Conclusion: Interpretable multi-label emotion elicitation provides a scalable foundation for responsible visual media analysis and emotionally sustainable online ecosystems, with VAEER demonstrating strong performance and transparency.

Abstract: Images shared online strongly influence emotions and public well-being. Understanding the emotions an image elicits is therefore vital for fostering healthier and more sustainable digital communities, especially during public crises. We study Visual Emotion Elicitation (VEE), predicting the set of emotions that an image evokes in viewers. We introduce VAEER, an interpretable multi-label VEE framework that combines attention-inspired cue extraction with knowledge-grounded reasoning. VAEER isolates salient visual foci and contextual signals, aligns them with structured affective knowledge, and performs per-emotion inference to yield transparent, emotion-specific rationales. Across three heterogeneous benchmarks, including social imagery and disaster-related photos, VAEER achieves state-of-the-art results with up to 19% per-emotion improvements and a 12.3% average gain over strong CNN and VLM baselines. Our findings highlight interpretable multi-label emotion elicitation as a scalable foundation for responsible visual media analysis and emotionally sustainable online ecosystems.

[149] Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation

Zhiyang Guo, Ori Zhang, Jax Xiang, Alan Zhao, Wengang Zhou, Houqiang Li

Main category: cs.CV

TL;DR: Make-It-Poseable is a novel feed-forward framework that reformulates 3D character posing as latent-space transformation, using a latent posing transformer to manipulate shape tokens based on skeletal motion, achieving superior posing quality and enabling 3D editing applications.

Details

Motivation: Existing methods like auto-rigging and pose-conditioned generation struggle with inaccurate skinning weight prediction, topological imperfections, and poor pose conformance, limiting robustness and generalizability in 3D character posing.

Method: Reformulates character posing as latent-space transformation instead of mesh vertex deformation. Uses a latent posing transformer to manipulate shape tokens based on skeletal motion, facilitated by dense pose representation. Includes latent-space supervision strategy and adaptive completion module for high-fidelity geometry and topological changes.

Result: Demonstrates superior performance in posing quality compared to existing methods. Naturally extends to 3D editing applications like part replacement and refinement.

Conclusion: Make-It-Poseable provides a robust and generalizable solution for 3D character posing by operating in latent space, overcoming limitations of traditional mesh deformation approaches and enabling broader 3D editing capabilities.

Abstract: Posing 3D characters is a fundamental task in computer graphics and vision. However, existing methods like auto-rigging and pose-conditioned generation often struggle with challenges such as inaccurate skinning weight prediction, topological imperfections, and poor pose conformance, limiting their robustness and generalizability. To overcome these limitations, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a latent-space transformation problem. Instead of deforming mesh vertices as in traditional pipelines, our method reconstructs the character in new poses by directly manipulating its latent representation. At the core of our method is a latent posing transformer that manipulates shape tokens based on skeletal motion. This process is facilitated by a dense pose representation for precise control. To ensure high-fidelity geometry and accommodate topological changes, we also introduce a latent-space supervision strategy and an adaptive completion module. Our method demonstrates superior performance in posing quality. It also naturally extends to 3D editing applications like part replacement and refinement.

[150] FlowDet: Unifying Object Detection and Generative Transport Flows

Enis Baty, C. P. Bridges, Simon Hadfield

Main category: cs.CV

TL;DR: FlowDet reformulates object detection using Conditional Flow Matching instead of diffusion, achieving faster performance scaling and better results than DiffusionDet.

Details

Motivation: To generalize object detection beyond diffusion-based generative approaches, enabling simpler, straighter transport paths for better performance scaling with inference steps.

Method: Reformulates object detection as a generative transport problem using Conditional Flow Matching, maintaining ability to vary number of boxes and inference steps without retraining.

Result: Outperforms DiffusionDet and non-generative baselines across various experiments, with gains up to +3.6% AP on COCO and +4.2% AP_rare on LVIS, especially effective in recall-constrained settings.

Conclusion: Conditional Flow Matching provides a superior formulation for generative object detection compared to diffusion, offering simpler transport paths and better performance scaling.

Abstract: We present FlowDet, the first formulation of object detection using modern Conditional Flow Matching techniques. This work follows from DiffusionDet, which originally framed detection as a generative denoising problem in the bounding box space via diffusion. We revisit and generalise this formulation to a broader class of generative transport problems, while maintaining the ability to vary the number of boxes and inference steps without re-training. In contrast to the curved stochastic transport paths induced by diffusion, FlowDet learns simpler and straighter paths resulting in faster scaling of detection performance as the number of inference steps grows. We find that this reformulation enables us to outperform diffusion based detection systems (as well as non-generative baselines) across a wide range of experiments, including various precision/recall operating points using multiple feature backbones and datasets. In particular, when evaluating under recall-constrained settings, we can highlight the effects of the generative transport without over-compensating with large numbers of proposals. This provides gains of up to +3.6% AP and +4.2% AP$_{rare}$ over DiffusionDet on the COCO and LVIS datasets, respectively.

[151] Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, Haodong Ouyang, Wenyu Qin, Wanqi Shi, Xiaoyu Shi, Lianghao Su, Haozhi Sun, Peiqin Sun, Pengfei Wan, Chao Wang, Chenyu Wang, Meng Wang, Qiulin Wang, Runqi Wang, Xintao Wang, Xuebo Wang, Zekun Wang, Min Wei, Tiancheng Wen, Guohao Wu, Xiaoshi Wu, Zhenhua Wu, Da Xie, Yingtong Xiong, Yulong Xu, Sile Yang, Zikang Yang, Weicai Ye, Ziyang Yuan, Shenglong Zhang, Shuaiyu Zhang, Yuanxing Zhang, Yufan Zhang, Wenzheng Zhao, Ruiliang Zhou, Yan Zhou, Guosheng Zhu, Yongjie Zhu

Main category: cs.CV

TL;DR: Kling-Omni is a unified generative framework that creates high-quality videos from multimodal inputs (text, images, videos) through end-to-end integration of generation, editing, and reasoning tasks.

Details

Motivation: To overcome the functional separation between different video generation, editing, and reasoning tasks by creating a holistic system that can handle diverse multimodal inputs and produce cinematic-quality video content.

Method: End-to-end framework that processes text instructions, reference images, and video contexts into unified multimodal representations, supported by a comprehensive data system, large-scale pre-training strategies, and infrastructure optimizations for inference.

Result: Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following, producing high-fidelity, intelligent video content.

Conclusion: Kling-Omni represents a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating, and interacting with dynamic and complex worlds, moving beyond just content creation tools.

Abstract: We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

[152] R3ST: A Synthetic 3D Dataset With Realistic Trajectories

Simone Teglia, Claudia Melis Tonti, Francesco Pro, Leonardo Russo, Andrea Alfarano, Leonardo Pentassuglia, Irene Amerini

Main category: cs.CV

TL;DR: R3ST is a synthetic 3D dataset that combines synthetic environments with real-world trajectories from drone footage to create realistic vehicle motion data for traffic analysis.

Details

Motivation: Existing real datasets lack precise ground-truth annotations, while synthetic datasets lack realistic vehicle motion. There's a need for datasets that combine accurate annotations with authentic human-driven trajectories.

Method: Generate synthetic 3D environment and integrate real-world trajectories derived from SinD (bird’s-eye-view dataset recorded from drone footage). This combines synthetic data generation with real trajectory data.

Result: R3ST dataset closes the gap between synthetic data and realistic trajectories, offering both accurate multimodal ground-truth annotations and authentic human-driven vehicle trajectories.

Conclusion: The proposed dataset advances research in trajectory forecasting of road vehicles by providing realistic synthetic data with both accurate annotations and authentic motion patterns.

Abstract: Datasets are essential to train and evaluate computer vision models used for traffic analysis and to enhance road safety. Existing real datasets fit real-world scenarios, capturing authentic road object behaviors, however, they typically lack precise ground-truth annotations. In contrast, synthetic datasets play a crucial role, allowing for the annotation of a large number of frames without additional costs or extra time. However, a general drawback of synthetic datasets is the lack of realistic vehicle motion, since trajectories are generated using AI models or rule-based systems. In this work, we introduce R3ST (Realistic 3D Synthetic Trajectories), a synthetic dataset that overcomes this limitation by generating a synthetic 3D environment and integrating real-world trajectories derived from SinD, a bird’s-eye-view dataset recorded from drone footage. The proposed dataset closes the gap between synthetic data and realistic trajectories, advancing the research in trajectory forecasting of road vehicles, offering both accurate multimodal ground-truth annotations and authentic human-driven vehicle trajectories.

[153] KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Shuting Zhao, Zeyu Xiao, Xinrong Chen

Main category: cs.CV

TL;DR: KineST is a kinematics-guided state space model for efficient full-body motion tracking from sparse HMD signals, achieving better accuracy and temporal consistency than existing methods.

Details

Motivation: Existing full-body pose reconstruction methods from sparse HMD signals face challenges in balancing accuracy, temporal coherence, and computational efficiency, often requiring separate modeling of spatial and temporal dependencies.

Method: Proposes KineST with kinematics-guided bidirectional scanning in State Space Duality framework, mixed spatiotemporal representation learning, and geometric angular velocity loss for physical constraints.

Result: Extensive experiments show KineST achieves superior performance in both accuracy and temporal consistency within a lightweight framework.

Conclusion: KineST effectively addresses the trade-off between accuracy, smoothness, and efficiency in full-body motion tracking for AR/VR applications.

Abstract: Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: https://kaka-1314.github.io/KineST/

[154] GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, Li Jiang

Main category: cs.CV

TL;DR: GeoPredict enhances Vision-Language-Action models with predictive 3D geometric reasoning for better robotic manipulation performance.

Details

Motivation: Current VLA models are largely reactive and 2D-centric, making them unreliable for tasks requiring precise 3D reasoning and spatial awareness in robotic manipulation.

Method: Introduces a geometry-aware VLA framework with two predictive modules: 1) trajectory-level module encoding motion history and predicting multi-step 3D keypoint trajectories, and 2) predictive 3D Gaussian geometry module forecasting workspace geometry with track-guided refinement along future trajectories. Uses depth-based rendering for training-time supervision only.

Result: GeoPredict consistently outperforms strong VLA baselines on RoboCasa Human-50, LIBERO, and real-world manipulation tasks, especially in geometry-intensive and spatially demanding scenarios.

Conclusion: Predictive 3D geometric reasoning significantly improves VLA model performance in robotic manipulation, addressing limitations of reactive 2D-centric approaches while maintaining efficient inference.

Abstract: Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

[155] DenseBEV: Transforming BEV Grid Cells into 3D Objects

Marius Dähling, Sebastian Krebs, J. Marius Zöllner

Main category: cs.CV

TL;DR: DenseBEV proposes using BEV feature cells directly as anchors for multi-camera 3D object detection, eliminating random queries and enabling efficient two-stage anchor generation with temporal modeling.

Details

Motivation: Traditional BEV-based transformers use random queries as anchors, which are less intuitive and efficient. Recent approaches add auxiliary networks, but a more direct approach using BEV features themselves as anchors could be more effective.

Method: 1) Uses BEV feature cells directly as anchors (each cell as potential object query), 2) Two-stage anchor generation for multi-camera 3D detection, 3) BEV-based Non-Maximum Suppression for gradient flow through non-suppressed objects, 4) Hybrid temporal modeling integrating prior detections with embedded temporal BEV information.

Result: Significant improvements on nuScenes (NDS and mAP) even with sparser BEV grids, particularly effective for small objects (3.8% mAP increase for pedestrians on nuScenes, 8% LET-mAP on Waymo). State-of-the-art on Waymo Open dataset with 60.7% LET-mAP, surpassing previous best by 5.4%.

Conclusion: DenseBEV demonstrates that using BEV features directly as anchors provides a more intuitive and efficient approach for multi-camera 3D object detection, achieving state-of-the-art performance with better efficiency and effectiveness, especially for small objects.

Abstract: In current research, Bird’s-Eye-View (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression, allowing gradients to flow only through non-suppressed objects. This ensures efficient training without the need for post-processing. By using BEV features from encoders such as BEVFormer directly as object queries, temporal BEV information is inherently embedded. Building on the temporal BEV information already embedded in our object queries, we introduce a hybrid temporal modeling approach by integrating prior detections to further enhance detection performance. Evaluating our method on the nuScenes dataset shows consistent and significant improvements in NDS and mAP over the baseline, even with sparser BEV grids and therefore fewer initial anchors. It is particularly effective for small objects, enhancing pedestrian detection with a 3.8% mAP increase on nuScenes and an 8% increase in LET-mAP on Waymo. Applying our method, named DenseBEV, to the challenging Waymo Open dataset yields state-of-the-art performance, achieving a LET-mAP of 60.7%, surpassing the previous best by 5.4%. Code is available at https://github.com/mdaehl/DenseBEV.

[156] Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu, Xin Eric Wang

Main category: cs.CV

TL;DR: DMLR is a dynamic multimodal latent reasoning framework that uses confidence-guided optimization and dynamic visual injection to improve multimodal reasoning efficiency and performance.

Details

Motivation: Current multimodal reasoning methods rely on explicit step-by-step reasoning, have unstable perception-reasoning interaction, and incur high computational overhead. Inspired by human cognition where thinking involves dynamic interleaving of reasoning and perception, the authors aim to create a more efficient and effective multimodal reasoning framework.

Method: DMLR uses confidence-guided latent policy gradient optimization to refine latent think tokens for deeper reasoning. It introduces a Dynamic Visual Injection Strategy that retrieves the most relevant visual features at each latent think token, updates the best visual patches, and injects them into latent think tokens to achieve dynamic visual-textual interleaving.

Result: Experiments across seven multimodal reasoning benchmarks and various model architectures show that DMLR significantly improves both reasoning and perception performance while maintaining high inference efficiency.

Conclusion: DMLR successfully addresses limitations of existing multimodal reasoning methods by enabling dynamic interleaving of reasoning and perception, leading to improved performance and efficiency across diverse multimodal reasoning tasks.

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.

[157] Next-Generation License Plate Detection and Recognition System using YOLOv8

Arslan Amin, Rafia Mumtaz, Muhammad Jawad Bashir, Syed Mohammad Hassan Zaidi

Main category: cs.CV

TL;DR: YOLOv8 variants achieve high precision for license plate and character recognition, with Nano variant (0.964 precision) for LPR and Small variant (0.92 precision) for character recognition, using custom character sequencing and optimized pipeline for edge deployment.

Details

Motivation: Efficient license plate detection and recognition are crucial for intelligent transportation systems, but existing methods struggle with consistent real-time accuracy in diverse environments. There's a need for robust solutions that work well on edge devices.

Method: Used YOLOv8 variants (Nano and Small) for license plate recognition and character recognition tasks on two distinct datasets. Introduced custom character sequencing method based on x-axis positions. Proposed optimized pipeline combining YOLOv8 Nano for LPR and YOLOv8 Small for character recognition.

Result: YOLOv8 Nano achieved 0.964 precision and 0.918 mAP50 on LPR task. YOLOv8 Small achieved 0.92 precision and 0.91 mAP50 on character recognition. The custom character sequencing method effectively organized detected characters. The combined pipeline maintains computational efficiency while ensuring high accuracy.

Conclusion: The proposed YOLOv8-based pipeline provides a robust solution for license plate recognition that balances accuracy and computational efficiency, making it suitable for real-world deployment on edge devices in intelligent transportation systems, advancing smarter urban infrastructure.

Abstract: In the evolving landscape of traffic management and vehicle surveillance, efficient license plate detection and recognition are indispensable. Historically, many methodologies have tackled this challenge, but consistent real-time accuracy, especially in diverse environments, remains elusive. This study examines the performance of YOLOv8 variants on License Plate Recognition (LPR) and Character Recognition tasks, crucial for advancing Intelligent Transportation Systems. Two distinct datasets were employed for training and evaluation, yielding notable findings. The YOLOv8 Nano variant demonstrated a precision of 0.964 and mAP50 of 0.918 on the LPR task, while the YOLOv8 Small variant exhibited a precision of 0.92 and mAP50 of 0.91 on the Character Recognition task. A custom method for character sequencing was introduced, effectively sequencing the detected characters based on their x-axis positions. An optimized pipeline, utilizing YOLOv8 Nano for LPR and YOLOv8 Small for Character Recognition, is proposed. This configuration not only maintains computational efficiency but also ensures high accuracy, establishing a robust foundation for future real-world deployments on edge devices within Intelligent Transportation Systems. This effort marks a significant stride towards the development of smarter and more efficient urban infrastructures.

[158] Radiology Report Generation with Layer-Wise Anatomical Attention

Emmanuel D. Muñiz-De-León, Jorge A. Rosales-de-Golferichs, Ana S. Muñoz-Rodríguez, Alejandro I. Trejo-Castro, Eduardo de Avila-Armenta, Antonio Martínez-Torteya

Main category: cs.CV

TL;DR: Compact image-to-text model generates chest X-ray findings from single frontal image using frozen DINOv3 encoder + GPT-2 decoder with anatomical attention, outperforming SOTA models without large-scale multimodal training.

Details

Motivation: Current SOTA radiology report generation systems (MAIRA-2, MedPaLM-M) require large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most clinical settings. There's a need for simpler, more accessible approaches.

Method: Combines frozen DINOv3 Vision Transformer encoder with GPT-2 decoder enhanced by layer-wise anatomical attention. Uses lung and heart segmentation masks with hierarchical Gaussian smoothing to bias attention toward clinically relevant regions without adding trainable parameters. Generates Findings section from single frontal chest X-ray image.

Result: Achieved substantial gains on MIMIC-CXR dataset: CheXpert Macro-F1 for five key pathologies increased 168% (0.083 -> 0.238), Micro-F1 increased 146% (0.137 -> 0.337). Broader performance across 14 observations improved 86% (0.170 -> 0.316). Structural coherence (RadGraph F1) improved 9.7%.

Conclusion: Despite small size and purely image-conditioned design, decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions, demonstrating that effective radiology report generation doesn’t require large-scale multimodal training.

Abstract: Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 -> 0.238) and Micro-F1 by 146% (0.137 -> 0.337), while broader performance across 14 observations improved by 86% (0.170 -> 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: https://github.com/devMuniz02/UDEM-CXR-Reporting-Thesis-2025.

[159] OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou, Rishi Shiv, Yaqi Li, Haoyu Xiong, Crystal Elaine Owens, Yilun Du, Yiyue Luo, Xianyi Cheng, Antonio Torralba, Wojciech Matusik, Paul Pu Liang

Main category: cs.CV

TL;DR: OpenTouch is the first in-the-wild egocentric full-hand tactile dataset with synchronized video-touch-pose data, enabling multimodal perception research.

Details

Motivation: Despite the hand being our primary interface with the physical world, current egocentric perception lacks understanding of when, where, and how forcefully hands make contact. There's a gap between visual perception and physical interaction, with no existing in-the-wild datasets aligning first-person video with full-hand touch.

Method: The authors present OpenTouch, a dataset containing 5.1 hours of synchronized video-touch-pose data with 2,900 curated clips and detailed text annotations. They introduce retrieval and classification benchmarks to probe how touch grounds perception and action.

Result: Tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries.

Conclusion: By releasing this annotated vision-touch-pose dataset and benchmark, the authors aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

Abstract: The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

[160] GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, Marjan Ghazvininejad

Main category: cs.CV

TL;DR: GenEval benchmark has drifted from human judgment over time, showing up to 17.7% error for current models. Authors introduce GenEval 2 with better visual concept coverage and compositionality, plus Soft-TIFA evaluation method to reduce future drift.

Details

Motivation: Automated T2I evaluation faces benchmark drift problems where static benchmarks fail to keep up with evolving model capabilities, as shown by GenEval's declining alignment with human judgment over time.

Method: 1) Analyze benchmark drift in GenEval via human study showing 17.7% absolute error. 2) Create GenEval 2 with improved primitive visual concept coverage and higher compositionality. 3) Develop Soft-TIFA evaluation combining judgments for visual primitives.

Result: GenEval has significantly drifted from human judgment, indicating saturation. GenEval 2 is more challenging for current models, and Soft-TIFA shows better alignment with human judgment and potentially less future drift than holistic judges like VQAScore.

Conclusion: Benchmark drift is a serious problem in T2I evaluation requiring continual audits. GenEval 2 and Soft-TIFA provide improved benchmarking, but ongoing vigilance is needed to maintain alignment with human judgment as models evolve.

Abstract: Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time – resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

[161] RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia

Main category: cs.CV

TL;DR: RePlan is a plan-then-execute framework for complex image editing that uses a vision-language planner to decompose instructions and ground them to regions, then applies parallel multi-region edits via attention-region injection without iterative inpainting.

Details

Motivation: Existing instruction-based image editing models struggle with Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes, leading to poor performance in fine-grained grounding and knowledge-intensive edits.

Method: RePlan couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and grounds them to target regions. The editor then applies changes using a training-free attention-region injection mechanism, enabling precise parallel multi-region edits without iterative inpainting. GRPO-based reinforcement learning is applied using 1K instruction-only examples to improve planning.

Result: RePlan consistently outperforms strong baselines trained on far larger datasets across IV-Complex settings, improving regional precision and overall fidelity. The authors also present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits.

Conclusion: The RePlan framework effectively addresses IV-Complexity in instruction-based image editing through its plan-then-execute approach with explicit region grounding and parallel editing, demonstrating superior performance despite using less training data.

Abstract: Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

[162] Pixel Seal: Adversarial-only training for invisible image and video watermarking

Tomáš Souček, Pierre Fernandez, Hady Elsahar, Sylvestre-Alvise Rebuffi, Valeriu Lacatusu, Tuan Tran, Tom Sander, Alexandre Mourachko

Main category: cs.CV

TL;DR: Pixel Seal introduces a new SOTA image/video watermarking method that addresses three key issues in existing approaches: unreliable perceptual losses, optimization instability, and poor scaling to high resolutions.

Details

Motivation: Current watermarking methods struggle to balance robustness and true imperceptibility, relying on proxy losses that don't mimic human perception well, causing optimization instability, and failing to scale effectively to high-resolution images and videos.

Method: 1) Adversarial-only training paradigm eliminating unreliable pixel-wise imperceptibility losses; 2) Three-stage training schedule decoupling robustness and imperceptibility for stable convergence; 3) High-resolution adaptation using JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts.

Result: Pixel Seal achieves clear improvements over state-of-the-art methods in both robustness and imperceptibility across different image types and transformations, and efficiently adapts to video via temporal watermark pooling.

Conclusion: Pixel Seal provides a practical and scalable solution for reliable provenance tracing in real-world image and video settings, addressing fundamental limitations of existing watermarking approaches.

Abstract: Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.

[163] Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation

Valay Bundele, Mehran Hosseinzadeh, Hendrik P. A. Lensch

Main category: cs.CV

TL;DR: ReMeDI-SAM3 is a training-free memory-enhanced extension of SAM3 for surgical instrument segmentation that improves occlusion handling and identity recovery through relevance-aware memory filtering, piecewise interpolation, and feature-based re-identification.

Details

Motivation: Surgical instrument segmentation in endoscopic videos is challenging due to occlusions, rapid motion, specular artefacts, and instrument re-entry. While SAM3 provides a spatio-temporal framework, it suffers from indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions.

Method: Three key components: (1) relevance-aware memory filtering with occlusion-aware memory for storing pre-occlusion frames, (2) piecewise interpolation scheme to expand effective memory capacity, and (3) feature-based re-identification module with temporal voting for post-occlusion identity disambiguation.

Result: Evaluations on EndoVis17 and EndoVis18 under zero-shot setting show absolute mcIoU improvements of ~7% and ~16% respectively over vanilla SAM3, outperforming even prior training-based approaches.

Conclusion: ReMeDI-SAM3 effectively addresses SAM3’s limitations for surgical scenes through memory enhancement techniques, enabling reliable recovery after occlusions and reducing error accumulation without requiring training.

Abstract: Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.

[164] M-PhyGs: Multi-Material Object Dynamics from Video

Norika Wada, Kohei Yamashita, Ryo Kawahara, Ko Nishino

Main category: cs.CV

TL;DR: M-PhyGs estimates material composition and mechanical parameters of complex multi-material objects (like flowers) from video, using 3D Gaussian representations and novel training losses.

Details

Motivation: Real-world objects often have complex material compositions and geometries that existing methods can't handle, as they assume homogeneous single-material objects, pre-learned dynamics, or simple topologies. Flowers serve as a representative example of such complex natural objects.

Method: Multi-material Physical Gaussians (M-PhyGs) uses 3D Gaussian representations to jointly segment objects into similar materials and recover continuum mechanical parameters from short videos. It employs cascaded 3D and 2D losses, temporal mini-batching, and accounts for gravity.

Result: The method is evaluated on the novel Phlowers dataset of people interacting with flowers. Experimental results demonstrate the accuracy and effectiveness of M-PhyGs and its components for multi-material physical parameter estimation.

Conclusion: M-PhyGs successfully addresses the challenging task of estimating physical material parameters for complex multi-material objects from visual data, overcoming limitations of existing approaches that can’t handle real-world object complexity.

Abstract: Knowledge of the physical material properties governing the dynamics of a real-world object becomes necessary to accurately anticipate its response to unseen interactions. Existing methods for estimating such physical material parameters from visual data assume homogeneous single-material objects, pre-learned dynamics, or simplistic topologies. Real-world objects, however, are often complex in material composition and geometry lying outside the realm of these assumptions. In this paper, we particularly focus on flowers as a representative common object. We introduce Multi-material Physical Gaussians (M-PhyGs) to estimate the material composition and parameters of such multi-material complex natural objects from video. From a short video captured in a natural setting, M-PhyGs jointly segments the object into similar materials and recovers their continuum mechanical parameters while accounting for gravity. M-PhyGs achieves this efficiently with newly introduced cascaded 3D and 2D losses, and by leveraging temporal mini-batching. We introduce a dataset, Phlowers, of people interacting with flowers as a novel platform to evaluate the accuracy of this challenging task of multi-material physical parameter estimation. Experimental results on Phlowers dataset demonstrate the accuracy and effectiveness of M-PhyGs and its components.

[165] Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation

Kaiwen Jiang, Xueting Li, Seonwook Park, Ravi Ramamoorthi, Shalini De Mello, Koki Nagano

Main category: cs.CV

TL;DR: Instant4D: A method that distills knowledge from 2D diffusion models into a feed-forward encoder to create fast, 3D-consistent, and expressive portrait animation from single images.

Details

Motivation: Current 2D video diffusion models lack 3D consistency and speed, while 3D-aware methods have inferior expression details. There's a need to combine the strengths of both approaches for real-world applications like digital twins and telepresence.

Method: Distills knowledge from 2D diffusion models into a feed-forward encoder that converts single images into 3D-consistent animatable representations. Uses decoupled animation representation that learns motion implicitly from data, eliminating dependency on parametric models. Employs efficient lightweight local fusion instead of computationally intensive global fusion mechanisms.

Result: Achieves 107.31 FPS for animation and pose control while maintaining comparable animation quality to state-of-the-art methods. Successfully balances speed and quality without trading one for the other.

Conclusion: The method successfully combines the strengths of 2D diffusion models (expression detail) and 3D-aware methods (consistency and speed), creating a practical solution for real-time portrait animation applications.

Abstract: Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods – built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting – ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face’s 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is https://research.nvidia.com/labs/amri/projects/instant4d

[166] FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Kai Qiu, Chong Luo, Zuxuan Wu

Main category: cs.CV

TL;DR: FlashPortrait is a video diffusion transformer that generates ID-consistent long portrait animations with 6x speed acceleration using normalized facial expression alignment and dynamic sliding windows.

Details

Motivation: Current diffusion-based methods for long portrait animation fail to maintain identity consistency while being computationally slow. There's a need for an approach that preserves identity across long sequences while accelerating inference.

Method: Uses off-the-shelf facial expression extractor, Normalized Facial Expression Block to align features with diffusion latents, dynamic sliding-window scheme with weighted blending for smooth transitions, and higher-order latent derivatives to skip denoising steps for 6x acceleration.

Result: Achieves identity-preserving infinite-length video generation with 6x inference speed acceleration, demonstrated through qualitative and quantitative experiments on benchmarks.

Conclusion: FlashPortrait effectively addresses identity consistency in long portrait animation while significantly accelerating inference through innovative normalization techniques and dynamic windowing with latent derivative prediction.

Abstract: Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.

[167] Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Kaixin Ding, Yang Zhou, Xi Chen, Miao Yang, Jiarong Ou, Rui Chen, Xin Tao, Hengshuang Zhao

Main category: cs.CV

TL;DR: Alchemist is a meta-gradient-based framework that automatically selects high-quality subsets from large-scale text-image datasets to improve T2I model training efficiency and visual quality.

Details

Motivation: Current T2I models are limited by low-quality training data from web-crawled and synthetic sources, which degrade visual fidelity and cause unstable training. Existing data selection methods rely on costly manual curation or simplistic heuristics, and meta-learning approaches haven't been adapted for image modalities.

Method: Two-stage framework: 1) Data rating using a lightweight rater that estimates sample influence via gradient information with multi-granularity perception, and 2) Data pruning using Shift-Gsampling strategy to select informative subsets for efficient training.

Result: Alchemist consistently improves visual quality and downstream performance. Training on just 50% of data selected by Alchemist can outperform training on the full dataset across both synthetic and web-crawled datasets.

Conclusion: Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for T2I training that effectively addresses data quality issues and improves training efficiency without manual intervention.

Abstract: Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose Alchemist, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample’s influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.

[168] VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma

Main category: cs.CV

TL;DR: VIVA is a scalable framework for instruction-based video editing that uses VLM-guided encoding and reward optimization to improve generalization to diverse real-world instructions.

Details

Motivation: Existing diffusion-based video editing methods are limited by training on simple editing operations, which restricts their ability to generalize to complex, real-world instructions.

Method: 1) VLM-based instructor encodes text instructions, first video frame, and optional reference image into visually-grounded representations. 2) Edit-GRPO post-training adapts Group Relative Policy Optimization for video editing to optimize for instruction-faithful, content-preserving, and aesthetically pleasing edits. 3) Synthetic data pipeline generates diverse, high-fidelity video-instruction pairs.

Result: VIVA achieves superior instruction following, generalization, and editing quality compared to state-of-the-art methods in extensive experiments.

Conclusion: The proposed VIVA framework effectively addresses the generalization gap in instruction-based video editing through VLM-guided encoding and reward optimization, enabling better handling of diverse real-world instructions.

Abstract: Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io

[169] Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, Hrvoje Benko, Eli Shlizerman, Yue Liu

Main category: cs.CV

TL;DR: EgoMAN: A reasoning-to-motion framework for 3D hand trajectory prediction using a new large-scale egocentric dataset with semantic QA pairs.

Details

Motivation: Existing 3D hand trajectory prediction methods are limited by datasets that separate motion from semantic supervision and models that weakly connect reasoning with action.

Method: Introduces EgoMAN dataset (219K 6DoF trajectories + 3M QA pairs) and EgoMAN model - a reasoning-to-motion framework linking vision-language reasoning to motion generation via trajectory-token interface with progressive training.

Result: The approach produces accurate and stage-aware trajectories with generalization across real-world scenes.

Conclusion: EgoMAN addresses limitations of prior work by integrating semantic reasoning with motion prediction through a novel dataset and framework.

Abstract: Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

[170] SceneDiff: A Benchmark and Method for Multiview Object Change Detection

Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou, Shenlong Wang, Derek Hoiem

Main category: cs.CV

TL;DR: SceneDiff: A new benchmark and training-free method for multiview object change detection that identifies added, removed, or moved objects between scene captures, achieving significant performance improvements over existing approaches.

Details

Motivation: Detecting object changes between scene captures is important for applications like robotic tidying and construction monitoring, but varying viewpoints can cause false positives. Existing benchmarks lack multiview scenarios with object instance annotations.

Method: A training-free approach that leverages pretrained 3D, segmentation, and image encoding models. It aligns captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes.

Result: Outperforms existing approaches by large margins (94% and 37.4% relative AP improvements) on multiview and two-view benchmarks. Introduces SceneDiff Benchmark with 350 diverse video pairs and thousands of changed objects.

Conclusion: The SceneDiff method and benchmark provide effective solutions for multiview object change detection, addressing viewpoint variations and enabling robust change detection across different applications.

Abstract: We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.

[171] MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath

Main category: cs.CV

TL;DR: MomaGraph is a unified scene representation for mobile manipulators that integrates spatial-functional relationships and part-level interactive elements, with a dataset, benchmark, and 7B vision-language model achieving SOTA results.

Details

Motivation: Mobile manipulators in households need compact, semantically rich scene representations that capture object locations, functions, and actionable parts. Prior scene graph approaches separate spatial/functional relations, treat scenes as static snapshots without object states or temporal updates, and overlook task-relevant information.

Method: Introduces MomaGraph unified scene representation, creates MomaGraph-Scenes dataset (large-scale annotated task-driven scene graphs), develops MomaGraph-Bench evaluation suite (6 reasoning capabilities), and trains MomaGraph-R1 (7B vision-language model with reinforcement learning) that predicts task-oriented scene graphs and serves as zero-shot task planner under Graph-then-Plan framework.

Result: Achieves state-of-the-art results among open-source models with 71.6% accuracy on benchmark (+11.4% over best baseline), generalizes across public benchmarks, and transfers effectively to real-robot experiments.

Conclusion: MomaGraph provides a comprehensive solution for embodied agents with unified scene representation, dataset, benchmark, and model that advances household mobile manipulation capabilities through integrated spatial-functional understanding and task-oriented reasoning.

Abstract: Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

[172] SFTok: Bridging the Performance Gap in Discrete Tokenizers

Qihang Rao, Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: SFTok is a novel discrete image tokenizer with multi-step iterative reconstruction that achieves state-of-the-art performance at high compression rates.

Details

Motivation: Discrete tokenizers lag behind continuous ones in multimodal systems despite their natural alignment with autoregressive paradigms, limiting their adoption in high-resolution image generation.

Method: SFTok incorporates a multi-step iterative mechanism with self-forcing guided visual reconstruction and debias-and-fitting training strategy to resolve training-inference inconsistency.

Result: At 64 tokens per image compression, SFTok achieves SOTA reconstruction quality on ImageNet (rFID = 1.21) and excellent class-to-image generation (gFID = 2.29).

Conclusion: SFTok bridges the gap between discrete and continuous tokenizers, enabling efficient high-resolution image generation while maintaining superior reconstruction quality.

Abstract: Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).

[173] Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi

Main category: cs.CV

TL;DR: A panoramic metric depth foundation model that generalizes across diverse scene distances using a data-in-the-loop paradigm with comprehensive data collection and novel optimization techniques.

Details

Motivation: To create a robust panoramic depth estimation model that works across diverse scene distances and domains (indoor/outdoor, synthetic/real), addressing the challenge of domain gaps and limited training data for panoramic depth estimation.

Method: Three-stage pseudo-label curation pipeline to generate reliable ground truth from unlabeled images; uses DINOv3-Large backbone with plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization for robustness to varying distances and geometric consistency.

Result: Strong performance on multiple benchmarks (Stanford2D3D, Matterport3D, Deep360) with zero-shot generalization capability, showing robust and stable metric predictions in diverse real-world scenes.

Conclusion: The proposed panoramic metric depth foundation model effectively generalizes across diverse scene distances through comprehensive data collection and novel optimization techniques, demonstrating practical applicability in real-world scenarios.

Abstract: In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{https://insta360-research-team.github.io/DAP_website/} {https://insta360-research-team.github.io/DAP_website/}

[174] StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, Ying-Cong Chen

Main category: cs.CV

TL;DR: UniStereo dataset and StereoPilot model for efficient monocular-to-stereo video conversion without depth maps or diffusion sampling.

Details

Motivation: High demand for stereo video content but production is costly/complex; existing multi-stage DWI pipeline has error propagation, depth ambiguity, and format inconsistency issues between parallel and converged stereo configurations.

Method: 1) Create UniStereo - first large-scale unified dataset covering both stereo formats; 2) Propose StereoPilot - efficient feed-forward model that directly synthesizes target view without explicit depth maps or iterative diffusion sampling; uses learnable domain switcher and cycle consistency loss for format adaptation.

Result: StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency.

Conclusion: The proposed approach addresses limitations of traditional DWI pipeline and enables high-quality, efficient stereo video conversion adaptable to different stereo formats.

Abstract: The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint’’ (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.

[175] DVGT: Driving Visual Geometry Transformer

Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhi-Xin Yang, Jiwen Lu

Main category: cs.CV

TL;DR: DVGT is a driving-targeted dense geometry perception model that reconstructs global 3D point maps from unposed multi-view visual inputs without explicit camera parameters or 3D geometric priors.

Details

Motivation: There's a lack of driving-targeted dense geometry perception models that can adapt to different scenarios and camera configurations. Current methods rely on precise camera parameters and external sensors for alignment.

Method: Uses DINO backbone for visual feature extraction, then employs alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations. Multiple heads decode global point maps in ego coordinates and ego poses for each frame.

Result: DVGT significantly outperforms existing models on various scenarios, trained on a large mixture of driving datasets (nuScenes, OpenScene, Waymo, KITTI, DDAD). It directly predicts metric-scaled geometry without post-alignment with external sensors.

Conclusion: DVGT provides a flexible, camera-agnostic solution for dense 3D geometry perception in autonomous driving that eliminates dependency on precise camera parameters and external sensor alignment.

Abstract: Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

[176] AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue

Main category: cs.CV

TL;DR: AdaTooler-V is a multimodal LLM that adaptively decides when to use vision tools, avoiding unnecessary tool invocations through a novel RL algorithm (AT-GRPO) and specialized training datasets, achieving state-of-the-art performance on visual reasoning benchmarks.

Details

Motivation: Existing MLLMs often exhibit "blind tool-use" patterns, invoking vision tools even when unnecessary, which increases inference overhead and degrades performance. There's a need for models that can intelligently decide when visual tools are genuinely beneficial.

Method: Proposes AdaTooler-V with two key components: 1) AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on Tool Benefit Scores to encourage tool use only when beneficial, and 2) Two training datasets: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL training with verifiable rewards across single-image, multi-image, and video data.

Result: Achieves strong performance across twelve benchmarks, with AdaTooler-V-7B reaching 89.8% accuracy on the high-resolution V* benchmark, surpassing commercial models like GPT-4o and Gemini 1.5 Pro. The model demonstrates efficient adaptive tool-use reasoning.

Conclusion: AdaTooler-V successfully addresses the blind tool-use problem in MLLMs through adaptive tool-use reasoning and specialized training approaches, achieving state-of-the-art performance while reducing unnecessary computation. The work provides open-source code, models, and data for community use.

Abstract: Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

[177] EasyV2V: A High-quality Instruction-based Video Editing Framework

Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei

Main category: cs.CV

TL;DR: EasyV2V is a simple yet effective framework for instruction-based video editing that addresses challenges in consistency, control, and generalization through systematic design of data, architecture, and control mechanisms.

Details

Motivation: Video editing has lagged behind image editing due to challenges in maintaining temporal consistency, providing flexible control, and generalizing across diverse editing tasks. The authors aim to create a comprehensive framework that addresses these limitations.

Method: The approach has three key components: 1) Data: compose existing experts with fast inverses, lift image edit pairs into videos via single-frame supervision, mine dense-captioned clips, and add transition supervision; 2) Architecture: leverage pretrained text-to-video models’ inherent editing capability with simple sequence concatenation and light LoRA fine-tuning; 3) Control: unify spatiotemporal control via single mask mechanism and support optional reference images.

Result: EasyV2V achieves state-of-the-art video editing results, surpassing both concurrent research systems and commercial solutions. It supports flexible input combinations including video+text, video+mask+text, and video+mask+reference+text.

Conclusion: The paper demonstrates that a carefully designed framework addressing data, architecture, and control can significantly advance video editing capabilities, making instruction-based video editing more accessible and effective while maintaining consistency and control.

Abstract: While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/

[178] Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu

Main category: cs.CV

TL;DR: AuditDM is an automated framework that discovers and fixes MLLM failure modes by training an auditor model to generate challenging questions and counterfactual images that maximize disagreement among target models.

Details

Motivation: Current multimodal LLM evaluation methods lack interpretability and fail to fully reveal significant capability gaps between models, limiting effective model diagnosis and improvement.

Method: Fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models, uncovering diverse failure types.

Result: Applied to Gemma-3 and PaliGemma-2, discovered 20+ distinct failure types. Fine-tuning on these discoveries improved all models across 16 benchmarks, enabling a 3B model to surpass its 28B counterpart.

Conclusion: As data scaling hits diminishing returns, targeted model auditing offers an effective path for model diagnosis and improvement, providing interpretable failure exemplars and annotation-free data for rectification.

Abstract: Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

[179] Next-Embedding Prediction Makes Strong Vision Learners

Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu

Main category: cs.CV

TL;DR: NEPA (Next-Embedding Predictive Autoregression) is a simple self-supervised visual learning method that trains models to predict future patch embeddings from past ones, achieving strong performance on ImageNet and segmentation without complex designs.

Details

Motivation: Inspired by generative pretraining success in NLP, the paper explores whether similar principles can create strong self-supervised visual learners by shifting from learning representations to learning models that directly perform predictive tasks.

Method: NEPA trains models to predict future patch embeddings conditioned on past ones using causal masking and stop gradient. It uses simple Transformer architecture pretrained on ImageNet-1k with next embedding prediction as the sole objective, avoiding pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads.

Result: Achieves 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transfers effectively to semantic segmentation on ADE20K.

Conclusion: Generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning, retaining architectural simplicity without requiring additional design complexity.

Abstract: Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

[180] Generative Refocusing: Flexible Defocus Control from a Single Image

Chun-Wei Tuan Mu, Jia-Bin Huang, Yu-Lun Liu

Main category: cs.CV

TL;DR: A two-step generative approach for single-image refocusing that recovers all-in-focus images and creates controllable bokeh effects using semi-supervised training with real optical data.

Details

Motivation: Depth-of-field control is essential in photography but difficult to achieve with single images. Current methods require all-in-focus inputs, rely on synthetic data, and offer limited aperture control.

Method: Two-step process: DeblurNet recovers all-in-focus images from various inputs, then BokehNet creates controllable bokeh. Uses semi-supervised training combining synthetic paired data with unpaired real bokeh images, leveraging EXIF metadata to capture real optical characteristics.

Result: Achieves state-of-the-art performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Enables text-guided adjustments and custom aperture shapes.

Conclusion: Generative Refocusing provides an effective solution for single-image refocusing with realistic bokeh effects and enhanced control capabilities, overcoming limitations of previous methods.

Abstract: Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.

[181] The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen

Main category: cs.CV

TL;DR: WorldCanvas is a multimodal framework for generating controllable world events using text prompts, motion trajectories, and reference images, enabling rich user-directed simulations with temporal coherence and object consistency.

Details

Motivation: Existing approaches have limitations: text-only methods lack visual control, and trajectory-controlled image-to-video methods don't support semantic intent or object identity grounding. There's a need for a framework that enables rich, user-directed simulation of complex world events with both motion control and semantic understanding.

Method: A multimodal approach combining three components: 1) trajectories encoding motion, timing, and visibility; 2) natural language for semantic intent; and 3) reference images for visual grounding of object identity. This enables generation of coherent, controllable events including multi-agent interactions, object entry/exit, appearance guidance, and counterintuitive events.

Result: The framework generates videos with temporal coherence and emergent consistency, preserving object identity and scene despite temporary disappearance. It supports expressive world events generation and advances world models from passive predictors to interactive, user-shaped simulators.

Conclusion: WorldCanvas represents an advancement in world modeling, transforming them from passive prediction tools to interactive, user-shaped simulators capable of generating rich, controllable world events through multimodal prompting.

Abstract: We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories – encoding motion, timing, and visibility – with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

[182] Low-Resolution Action Recognition for Tiny Actions Challenge

Boyu Chen, Yu Qiao, Yali Wang

Main category: cs.CV

TL;DR: The paper presents a solution for Tiny Actions Challenge focusing on small-resolution, long-tailed activity recognition in surveillance videos, achieving top performance through data balance, dual-resolution distillation, and model ensemble.

Details

Motivation: Two main challenges in real-world surveillance activity recognition: 1) Activities recorded at distance appear in small resolution with limited discriminative clues, 2) Natural long-tailed distribution creates heavy category imbalance and data bias.

Method: Three-stage solution: 1) Train video backbones with data balance to alleviate overfitting, 2) Design dual-resolution distillation framework to guide low-resolution recognition using super-resolution knowledge, 3) Apply model ensemble with post-processing for long-tailed categories.

Result: The solution achieves Top-1 ranking on the leaderboard, demonstrating effectiveness in handling small-resolution, long-tailed activity recognition in surveillance scenarios.

Conclusion: The proposed comprehensive approach effectively addresses both resolution and distribution challenges in real-world surveillance activity recognition, providing a robust solution for the Tiny Actions Challenge.

Abstract: Tiny Actions Challenge focuses on understanding human activities in real-world surveillance. Basically, there are two main difficulties for activity recognition in this scenario. First, human activities are often recorded at a distance, and appear in a small resolution without much discriminative clue. Second, these activities are naturally distributed in a long-tailed way. It is hard to alleviate data bias for such heavy category imbalance. To tackle these problems, we propose a comprehensive recognition solution in this paper. First, we train video backbones with data balance, in order to alleviate overfitting in the challenge benchmark. Second, we design a dual-resolution distillation framework, which can effectively guide low-resolution action recognition by super-resolution knowledge. Finally, we apply model en-semble with post-processing, which can further boost per-formance on the long-tailed categories. Our solution ranks Top-1 on the leaderboard.

[183] BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

Yonghao Yu, Shunan Zhu, Huai Qin, Haorui Li

Main category: cs.CV

TL;DR: BoostDream is a plug-and-play 3D refining method that transforms coarse 3D assets into high-quality ones by combining feed-forward generation speed with SDS-based fidelity through three key processes: 3D model distillation, multi-view SDS loss, and prompt/normal map guidance.

Details

Motivation: Current text-to-3D generation has two main approaches: fast but coarse feed-forward methods, and slow but high-fidelity SDS-based methods. The paper aims to combine the strengths of both paradigms to achieve both speed and quality in 3D generation.

Method: BoostDream uses three processes: 1) 3D model distillation to fit differentiable representations from coarse 3D assets, 2) novel multi-view SDS loss using multi-view aware 2D diffusion models, and 3) prompt and multi-view consistent normal maps as guidance during refinement.

Result: The method generates high-quality 3D assets rapidly while overcoming the Janus problem (multiple faces) compared to conventional SDS-based methods. Extensive experiments across different 3D representations demonstrate its effectiveness.

Conclusion: BoostDream represents a substantial advancement in both efficiency and quality of 3D generation processes by synergistically integrating feed-forward generation speed with SDS-based refinement quality.

Abstract: Witnessing the evolution of text-to-image diffusion models, significant strides have been made in text-to-3D generation. Currently, two primary paradigms dominate the field of text-to-3D: the feed-forward generation solutions, capable of swiftly producing 3D assets but often yielding coarse results, and the Score Distillation Sampling (SDS) based solutions, known for generating high-fidelity 3D assets albeit at a slower pace. The synergistic integration of these methods holds substantial promise for advancing 3D generation techniques. In this paper, we present BoostDream, a highly efficient plug-and-play 3D refining method designed to transform coarse 3D assets into high-quality. The BoostDream framework comprises three distinct processes: (1) We introduce 3D model distillation that fits differentiable representations from the 3D assets obtained through feed-forward generation. (2) A novel multi-view SDS loss is designed, which utilizes a multi-view aware 2D diffusion model to refine the 3D assets. (3) We propose to use prompt and multi-view consistent normal maps as guidance in refinement.Our extensive experiment is conducted on different differentiable 3D representations, revealing that BoostDream excels in generating high-quality 3D assets rapidly, overcoming the Janus problem compared to conventional SDS-based methods. This breakthrough signifies a substantial advancement in both the efficiency and quality of 3D generation processes.

[184] Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of Foundation Models for Open-World Video Recognition

Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang

Main category: cs.CV

TL;DR: PCA framework transfers multimodal knowledge from foundation models to boost open-world video recognition through Percept-Chat-Adapt pipeline.

Details

Motivation: Traditional networks struggle with generalization in complex open-world video environments, while foundation models possess rich knowledge but their application to video recognition remains underexplored.

Method: Three-stage PCA pipeline: 1) Percept - reduces video domain gap and extracts external visual knowledge; 2) Chat - generates rich linguistic semantics as textual knowledge; 3) Adapt - blends multimodal knowledge by inserting adaptation modules into networks.

Result: Achieves state-of-the-art performance on three challenging open-world video benchmarks: TinyVIRAT, ARID, and QV-Pipe.

Conclusion: Progressive knowledge transfer from foundation models effectively boosts open-world video recognition performance through multimodal knowledge integration.

Abstract: Open-world video recognition is challenging since traditional networks are not generalized well on complex environment variations. Alternatively, foundation models with rich knowledge have recently shown their generalization power. However, how to apply such knowledge has not been fully explored for open-world video recognition. To this end, we propose a generic knowledge transfer pipeline, which progressively exploits and integrates external multimodal knowledge from foundation models to boost open-world video recognition. We name it PCA, based on three stages of Percept, Chat, and Adapt. First, we perform Percept process to reduce the video domain gap and obtain external visual knowledge. Second, we generate rich linguistic semantics as external textual knowledge in Chat stage. Finally, we blend external multimodal knowledge in Adapt stage, by inserting multimodal knowledge adaptation modules into networks. We conduct extensive experiments on three challenging open-world video benchmarks, i.e., TinyVIRAT, ARID, and QV-Pipe. Our approach achieves state-of-the-art performance on all three datasets.

[185] Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Ting Liu, Xuyang Liu, Liangtao Shi, Zunnan Xu, Yue Hu, Siteng Huang, Yi Xin, Bineng Zhong, Donglin Wang

Main category: cs.CV

TL;DR: Sparse-Tuning: A PEFT framework combining token sparsification with dense adapters to reduce computational/memory overhead while maintaining performance for ViT fine-tuning.

Details

Motivation: Current PEFT methods focus on parameter efficiency during fine-tuning but overlook computational and memory efficiency during inference, which is crucial for practical deployment.

Method: Two-component approach: 1) Token sparsification (TS) to reduce information redundancy in images/videos, 2) Dense Adapters (DA) to compensate for information loss from TS by integrating shallow layer token information into deeper retained tokens.

Result: Achieves 66% of original ViT-B GFLOPs while maintaining state-of-the-art performance on VTAB-1K, three image datasets, and two video datasets, outperforming full fine-tuning and other PEFT baselines.

Conclusion: Sparse-Tuning successfully addresses the inference efficiency gap in PEFT methods by combining token sparsification with information compensation, making ViT fine-tuning more practical for real-world applications.

Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a popular solution for adapting pre-trained Vision Transformer (ViT) models to downstream applications by updating only a small subset of parameters. While current PEFT methods have achieved fine-tuning efficiency, they overlook the efficiency of computation and GPU memory during inference, falling short of practical requirements. To address this limitation, we propose Sparse-Tuning, an efficient and effective framework that leverages popular token sparsification (TS) techniques to reduce information redundancy in images and videos, thereby significantly improving computational and memory efficiency. However, TS often compromises performance due to inevitable information loss. To address this limitation, we further introduce Dense Adapters (DA) to compensate for the information losses incurred by token sparsification. DA integrates comprehensive token information from shallow layers into the retained tokens of deeper layers, ensuring minimal performance degradation. Through the integration of TS techniques and DA, Sparse-Tuning achieves a significant reduction in computation and memory overhead while maintaining performance. Empirical results on VTAB-1K, three image datasets, and two video datasets show that Sparse-Tuning reduces GFLOPs to 66% of the original ViT-B while achieving state-of-the-art performance compared to full fine-tuning and other PEFT baselines.

Jiahao Nie, Gongjie Zhang, Wenbin An, Yun Xing, Yap-Peng Tan, Alex C. Kot, Shijian Lu

Main category: cs.CV

TL;DR: MMRel is a large-scale, high-quality benchmark for evaluating and improving MLLMs’ understanding of inter-object relations across diverse domains.

Details

Motivation: Current MLLMs struggle with understanding diverse and complex inter-object relations due to lack of large-scale, high-quality relation data, hindering progress in vision-language perception tasks.

Method: Created MMRel benchmark with 22,500 QA pairs spanning 3 domains and ~400 relations, featuring manually verified labels and adversarial cases with unusual relations for hallucination evaluation.

Result: Extensive experiments on 28 MLLMs demonstrate MMRel’s effectiveness in both evaluating and enhancing MLLMs’ relation understanding capabilities.

Conclusion: MMRel provides an ideal benchmark for evaluating and fine-tuning MLLMs on relation understanding, with insights for future research and public availability.

Abstract: Though Multi-modal Large Language Models (MLLMs) have recently achieved significant progress, they often struggle to understand diverse and complicated inter-object relations. Specifically, the lack of large-scale and high-quality relation data has greatly hindered the progress of MLLMs in various vision-language perception tasks. We attempt to address this challenge by contributing the Multi-Modal Relation Understanding benchmark (MMRel), which features large-scale, high-quality, and diverse data on inter-object relations. MMRel has three distinctive attributes: (i) it contains 22,500 question-answer pairs spanning three distinct domains and around 400 relations, ensuring both scale and diversity; (ii) it provides manually verified, high-quality labels to ensure exceptional annotation accuracy; and (iii) it includes adversarial cases with highly unusual relations, offering a challenging setting for evaluating relation hallucination. These features make MMRel ideal for evaluating MLLMs on relation understanding, as well as for fine-tuning MLLMs to enhance relation comprehension capability. Extensive experiments on 28 MLLMs demonstrate the effectiveness of MMRel in both evaluating and enhancing MLLMs’ relation understanding, and the accompanying analyses provide insights for future research. The benchmark has been made publicly available at: https://niejiahao1998.github.io/MMRel

[187] Vulnerabilities in AI-generated Image Detection: The Challenge of Adversarial Attacks

Yunfeng Diao, Naixin Zhai, Changtao Miao, Zitong Yu, Xingxing Wei, Xun Yang, Meng Wang

Main category: cs.CV

TL;DR: FPBA is a novel adversarial attack method that targets AI-generated image detectors by combining frequency-domain perturbations with Bayesian modeling to achieve effective black-box attacks across diverse detectors.

Details

Motivation: As AI-generated images become more realistic and widespread, there are growing concerns about disinformation. While AIGI detectors have been developed, their adversarial robustness remains poorly understood and rarely investigated, creating a critical security gap.

Method: FPBA combines two key strategies: 1) Frequency-domain perturbations to push images away from their original frequency distribution, exploiting differences between real and fake images in frequency space; 2) A post-train Bayesian strategy that transforms a single surrogate model into a Bayesian model to simulate diverse victim models without retraining, enabling effective transfer attacks across different architectures (CNNs and ViTs).

Result: FPBA successfully demonstrates that adversarial attacks pose a real threat to AIGI detectors. The method achieves successful black-box attacks across various detectors, generators, defense methods, and even evades cross-generator detection and compressed image detection, which are crucial real-world scenarios.

Conclusion: The vulnerability of state-of-the-art AIGI detectors to adversarial attacks is significant and previously underestimated. FPBA reveals critical security weaknesses that need to be addressed to ensure reliable detection of AI-generated content in real-world applications.

Abstract: Recent advancements in image synthesis, particularly with the advent of GAN and Diffusion models, have amplified public concerns regarding the dissemination of disinformation. To address such concerns, numerous AI-generated Image (AIGI) Detectors have been proposed and achieved promising performance in identifying fake images. However, there still lacks a systematic understanding of the adversarial robustness of AIGI detectors. In this paper, we examine the vulnerability of state-of-the-art AIGI detectors against adversarial attack under white-box and black-box settings, which has been rarely investigated so far. To this end, we propose a new method to attack AIGI detectors. First, inspired by the obvious difference between real images and fake images in the frequency domain, we add perturbations under the frequency domain to push the image away from its original frequency distribution. Second, we explore the full posterior distribution of the surrogate model to further narrow this gap between heterogeneous AIGI detectors, e.g., transferring adversarial examples across CNNs and ViTs. This is achieved by introducing a novel post-train Bayesian strategy that turns a single surrogate into a Bayesian one, capable of simulating diverse victim models using one pre-trained surrogate, without the need for re-training. We name our method as Frequency-based Post-train Bayesian Attack, or FPBA. Through FPBA, we demonstrate that adversarial attacks pose a real threat to AIGI detectors. FPBA can deliver successful black-box attacks across various detectors, generators, defense methods, and even evade cross-generator and compressed image detection, which are crucial real-world detection scenarios. Our code is available at https://github.com/onotoa/fpba.

[188] WildFit: Autonomous In-situ Model Adaptation for Resource-Constrained IoT Systems

Mohammad Mehdi Rastikerdar, Jin Huang, Hui Guan, Deepak Ganesan

Main category: cs.CV

TL;DR: WildFit enables resource-constrained IoT devices to autonomously adapt deep learning models to changing environmental conditions using background-aware synthesis and drift-aware fine-tuning, achieving high accuracy with minimal energy consumption.

Details

Motivation: IoT devices running deep learning models suffer accuracy drops due to domain shifts (lighting, weather, seasonal changes), but cloud retraining is impractical due to limited connectivity and energy constraints, especially in wildlife monitoring applications.

Method: WildFit combines: 1) Background-aware synthesis that generates training samples on-device by leveraging that backgrounds change more frequently than species characteristics, and 2) Drift-aware fine-tuning that triggers model updates only when necessary to conserve resources.

Result: Background-aware synthesis outperforms efficient baselines by 7.3% and diffusion models by 3.0% while being orders of magnitude faster. Drift-aware fine-tuning achieves Pareto optimality with 50% fewer updates and 1.5% higher accuracy. End-to-end system outperforms domain adaptation approaches by 20-35% while consuming only 11.2 Wh over 37 days.

Conclusion: WildFit enables battery-powered IoT deployment by providing autonomous in-situ adaptation that maintains accuracy across changing environmental conditions with minimal resource consumption, making it practical for real-world wildlife monitoring and similar applications.

Abstract: Resource-constrained IoT devices increasingly rely on deep learning models, however, these models experience significant accuracy drops due to domain shifts when encountering variations in lighting, weather, and seasonal conditions. While cloud-based retraining can address this issue, many IoT deployments operate with limited connectivity and energy constraints, making traditional fine-tuning approaches impractical. We explore this challenge through the lens of wildlife ecology, where camera traps must maintain accurate species classification across changing seasons, weather, and habitats without reliable connectivity. We introduce WildFit, an autonomous in-situ adaptation framework that leverages the key insight that background scenes change more frequently than the visual characteristics of monitored species. WildFit combines background-aware synthesis to generate training samples on-device with drift-aware fine-tuning that triggers model updates only when necessary to conserve resources. Our background-aware synthesis surpasses efficient baselines by 7.3% and diffusion models by 3.0% while being orders of magnitude faster, our drift-aware fine-tuning achieves Pareto optimality with 50% fewer updates and 1.5% higher accuracy, and the end-to-end system outperforms domain adaptation approaches by 20-35% while consuming only 11.2 Wh over 37 days-enabling battery-powered deployment.

[189] Matérn Kernels for Tunable Implicit Surface Reconstruction

Maximilian Weiherer, Bernhard Egger

Main category: cs.CV

TL;DR: Matérn kernels outperform arc-cosine kernels for implicit surface reconstruction, offering better performance, easier implementation, faster computation, and scalability while maintaining competitive accuracy.

Details

Motivation: To improve upon existing kernel methods for 3D surface reconstruction by leveraging Matérn kernels' appealing theoretical and practical properties that address limitations of current approaches like arc-cosine kernels.

Method: Proposes using Matérn kernel family for implicit surface reconstruction, analyzing their theoretical properties, demonstrating tunable reconstruction similar to Fourier feature mappings, and presenting data-dependent Matérn kernels based on Neural Kernel Fields framework.

Result: Matérn kernels outperform state-of-the-art arc-cosine methods, with Laplace kernel (part of Matérn family) performing almost on par with state-of-the-art in noise-free cases while having >5x shorter training time.

Conclusion: Matérn kernels are particularly well-suited for surface reconstruction, offering superior performance, efficiency, and scalability compared to existing kernel methods, with the Laplace kernel being especially competitive.

Abstract: We propose to use the family of Matérn kernels for implicit surface reconstruction, building upon the recent success of kernel methods for 3D reconstruction of oriented point clouds. As we show from a theoretical and practical perspective, Matérn kernels have some appealing properties which make them particularly well suited for surface reconstruction – outperforming state-of-the-art methods based on the arc-cosine kernel while being significantly easier to implement, faster to compute, and scalable. Being stationary, we demonstrate that Matérn kernels allow for tunable surface reconstruction in the same way as Fourier feature mappings help coordinate-based MLPs overcome spectral bias. Moreover, we theoretically analyze Matérn kernels’ connection to SIREN networks as well as their relation to previously employed arc-cosine kernels. Finally, based on recently introduced Neural Kernel Fields, we present data-dependent Matérn kernels and conclude that especially the Laplace kernel (being part of the Matérn family) is extremely competitive, performing almost on par with state-of-the-art methods in the noise-free case while having a more than five times shorter training time.

[190] From Engineering Diagrams to Graphs: Digitizing P&IDs with Transformers

Jan Marius Stürmer, Marius Graumann, Tobias Koch

Main category: cs.CV

TL;DR: Transformer-based Relationformer jointly extracts symbols and connections from P&IDs, outperforming modular approaches by 25%+ in edge detection accuracy, with a new public benchmark dataset.

Details

Motivation: Previous P&ID digitization methods use separate steps for symbol and line detection, which limits their ability to capture diagram structure. There's a need for joint extraction and a public benchmark for evaluation.

Method: Transformer-based Relationformer approach that jointly extracts symbols and their interconnections from P&IDs in a unified framework rather than separate modular steps.

Result: Method significantly outperforms modular baseline with over 25% improvement in edge detection accuracy on real-world diagrams. First public benchmark dataset for P&ID digitization created.

Conclusion: Transformer models are effective for structural understanding of complex engineering diagrams. Research provides reproducible evaluation framework and demonstrates superiority of joint extraction over modular approaches.

Abstract: Digitizing engineering diagrams like Piping and Instrumentation Diagrams (P&IDs) plays a vital role in maintainability and operational efficiency of process and hydraulic systems. Previous methods typically decompose the task into separate steps such as symbol detection and line detection, which can limit their ability to capture the structure in these diagrams. In this work, a transformer-based approach leveraging the Relationformer that addresses this limitation by jointly extracting symbols and their interconnections from P&IDs is introduced. To evaluate our approach and compare it to a modular digitization approach, we present the first publicly accessible benchmark dataset for P&ID digitization, annotated with graph-level ground truth. Experimental results on real-world diagrams show that our method significantly outperforms the modular baseline, achieving over 25% improvement in edge detection accuracy. This research contributes a reproducible evaluation framework and demonstrates the effectiveness of transformer models for structural understanding of complex engineering diagrams. The dataset is available under https://zenodo.org/records/14803338.

[191] UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, Luc Van Gool

Main category: cs.CV

TL;DR: UniDepthV2 is a universal monocular metric depth estimation model that predicts metric 3D points from single images across domains without additional information, improving over its predecessor with better edge localization, simplified architecture, and uncertainty estimation.

Details

Motivation: Current monocular metric depth estimation methods fail to generalize to unseen domains, limiting their practical applicability. There's a need for a universal solution that works across different domains without requiring additional information at inference time.

Method: UniDepthV2 implements a self-promptable camera module predicting dense camera representation to condition depth features, uses pseudo-spherical output representation to disentangle camera and depth, employs geometric invariance loss, edge-guided loss for better edge localization, simplified architecture, and adds uncertainty-level output.

Result: Thorough evaluations on ten depth datasets in zero-shot regime demonstrate superior performance and generalization compared to existing methods, showing consistent improvement across different domains.

Conclusion: UniDepthV2 provides a universal and flexible solution for monocular metric depth estimation that generalizes well across domains, addressing the domain gap limitation of previous methods and enabling practical applications in 3D perception and modeling.

Abstract: Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepthV2 implements a self-promptable camera module predicting a dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss which enhances the localization and sharpness of edges in the metric depth outputs, a revisited, simplified and more efficient architectural design, and an additional uncertainty-level output which enables downstream tasks requiring confidence. Thorough evaluations on ten depth datasets in a zero-shot regime consistently demonstrate the superior performance and generalization of UniDepthV2. Code and models are available at https://github.com/lpiccinelli-eth/UniDepth

[192] Radar-Guided Polynomial Fitting for Metric Depth Estimation

Patrick Rim, Hyoungseob Park, Vadim Ezhov, Jeffrey Moon, Alex Wong

Main category: cs.CV

TL;DR: POLAR uses radar-guided polynomial fitting to convert scaleless depth predictions from pretrained monocular depth estimation models into accurate metric depth maps, outperforming existing methods by significant margins.

Details

Motivation: Existing monocular depth estimation (MDE) models produce reasonable local depth structure but often misalign different regions relative to each other, making simple linear scale and shift transformations insufficient for accurate metric depth estimation.

Method: POLAR introduces polynomial fitting using coefficients predicted from radar data to non-uniformly adjust depth predictions across ranges, with a novel training objective that enforces local monotonicity through first-derivative regularization to preserve structural consistency.

Result: Achieves state-of-the-art performance across three datasets, outperforming existing methods by average 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in latency and computational cost.

Conclusion: POLAR demonstrates that polynomial transformations guided by radar data can effectively correct region misalignments in monocular depth estimation, providing a more accurate and efficient solution than affine transformations or complex architectures.

Abstract: We propose POLAR, a novel radar-guided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.

[193] Core-Set Selection for Data-efficient Land Cover Segmentation

Keiller Nogueira, Akram Zaytar, Wanli Ma, Ribana Roscher, Ronny Hansch, Caleb Robinson, Anthony Ortiz, Simone Nsutezo, Rahul Dodhia, Juan M. Lavista Ferres, Oktay Karakus, Paul L. Rosin

Main category: cs.CV

TL;DR: Core-set selection methods can identify high-quality subsets for remote sensing semantic segmentation that maintain or even surpass performance of full datasets, with some subsets using only 25% of data outperforming training on all data.

Details

Motivation: Traditional deep learning for Earth Observation relies on large datasets, but this overlooks issues of data redundancy, noise, and computational costs. The paper argues for focusing on data quality over quantity through data-centric approaches.

Method: Introduces six core-set selection approaches using imagery only, labels only, or both. Benchmarks against two traditional baselines on three land-cover classification datasets (DFC2022, Vaihingen, Potsdam) using SegFormer and U-Net architectures.

Result: All proposed methods consistently outperform baselines across multiple subset sizes. Some approaches select core sets that surpass training on all available data. On DFC2022, a 25% subset yields slightly higher SegFormer performance than full dataset training.

Conclusion: Demonstrates the importance and potential of data-centric learning for remote sensing, showing that carefully selected subsets can maintain or improve performance while reducing computational costs and addressing data quality issues.

Abstract: The increasing accessibility of remotely sensed data and their potential to support large-scale decision-making have driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models rely on large datasets. However, the common assumption that larger training datasets lead to better performance tends to overlook issues related to data redundancy, noise, and the computational cost of processing massive datasets. Effective solutions must therefore consider not only the quantity but also the quality of data. Towards this, in this paper, we introduce six basic core-set selection approaches – that rely on imagery only, labels only, or a combination of both – and investigate whether they can identify high-quality subsets of data capable of maintaining – or even surpassing – the performance achieved when using full datasets for remote sensing semantic segmentation. We benchmark such approaches against two traditional baselines on three widely used land-cover classification datasets (DFC2022, Vaihingen, and Potsdam) using two different architectures (SegFormer and U-Net), thus establishing a general baseline for future works. Our experiments show that all proposed methods consistently outperform the baselines across multiple subset sizes, with some approaches even selecting core sets that surpass training on all available data. Notably, on DFC2022, a selected subset comprising only 25% of the training data yields slightly higher SegFormer performance than training with the entire dataset. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data-centric-rs-classification/.

[194] MoAPT: Mixture of Adversarial Prompt Tuning for Vision-Language Models

Shiji Zhao, Qihui Zhu, Shukun Xiong, Shouwei Ruan, Maoxun Yuan, Jialing Tao, Jiexi Liu, Ranjie Duan, Jie Zhang, Jie Zhang, Xingxing Wei

Main category: cs.CV

TL;DR: MoAPT improves VLM robustness against adversarial attacks by learning mixture text prompts with conditional weight routing instead of single prompts.

Details

Motivation: VLMs are vulnerable to adversarial examples despite good generalization. Existing adversarial prompt tuning methods use single prompts that overfit and lack generalization across different attacks.

Method: Mixture of Adversarial Prompt Tuning (MoAPT) learns multiple text prompts and uses a conditional weight router based on adversarial images to predict mixture weights, creating sample-specific text features.

Result: Extensive experiments on 11 datasets show MoAPT achieves better adversarial robustness than state-of-the-art approaches across different settings.

Conclusion: Learning mixture prompts with conditional routing effectively enhances VLM robustness against various adversarial attacks by preventing overfitting and improving generalization.

Abstract: Large pre-trained Vision Language Models (VLMs) demonstrate excellent generalization capabilities but remain highly susceptible to adversarial examples, posing potential security risks. To improve the robustness of VLMs against adversarial examples, adversarial prompt tuning methods are proposed to align the text feature with the adversarial image feature without changing model parameters. However, when facing various adversarial attacks, a single learnable text prompt has insufficient generalization to align well with all adversarial image features, which ultimately results in overfitting. To address the above challenge, in this paper, we empirically find that increasing the number of learned prompts yields greater robustness improvements than simply extending the length of a single prompt. Building on this observation, we propose an adversarial tuning method named \textbf{Mixture of Adversarial Prompt Tuning (MoAPT)} to enhance the generalization against various adversarial attacks for VLMs. MoAPT aims to learn mixture text prompts to obtain more robust text features. To further enhance the adaptability, we propose a conditional weight router based on the adversarial images to predict the mixture weights of multiple learned prompts, which helps obtain sample-specific mixture text features aligning with different adversarial image features. Extensive experiments across 11 datasets under different settings show that our method can achieve better adversarial robustness than state-of-the-art approaches.

[195] ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, Chi Zhang

Main category: cs.CV

TL;DR: ViStoryBench is a comprehensive benchmark for evaluating story visualization models across diverse narratives, styles, and character settings with automated metrics validated by human studies.

Details

Motivation: Existing story visualization benchmarks are too narrow - limited to short prompts, lacking character references, or single-image cases - failing to capture real-world storytelling complexity and hindering nuanced understanding of model capabilities.

Method: Created richly annotated multi-shot scripts from curated stories (literature, film, folklore) using LLM-assisted story summarization and script generation with human verification. Curated character references for intra-story consistency across artistic styles. Developed automated metrics for character consistency, style similarity, prompt alignment, aesthetic quality, and generation artifacts.

Result: ViStoryBench provides a comprehensive evaluation suite with validated metrics that benchmark a broad range of open-source and commercial models, enabling systematic analysis of story visualization capabilities.

Conclusion: ViStoryBench offers a multi-dimensional evaluation framework that addresses limitations of existing benchmarks and facilitates future progress in visual storytelling through systematic model assessment.

Abstract: Story visualization aims to generate coherent image sequences that faithfully depict a narrative and align with character references. Despite progress in generative models, existing benchmarks are narrow in scope, often limited to short prompts, lacking character references, or single-image cases, and fail to capture real-world storytelling complexity. This hinders a nuanced understanding of model capabilities and limitations. We present \textbf{ViStoryBench}, a comprehensive benchmark designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings. The benchmark features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs human-verified to ensure coherence and fidelity. Character references are carefully curated to maintain intra-story consistency across varying artistic styles. To enable thorough evaluation, ViStoryBench introduces a set of automated metrics that assess character consistency, style similarity, prompt alignment, aesthetic quality, and generation artifacts such as copy-paste behavior. These metrics are validated through human studies, and used to benchmark a broad range of open-source and commercial models. ViStoryBench offers a multi-dimensional evaluation suite that facilitates systematic analysis and fosters future progress in visual storytelling.

[196] Towards Practical Alzheimer’s Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model

Changwei Wu, Yifei Chen, Yuxin Du, Jinying Zong, Jie Dong, Mingxuan Liu, Feiwei Qin, Yong Peng, Jin Fan, Changmiao Wang

Main category: cs.CV

TL;DR: FasterSNN is a hybrid spiking neural network that combines LIF neurons with region-adaptive convolution and multi-scale spiking attention for efficient, accurate Alzheimer’s Disease diagnosis from 3D MRI data.

Details

Motivation: Early Alzheimer's diagnosis faces barriers including subjective assessments and expensive imaging. Deep learning solutions have high energy/computational demands, especially in resource-limited settings. SNNs offer energy-efficient alternatives but suffer from limited expressiveness and training instability for complex medical tasks.

Method: FasterSNN combines biologically inspired Leaky Integrate-and-Fire (LIF) neurons with region-adaptive convolution and multi-scale spiking attention mechanisms to enable efficient, sparse processing of 3D MRI data while maintaining diagnostic accuracy.

Result: Experimental results on benchmark datasets show FasterSNN delivers competitive performance with significantly enhanced efficiency and training stability compared to existing approaches.

Conclusion: FasterSNN demonstrates potential for practical application in AD screening by addressing SNN limitations while maintaining energy efficiency and diagnostic accuracy.

Abstract: Early diagnosis of Alzheimer’s Disease (AD), particularly at the mild cognitive impairment stage, is essential for timely intervention. However, this process faces significant barriers, including reliance on subjective assessments and the high cost of advanced imaging techniques. While deep learning offers automated solutions to improve diagnostic accuracy, its widespread adoption remains constrained due to high energy requirements and computational demands, particularly in resource-limited settings. Spiking neural networks (SNNs) provide a promising alternative, as their brain-inspired design is well-suited to model the sparse and event-driven patterns characteristic of neural degeneration in AD. These networks offer the potential for developing interpretable, energy-efficient diagnostic tools. Despite their advantages, existing SNNs often suffer from limited expressiveness and challenges in stable training, which reduce their effectiveness in handling complex medical tasks. To address these shortcomings, we introduce FasterSNN, a hybrid neural architecture that combines biologically inspired Leaky Integrate-and-Fire (LIF) neurons with region-adaptive convolution and multi-scale spiking attention mechanisms. This approach facilitates efficient, sparse processing of 3D MRI data while maintaining high diagnostic accuracy. Experimental results on benchmark datasets reveal that FasterSNN delivers competitive performance with significantly enhanced efficiency and training stability, highlighting its potential for practical application in AD screening. Our source code is available at https://github.com/wuchangw/FasterSNN.

[197] Scene-aware SAR ship detection guided by unsupervised sea-land segmentation

Han Ke, Xiao Ke, Ye Yan, Rui Liu, Jinpeng Yang, Tianwen Zhang, Xu Zhan, Xiaowo Xu

Main category: cs.CV

TL;DR: A scene-aware SAR ship detection method using unsupervised sea-land segmentation to improve detection accuracy by reducing attention on land areas.

Details

Motivation: DL-based SAR ship detection lacks prior knowledge (sea-land segmentation information), which seriously affects detection accuracy, especially in distinguishing ships from land clutter in inshore scenes.

Method: Two-stage framework with two enhancement models: 1) Unsupervised Land and Sea Segmentation Module (ULSM) that classifies scenes as inshore/offshore and performs sea-land segmentation for inshore scenes without labeled data, and 2) Land Attention Suppression Module (LASM) that uses segmentation information as prior knowledge to reduce network attention on land areas.

Result: Experiments on the SSDD dataset demonstrated the effectiveness of the proposed network in improving ship detection accuracy and enhancing model interpretability.

Conclusion: The scene-aware SAR ship detection method with unsupervised sea-land segmentation successfully addresses the prior knowledge limitation, reduces false alarms from land clutter, and improves offshore detection performance while increasing model interpretability.

Abstract: DL based Synthetic Aperture Radar (SAR) ship detection has tremendous advantages in numerous areas. However, it still faces some problems, such as the lack of prior knowledge, which seriously affects detection accuracy. In order to solve this problem, we propose a scene-aware SAR ship detection method based on unsupervised sea-land segmentation. This method follows a classical two-stage framework and is enhanced by two models: the unsupervised land and sea segmentation module (ULSM) and the land attention suppression module (LASM). ULSM and LASM can adaptively guide the network to reduce attention on land according to the type of scenes (inshore scene and offshore scene) and add prior knowledge (sea land segmentation information) to the network, thereby reducing the network’s attention to land directly and enhancing offshore detection performance relatively. This increases the accuracy of ship detection and enhances the interpretability of the model. Specifically, in consideration of the lack of land sea segmentation labels in existing deep learning-based SAR ship detection datasets, ULSM uses an unsupervised approach to classify the input data scene into inshore and offshore types and performs sea-land segmentation for inshore scenes. LASM uses the sea-land segmentation information as prior knowledge to reduce the network’s attention to land. We conducted our experiments using the publicly available SSDD dataset, which demonstrated the effectiveness of our network.

[198] An Efficient Deep Learning Framework for Brain Stroke Diagnosis Using Computed Tomography Images

Md. Sabbir Hossen, Eshat Ahmed Shuvo, Shibbir Ahmed Arif, Pabon Shaha, Anichur Rahman, Md. Saiduzzaman, Fahmid Al Farid, Hezerul Abdul Karim, Abu Saleh Musa Miah

Main category: cs.CV

TL;DR: This paper proposes a novel ML approach for brain stroke detection using CT scans, combining pre-trained deep learning models with feature engineering and classification algorithms, achieving 97.93% accuracy with MobileNetV2+LDA+SVC.

Details

Motivation: Brain stroke is a leading cause of mortality and disability worldwide, requiring precise and rapid prediction. Traditional CT scan diagnosis relies on manual slice selection by radiologists, creating a need for automated ML approaches to supplement traditional diagnostic techniques.

Method: The study uses pre-trained deep learning models (DenseNet201, InceptionV3, MobileNetV2, ResNet50, Xception) for feature extraction from CT scans. Feature engineering techniques (BFO, PCA, LDA) optimize features, which are then classified using ML algorithms (SVC, RF, XGB, DT, LR, KNN, GNB).

Result: The combination of MobileNetV2 for feature extraction, LDA for feature engineering, and SVC for classification achieved the highest accuracy of 97.93%, significantly outperforming other model-optimizer-classifier combinations.

Conclusion: The research demonstrates that integrating lightweight pre-trained models with robust optimization and classification techniques is highly effective for brain stroke diagnosis, offering a promising automated approach to supplement traditional diagnostic methods.

Abstract: Brain stroke is a leading cause of mortality and long-term disability worldwide, underscoring the need for precise and rapid prediction techniques. Computed Tomography (CT) scan is considered one of the most effective methods for diagnosing brain strokes. Most stroke classification techniques use a single slice-level prediction mechanism, requiring radiologists to manually select the most critical CT slice from the original CT volume. Although clinical evaluations are often used in traditional diagnostic procedures, machine learning (ML) has opened up new avenues for improving stroke diagnosis. To supplement traditional diagnostic techniques, this study investigates machine learning models for early brain stroke prediction using CT scan images. This research proposes a novel machine learning approach to brain stroke detection, focusing on optimizing classification performance with pre-trained deep learning models and advanced optimization strategies. Pre-trained models, including DenseNet201, InceptionV3, MobileNetV2, ResNet50, and Xception, are used for feature extraction. Feature engineering techniques, including BFO, PCA, and LDA, further enhance model performance. These features are then classified using machine learning algorithms, including SVC, RF, XGB, DT, LR, KNN, and GNB. Our experiments demonstrate that the combination of MobileNetV2, LDA, and SVC achieved the highest classification accuracy of 97.93%, significantly outperforming other model-optimizer-classifier combinations. The results underline the effectiveness of integrating lightweight pre-trained models with robust optimization and classification techniques for brain stroke diagnosis.

[199] SlumpGuard: An AI-Powered Real-Time System for Automated Concrete Slump Prediction via Video Analysis

Youngmin Kim, Giyeong Oh, Kwangsoo Youm, Youngjae Yu

Main category: cs.CV

TL;DR: SlumpGuard: AI vision system analyzes concrete discharge flow from mixer-truck chute using single camera for automated slump testing, eliminating manual intervention.

Details

Motivation: Traditional slump testing is manual, time-consuming, operator-dependent, and unsuitable for continuous/real-time monitoring during concrete placement, creating quality control limitations.

Method: AI-powered vision system with single fixed camera analyzes natural discharge flow from mixer-truck chute. System performs automatic chute detection, pouring-event identification, and video-based slump classification without sensors or hardware installation.

Result: System evaluated on site-replicated dataset of 6,000+ video clips, demonstrating reliable chute localization, accurate pouring detection, and robust slump prediction under diverse field conditions. Expert study revealed significant disagreement in human visual estimates.

Conclusion: SlumpGuard provides automated, reliable concrete workability assessment, addressing limitations of traditional manual slump testing and highlighting need for objective, automated quality monitoring in construction.

Abstract: Concrete workability is essential for construction quality, with the slump test being the most widely used on-site method for its assessment. However, traditional slump testing is manual, time-consuming, and highly operator-dependent, making it unsuitable for continuous or real-time monitoring during placement. To address these limitations, we present SlumpGuard, an AI-powered vision system that analyzes the natural discharge flow from a mixer-truck chute using a single fixed camera. The system performs automatic chute detection, pouring-event identification, and video-based slump classification, enabling quality monitoring without sensors, hardware installation, or manual intervention. We introduce the system design, construct a site-replicated dataset of over 6,000 video clips, and report extensive evaluations demonstrating reliable chute localization, accurate pouring detection, and robust slump prediction under diverse field conditions. An expert study further reveals significant disagreement in human visual estimates, highlighting the need for automated assessment.

[200] Automated Building Heritage Assessment Using Street-Level Imagery

Kristina Dabrock, Tim Johansson, Anna Donarelli, Mikael Mangold, Noah Pflugradt, Jann Michael Weinand, Jochen Linßen

Main category: cs.CV

TL;DR: Using GPT to detect heritage values from facade images, combined with building register data, achieves 0.71 F1-score for classifying buildings in Stockholm, improving efficiency over traditional heritage inventories.

Details

Motivation: Traditional heritage value registration is cumbersome and time-consuming. AI tools can improve efficiency in identifying heritage values in buildings compared to costly traditional inventories.

Method: Used OpenAI’s GPT to detect cultural heritage values from facade images. Combined GPT-derived data with building register data to train machine learning models for classifying multi-family and non-residential buildings in Stockholm.

Result: Achieved macro F1-score of 0.71 using combination of register data and GPT features, and 0.60 using only GPT-derived data when validated against expert-created heritage inventory.

Conclusion: The methods can contribute to higher-quality heritage datasets and support decision making in heritage preservation, offering more efficient alternatives to traditional inventory processes.

Abstract: Registration of heritage values in buildings is important to safeguard heritage values that can be lost in renovation and energy efficiency projects. However, registering heritage values is a cumbersome process. Novel artificial intelligence tools may improve efficiency in identifying heritage values in buildings compared to costly and time-consuming traditional inventories. In this study, OpenAI’s large language model GPT was used to detect various aspects of cultural heritage value in facade images. Using GPT derived data and building register data, machine learning models were trained to classify multi-family and non-residential buildings in Stockholm, Sweden. Validation against a heritage expert-created inventory shows a macro F1-score of 0.71 using a combination of register data and features retrieved from GPT, and a score of 0.60 using only GPT-derived data. The methods presented can contribute to higher-quality datasets and support decision making.

[201] STAGNet: A Spatio-Temporal Graph and LSTM Framework for Accident Anticipation

Vipooshan Vipulananthan, Kumudu Mohottala, Kavindu Chinthana, Nimsara Paramulla, Charith D Chitraranjan

Main category: cs.CV

TL;DR: STAGNet improves accident prediction from dash-cam videos using better spatio-temporal features aggregated through recurrent networks, outperforming previous graph neural network methods.

Details

Motivation: Accident prediction is crucial for road safety and ADAS systems. While many systems use expensive sensors like LiDAR and radar, dash-cam video offers a more cost-effective and easily deployable solution, though more challenging to work with.

Method: Proposes STAGNet model that incorporates improved spatio-temporal features and aggregates them through a recurrent network to enhance accident prediction from dash-cam videos, building upon state-of-the-art graph neural networks.

Result: Experiments on three public datasets show STAGNet achieves higher average precision and mean time-to-collision values than previous methods, both in cross-validation and cross-dataset evaluation.

Conclusion: The proposed STAGNet model effectively improves accident prediction performance from dash-cam videos, offering a practical and cost-effective solution for road safety applications.

Abstract: Accident prediction and timely warnings play a key role in improving road safety by reducing the risk of injury to road users and minimizing property damage. Advanced Driver Assistance Systems (ADAS) are designed to support human drivers and are especially useful when they can anticipate potential accidents before they happen. While many existing systems depend on a range of sensors such as LiDAR, radar, and GPS, relying solely on dash-cam video input presents a more challenging but a more cost-effective and easily deployable solution. In this work, we incorporate better spatio-temporal features and aggregate them through a recurrent network to improve upon state-of-the-art graph neural networks for predicting accidents from dash-cam videos. Experiments using three publicly available datasets show that our proposed STAGNet model achieves higher average precision and mean time-to-collision values than previous methods, both when cross-validated on a given dataset and when trained and tested on different datasets.

[202] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiao-Xiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, Yao Yao

Main category: cs.CV

TL;DR: SpatialVID is a large-scale dataset of 7,089 hours of in-the-wild videos with dense 3D annotations including camera poses, depth maps, and motion instructions to address data scarcity in spatial intelligence research.

Details

Motivation: Current spatial intelligence models are limited by scarce, small-scale training data lacking diversity and rich annotations, especially for real-world dynamic scenes with ground-truth camera motion.

Method: Collected 21,000+ hours of raw videos, filtered to 2.7 million clips (7,089 hours) using hierarchical filtering, then annotated with camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions.

Result: Created SpatialVID - a diverse, large-scale dataset with rich spatial and semantic annotations that analysis shows improves model generalization and performance for video and 3D vision research.

Conclusion: SpatialVID addresses critical data scarcity in spatial intelligence, providing a key asset for advancing video and 3D vision research through its scale, diversity, and comprehensive annotations.

Abstract: Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw videos, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID’s data statistics reveals a richness and diversity that directly fosters improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.

[203] Improved Segmentation of Polyps and Visual Explainability Analysis

Akwasi Asare, Thanh-Huy Nguyen, Ulas Bagci

Main category: cs.CV

TL;DR: PolypSeg-GradCAM: An explainable deep learning framework combining U-Net with ResNet-34 and Grad-CAM for transparent polyp segmentation in colonoscopy, achieving high accuracy with interpretable visualizations.

Details

Motivation: Colorectal cancer is a major global health issue, with GI polyps as critical precursors. Early and accurate polyp segmentation is essential but manual methods are labor-intensive and subjective. While deep learning shows promise for automation, its limited interpretability hinders clinical adoption.

Method: PolypSeg-GradCAM integrates U-Net architecture with pre-trained ResNet-34 backbone and Gradient-weighted Class Activation Mapping (Grad-CAM) for explainable polyp segmentation. The model was trained and evaluated using 5-Fold Cross-Validation on the Kvasir-SEG dataset of 1,000 annotated endoscopic images.

Result: Achieved mean Dice coefficient of 0.8902 ± 0.0125, mean IoU of 0.8023, and AUC-ROC of 0.9722. With optimal threshold: Sensitivity of 0.9058 and Precision of 0.9083. Grad-CAM visualizations confirmed predictions were guided by clinically relevant regions, providing model interpretability.

Conclusion: Integrating segmentation accuracy with interpretability supports development of trustworthy AI-assisted colonoscopy tools. The framework demonstrates that explainable AI can bridge the gap between technical performance and clinical adoption.

Abstract: Colorectal cancer (CRC) remains one of the leading causes of cancer-related morbidity and mortality worldwide, with gastrointestinal (GI) polyps serving as critical precursors according to the World Health Organization (WHO). Early and accurate segmentation of polyps during colonoscopy is essential for reducing CRC progression, yet manual delineation is labor-intensive and prone to observer variability. Deep learning methods have demonstrated strong potential for automated polyp analysis, but their limited interpretability remains a barrier to clinical adoption. In this study, we present PolypSeg-GradCAM, an explainable deep learning framework that integrates a U-Net architecture with a pre-trained ResNet-34 backbone and Gradient-weighted Class Activation Mapping (Grad-CAM) for transparent polyp segmentation. To ensure rigorous benchmarking, the model was trained and evaluated using 5-Fold Cross-Validation on the Kvasir-SEG dataset of 1,000 annotated endoscopic images. Experimental results show a mean Dice coefficient of 0.8902 +/- 0.0125, a mean Intersection-over-Union (IoU) of 0.8023, and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.9722. Advanced quantitative analysis using an optimal threshold yielded a Sensitivity of 0.9058 and Precision of 0.9083. Additionally, Grad-CAM visualizations confirmed that predictions were guided by clinically relevant regions, offering insight into the model’s decision-making process. This study demonstrates that integrating segmentation accuracy with interpretability can support the development of trustworthy AI-assisted colonoscopy tools.

[204] CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

Jie Cai, Kangning Yang, Lan Fu, Jiaming Ding, Jinlong Li, Huiming Sun, Daitao Xing, Jinglin Shen, Zibo Meng

Main category: cs.CV

TL;DR: CompareBench is a new benchmark for evaluating visual comparison reasoning in VLMs, covering quantity, temporal, geometric, and spatial tasks. Current models show systematic failures in temporal ordering, spatial relations, and basic counting despite scaling trends.

Details

Motivation: Visual comparison reasoning is a fundamental but understudied skill in vision-language models. There's a need for controlled, diverse, and diagnostic evaluation to identify systematic blind spots in current VLMs' multimodal reasoning capabilities.

Method: Created CompareBench with 1000 QA pairs across four tasks (quantity:600, temporal:100, geometric:200, spatial:100). Built from two auxiliary datasets: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). Evaluated both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series).

Result: Models show clear scaling trends but have critical limitations: consistently fail at temporal ordering and spatial relations, make mistakes in basic counting and geometric comparisons that are trivial for humans. Visual comparison remains a systematic blind spot for current VLMs.

Conclusion: CompareBench provides a foundation for advancing more reliable multimodal reasoning by offering controlled, diverse, and diagnostic evaluation of visual comparison capabilities in VLMs, revealing systematic weaknesses that need to be addressed.

Abstract: We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal (100), geometric (200), and spatial (100). It is derived from two auxiliary datasets that we constructed: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). We evaluate both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series). Results show clear scaling trends but also reveal critical limitations: even the strongest models consistently fail at temporal ordering and spatial relations, and they often make mistakes in basic counting and geometric comparisons that are trivial for humans. These findings demonstrate that visual comparison remains a systematic blind spot for current VLMs. By providing controlled, diverse, and diagnostic evaluation, CompareBench establishes a foundation for advancing more reliable multimodal reasoning.

[205] From Frames to Clips: Training-free Adaptive Key Clip Selection for Long-Form Video Understanding

Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, Garin Kessler

Main category: cs.CV

TL;DR: F2C: A training-free method that selects temporally coherent key clips instead of isolated frames for video understanding, dynamically balancing clip length and frame resolution to maintain fixed token count.

Details

Motivation: Current Video LLMs suffer from excessive visual tokens from raw video frames that exhaust context windows. Existing frame-wise selection methods discard essential temporal dynamics, leading to poor reasoning about motion and event continuity in long-form videos.

Method: Proposes F2C which extends selection from isolated key frames to temporally coherent key clips. Introduces frame resolution as a controllable factor to trade-off between spatial resolution and clip length. Uses an adaptive clip length module to dynamically balance these factors while maintaining constant token count per video.

Result: Outperforms uniform sampling by up to 8.1% on Video-MME, 5.6% on LongVideoBench, and 10.3% on MLVU benchmarks. Demonstrates importance of preserving temporal coherence in frame selection for long-form video understanding.

Conclusion: Temporally coherent clip selection is crucial for video understanding, and balancing clip length with frame resolution provides a practical pathway for scaling VLMs to real-world video applications. The training-free F2C approach effectively addresses token explosion while preserving essential temporal dynamics.

Abstract: Video Large Language Models (VLMs) have achieved strong performance on various vision-language tasks, yet their practical use is limited by the massive number of visual tokens produced from raw video frames, which quickly exhausts the model’s context window. Existing solutions mitigate this issue by selecting a sparse set of frames, but such frame-wise selection discards essential temporal dynamics in long-form videos, leading to suboptimal reasoning about motion and event continuity. In this work, we systematically examine the role of temporal information and show that extending selection from isolated key frames to temporally coherent key clips improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we introduce frame resolution as a controllable factor in frame selection, enabling a trade-off between spatial resolution and clip length. Building on this idea, we propose an adaptive clip length module that dynamically balances these factors to ensure a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling by up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench, and MLVU, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling VLMs to real-world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .

[206] DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fan, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: DriveVLA-W0 introduces world modeling to address supervision deficit in Vision-Language-Action models for autonomous driving, using future image prediction to create dense self-supervised signals and improve data scaling efficiency.

Details

Motivation: Current Vision-Language-Action (VLA) models suffer from "supervision deficit" - their large capacity is only supervised by sparse, low-dimensional actions, leaving much representational power underutilized. There's a need to better leverage model capacity for driving intelligence.

Method: Proposes DriveVLA-W0 training paradigm using world modeling to predict future images, creating dense self-supervised signals. Implements two variants: autoregressive world model for VLAs with discrete visual tokens, and diffusion world model for those with continuous visual features. Adds lightweight action expert for real-time inference.

Result: Significantly outperforms BEV and VLA baselines on NAVSIM v1/v2 benchmarks and 680x larger in-house dataset. Demonstrates amplified data scaling law - performance gains accelerate as training dataset size increases.

Conclusion: World modeling effectively addresses supervision deficit in VLA models for autonomous driving, enabling better utilization of model capacity and improved data scaling efficiency, making it a promising approach for generalized driving intelligence.

Abstract: Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit’’: the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose \textbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm’s versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.

[207] Deep generative priors for 3D brain analysis

Ana Lawry Aguila, Dina Zemlyanker, You Cheng, Sudeshna Das, Daniel C. Alexander, Oula Puonti, Annabel Sorby-Adams, W. Taylor Kimberly, Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: Diffusion models used as priors for medical imaging inverse problems, achieving SOTA performance on brain MRI tasks without paired training data.

Details

Motivation: Diffusion models are powerful generative tools but lack domain knowledge integration. Bayesian inverse problems incorporate domain knowledge but use simplistic anatomical priors. Need to combine diffusion models' generative power with Bayesian framework's domain knowledge.

Method: Use score-based diffusion prior trained on diverse brain MRI data, paired with flexible forward models for various image processing tasks (super-resolution, bias field correction, inpainting). Framework can also refine outputs from existing deep learning methods.

Result: Achieves state-of-the-art performance on heterogeneous clinical and research MRI data, producing consistent, high-quality solutions without requiring paired training datasets.

Conclusion: Diffusion priors are versatile tools for brain MRI analysis, successfully combining data-driven models with domain knowledge for robust medical imaging inverse problems.

Abstract: Diffusion models have recently emerged as powerful generative models in medical imaging. However, it remains a major challenge to combine these data-driven models with domain knowledge to guide brain imaging problems. In neuroimaging, Bayesian inverse problems have long provided a successful framework for inference tasks, where incorporating domain knowledge of the imaging process enables robust performance without requiring extensive training data. However, the anatomical modeling component of these approaches typically relies on classical mathematical priors that often fail to capture the complex structure of brain anatomy. In this work, we present the first general-purpose application of diffusion models as priors for solving a wide range of medical imaging inverse problems. Our approach leverages a score-based diffusion prior trained extensively on diverse brain MRI data, paired with flexible forward models that capture common image processing tasks such as super-resolution, bias field correction, inpainting, and combinations thereof. We further demonstrate how our framework can refine outputs from existing deep learning methods to improve anatomical fidelity. Experiments on heterogeneous clinical and research MRI data show that our method achieves state-of-the-art performance producing consistent, high-quality solutions without requiring paired training datasets. These results highlight the potential of diffusion priors as versatile tools for brain MRI analysis.

Zelin Peng, Zhengqin Xu, Qingyang Liu, Xiaokang Yang, Wei Shen

Main category: cs.CV

TL;DR: HyperET is an efficient training paradigm for multi-modal LLMs that uses hyperbolic space with dynamic radius adjustment to align visual and textual representations at arbitrary granularity levels, achieving significant improvements with minimal parameter overhead.

Details

Motivation: Current MLLMs require massive computational resources (thousands of GPUs) for training due to inefficient cross-modal alignment. The core problem is that widely-used vision encoders like CLIP and SAM lack proper alignment with language at multi-granularity levels, creating a granularity gap between visual and textual modalities.

Method: HyperET leverages hyperbolic space, which naturally models hierarchical structures, to bridge the granularity gap. It uses dynamic hyperbolic radius adjustment to align visual representations with textual counterparts at arbitrary granularity levels. The approach employs learnable matrices with Möbius multiplication operations implemented through three configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices for flexible yet efficient parametrization.

Result: Comprehensive experiments across multiple MLLM benchmarks show that HyperET consistently improves both existing pre-training and fine-tuning MLLMs with less than 1% additional parameters. The method achieves clear performance gains while maintaining computational efficiency.

Conclusion: HyperET provides a principled and efficient solution to the granularity alignment problem in MLLMs by leveraging hyperbolic geometry. The approach enables better cross-modal understanding with minimal parameter overhead, addressing the computational inefficiency of current MLLM training methods.

Abstract: Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with Möbius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1% additional parameters. Code is available at https://github.com/godlin-sjtu/HyperET

[209] V-Thinker: Interactive Thinking with Images

Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Jing Lyu, Honggang Zhang

Main category: cs.CV

TL;DR: V-Thinker is a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning, outperforming existing LMMs on vision-interactive reasoning tasks.

Details

Motivation: Current Large Multimodal Models (LMMs) lack deep integration of image interaction with long-horizon reasoning capabilities. While recent "Thinking with Images" paradigms show promise, they remain constrained by limited visual tool spaces and task-specific workflow designs.

Method: V-Thinker uses two key components: (1) Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across diversity, quality, and difficulty dimensions; (2) Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework.

Result: V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios. The paper also introduces VTBench, an expert-verified benchmark for vision-centric interactive reasoning tasks.

Conclusion: V-Thinker provides a general-purpose solution for advancing image-interactive reasoning applications, demonstrating the effectiveness of end-to-end reinforcement learning combined with automated data evolution and progressive training for vision-centric thinking.

Abstract: Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising “Thinking with Images” paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.

[210] ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search

Zhenjie Liu, Jianzhang Lu, Renjie Lu, Cong Liang, Shangfei Wang

Main category: cs.CV

TL;DR: ConsistTalk is a novel talking head generation framework that addresses flickering, identity drift, and poor audio-visual sync through optical flow-guided temporal modeling, audio-to-intensity transformation, and diffusion noise search inference.

Details

Motivation: Current video diffusion models for audio-driven portrait animation suffer from flickering, identity drift, and poor audio-visual synchronization due to entangled appearance-motion representations and unstable inference strategies.

Method: Three key components: 1) Optical flow-guided temporal module (OFT) decouples motion from appearance using facial optical flow; 2) Audio-to-Intensity (A2I) model transforms audio and facial velocity into frame-wise intensity sequences; 3) Diffusion noise initialization strategy (IC-Init) with constraints on background coherence and motion continuity.

Result: Extensive experiments show ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos with fine-grained motion control.

Conclusion: ConsistTalk provides a robust framework for intensity-controllable and temporally consistent talking head generation that addresses key limitations of current approaches through decoupled representations and improved inference strategies.

Abstract: Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce \textbf{ConsistTalk}, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose \textbf{an optical flow-guided temporal module (OFT)} that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an \textbf{Audio-to-Intensity (A2I) model} obtained through multimodal teacher-student knowledge distillation. By transforming audio and facial velocity features into a frame-wise intensity sequence, the A2I model enables joint modeling of audio and visual motion, resulting in more natural dynamics. This further enables fine-grained, frame-wise control of motion dynamics while maintaining tight audio-visual synchronization. Third, we introduce a \textbf{diffusion noise initialization strategy (IC-Init)}. By enforcing explicit constraints on background coherence and motion continuity during inference-time noise search, we achieve better identity preservation and refine motion dynamics compared to the current autoregressive strategy. Extensive experiments demonstrate that ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos.

[211] DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization

Rui-Yang Ju, Kohei Yamashita, Hirotaka Kameko, Shinsuke Mori

Main category: cs.CV

TL;DR: Introduces DKDS dataset for degraded Kuzushiji documents with seals, providing benchmarks for text/seal detection and document binarization tasks.

Details

Motivation: Existing OCR methods for Kuzushiji (pre-modern Japanese cursive) perform well on clean documents but fail on noisy ones with degradation and seals. No dataset specifically addresses these challenges.

Method: Created DKDS dataset with expert assistance, defined two benchmark tracks: (1) text and seal detection using YOLO models, (2) document binarization using traditional algorithms, K-means clustering, SOTA GAN methods, and their Conditional GAN baseline.

Result: Provides baseline results for both tracks: YOLO models for detection tasks and various binarization methods including their cGAN baseline. Dataset and implementation code are publicly available.

Conclusion: DKDS dataset fills the gap for noisy Kuzushiji document analysis, enabling better OCR performance on degraded documents with seals through standardized benchmarks.

Abstract: Kuzushiji, a pre-modern Japanese cursive script, can currently be read and understood by only a few thousand trained experts in Japan. With the rapid development of deep learning, researchers have begun applying Optical Character Recognition (OCR) techniques to transcribe Kuzushiji into modern Japanese. Although existing OCR methods perform well on clean pre-modern Japanese documents written in Kuzushiji, they often fail to consider various types of noise, such as document degradation and seals, which significantly affect recognition accuracy. To the best of our knowledge, no existing dataset specifically addresses these challenges. To address this gap, we introduce the Degraded Kuzushiji Documents with Seals (DKDS) dataset as a new benchmark for related tasks. We describe the dataset construction process, which required the assistance of a trained Kuzushiji expert, and define two benchmark tracks: (1) text and seal detection and (2) document binarization. For the text and seal detection track, we provide baseline results using several recent versions of the You Only Look Once (YOLO) models for detecting Kuzushiji characters and seals. For the document binarization track, we present baseline results from traditional binarization algorithms, traditional algorithms combined with K-means clustering, two state-of-the-art (SOTA) Generative Adversarial Network (GAN) methods, as well as our Conditional GAN (cGAN) baseline. The DKDS dataset and the implementation code for baseline methods are available at https://ruiyangju.github.io/DKDS.

[212] $\mathrm{D}^\mathrm{3}$-Predictor: Noise-Free Deterministic Diffusion for Dense Prediction

Changliang Xia, Chengyou Jia, Minnan Luo, Zhuohang Dang, Xin Shen, Bowen Ping

Main category: cs.CV

TL;DR: D³-Predictor: A noise-free deterministic framework that reformulates pretrained diffusion models for dense prediction tasks by eliminating stochastic noise and aggregating timestep-dependent visual priors.

Details

Motivation: Diffusion models with strong visual priors are powerful for dense prediction, but their core stochastic noise is inherently misaligned with deterministic dense prediction tasks. This noise corrupts fine-grained spatial cues and pushes models toward timestep-specific noise objectives, destroying meaningful geometric structure mappings.

Method: Introduces D³-Predictor, a noise-free deterministic framework that reformulates pretrained diffusion models without stochastic noise. Instead of relying on noisy inputs, it views the pretrained diffusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Task-specific supervision is used to adapt this noise-free prior to dense prediction tasks.

Result: Extensive experiments on various dense prediction tasks demonstrate competitive or state-of-the-art performance in diverse scenarios. The method requires less than half the training data previously used and efficiently performs inference in a single step.

Conclusion: D³-Predictor successfully addresses the misalignment between stochastic diffusion sampling and deterministic dense prediction by creating a noise-free deterministic framework that effectively leverages diffusion priors for geometric understanding tasks with improved efficiency and data requirements.

Abstract: Although diffusion models with strong visual priors have emerged as powerful dense prediction backboens, they overlook a core limitation: the stochastic noise at the core of diffusion sampling is inherently misaligned with dense prediction that requires a deterministic mapping from image to geometry. In this paper, we show that this stochastic noise corrupts fine-grained spatial cues and pushes the model toward timestep-specific noise objectives, consequently destroying meaningful geometric structure mappings. To address this, we introduce $\mathrm{D}^\mathrm{3}$-Predictor, a noise-free deterministic framework built by reformulating a pretrained diffusion model without stochasticity noise. Instead of relying on noisy inputs to leverage diffusion priors, $\mathrm{D}^\mathrm{3}$-Predictor views the pretrained diffusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Meanwhile, we utilize task-specific supervision to seamlessly adapt this noise-free prior to dense prediction tasks. Extensive experiments on various dense prediction tasks demonstrate that $\mathrm{D}^\mathrm{3}$-Predictor achieves competitive or state-of-the-art performance in diverse scenarios. In addition, it requires less than half the training data previously used and efficiently performs inference in a single step. Our code, data, and checkpoints are publicly available at https://x-gengroup.github.io/HomePage_D3-Predictor/.

[213] MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering

Seokwon Song, Minsu Park, Gunhee Kim

Main category: cs.CV

TL;DR: MAVIS is the first benchmark for evaluating multimodal source attribution systems that handle visual questions, retrieve multimodal evidence, and generate cited long-form answers.

Details

Motivation: Existing source attribution work focuses on text-only scenarios and overlooks multimodality, creating a gap in evaluating systems that need to handle visual questions with proper citations.

Method: Created MAVIS benchmark with 157K visual QA instances annotated with fact-level citations to multimodal documents. Developed fine-grained automatic metrics for informativeness, groundedness, and fluency that correlate with human judgments.

Result: Three key findings: (1) LVLMs with multimodal RAG produce more informative/fluent answers than unimodal RAG but have weaker groundedness for images vs text; (2) Trade-off between informativeness and groundedness across prompting methods; (3) Need to mitigate contextual bias in interpreting image documents.

Conclusion: MAVIS enables evaluation of multimodal source attribution systems, revealing important challenges in groundedness for visual content and highlighting contextual bias in image interpretation as a crucial research direction.

Abstract: Source attribution aims to enhance the reliability of AI-generated answers by including references for each statement, helping users validate the provided answers. However, existing work has primarily focused on text-only scenario and largely overlooked the role of multimodality. We introduce MAVIS, the first benchmark designed to evaluate multimodal source attribution systems that understand user intent behind visual questions, retrieve multimodal evidence, and generate long-form answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documents. We develop fine-grained automatic metrics along three dimensions of informativeness, groundedness, and fluency, and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs with multimodal RAG generate more informative and fluent answers than unimodal RAG, but they exhibit weaker groundedness for image documents than for text documents, a gap amplified in multimodal settings. (2) Given the same multimodal documents, there is a trade-off between informativeness and groundedness across different prompting methods. (3) Our proposed method highlights mitigating contextual bias in interpreting image documents as a crucial direction for future research.

[214] PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching

Zewei Chang, Zheng-Peng Duan, Jianxing Zhang, Chun-Le Guo, Siyu Liu, Hyungju Chun, Hyunhee Park, Zikun Liu, Chongyi Li

Main category: cs.CV

TL;DR: PerTouch is a diffusion-based image retouching framework that balances controllability and subjectivity by supporting semantic-level editing while maintaining global aesthetics through parameter maps and VLM-driven instruction handling.

Details

Motivation: The paper addresses the challenge of balancing controllability and subjectivity in image retouching, where users want both fine-grained control over specific semantic regions and alignment with their personalized aesthetic preferences.

Method: Proposes PerTouch framework using parameter maps containing attribute values in semantic regions as input to construct explicit parameter-to-image mapping. Introduces semantic replacement and parameter perturbation mechanisms for better boundary perception. Develops a VLM-driven agent with feedback-driven rethinking and scene-aware memory to handle user instructions and capture long-term preferences.

Result: Extensive experiments demonstrate each component’s effectiveness and superior performance of PerTouch in personalized image retouching compared to existing methods.

Conclusion: PerTouch provides a unified solution for personalized image retouching that effectively balances fine-grained control with subjective aesthetic preferences through its diffusion-based framework and intelligent instruction handling mechanisms.

Abstract: Image retouching aims to enhance visual quality while aligning with users’ personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch. Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms during training. To connect natural language instructions with visual control, we develop a VLM-driven agent to handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences. Extensive experiments demonstrate each component’s effectiveness and the superior performance of PerTouch in personalized image retouching. Code Pages: https://github.com/Auroral703/PerTouch.

[215] CompEvent: Complex-valued Event-RGB Fusion for Low-light Video Enhancement and Deblurring

Mingchen Zhong, Xin Lu, Dong Li, Senyan Xu, Ruixuan Jiang, Xueyang Fu, Baocai Yin

Main category: cs.CV

TL;DR: CompEvent is a complex neural network framework for low-light video deblurring that performs holistic full-process fusion of event camera data and RGB frames using complex-valued processing for temporal alignment and space-frequency learning.

Details

Motivation: Low-light video deblurring is challenging for nighttime surveillance and autonomous driving due to dim lighting and long exposures. Existing staged fusion methods are limited against combined low-light and motion blur degradations.

Method: CompEvent uses complex-valued neural networks with two core components: 1) Complex Temporal Alignment GRU using complex-valued convolutions and GRU processing for temporal alignment and continuous fusion, and 2) Complex Space-Frequency Learning module performing unified complex-valued signal processing in both spatial and frequency domains.

Result: Extensive experiments demonstrate that CompEvent outperforms state-of-the-art methods in low-light video deblurring, achieving superior performance through full-process spatiotemporal fusion and complementary learning between modalities.

Conclusion: CompEvent’s holistic full-process fusion approach using complex-valued neural networks effectively addresses the challenging task of low-light video deblurring by maximizing complementary learning between event data and RGB frames.

Abstract: Low-light video deblurring poses significant challenges in applications like nighttime surveillance and autonomous driving due to dim lighting and long exposures. While event cameras offer potential solutions with superior low-light sensitivity and high temporal resolution, existing fusion methods typically employ staged strategies, limiting their effectiveness against combined low-light and motion blur degradations. To overcome this, we propose CompEvent, a complex neural network framework enabling holistic full-process fusion of event data and RGB frames for enhanced joint restoration. CompEvent features two core components: 1) Complex Temporal Alignment GRU, which utilizes complex-valued convolutions and processes video and event streams iteratively via GRU to achieve temporal alignment and continuous fusion; and 2) Complex Space-Frequency Learning module, which performs unified complex-valued signal processing in both spatial and frequency domains, facilitating deep fusion through spatial structures and system-level characteristics. By leveraging the holistic representation capability of complex-valued neural networks, CompEvent achieves full-process spatiotemporal fusion, maximizes complementary learning between modalities, and significantly strengthens low-light video deblurring capability. Extensive experiments demonstrate that CompEvent outperforms SOTA methods in addressing this challenging task.

[216] GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

Yikun Wang, Zuyan Liu, Ziyi Wang, Han Hu, Pengfei Liu, Yongming Rao

Main category: cs.CV

TL;DR: GeoVista is an agentic model for geolocalization that integrates image zooming and web search tools within reasoning loops, trained via SFT and RL with hierarchical rewards, achieving state-of-the-art performance comparable to closed-source models.

Details

Motivation: Current agentic visual reasoning models focus too narrowly on image manipulation tools, lacking general-purpose capabilities. Geolocalization requires both nuanced visual grounding and web search for hypothesis refinement, but existing benchmarks lack high-resolution imagery and sufficient challenge for deep agentic reasoning.

Method: 1) Created GeoBench benchmark with worldwide photos, panoramas, and city satellite images; 2) Developed GeoVista agentic model with integrated tool invocation (image-zoom-in for region magnification and web-search for information retrieval); 3) Two-stage training: cold-start SFT for reasoning patterns and tool-use priors, followed by RL with hierarchical rewards leveraging multi-level geographical information.

Result: GeoVista greatly surpasses other open-source agentic models on geolocalization tasks and achieves performance comparable to closed-source models like Gemini-2.5-flash and GPT-5 on most metrics.

Conclusion: The proposed GeoVista model demonstrates effective integration of tool invocation within reasoning loops for geolocalization, with the hierarchical reward strategy and two-stage training pipeline enabling strong performance that bridges the gap between open-source and closed-source agentic models.

Abstract: Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.

[217] NeAR: Coupled Neural Asset-Renderer Stack

Hong Li, Chongjie Ye, Houyuan Chen, Weiqing Xiao, Ziyang Yan, Lixing Xiao, Zhaoxi Chen, Jianfeng Xiang, Shaocong Xu, Xuhui Liu, Yikai Wang, Baochang Zhang, Xiaoguang Han, Jiaolong Yang, Hao Zhao

Main category: cs.CV

TL;DR: NeAR introduces a coupled neural asset-renderer stack that co-designs asset representation and rendering for end-to-end optimization, enabling single-image to relightable 3D Gaussian splats in real-time.

Details

Motivation: Traditional neural asset authoring and neural rendering have evolved as disjoint paradigms, limiting potential for end-to-end optimization in fidelity and consistency. The paper aims to bridge this gap by treating assets and renderers as co-designed components rather than independent entities.

Method: NeAR consists of two co-designed components: 1) Lighting-Homogenized SLAT (LH-SLAT) asset representation that uses rectified-flow models to lift casually lit single images into canonical, illumination-invariant latent space, and 2) a lighting-aware neural decoder that interprets these homogenized latents to synthesize relightable 3D Gaussian splats in real-time when conditioned on HDR environment maps and camera views.

Result: The method outperforms state-of-the-art baselines on four tasks: G-buffer-based forward rendering, random-lit reconstruction, unknown-lit relighting, and novel-view relighting. Extensive experiments show superior performance in both quantitative metrics and perceptual quality.

Conclusion: The coupled asset-renderer perspective creates a robust “contract” for superior generation and should inspire future graphics stacks to view neural assets and renderers as co-designed components rather than independent entities.

Abstract: Neural asset authoring and neural rendering have traditionally evolved as disjoint paradigms: one generates digital assets for fixed graphics pipelines, while the other maps conventional assets to images. However, treating them as independent entities limits the potential for end-to-end optimization in fidelity and consistency. In this paper, we bridge this gap with NeAR, a Coupled Neural Asset–Renderer Stack. We argue that co-designing the asset representation and the renderer creates a robust “contract” for superior generation. On the asset side, we introduce the Lighting-Homogenized SLAT (LH-SLAT). Leveraging a rectified-flow model, NeAR lifts casually lit single images into a canonical, illumination-invariant latent space, effectively suppressing baked-in shadows and highlights. On the renderer side, we design a lighting-aware neural decoder tailored to interpret these homogenized latents. Conditioned on HDR environment maps and camera views, it synthesizes relightable 3D Gaussian splats in real-time without per-object optimization. We validate NeAR on four tasks: (1) G-buffer-based forward rendering, (2) random-lit reconstruction, (3) unknown-lit relighting, and (4) novel-view relighting. Extensive experiments demonstrate that our coupled stack outperforms state-of-the-art baselines in both quantitative metrics and perceptual quality. We hope this coupled asset-renderer perspective inspires future graphics stacks that view neural assets and renderers as co-designed components instead of independent entities.

[218] UniVCD: A New Method for Unsupervised Change Detection in the Open-Vocabulary Era

Ziqiang Zhu, Bowei Yang

Main category: cs.CV

TL;DR: UniVCD is an unsupervised, open-vocabulary change detection method using frozen SAM2 and CLIP models with lightweight feature alignment, achieving strong performance without labeled data.

Details

Motivation: Supervised CD methods are dataset-dependent, expensive to annotate, limited to predefined categories, and generalize poorly. Vision foundation models (SAM2, CLIP) offer new opportunities to overcome these limitations.

Method: Uses frozen SAM2 and CLIP models with a lightweight feature alignment module to bridge SAM2’s spatial details and CLIP’s semantic priors. Includes streamlined post-processing to suppress noise and pseudo-changes.

Result: Achieves consistently strong performance on BCD and SCD benchmarks, matching or surpassing existing open-vocabulary CD methods in key metrics (F1, IoU).

Conclusion: Unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD.

Abstract: Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at https://github.com/Die-Xie/UniVCD.

[219] Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

Yu Zhang, Jingyi Liu, Yiwei Shi, Qi Zhang, Duoqian Miao, Changwei Wang, Longbing Cao

Main category: cs.CV

TL;DR: Markov-VAR reformulates visual autoregressive modeling as a non-full-context Markov process using sliding window compression to reduce computational overhead while maintaining performance.

Details

Motivation: Traditional VAR models use full-context dependency (modeling all previous scales) which causes computational inefficiency and high memory overhead, limiting practical scalability despite better representation learning.

Method: Reformulate VAR as Markov process with Markovian Scale Prediction: treat each scale as Markov state, use sliding window to compress previous scales into compact history vector, combine with current state to create dynamic state evolving under Markov process.

Result: Markov-VAR reduces FID by 10.5% on ImageNet 256×256 and decreases peak memory consumption by 83.8% on 1024×1024 compared to original VAR, achieving better performance with much lower resource usage.

Conclusion: Markov-VAR provides an efficient and effective foundation for visual autoregressive generation, balancing performance and computational efficiency through Markov process reformulation.

Abstract: Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR’s practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256 $\times$ 256) and decreases peak memory consumption by 83.8% (1024 $\times$ 1024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.

Zhijian He, Feifei Liu, Yuwei Li, Zhanpeng Luo, Jintao Cheng, Xieyuanli Chen, Xiaoyu Tang

Main category: cs.CV

TL;DR: DiffFusion: A diffusion-based framework for robust multi-modal 3D object detection in adverse weather conditions using restoration and adaptive fusion.

Details

Motivation: Multi-modal 3D object detection suffers in adverse weather due to weather-induced distortions and misalignment between different data modalities (images and LiDAR).

Method: 1. Diffusion-IR: Restores weather-degraded images using diffusion models. 2. Point Cloud Restoration (PCR): Compensates corrupted LiDAR data using image object cues. 3. Bidirectional Adaptive Fusion and Alignment Module (BAFAM): Enables dynamic multi-modal fusion and bidirectional BEV alignment.

Result: Achieves state-of-the-art robustness under adverse weather on three public datasets while preserving strong clean-data performance. Zero-shot results on real-world DENSE dataset validate generalization.

Conclusion: DiffFusion effectively addresses weather challenges in multi-modal 3D object detection through diffusion-based restoration and adaptive cross-modal fusion, demonstrating strong robustness and generalization.

Abstract: Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird’s-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.

[221] Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation

Seogkyu Jeon, Kibeom Hong, Hyeran Byun

Main category: cs.CV

TL;DR: DPMFormer: A domain generalization framework for semantic segmentation that addresses semantic misalignment between visual and textual contexts through domain-aware prompt learning, contrastive learning with texture perturbation, and domain-robust consistency learning.

Details

Motivation: Existing domain generalized semantic segmentation methods that use Vision-Language Models overlook semantic misalignment between visual and textual contexts caused by fixed context prompts learned on single source domains.

Method: 1) Domain-aware prompt learning for semantic alignment between visual and textual cues; 2) Domain-aware contrastive learning with texture perturbation to diversify observable domains; 3) Domain-robust consistency learning to minimize prediction discrepancies between original and augmented images.

Result: Establishes new state-of-the-art performance on various DGSS benchmarks, demonstrating superiority over existing methods.

Conclusion: DPMFormer effectively addresses semantic misalignment in domain generalized semantic segmentation through domain-aware prompt learning and robust consistency mechanisms, achieving superior generalization across diverse domains.

Abstract: Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks. The code is available at https://github.com/jone1222/DPMFormer.

[222] Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi

Main category: cs.CV

TL;DR: Live Avatar: A real-time streaming avatar generation system using a 14B diffusion model with novel parallelism and consistency mechanisms to achieve 20 FPS infinite-length generation.

Details

Motivation: Existing diffusion-based video generation methods suffer from sequential computation bottlenecks and long-horizon inconsistency, making them impractical for real-time streaming audio-driven avatar synthesis applications.

Method: 1) Timestep-forcing Pipeline Parallelism (TPP) - pipelines denoising steps across multiple GPUs to break autoregressive bottlenecks; 2) Rolling Sink Frame Mechanism (RSFM) - maintains temporal consistency by dynamically recalibrating appearance using cached reference images; 3) Self-Forcing Distribution Matching Distillation - enables causal, streamable adaptation of large-scale models without quality loss.

Result: Achieves state-of-the-art performance with 20 FPS end-to-end generation on 5 H800 GPUs, enabling practical real-time high-fidelity avatar generation at scale - the first system to do so.

Conclusion: Live Avatar establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications, overcoming previous limitations of sequential computation and inconsistency for real-time streaming avatar generation.

Abstract: Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.

[223] Generation is Required for Data-Efficient Perception

Jack Brady, Bernhard Schölkopf, Thomas Kipf, Simon Buchholz, Wieland Brendel

Main category: cs.CV

TL;DR: Generative methods with decoder inversion enable compositional generalization, while non-generative encoder methods struggle without large-scale pretraining or extra supervision.

Details

Motivation: To investigate whether generative approaches (with decoder inversion) are necessary for human-level visual perception, specifically for achieving compositional generalization - a key aspect of human perception that current non-generative vision models may lack.

Method: Theoretical analysis formalizing inductive biases needed for compositional generalization in both decoder-based (generative) and encoder-based (non-generative) methods, followed by empirical evaluation on photorealistic image datasets comparing various generative and non-generative approaches.

Result: Generative methods can enforce necessary inductive biases through decoder constraints and inversion (via gradient-based search or generative replay), enabling compositional generalization. Non-generative methods generally cannot enforce these biases through regularization or architecture, requiring large-scale pretraining or added supervision to improve generalization.

Conclusion: Generative approaches with decoder inversion are crucial for achieving compositional generalization in visual perception, supporting the hypothesis that human-level perception requires generative mechanisms, while non-generative methods face fundamental limitations in this regard.

Abstract: It has been hypothesized that human-level visual perception requires a generative approach in which internal representations result from inverting a decoder. Yet today’s most successful vision models are non-generative, relying on an encoder that maps images to representations without decoder inversion. This raises the question of whether generation is, in fact, necessary for machines to achieve human-level visual perception. To address this, we study whether generative and non-generative methods can achieve compositional generalization, a hallmark of human perception. Under a compositional data generating process, we formalize the inductive biases required to guarantee compositional generalization in decoder-based (generative) and encoder-based (non-generative) methods. We then show theoretically that enforcing these inductive biases on encoders is generally infeasible using regularization or architectural constraints. In contrast, for generative methods, the inductive biases can be enforced straightforwardly, thereby enabling compositional generalization by constraining a decoder and inverting it. We highlight how this inversion can be performed efficiently, either online through gradient-based search or offline through generative replay. We examine the empirical implications of our theory by training a range of generative and non-generative methods on photorealistic image datasets. We find that, without the necessary inductive biases, non-generative methods often fail to generalize compositionally and require large-scale pretraining or added supervision to improve generalization. By comparison, generative methods yield significant improvements in compositional generalization, without requiring additional data, by leveraging suitable inductive biases on a decoder along with search and replay.

[224] VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference

Shengling Qin, Hao Yu, Chenxin Wu, Zheng Li, Yizhong Cao, Zhengyang Zhuge, Yuxin Zhou, Wentao Yao, Yi Zhang, Zhengheng Wang, Shuai Bai, Jianwei Zhang, Junyang Lin

Main category: cs.CV

TL;DR: VLCache is a cache reuse framework for multimodal LLMs that reuses both KV cache and encoder cache from previous inputs to avoid recomputation when same multimodal inputs recur, achieving near-perfect accuracy with only 2-5% token computation and 1.2x-16x speedups.

Details

Motivation: Multimodal LLMs incur high computational costs when processing the same multimodal inputs repeatedly. Current approaches use heuristic methods that don't address cumulative reuse errors effectively, leading to accuracy degradation or inefficient computation.

Method: 1) Formal identification of cumulative reuse error effect and techniques to minimize non-prefix cache reuse error. 2) Analysis of varying importance across model layers and proposal of dynamic, layer-aware recomputation strategy to balance accuracy and efficiency. 3) Implementation based on SGLang framework.

Result: Achieves accuracy on par with full recomputation while requiring only 2-5% of tokens to compute, yielding 1.2x-16x TTFT (Time To First Token) speedups. Experimental implementation shows significantly faster inference in practical deployments.

Conclusion: VLCache effectively eliminates costly recomputation for recurring multimodal inputs through formal error minimization and layer-aware strategies, enabling efficient multimodal LLM inference with minimal accuracy loss.

Abstract: This paper presents VLCache, a cache reuse framework that exploits both Key-Value (KV) cache and encoder cache from prior multimodal inputs to eliminate costly recomputation when the same multimodal inputs recur. Unlike previous heuristic approaches, we formally identify the cumulative reuse error effect and demonstrate how to minimize the non-prefix cache reuse error effectively. We further analyze the varying importance of model layers and propose a dynamic, layer-aware recomputation strategy to balance accuracy and efficiency. Experimental results show that VLCache achieves an accuracy on par with full recomputation, while requiring only 2-5% of the tokens to compute, yielding 1.2x-16x TTFT speedups. We develop an experimental implementation of the proposed VLCache pipeline based on SGLang, enabling significantly faster inference in practical deployments.

[225] Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin

Main category: cs.CV

TL;DR: Video Reality Test is a new benchmark for evaluating AI-generated video detection, focusing on immersive ASMR videos with tight audio-visual coupling, revealing that current VLMs struggle to distinguish realistic AI videos while humans perform much better.

Details

Motivation: Current AI-generated video detection benchmarks lack evaluation of audio-paired videos and focus only on classification. There's a need to test whether state-of-the-art video generation models can produce immersive, audio-paired videos that can deceive both humans and VLMs, especially in tightly coupled audio-visual scenarios like ASMR content.

Method: The authors introduce Video Reality Test, an ASMR-sourced video benchmark with two key components: (1) Immersive ASMR video-audio sources built on curated real ASMR videos targeting fine-grained action-object interactions, and (2) A peer-review evaluation protocol where video generation models act as creators trying to fool reviewers, while VLMs serve as reviewers trying to detect fakeness.

Result: The best creator model (Veo3.1-Fast) fools most VLMs, with the strongest reviewer (Gemini 2.5-Pro) achieving only 56% accuracy (random is 50%), far below human expert performance of 81.25%. Adding audio improves real-fake discrimination, but superficial cues like watermarks can still significantly mislead models.

Conclusion: The findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency, highlighting the gap between AI and human performance in detecting realistic AI-generated videos.

Abstract: Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: (i) Immersive ASMR video-audio sources. Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. (ii) Peer-Review evaluation. An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56% accuracy (random 50%), far below that of human experts (81.25%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.

[226] Unified Semantic Transformer for 3D Scene Understanding

Sebastian Koch, Johanna Wald, Hidenobu Matsuki, Pedro Hermosilla, Timo Ropinski, Federico Tombari

Main category: cs.CV

TL;DR: UNITE is a unified neural network that performs multiple 3D semantic understanding tasks from RGB images in seconds, outperforming task-specific models.

Details

Motivation: Existing 3D scene understanding models are task-specific and limited due to real-world complexity. There's a need for a unified approach that can handle multiple semantic tasks efficiently.

Method: A Unified Semantic Transformer (UNITE) that processes unseen scenes end-to-end from RGB images, using 2D distillation with self-supervision and novel multi-view losses for 3D consistency.

Result: Achieves state-of-the-art performance on multiple semantic tasks, often outperforming task-specific models and even methods using ground truth 3D geometry.

Conclusion: UNITE demonstrates that a unified model can effectively handle diverse 3D semantic understanding tasks from RGB images, offering efficient and comprehensive scene parsing.

Abstract: Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io

[227] Null-LoRA: Low-Rank Adaptation on Null Space

Yi Zhang, Yulei Kang, Haoxuan Chen, Jinxuan Li, Jian-Fang Hu

Main category: cs.CV

TL;DR: Null-LoRA: A parameter-efficient fine-tuning method that performs low-rank adaptation within the null space of pre-trained models, achieving better performance with fewer parameters.

Details

Motivation: Existing LoRA methods perform adaptation over the full parameter space, but fine-tuning within a subspace can achieve comparable effectiveness. Pre-trained models possess non-trivial null spaces that can be leveraged for more efficient adaptation.

Method: Null-LoRA freezes portions of low-rank matrices to reduce redundancy and enhance effective rank. It constrains the entire incremental update within the null space of pre-trained models, maximizing utilization of incremental updates for new task adaptation.

Result: Null-LoRA surpasses state-of-the-art methods with fewer parameters in extensive experiments across image-text retrieval and visual question answering tasks.

Conclusion: Null-space based adaptation provides an effective approach for parameter-efficient fine-tuning, achieving better performance while reducing parameter requirements compared to existing LoRA variants.

Abstract: Parameter-efficient fine-tuning methods have gained considerable popularity for adapting large-scale models to downstream tasks, particularly LoRA and its variants. Existing methods perform low-rank adaptation over the full parameter space. However, fine-tuning within a subspace can achieve comparable effectiveness. Inspired by the observation that pre-trained models possess non-trivial null spaces, we propose Null-space based Low-Rank Adaptation (Null-LoRA). Null-LoRA effectively reduces redundancy and enhances effective rank by freezing portions of the low-rank matrices. To further improve parameter efficiency, Null-LoRA constrains the entire incremental update within the null space, maximizing the utilization of incremental updates to adapt to new task paradigms. Null-LoRA surpasses the state of the art with fewer parameters in extensive experiments across image-text retrieval and visual question answering tasks.

[228] SemanticBridge - A Dataset for 3D Semantic Segmentation of Bridges and Domain Gap Analysis

Maximilian Kellner, Mariana Ferrandon Cervantes, Yuandong Pan, Ruodan Lu, Ioannis Brilakis, Alexander Reiterer

Main category: cs.CV

TL;DR: A novel 3D semantic segmentation dataset for bridges with domain gap analysis across different sensors, showing up to 11.4% mIoU performance drop due to sensor variations.

Details

Motivation: Addresses the critical need for infrastructure inspection and maintenance by providing a specialized dataset for 3D semantic segmentation of bridge structures, which is essential for advancing structural health monitoring practices.

Method: Created a novel dataset of high-resolution 3D scans of diverse bridge structures from various countries with detailed semantic labels. Evaluated three state-of-the-art 3D deep learning architectures and analyzed domain gaps caused by different sensors.

Result: All three architectures demonstrated robust performance on bridge segmentation. However, sensor variations caused a domain gap that can lead to performance degradation of up to 11.4% mIoU.

Conclusion: The proposed dataset enables accurate automated bridge component segmentation, but sensor-induced domain gaps must be addressed for reliable real-world deployment in infrastructure monitoring applications.

Abstract: We propose a novel dataset that has been specifically designed for 3D semantic segmentation of bridges and the domain gap analysis caused by varying sensors. This addresses a critical need in the field of infrastructure inspection and maintenance, which is essential for modern society. The dataset comprises high-resolution 3D scans of a diverse range of bridge structures from various countries, with detailed semantic labels provided for each. Our initial objective is to facilitate accurate and automated segmentation of bridge components, thereby advancing the structural health monitoring practice. To evaluate the effectiveness of existing 3D deep learning models on this novel dataset, we conduct a comprehensive analysis of three distinct state-of-the-art architectures. Furthermore, we present data acquired through diverse sensors to quantify the domain gap resulting from sensor variations. Our findings indicate that all architectures demonstrate robust performance on the specified task. However, the domain gap can potentially lead to a decline in the performance of up to 11.4% mIoU.

[229] Stylized Synthetic Augmentation further improves Corruption Robustness

Georg Siedel, Rojan Regmi, Abhirami Anand, Weijia Shao, Silvia Vock, Andrey Morozov

Main category: cs.CV

TL;DR: Combining synthetic data with neural style transfer improves corruption robustness in image classifiers, achieving SOTA results on CIFAR-10-C, CIFAR-100-C, and TinyImageNet-C benchmarks.

Details

Motivation: Deep vision models are vulnerable to common corruptions, and existing data augmentation methods need improvement for better corruption robustness.

Method: Proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer, systematically analyzing augmentation effects and hyperparameters.

Result: Achieves state-of-the-art corruption robustness: 93.54% on CIFAR-10-C, 74.9% on CIFAR-100-C, and 50.86% on TinyImageNet-C. Stylization and synthetic data complement each other and work with TrivialAugment but not other rule-based augmentations.

Conclusion: The combination of synthetic data and neural style transfer effectively improves model robustness to common corruptions, despite style transfer degrading synthetic image quality according to FID metrics.

Abstract: This paper proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer in order to address the vulnerability of deep vision models to common corruptions. We show that although applying style transfer on synthetic images degrades their quality with respect to the common FID metric, these images are surprisingly beneficial for model training. We conduct a systematic empirical analysis of the effects of both augmentations and their key hyperparameters on the performance of image classifiers. Our results demonstrate that stylization and synthetic data complement each other well and can be combined with popular rule-based data augmentation techniques such as TrivialAugment, while not working with others. Our method achieves state-of-the-art corruption robustness on several small-scale image classification benchmarks, reaching 93.54%, 74.9% and 50.86% robust accuracy on CIFAR-10-C, CIFAR-100-C and TinyImageNet-C, respectively

cs.AI

[230] Anubuddhi: A Multi-Agent AI System for Designing and Simulating Quantum Optics Experiments

S. K. Rithvik

Main category: cs.AI

TL;DR: Anubuddhi is a multi-agent AI system that designs and simulates quantum optics experiments from natural language prompts without requiring programming expertise, achieving high design-simulation alignment for diverse quantum optics experiments.

Details

Motivation: To democratize computational experiment design in quantum optics by enabling researchers and students to create and simulate experiments through natural language without specialized programming knowledge, bridging the gap between conceptual ideas and computational implementation.

Method: Multi-agent AI system with intent routing, knowledge-augmented generation, and semantic retrieval from a three-tier toolbox to compose optical layouts. Uses dual-mode validation with QuTiP and FreeSim physics simulation engines with convergent refinement. The system arranges components via semantic matching and validates designs through physics simulation.

Result: Achieved design-simulation alignment scores of 8-9/10 across 13 diverse quantum optics experiments. Free-form simulation outperformed constrained frameworks for 11/13 experiments. System successfully modeled fundamental optics, quantum information protocols, and advanced technologies. Critical distinction found between structural correctness (high alignment confirms correct physics architecture) and quantitative accuracy (requires expert review).

Conclusion: Anubuddhi democratizes quantum optics experiment design for research and pedagogy, producing strong initial designs that users can iteratively refine through conversation. The system reveals that quantum optics diversity demands flexible mathematical representations, with free-form simulation outperforming constrained frameworks for most experiments.

Abstract: We present Anubuddhi, a multi-agent AI system that designs and simulates quantum optics experiments from natural language prompts without requiring specialized programming knowledge. The system composes optical layouts by arranging components from a three-tier toolbox via semantic retrieval, then validates designs through physics simulation with convergent refinement. The architecture combines intent routing, knowledge-augmented generation, and dual-mode validation (QuTiP and FreeSim). We evaluated 13 experiments spanning fundamental optics (Hong-Ou-Mandel interference, Michelson/Mach-Zehnder interferometry, Bell states, delayed-choice quantum eraser), quantum information protocols (BB84 QKD, Franson interferometry, GHZ states, quantum teleportation, hyperentanglement), and advanced technologies (boson sampling, electromagnetically induced transparency, frequency conversion). The system achieves design-simulation alignment scores of 8–9/10, with simulations faithfully modeling intended physics. A critical finding distinguishes structural correctness from quantitative accuracy: high alignment confirms correct physics architecture, while numerical predictions require expert review. Free-form simulation outperformed constrained frameworks for 11/13 experiments, revealing that quantum optics diversity demands flexible mathematical representations. The system democratizes computational experiment design for research and pedagogy, producing strong initial designs users can iteratively refine through conversation.

[231] The Principle of Proportional Duty: A Knowledge-Duty Framework for Ethical Equilibrium in Human and Artificial Systems

Timothy Prescher

Main category: cs.AI

TL;DR: The paper introduces the Principle of Proportional Duty (PPD), a framework showing how ethical responsibility scales with epistemic uncertainty, transforming Action Duty into Repair Duty as uncertainty increases.

Details

Motivation: Traditional ethical frameworks struggle to model decision-making under uncertainty, treating it as a simple constraint rather than a dynamic component of moral responsibility.

Method: Develops the PPD framework with mathematical formulation (D_total = K[(1-HI) + HI * g(C_signal)]), uses Monte Carlo simulations to demonstrate system behavior, and applies the framework across four domains: clinical ethics, recipient-rights law, economic governance, and artificial intelligence.

Result: Systems with baseline humility coefficients (lambda > 0) produce more stable duty allocations and reduce overconfident decision-making. The framework demonstrates cross-disciplinary validity and shows proportional duty serves as a stabilizing principle in complex systems.

Conclusion: The PPD offers a mathematically tractable approach to moral responsibility that could inform auditable AI decision systems, preventing both overreach and omission by dynamically balancing epistemic confidence against contextual risk.

Abstract: Traditional ethical frameworks often struggle to model decision-making under uncertainty, treating it as a simple constraint on action. This paper introduces the Principle of Proportional Duty (PPD), a novel framework that models how ethical responsibility scales with an agent’s epistemic state. The framework reveals that moral duty is not lost to uncertainty but transforms: as uncertainty increases, Action Duty (the duty to act decisively) is proportionally converted into Repair Duty (the active duty to verify, inquire, and resolve uncertainty). This dynamic is expressed by the equation D_total = K[(1-HI) + HI * g(C_signal)], where Total Duty is a function of Knowledge (K), Humility/Uncertainty (HI), and Contextual Signal Strength (C_signal). Monte Carlo simulations demonstrate that systems maintaining a baseline humility coefficient (lambda > 0) produce more stable duty allocations and reduce the risk of overconfident decision-making. By formalizing humility as a system parameter, the PPD offers a mathematically tractable approach to moral responsibility that could inform the development of auditable AI decision systems. This paper applies the framework across four domains, clinical ethics, recipient-rights law, economic governance, and artificial intelligence, to demonstrate its cross-disciplinary validity. The findings suggest that proportional duty serves as a stabilizing principle within complex systems, preventing both overreach and omission by dynamically balancing epistemic confidence against contextual risk.

[232] Prompt-to-Parts: Generative AI for Physical Assembly and Scalable Instructions

David Noever

Main category: cs.AI

TL;DR: A framework that generates physically realizable assembly instructions from natural language using LDraw as intermediate representation, producing valid construction sequences for brick-based prototypes with over 3000 parts.

Details

Motivation: Bridging the gap between semantic design intent and manufacturable output by creating a method that enforces geometric validity, connection constraints, and buildability ordering - addressing limitations of previous pixel-based diffusion methods and CAD models that fail to support complex assembly instructions.

Method: Uses LDraw as a text-rich intermediate representation to guide large language models with tools, operating within a discrete parts vocabulary to produce valid step-by-step construction sequences. Introduces a Python library for programmatic model generation and evaluates on complex domains like satellites, aircraft, and architecture.

Result: Demonstrates scalability, modularity, and fidelity with physical prototyping from natural language specifications. The “bag of bricks” method functions as a physical API connecting brick locations to a “bag of words” vocabulary, enabling compilation of arbitrary functional requirements into material reality across four original designs.

Conclusion: The approach provides a novel elemental lingua franca that opens new design options while guiding natural language implementations in manufacturing and engineering prototyping, creating a consistent and repeatable AI representation for physical assembly.

Abstract: We present a framework for generating physically realizable assembly instructions from natural language descriptions. Unlike unconstrained text-to-3D approaches, our method operates within a discrete parts vocabulary, enforcing geometric validity, connection constraints, and buildability ordering. Using LDraw as a text-rich intermediate representation, we demonstrate that large language models can be guided with tools to produce valid step-by-step construction sequences and assembly instructions for brick-based prototypes of more than 3000 assembly parts. We introduce a Python library for programmatic model generation and evaluate buildable outputs on complex satellites, aircraft, and architectural domains. The approach aims for demonstrable scalability, modularity, and fidelity that bridges the gap between semantic design intent and manufacturable output. Physical prototyping follows from natural language specifications. The work proposes a novel elemental lingua franca as a key missing piece from the previous pixel-based diffusion methods or computer-aided design (CAD) models that fail to support complex assembly instructions or component exchange. Across four original designs, this novel “bag of bricks” method thus functions as a physical API: a constrained vocabulary connecting precisely oriented brick locations to a “bag of words” through which arbitrary functional requirements compile into material reality. Given such a consistent and repeatable AI representation opens new design options while guiding natural language implementations in manufacturing and engineering prototyping.

[233] Emergence: Overcoming Privileged Information Bias in Asymmetric Embodied Agents via Active Querying

Shaun Baek, Sam Liu, Joseph Ukpong

Main category: cs.AI

TL;DR: LLMs struggle with symbolic grounding in asymmetric environments due to Privileged Information Bias, where knowledgeable agents fail to guide sensor-limited partners. A novel framework reveals a 50% success gap, showing active querying protocols are 2x more effective than standard instruction.

Details

Motivation: To investigate how LLMs' reasoning capabilities break down in embodied environments with asymmetric information distribution, particularly the "Curse of Knowledge" phenomenon where knowledgeable agents fail to account for their partners' limited perception.

Method: Proposed an Asymmetric Assistive Reasoning framework within AI2-THOR environment, comparing “Pull-based” (active querying) vs “Push-based” (standard instruction) communication protocols between Leader and Follower agents.

Result: Found significant “Success Gap”: Leader perceives target in 35.0% of episodes but team succeeds only 17.0% of the time. Pull-based protocol was significantly more robust, with successful episodes featuring 2x the frequency of clarification requests.

Conclusion: Active uncertainty reduction through querying protocols is crucial for overcoming symbolic grounding failures in asymmetric environments, providing a prerequisite for safe human-AI and robot-robot collaboration.

Abstract: Large Language Models (LLMs) act as powerful reasoning engines but struggle with “symbol grounding” in embodied environments, particularly when information is asymmetrically distributed. We investigate the Privileged Information Bias (or “Curse of Knowledge”), where a knowledgeable “Leader” agent fails to guide a sensor-limited “Follower” due to a lack of Theory of Mind. To quantify this phenomenon, we propose a novel Asymmetric Assistive Reasoning framework within AI2-THOR. Our experiments reveal a significant “Success Gap”: while the Leader successfully perceives the target in 35.0% of episodes, the collaborative team succeeds only 17.0% of the time, implying that nearly 50% of feasible plans fail solely due to communicative grounding errors. We demonstrate that a “Pull-based” protocol (active querying) is significantly more robust than standard “Push-based” instruction, with successful episodes featuring 2x the frequency of clarification requests. This research isolates the mechanism of active uncertainty reduction as a prerequisite for safe human-AI and robot-robot collaboration.

[234] AI Epidemiology: achieving explainable AI through expert oversight patterns

Kit Tempest-Walters

Main category: cs.AI

TL;DR: AI Epidemiology applies public health surveillance methods to AI governance by tracking population-level statistical patterns in AI outputs rather than analyzing internal model mechanics.

Details

Motivation: Current AI interpretability methods (like SHAP and mechanistic interpretability) struggle with model complexity at deployment scale. There's a need for practical governance frameworks that don't require deep ML expertise and can work across model updates and vendor changes.

Method: Standardizes capture of AI-expert interactions into structured assessment fields (risk level, alignment score, accuracy score) as exposure variables. Uses statistical associations to predict output failures, validated against expert overrides and real-world outcomes. Passively tracks expert convergence/divergence with AI recommendations.

Result: Provides automatic audit trails, zero burden on experts, governance continuity across model updates/vendor changes, reliability scores, and semantic assessments. Enables detection of unreliable AI outputs before harm occurs.

Conclusion: AI Epidemiology democratizes AI oversight by enabling domain experts to govern AI systems without requiring machine learning expertise, using epidemiological surveillance principles to provide practical, scalable governance.

Abstract: AI Epidemiology is a framework for governing and explaining advanced AI systems by applying population-level surveillance methods to AI outputs. The approach mirrors the way in which epidemiologists enable public health interventions through statistical evidence before molecular mechanisms are understood. This bypasses the problem of model complexity which plagues current interpretability methods (such as SHAP and mechanistic interpretability) at the scale of deployed models. AI Epidemiology achieves this population-level surveillance by standardising capture of AI-expert interactions into structured assessment fields: risk level, alignment score, and accuracy score. These function as exposure variables which predict output failure through statistical associations, much like cholesterol and blood pressure act as exposure variables predicting cardiac events. Output-failure associations are subsequently validated against expert overrides and real-world outcomes. The framework places zero burden on experts and provides automatic audit trails by passively tracking expert convergence and divergence with AI recommendations. Since it analyses outputs rather than internal model computations, it also provides governance continuity when institutions update models and switch vendors. Finally, by providing reliability scores and semantic assessments (e.g. ’this recommendation resembles 500 cases overridden by experts due to guideline violations’), it enables experts and institutions to detect unreliable AI outputs before they cause harm. This democratises AI oversight by enabling domain experts to govern AI systems without requiring machine learning expertise.

[235] Beyond Training: Enabling Self-Evolution of Agents with MOBIMEM

Zibin Liu, Cheng Zhang, Xi Zhao, Yunfei Feng, Bingyu Bai, Dahu Feng, Erhu Feng, Yubin Xia, Haibo Chen

Main category: cs.AI

TL;DR: MOBIMEM is a memory-centric LLM agent system that enables self-evolution without model retraining through specialized memory primitives and OS-inspired services, achieving significant improvements in profile alignment, task success rates, and latency reduction on mobile devices.

Details

Motivation: Current LLM agent architectures struggle with post-deployment self-evolution, requiring continuous model retraining/fine-tuning that incurs high computational costs and suffers from accuracy-efficiency trade-offs. There's a need for agents that can evolve iteratively without model retraining.

Method: Proposes MOBIMEM with three specialized memory primitives: (1) Profile Memory using lightweight DisGraph structure for user preference alignment, (2) Experience Memory with multi-level templates for task generalization, and (3) Action Memory recording fine-grained interactions. Integrates OS-inspired services including a scheduler, AgentRR mechanism for action reuse, and context-aware exception handling.

Result: Achieves 83.1% profile alignment with 23.83 ms retrieval time (280x faster than GraphRAG baselines), improves task success rates by up to 50.3%, and reduces end-to-end latency by up to 9x on mobile devices in evaluations on AndroidWorld and top-50 apps.

Conclusion: MOBIMEM demonstrates that memory-centric architectures can enable LLM agents to self-evolve post-deployment without model retraining, achieving significant improvements in personalization, capability, and efficiency through specialized memory primitives and OS-inspired orchestration services.

Abstract: Large Language Model (LLM) agents are increasingly deployed to automate complex workflows in mobile and desktop environments. However, current model-centric agent architectures struggle to self-evolve post-deployment: improving personalization, capability, and efficiency typically requires continuous model retraining/fine-tuning, which incurs prohibitive computational overheads and suffers from an inherent trade-off between model accuracy and inference efficiency. To enable iterative self-evolution without model retraining, we propose MOBIMEM, a memory-centric agent system. MOBIMEM first introduces three specialized memory primitives to decouple agent evolution from model weights: (1) Profile Memory uses a lightweight distance-graph (DisGraph) structure to align with user preferences, resolving the accuracy-latency trade-off in user profile retrieval; (2) Experience Memory employs multi-level templates to instantiate execution logic for new tasks, ensuring capability generalization; and (3) Action Memory records fine-grained interaction sequences, reducing the reliance on expensive model inference. Building upon this memory architecture, MOBIMEM further integrates a suite of OS-inspired services to orchestrate execution: a scheduler that coordinates parallel sub-task execution and memory operations; an agent record-and-replay (AgentRR) mechanism that enables safe and efficient action reuse; and a context-aware exception handling that ensures graceful recovery from user interruptions and runtime errors. Evaluation on AndroidWorld and top-50 apps shows that MOBIMEM achieves 83.1% profile alignment with 23.83 ms retrieval time (280x faster than GraphRAG baselines), improves task success rates by up to 50.3%, and reduces end-to-end latency by up to 9x on mobile devices.

[236] AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Sanjoy Chowdhury, Karren D. Yang, Xudong Liu, Fartash Faghri, Pavan Kumar Anasosalu Vasu, Oncel Tuzel, Dinesh Manocha, Chun-Liang Li, Raviteja Vemulapalli

Main category: cs.AI

TL;DR: AMUSE benchmark tests multimodal LLMs on agentic audio-video reasoning tasks, showing current models struggle with multi-speaker dialogue understanding. RAFT framework improves performance by 39.52% through reward optimization and selective parameter adaptation.

Details

Motivation: Current MLLMs like GPT-4o and Qwen3-Omni have strong perception but struggle with agentic reasoning in multi-speaker, dialogue-centric audio-video scenarios that require tracking speakers, maintaining roles, and grounding events across time - crucial for applications like conversational video assistants and meeting analytics.

Method: 1) AMUSE benchmark with tasks requiring agentic decomposition into planning, grounding, and reflection steps, evaluated across three modes (zero-shot, guided, agentic) and six task families. 2) RAFT framework: data-efficient agentic alignment integrating reward optimization with intrinsic multimodal self-evaluation as reward signal and selective parameter adaptation for efficient updates.

Result: Current models exhibit weak multi-speaker reasoning and inconsistent behavior across all evaluation modes. RAFT achieves up to 39.52% relative improvement in accuracy on the AMUSE benchmark, demonstrating significant performance gains through agentic alignment.

Conclusion: AMUSE and RAFT provide a practical platform for examining and improving agentic reasoning in multimodal models, addressing critical gaps in multi-speaker audio-video understanding through benchmark evaluation and data-efficient alignment framework.

Abstract: Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMUSE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes zero-shot, guided, and agentic and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak multi-speaker reasoning and inconsistent behavior under both non-agentic and agentic evaluation. Motivated by the inherently agentic nature of these tasks and recent advances in LLM agents, we propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation as reward and selective parameter adaptation for data and parameter efficient updates. Using RAFT, we achieve up to 39.52% relative improvement in accuracy on our benchmark. Together, AMUSE and RAFT provide a practical platform for examining agentic reasoning in multimodal models and improving their capabilities.

[237] State-Augmented Graphs for Circular Economy Triage

Richard Fox, Rui Li, Gustav Jonsson, Farzaneh Goli, Miying Yang, Emel Aktas, Yongjing Wang

Main category: cs.AI

TL;DR: A novel decision-making framework for circular economy triage using state-augmented disassembly sequencing planning with Markov property enforcement for optimal recursive evaluation of product end-of-life pathways.

Details

Motivation: Circular economy triage requires adaptive decisions balancing retained value against processing costs and labor constraints, but existing approaches lack a unified framework for optimal decision-making across diverse products and operational contexts.

Method: Develops a deterministic solver over a state-augmented Disassembly Sequencing Planning (DSP) graph that encodes disassembly history into states to enforce Markov property, enabling optimal recursive evaluation where each decision depends only on previous state. Integrates condition-aware utility based on diagnostic health scores and operational constraints.

Result: Demonstrates framework flexibility with electric vehicle battery triage example, showing how recursive valuation of components accommodates varying mechanical complexity, safety requirements, and economic drivers in decision-making.

Conclusion: The unified formalism provides a tractable and generalizable foundation for optimizing circular economy triage decisions across diverse products and operational contexts, enabling adaptive decisions between continuing disassembly or committing to circular economy options.

Abstract: Circular economy (CE) triage is the assessment of products to determine which sustainable pathway they can follow once they reach the end of their usefulness as they are currently being used. Effective CE triage requires adaptive decisions that balance retained value against the costs and constraints of processing and labour. This paper presents a novel decision-making framework as a simple deterministic solver over a state-augmented Disassembly Sequencing Planning (DSP) graph. By encoding the disassembly history into the state, our framework enforces the Markov property, enabling optimal, recursive evaluation by ensuring each decision only depends on the previous state. The triage decision involves choices between continuing disassembly or committing to a CE option. The model integrates condition-aware utility based on diagnostic health scores and complex operational constraints. We demonstrate the framework’s flexibility with a worked example: the hierarchical triage of electric vehicle (EV) batteries, where decisions are driven by the recursive valuation of components. The example illustrates how a unified formalism enables the accommodation of varying mechanical complexity, safety requirements, and economic drivers. This unified formalism therefore provides a tractable and generalisable foundation for optimising CE triage decisions across diverse products and operational contexts.

[238] PediatricAnxietyBench: Evaluating Large Language Model Safety Under Parental Anxiety and Pressure in Pediatric Consultations

Vahideh Zolfaghari

Main category: cs.AI

TL;DR: LLMs show significant safety vulnerabilities when providing pediatric advice under real-world parental pressure, with adversarial queries reducing safety by 8% and emergency recognition being absent.

Details

Motivation: LLMs are increasingly used by parents for pediatric guidance, but their safety under real-world adversarial pressures (like anxious parental urgency) is poorly understood and could lead to harmful advice.

Method: Created PediatricAnxietyBench - an open-source benchmark of 300 high-quality queries across 10 pediatric topics (150 patient-derived, 150 adversarial). Evaluated two Llama models (70B and 8B) using multi-dimensional safety framework covering diagnostic restraint, referral adherence, hedging, and emergency recognition. Adversarial queries incorporated parental pressure patterns including urgency, economic barriers, and challenges to disclaimers.

Result: Mean safety score was 5.50/15 (SD=2.41). 70B model outperformed 8B model (6.26 vs 4.95, p<0.001) with lower critical failures (4.8% vs 12.0%, p=0.02). Adversarial queries reduced safety by 8% (p=0.03), with urgency causing largest drop (-1.40). Vulnerabilities in seizures (33.3% inappropriate diagnosis) and post-vaccination queries. Hedging strongly correlated with safety (r=0.68, p<0.001), while emergency recognition was absent.

Conclusion: Model scale influences safety, but all models show vulnerabilities to realistic parental pressures. PediatricAnxietyBench provides a reusable adversarial evaluation framework to reveal clinically significant failure modes overlooked by standard benchmarks.

Abstract: Large language models (LLMs) are increasingly consulted by parents for pediatric guidance, yet their safety under real-world adversarial pressures is poorly understood. Anxious parents often use urgent language that can compromise model safeguards, potentially causing harmful advice. PediatricAnxietyBench is an open-source benchmark of 300 high-quality queries across 10 pediatric topics (150 patient-derived, 150 adversarial) enabling reproducible evaluation. Two Llama models (70B and 8B) were assessed using a multi-dimensional safety framework covering diagnostic restraint, referral adherence, hedging, and emergency recognition. Adversarial queries incorporated parental pressure patterns, including urgency, economic barriers, and challenges to disclaimers. Mean safety score was 5.50/15 (SD=2.41). The 70B model outperformed the 8B model (6.26 vs 4.95, p<0.001) with lower critical failures (4.8% vs 12.0%, p=0.02). Adversarial queries reduced safety by 8% (p=0.03), with urgency causing the largest drop (-1.40). Vulnerabilities appeared in seizures (33.3% inappropriate diagnosis) and post-vaccination queries. Hedging strongly correlated with safety (r=0.68, p<0.001), while emergency recognition was absent. Model scale influences safety, yet all models showed vulnerabilities to realistic parental pressures. PediatricAnxietyBench provides a reusable adversarial evaluation framework to reveal clinically significant failure modes overlooked by standard benchmarks.

[239] Darth Vecdor: An Open-Source System for Generating Knowledge Graphs Through Large Language Model Queries

Jonathan A. Handler

Main category: cs.AI

TL;DR: DV extracts LLM knowledge into structured SQL database with GUI interface for domain experts, addressing LLM response issues for healthcare applications.

Details

Motivation: LLMs contain vast knowledge but direct querying has cost, speed, safety, and confidence concerns, especially in high-volume healthcare operations. Structured knowledge extraction could mitigate these issues.

Method: Darth Vecdor extracts knowledge from LLMs into structured SQL databases with terminology mapping. Features address erroneous, off-topic, free-text, overly general, and inconsistent LLM responses, and support multi-element responses. Includes browser-based GUI for domain experts with minimal technical background.

Result: DV released as free, open-source, extensible software with disclaimer about potential bugs and risks. Provides structured knowledge base alternative to direct LLM querying with improved queryability through standard database.

Conclusion: DV offers structured knowledge extraction from LLMs with GUI accessibility for domain experts, potentially improving healthcare applications despite acknowledged risks and limitations.

Abstract: Many large language models (LLMs) are trained on a massive body of knowledge present on the Internet. Darth Vecdor (DV) was designed to extract this knowledge into a structured, terminology-mapped, SQL database (“knowledge base” or “knowledge graph”). Knowledge graphs may be useful in many domains, including healthcare. Although one might query an LLM directly rather than a SQL-based knowledge graph, concerns such as cost, speed, safety, and confidence may arise, especially in high-volume operations. These may be mitigated when the information is pre-extracted from the LLM and becomes query-able through a standard database. However, the author found the need to address several issues. These included erroneous, off-topic, free-text, overly general, and inconsistent LLM responses, as well as allowing for multi-element responses. DV was built with features intended to mitigate these issues. To facilitate ease of use, and to allow for prompt engineering by those with domain expertise but little technical background, DV provides a simple, browser-based graphical user interface. DV has been released as free, open-source, extensible software, on an “as is” basis, without warranties or conditions of any kind, either express or implied. Users need to be cognizant of the potential risks and benefits of using DV and its outputs, and users are responsible for ensuring any use is safe and effective. DV should be assumed to have bugs, potentially very serious ones. However, the author hopes that appropriate use of current and future versions of DV and its outputs can help improve healthcare.

[240] Leveraging Spreading Activation for Improved Document Retrieval in Knowledge-Graph-Based RAG Systems

Jovan Pavlović, Miklós Krész, László Hajdu

Main category: cs.AI

TL;DR: A novel RAG framework using spreading activation algorithm on automatically constructed knowledge graphs improves multi-hop reasoning without requiring high-quality pre-existing graphs.

Details

Motivation: Standard RAG systems treat all retrieved information as equally reliable and struggle with multi-step reasoning. GraphRAG approaches have limitations due to dependency on expensive pre-existing knowledge graphs or unreliable automated construction pipelines.

Method: Proposes a RAG framework using spreading activation algorithm to retrieve information from documents interconnected by automatically constructed knowledge graphs, enabling better multi-step evidence retrieval.

Result: Achieves better or comparable performance to iterative RAG methods, with up to 39% absolute gain in answer correctness when combined with chain-of-thought iterative retrieval compared to naive RAG, using small open-weight language models.

Conclusion: The spreading activation-based approach effectively enhances RAG performance for complex reasoning tasks, is easily integrable as plug-and-play module, and works well in resource-constrained settings with small language models.

Abstract: Despite initial successes and a variety of architectures, retrieval-augmented generation (RAG) systems still struggle to reliably retrieve and connect the multi-step evidence required for complicated reasoning tasks. Most of the standard RAG frameworks regard all retrieved information as equally reliable, overlooking the varying credibility and interconnected nature of large textual corpora. GraphRAG approaches offer potential improvement to RAG systems by integrating knowledge graphs, which structure information into nodes and edges, capture entity relationships, and enable multi-step logical traversal. However, GraphRAG is not always an ideal solution as it depends on high-quality graph representations of the corpus, which requires either pre-existing knowledge graphs that are expensive to build and update, or automated graph construction pipelines that are often unreliable. Moreover, systems following this paradigm typically use large language models to guide graph traversal and evidence retrieval, leading to challenges similar to those encountered with standard RAG. In this paper, we propose a novel RAG framework that employs the spreading activation algorithm to retrieve information from a corpus of documents interconnected by automatically constructed knowledge graphs, thereby enhancing the performance of large language models on complex tasks such as multi-hop question answering. Experiments show that our method achieves better or comparable performance to iterative RAG methodologies, while also being easily integrable as a plug-and-play module with a wide range of RAG-based approaches. Combining our method with chain-of-thought iterative retrieval yields up to a 39% absolute gain in answer correctness compared to naive RAG, achieving these results with small open-weight language models and highlighting its effectiveness in resource-constrained settings.

[241] Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning

Polaris Jhandi, Owais Kazi, Shreyas Subramanian, Neel Sendas

Main category: cs.AI

TL;DR: Fine-tuned small language model (OPT-350M) achieves 77.55% pass rate on ToolBench evaluation, outperforming larger models like ChatGPT-CoT, demonstrating cost-effective AI deployment potential.

Details

Motivation: As organizations scale generative AI adoption, LLM computational costs become prohibitive for routine enterprise use, motivating exploration of smaller, more efficient models that can deliver comparable performance in targeted applications.

Method: Fine-tuned facebook/opt-350m model using Hugging Face TRL’s Supervised Fine-Tuning (SFT) trainer for single epoch, adapting it to execute tasks like document summarization, query answering, and structured data interpretation traditionally handled by LLMs.

Result: The fine-tuned SLM achieved exceptional 77.55% pass rate on ToolBench evaluation, significantly outperforming baseline models: ChatGPT-CoT (26.00%), ToolLLaMA-DFS (30.18%), and ToolLLaMA-CoT (16.27%).

Conclusion: Thoughtful design and targeted training of SLMs can significantly lower adoption barriers, enabling cost-effective, large-scale integration of generative AI into production systems while maintaining competitive performance.

Abstract: As organizations scale adoption of generative AI, model cost optimization and operational efficiency have emerged as critical factors determining sustainability and accessibility. While Large Language Models (LLMs) demonstrate impressive capabilities across diverse tasks, their extensive computational requirements make them cost-prohibitive for routine enterprise use. This limitation motivates the exploration of Small Language Models (SLMs), which can deliver comparable performance in targeted applications while drastically reducing infrastructure overhead (Irugalbandara et al., 2023). In this work, we investigate the feasibility of replacing LLM-driven workflows with optimized SLMs. We trained a domain-adapted SLM to execute representative tasks traditionally handled by LLMs, such as document summarization, query answering, and structured data interpretation. As part of the experiment, we investigated the fine-tuning of facebook/opt-350m model (single epoch only) using the Hugging Face TRL (Transformer Reinforcement Learning), specifically the Supervised Fine-Tuning (SFT) trainer. The OPT-350M model was released by Meta AI in 2022 as part of the OPT (Open Pretrained Transformer) family of models. Similar studies demonstrate that even models at the 350M parameter scale can meaningfully contribute to instruction-tuning pipelines (Mekala et al., 2024). Experimental results demonstrated that our fine-tuned SLM achieves exceptional performance with a 77.55% pass rate on ToolBench evaluation, significantly outperforming all baseline models including ChatGPT-CoT (26.00%), ToolLLaMA-DFS (30.18%), and ToolLLaMA-CoT (16.27%). These findings emphasize that thoughtful design and targeted training of SLMs can significantly lower barriers to adoption, enabling cost-effective, large-scale integration of generative AI into production systems.

[242] Towards Pervasive Distributed Agentic Generative AI – A State of The Art

Gianni Molinari, Fabio Ciravegna

Main category: cs.AI

TL;DR: Survey paper on LLM agents in pervasive computing, covering architectures, deployment strategies, challenges, and proposing “Agent as a Tool” framework for context-aware, modular agentic AI.

Details

Motivation: The rapid advancement of intelligent agents and LLMs is reshaping pervasive computing, enabling autonomous problem-solving in complex environments with heterogeneous sensors, devices, and data. There's a need to understand how these agents can be effectively deployed and evaluated in pervasive computing scenarios.

Method: Survey methodology: 1) Outlines architectural components of LLM agents (profiling, memory, planning, action), 2) Examines deployment and evaluation across various scenarios, 3) Reviews computational and infrastructural advancements (cloud to edge), 4) Analyzes state-of-the-art agent deployment strategies and applications, 5) Identifies key challenges in pervasive computing.

Result: The survey provides comprehensive analysis of LLM agents in pervasive computing, highlighting: architectural components, deployment strategies (including local/distributed execution on resource-constrained devices), computational advancements from cloud to edge, and key challenges including architectural, energetic, and privacy limitations.

Conclusion: Proposes “Agent as a Tool” conceptual framework for pervasive agentic AI, emphasizing context awareness, modularity, security, efficiency, and effectiveness as key principles for future development of LLM agents in pervasive computing environments.

Abstract: The rapid advancement of intelligent agents and Large Language Models (LLMs) is reshaping the pervasive computing field. Their ability to perceive, reason, and act through natural language understanding enables autonomous problem-solving in complex pervasive environments, including the management of heterogeneous sensors, devices, and data. This survey outlines the architectural components of LLM agents (profiling, memory, planning, and action) and examines their deployment and evaluation across various scenarios. Than it reviews computational and infrastructural advancements (cloud to edge) in pervasive computing and how AI is moving in this field. It highlights state-of-the-art agent deployment strategies and applications, including local and distributed execution on resource-constrained devices. This survey identifies key challenges of these agents in pervasive computing such as architectural, energetic and privacy limitations. It finally proposes what we called “Agent as a Tool”, a conceptual framework for pervasive agentic AI, emphasizing context awareness, modularity, security, efficiency and effectiveness.

[243] Subjective functions

Samuel J. Gershman

Main category: cs.AI

TL;DR: The paper proposes subjective functions as higher-order objective functions endogenous to agents, using expected prediction error as a concrete example, to understand how intelligent systems synthesize goals.

Details

Motivation: To understand how intelligent systems (both human and artificial) synthesize new objective functions on the fly, and to develop a framework for endowing artificial systems with similar goal-generation capabilities.

Method: Introduces the concept of subjective functions - higher-order objective functions that are endogenous to agents (defined with respect to the agent’s features rather than external tasks). Uses expected prediction error as a concrete example of such a subjective function.

Result: Proposes a theoretical framework connecting subjective functions to ideas from psychology, neuroscience, and machine learning, suggesting this approach can help explain how intelligent systems generate and select goals.

Conclusion: The subjective function framework provides a promising approach to understanding goal synthesis in intelligent systems, with expected prediction error serving as a concrete example that bridges multiple disciplines.

Abstract: Where do objective functions come from? How do we select what goals to pursue? Human intelligence is adept at synthesizing new objective functions on the fly. How does this work, and can we endow artificial systems with the same ability? This paper proposes an approach to answering these questions, starting with the concept of a subjective function, a higher-order objective function that is endogenous to the agent (i.e., defined with respect to the agent’s features, rather than an external task). Expected prediction error is studied as a concrete example of a subjective function. This proposal has many connections to ideas in psychology, neuroscience, and machine learning.

[244] Conversational Time Series Foundation Models: Towards Explainable and Effective Forecasting

Defu Cao, Michael Gee, Jinbo Liu, Hengxuan Wang, Wei Yang, Rui Wang, Yan Liu

Main category: cs.AI

TL;DR: LLM acts as intelligent judge to coordinate time series foundation model ensemble, achieving SOTA results through R1-style finetuning with SHAP-based faithfulness guidance.

Details

Motivation: No single time series foundation model consistently outperforms others, creating need for optimal ensemble coordination with interpretability. LLMs have strong reasoning but poor direct forecasting performance.

Method: Reposition LLM as intelligent judge to evaluate/coordinate ensemble. Use R1-style finetuning guided by SHAP-based faithfulness scores to teach LLM to interpret ensemble weights as causal statements about temporal dynamics. Enable multi-turn conversations for forward-looking assessments and adaptive optimization.

Result: Significantly outperforms leading time series foundation models on GIFT-Eval benchmark (23 datasets, 97 settings) on both CRPS and MASE metrics, establishing new state-of-the-art results.

Conclusion: LLMs can effectively coordinate time series ensembles when repositioned as intelligent judges with domain-specific finetuning, providing both superior performance and interpretable, causally-grounded explanations.

Abstract: The proliferation of time series foundation models has created a landscape where no single method achieves consistent superiority, framing the central challenge not as finding the best model, but as orchestrating an optimal ensemble with interpretability. While Large Language Models (LLMs) offer powerful reasoning capabilities, their direct application to time series forecasting has proven ineffective. We address this gap by repositioning the LLM as an intelligent judge that evaluates, explains, and strategically coordinates an ensemble of foundation models. To overcome the LLM’s inherent lack of domain-specific knowledge on time series, we introduce an R1-style finetuning process, guided by SHAP-based faithfulness scores, which teaches the model to interpret ensemble weights as meaningful causal statements about temporal dynamics. The trained agent then engages in iterative, multi-turn conversations to perform forward-looking assessments, provide causally-grounded explanations for its weighting decisions, and adaptively refine the optimization strategy. Validated on the GIFT-Eval benchmark on 23 datasets across 97 settings, our approach significantly outperforms leading time series foundation models on both CRPS and MASE metrics, establishing new state-of-the-art results.

[245] Do Large Language Models Know What They Don’t Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets

Lukas Nel

Main category: cs.AI

TL;DR: LLMs show systematic overconfidence when predicting future events, with most performing worse than simple base rate predictions despite scaling and enhanced reasoning.

Details

Motivation: While LLMs achieve strong performance on many tasks, their epistemic calibration (ability to quantify uncertainty about genuinely unknown future events) remains poorly understood and needs evaluation.

Method: Created KalshiBench with 300 prediction market questions from a regulated exchange, with outcomes occurring after training cutoffs. Evaluated five frontier models (Claude Opus 4.5, GPT-5.2, DeepSeek-V3.2, Qwen3-235B, Kimi-K2) on calibration metrics including Expected Calibration Error (ECE) and Brier Skill Score.

Result: All models showed systematic overconfidence. Claude Opus 4.5 was best-calibrated but still had substantial errors (ECE=0.120). Reasoning-enhanced models like GPT-5.2-XHigh had worse calibration (ECE=0.395) despite comparable accuracy. Only one model achieved positive Brier Skill Score.

Conclusion: Scaling and enhanced reasoning don’t automatically improve calibration. Epistemic calibration is a distinct capability requiring targeted development beyond standard accuracy improvements.

Abstract: A well-calibrated model should express confidence that matches its actual accuracy – when it claims 80% confidence, it should be correct 80% of the time. While large language models (LLMs) have achieved remarkable performance across diverse tasks, their epistemic calibration remains poorly understood. We introduce \textbf{KalshiBench}, a benchmark of 300 prediction market questions from Kalshi, a CFTC-regulated exchange, with verifiable real-world outcomes occurring after model training cutoffs. Unlike traditional benchmarks measuring accuracy on static knowledge, KalshiBench evaluates whether models can appropriately quantify uncertainty about genuinely unknown future events. We evaluate five frontier models – Claude Opus 4.5, GPT-5.2, DeepSeek-V3.2, Qwen3-235B, and Kimi-K2 – and find \textbf{systematic overconfidence across all models}. Even the best-calibrated model (Claude Opus 4.5, ECE=0.120) shows substantial calibration errors, while reasoning-enhanced models like GPT-5.2-XHigh exhibit \emph{worse} calibration (ECE=0.395) despite comparable accuracy. Critically, only one model achieves a positive Brier Skill Score, indicating most models perform worse than simply predicting base rates. Our findings suggest that scaling and enhanced reasoning do not automatically confer calibration benefits, highlighting epistemic calibration as a distinct capability requiring targeted development.

[246] Topic Discovery and Classification for Responsible Generative AI Adaptation in Higher Education

Diane Myung-kyung Woodbridge, Allyson Seba, Freddie Seba, Aydin Schwartz

Main category: cs.AI

TL;DR: Researchers developed an automated system using topic modeling and LLMs to discover and categorize AI policies in course syllabi and institutional websites, achieving high classification accuracy to promote responsible GenAI use in education.

Details

Motivation: As GenAI becomes widely used by students for learning and assignments, there are concerns about misinformation, hallucinations, and undermining critical thinking. Institutions have developed varied AI policies, but their inconsistency and evolving nature create confusion for students about expectations and best practices.

Method: The authors designed an automated system combining unsupervised topic modeling to identify key policy themes with large language models (specifically GPT-4.0) to classify the level of GenAI allowance and other requirements in policy texts from course syllabi and institutional websites.

Result: The system achieved a coherence score of 0.73 for topic discovery. GPT-4.0-based classification achieved precision between 0.92-0.97 and recall between 0.85-0.97 across eight identified policy topics, demonstrating high accuracy in categorizing AI policies.

Conclusion: The automated system provides structured, interpretable policy information to promote safe, equitable, and pedagogically aligned GenAI use in education. It can be integrated into educational technology platforms to help students understand and comply with relevant guidelines, addressing the challenge of inconsistent AI policies across institutions.

Abstract: As generative artificial intelligence (GenAI) becomes increasingly capable of delivering personalized learning experiences and real-time feedback, a growing number of students are incorporating these tools into their academic workflows. They use GenAI to clarify concepts, solve complex problems, and, in some cases, complete assignments by copying and pasting model-generated contents. While GenAI has the potential to enhance learning experience, it also raises concerns around misinformation, hallucinated outputs, and its potential to undermine critical thinking and problem-solving skills. In response, many universities, colleges, departments, and instructors have begun to develop and adopt policies to guide responsible integration of GenAI into learning environments. However, these policies vary widely across institutions and contexts, and their evolving nature often leaves students uncertain about expectations and best practices. To address this challenge, the authors designed and implemented an automated system for discovering and categorizing AI-related policies found in course syllabi and institutional policy websites. The system combines unsupervised topic modeling techniques to identify key policy themes with large language models (LLMs) to classify the level of GenAI allowance and other requirements in policy texts. The developed application achieved a coherence score of 0.73 for topic discovery. In addition, GPT-4.0-based classification of policy categories achieved precision between 0.92 and 0.97, and recall between 0.85 and 0.97 across eight identified topics. By providing structured and interpretable policy information, this tool promotes the safe, equitable, and pedagogically aligned use of GenAI technologies in education. Furthermore, the system can be integrated into educational technology platforms to help students understand and comply with relevant guidelines.

[247] WeMusic-Agent: Efficient Conversational Music Recommendation via Knowledge Internalization and Agentic Boundary Learning

Wendong Bi, Yirong Mao, Xianglong Liu, Kai Tian, Jian Zhang, Hanjie Wang, Wenhui Que

Main category: cs.AI

TL;DR: WeMusic-Agent is a training framework for LLM-based conversational music recommendation that teaches models to intelligently decide when to use internalized knowledge vs. external tools, achieving significant improvements over existing methods.

Details

Motivation: Existing methods struggle to balance specialized domain knowledge with flexible tool integration in personalized music recommendation, especially in conversational scenarios requiring deep understanding of user preferences and musical context.

Method: Proposes WeMusic-Agent framework integrating knowledge internalization and agentic boundary learning. Presents WeMusic-Agent-M1 model that internalizes musical knowledge via continued pretraining on 50B music corpus while learning to invoke external tools when needed. Also constructs a benchmark from real-world WeChat Listen data.

Result: Experiments on real-world data demonstrate significant improvements over existing models. The framework enables comprehensive evaluation across relevance, personalization, and diversity dimensions.

Conclusion: WeMusic-Agent effectively addresses the balance between domain knowledge and tool integration in conversational music recommendation, providing a robust framework and benchmark for future research in this area.

Abstract: Personalized music recommendation in conversational scenarios usually requires a deep understanding of user preferences and nuanced musical context, yet existing methods often struggle with balancing specialized domain knowledge and flexible tool integration. This paper proposes WeMusic-Agent, a training framework for efficient LLM-based conversational music recommendation. By integrating the knowledge internalization and agentic boundary learning, the framework aims to teach the model to intelligently decide when to leverage internalized knowledge and when to call specialized tools (e.g., music retrieval APIs, music recommendation systems). Under this framework, we present WeMusic-Agent-M1, an agentic model that internalizes extensive musical knowledge via continued pretraining on 50B music-related corpus while acquiring the ability to invoke external tools when necessary. Additionally, considering the lack of open-source benchmarks for conversational music recommendation, we also construct a benchmark for personalized music recommendations derived from real-world data in WeChat Listen. This benchmark enables comprehensive evaluation across multiple dimensions, including relevance, personalization, and diversity of the recommendations. Experiments on real-world data demonstrate that WeMusic-Agent achieves significant improvements over existing models.

[248] ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

Hao Chen, Zhexin Hu, Jiajun Chai, Haocheng Yang, Hang He, Xiaohan Wang, Wei Lin, Luhang Wang, Guojun Yin, Zhuofeng zhao

Main category: cs.AI

TL;DR: ToolForge is an automated framework that synthesizes high-quality tool-learning data without real API calls, using virtual tools and multi-hop reasoning to train LLMs that outperform GPT-4o on benchmarks.

Details

Motivation: Existing synthetic data generation pipelines require thousands of real API calls (costly) and lack multi-hop reasoning and self-reflection capabilities, limiting LLM tool-calling performance.

Method: ToolForge constructs small number of virtual tools (no real API calls), uses (question, golden context, answer) triples to synthesize multi-hop search data, incorporates multi-hop reasoning and self-reflection, and employs Multi-Layer Validation Framework with rule-based and model-based assessments.

Result: An 8B parameter model trained on ToolForge synthesized data outperforms GPT-4o on multiple benchmarks, demonstrating strong real-world tool-calling performance.

Conclusion: ToolForge enables cost-effective, high-quality tool-learning data synthesis without real API calls, achieving state-of-the-art tool-calling performance through multi-hop reasoning and rigorous validation.

Abstract: Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .

[249] Science Consultant Agent

Karthikeyan K, Philip Wu, Xin Tang, Alexandre Alves

Main category: cs.AI

TL;DR: AI tool that helps practitioners select and implement optimal modeling strategies through questionnaire, smart fill, research recommendations, and prototype building.

Details

Motivation: To help practitioners (Product Managers, Software Developers, Researchers) overcome challenges in selecting and implementing appropriate AI modeling strategies, accelerating development and ensuring effective solutions.

Method: Web-based AI tool with four core components: Questionnaire (structured input gathering), Smart Fill (automated assistance), Research-Guided Recommendation (literature-backed suggestions), and Prototype Builder (model generation).

Result: A comprehensive pipeline that accelerates AI solution development by guiding practitioners through strategy selection and implementation with structured guidance and automated prototype generation.

Conclusion: The Science Consultant Agent provides an effective framework for democratizing AI development by combining structured guidance, research-backed recommendations, and practical prototyping tools to help diverse practitioners implement optimal modeling strategies.

Abstract: The Science Consultant Agent is a web-based Artificial Intelligence (AI) tool that helps practitioners select and implement the most effective modeling strategy for AI-based solutions. It operates through four core components: Questionnaire, Smart Fill, Research-Guided Recommendation, and Prototype Builder. By combining structured questionnaires, literature-backed solution recommendations, and prototype generation, the Science Consultant Agent accelerates development for everyone from Product Managers and Software Developers to Researchers. The full pipeline is illustrated in Figure 1.

[250] Weighted K-Harmonic Means Clustering: Convergence Analysis and Applications to Wireless Communications

Gourab Ghatak

Main category: cs.AI

TL;DR: WKHM is a regularized K-harmonic means clustering algorithm with numerical stability and soft assignments via inverse-distance weighting, specifically designed for wireless networks where weights correspond to fractional user association based on signal strength.

Details

Motivation: The paper aims to develop a clustering algorithm that can be directly interpreted in wireless network applications, specifically for joint radio node placement and user association, addressing the limitations of classical K-means and constrained K-means in this domain.

Method: Proposes Weighted K-Harmonic Means (WKHM) as a regularized variant of K-harmonic means with inverse-distance weighting for soft assignments. The method establishes rigorous convergence guarantees under deterministic (monotone descent to local minimum) and stochastic settings (convergence in probability under BPP initialization and almost sure convergence under mild decay conditions).

Result: WKHM achieves superior tradeoff between minimum signal strength and load fairness compared to classical and modern clustering baselines. The algorithm provides the first stochastic convergence guarantees for harmonic-mean-based clustering and demonstrates practical effectiveness through extensive simulations with diverse user distributions.

Conclusion: WKHM serves as a principled tool for joint radio node placement and user association in wireless networks, offering both theoretical convergence guarantees and practical performance advantages over existing clustering methods in wireless applications.

Abstract: We propose the \emph{weighted K-harmonic means} (WKHM) clustering algorithm, a regularized variant of K-harmonic means designed to ensure numerical stability while enabling soft assignments through inverse-distance weighting. Unlike classical K-means and constrained K-means, WKHM admits a direct interpretation in wireless networks: its weights are exactly equivalent to fractional user association based on received signal strength. We establish rigorous convergence guarantees under both deterministic and stochastic settings, addressing key technical challenges arising from non-convexity and random initialization. Specifically, we prove monotone descent to a local minimum under fixed initialization, convergence in probability under Binomial Point Process (BPP) initialization, and almost sure convergence under mild decay conditions. These results provide the first stochastic convergence guarantees for harmonic-mean-based clustering. Finally, through extensive simulations with diverse user distributions, we show that WKHM achieves a superior tradeoff between minimum signal strength and load fairness compared to classical and modern clustering baselines, making it a principled tool for joint radio node placement and user association in wireless networks.

[251] PDE-Agent: A toolchain-augmented multi-agent framework for PDE solving

Jianming Liu, Ren Zhu, Jian Xu, Kun Ding, Xu-Yao Zhang, Gaofeng Meng, Cheng-Lin Liu

Main category: cs.AI

TL;DR: PDE-Agent: A multi-agent LLM framework that automates PDE solving through toolchain collaboration, featuring dynamic planning and centralized resource management.

Details

Motivation: Traditional PDE solving methods require manual setup and domain expertise. While PINNs and frameworks like DeepXDE improved automation, they still depend on expert knowledge and lack full autonomy. The authors aim to create a fully automated PDE solving system from natural language descriptions.

Method: PDE-Agent frames PDE solving as tool invocation via LLM-driven agents with two key innovations: 1) Prog-Act framework with graph memory for multi-agent collaboration enabling dynamic planning and error correction via dual-loop mechanisms, and 2) Resource-Pool with tool-parameter separation for multi-tool collaboration to manage runtime artifacts and resolve inter-tool dependencies.

Result: The authors developed PDE-Bench, a multi-type PDE benchmark for agent-based tool collaborative solving, and proposed multi-level metrics for assessing tool coordination. Evaluations show PDE-Agent exhibits superior applicability and performance in complex multi-step, cross-step dependent tasks.

Conclusion: PDE-Agent introduces a new paradigm of toolchain-augmented multi-agent PDE solving that advances automated scientific computing by combining LLM reasoning capacity with external tool controllability for fully automated PDE solving from natural language.

Abstract: Solving Partial Differential Equations (PDEs) is a cornerstone of engineering and scientific research. Traditional methods for PDE solving are cumbersome, relying on manual setup and domain expertise. While Physics-Informed Neural Network (PINNs) introduced end-to-end neural network-based solutions, and frameworks like DeepXDE further enhanced automation, these approaches still depend on expert knowledge and lack full autonomy. In this work, we frame PDE solving as tool invocation via LLM-driven agents and introduce PDE-Agent, the first toolchain-augmented multi-agent collaboration framework, inheriting the reasoning capacity of LLMs and the controllability of external tools and enabling automated PDE solving from natural language descriptions. PDE-Agent leverages the strengths of multi-agent and multi-tool collaboration through two key innovations: (1) A Prog-Act framework with graph memory for multi-agent collaboration, which enables effective dynamic planning and error correction via dual-loop mechanisms (localized fixes and global revisions). (2) A Resource-Pool integrated with a tool-parameter separation mechanism for multi-tool collaboration. This centralizes the management of runtime artifacts and resolves inter-tool dependency gaps in existing frameworks. To validate and evaluate this new paradigm for PDE solving , we develop PDE-Bench, a multi-type PDE Benchmark for agent-based tool collaborative solving, and propose multi-level metrics for assessing tool coordination. Evaluations verify that PDE-Agent exhibits superior applicability and performance in complex multi-step, cross-step dependent tasks. This new paradigm of toolchain-augmented multi-agent PDE solving will further advance future developments in automated scientific computing. Our source code and dataset will be made publicly available.

[252] Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

Zhi Helu, Huang Jingjing, Xu Wang, Xu Yangbin, Zhang Wanyue, Jiang Baoyang, Deng Shirui, Zhu Liang, Li Fangfang, Zhao Tiejun, Lin Yankai, Yao Yuan

Main category: cs.AI

TL;DR: SPRITE is a framework that uses simulators and LLMs to programmatically generate scalable, diverse, and high-quality spatial reasoning data for training VLMs, overcoming limitations of traditional template-based or manually annotated datasets.

Details

Motivation: Current approaches to enhancing spatial understanding in VLMs face a dilemma: template-based datasets are scalable but rigid, while manual annotation is diverse but unscalable and computationally imprecise. There's a need for scalable, diverse, and precise spatial reasoning data.

Method: SPRITE reframes ground-truth generation as a code-generation task. It uses LLMs to compile complex spatial questions into executable programs, which are then verified against high-precision scene meta-information extracted from simulators. This creates a pipeline for generating computationally precise and verifiable data.

Result: The framework produced a dataset with 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs. VLMs trained on this data achieved significant performance gains on multiple spatial benchmarks and outperformed other open-source datasets of equivalent size.

Conclusion: SPRITE successfully overcomes the limitations of traditional methods by providing scalable, diverse, and high-quality spatial reasoning data. The framework demonstrates that overcoming low-diversity in template methods is essential for building robust, generalizable spatial intelligence, and the authors will release the framework and dataset publicly.

Abstract: Embodied intelligence, a grand challenge in artificial intelligence, is fundamentally constrained by the limited spatial understanding and reasoning capabilities of current models. Prevailing efforts to address this through enhancing Vision-Language Models (VLMs) are trapped in a dilemma: template-based datasets are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable and, critically, computationally imprecise. We introduce SPRITE, a novel framework that overcomes this dilemma by leveraging simulators and large models to programmatically synthesize scalable, diverse, and high-quality spatial reasoning data. The core innovation of SPRITE is to reframe ground-truth generation as a code-generation task. We utilize LLMs to compile complex spatial questions into executable programs, which are then verified against high-precision scene meta-information extracted from simulators. This ensures our ground truth is both computationally precise and verifiable, while the generative power of LLMs provides vast linguistic diversity. Leveraging this pipeline, we have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs. We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks and outperforms other open-source datasets of equivalent size. Furthermore, a scalability analysis confirms our hypothesis that overcoming the low-diversity nature of traditional template methods is essential for building robust, generalizable spatial intelligence. We will make the SPRITE framework code and the full 300k+ dataset publicly available to facilitate future research in spatial intelligence.

[253] AlignMerge - Alignment-Preserving Large Language Model Merging via Fisher-Guided Geometric Constraints

Aniruddha Roy, Jyoti Patel, Aman Chadha, Vinija Jain, Amitava Das

Main category: cs.AI

TL;DR: AlignMerge is a geometry-aware LLM merging framework that preserves alignment by explicitly constraining merges to respect safety geometry, preventing alignment degradation while maintaining task performance.

Details

Motivation: Standard LLM merging techniques (weight soups, task vectors, Fisher averaging) can preserve loss metrics while quietly destroying model alignment. Merging should be treated as a geometry-constrained operation around aligned anchors rather than just a numerical trick.

Method: AlignMerge operates in a local Fisher chart around an instruction-tuned base model. It estimates an alignment subspace with projector P_A and optimizes a loss function with three components: L_geo (keeps merge close to experts in Fisher-Rao geometry), L_align (penalizes motion along alignment-sensitive directions), and L_bud (enforces soft alignment budget). Uses Alignment Quality Index (AQI) as alignment functional.

Result: Across five model families (LLaMA-3 8B, Mistral 7B, Qwen 2, Phi-3.5, Gemma 2), AlignMerge improves alignment metrics (AQI, toxicity, LLM-judge alignment) while matching or exceeding best expert on instruction-following, reasoning, and helpfulness. Shows smaller alignment-subspace drift and fewer budget violations than existing methods.

Conclusion: AlignMerge makes alignment-preserving merging a first-class design goal and suggests a path to geometry-aware composition of future foundation models, treating merging as geometry-constrained rather than just numerical optimization.

Abstract: Merging large language models (LLMs) is a practical way to compose capabilities from multiple fine-tuned checkpoints without retraining. Yet standard schemes (linear weight soups, task vectors, and Fisher-weighted averaging) can preserve loss while quietly destroying alignment. We argue that merging is not a numerical trick but a geometry-constrained operation around an already-aligned anchor: fusion must be steered to respect safety geometry, not validated post hoc. We introduce AlignMerge, a geometry-aware merging framework that makes alignment an explicit invariant. In a local Fisher chart around an instruction-tuned base, we estimate an alignment subspace with projector P_A and optimize: L_AlignMerge = L_geo + lambda_align * L_align + lambda_bud * L_bud, where L_geo keeps the merge close to its experts in Fisher-Rao geometry, L_align penalizes motion along alignment-sensitive directions, and L_bud enforces a soft alignment budget. As the alignment functional we use the decoding-invariant Alignment Quality Index (AQI), a latent-space criterion that captures how cleanly aligned and misaligned behaviors separate in representation space. Across five model families (LLaMA-3 8B, Mistral 7B, Qwen 2, Phi-3.5, Gemma 2), merging safety anchors with task experts, AlignMerge improves alignment metrics (AQI, toxicity, LLM-judge alignment) while matching or exceeding the best expert on instruction-following, reasoning, and helpfulness. It also exhibits smaller alignment-subspace drift and fewer budget violations than Fisher soups, TIES, SafeMerge, and MergeAlign. These results make alignment-preserving merging a first-class design goal and suggest a path to geometry-aware composition of future foundation models.

[254] Learning to Wait: Synchronizing Agents with the Physical World

Yifei She, Ping Zhang, He Liu, Yanmin Jia, Yang Jing, Zijun Liu, Peng Sun, Xiangbin Li, Xiaohe Hu

Main category: cs.AI

TL;DR: LLM agents learn to predict precise waiting times (time.sleep(t)) to synchronize with asynchronous environments, addressing the Temporal Gap between action initiation and completion without constant polling.

Details

Motivation: Real-world agentic tasks involve non-blocking actions with variable latencies, creating a Temporal Gap that existing environment-side solutions handle poorly - either limiting scalability or diluting context windows with redundant observations.

Method: Agent-side approach extending Code-as-Action paradigm to temporal domain, using semantic priors and In-Context Learning to predict precise waiting durations (time.sleep(t)) for synchronizing with asynchronous environments.

Result: Experiments in simulated Kubernetes cluster show agents can precisely calibrate internal clocks to minimize both query overhead and execution latency.

Conclusion: Temporal awareness is a learnable capability essential for autonomous evolution in open-ended environments, enabling LLM agents to actively align their cognitive timeline with the physical world.

Abstract: Real-world agentic tasks, unlike synchronous Markov Decision Processes (MDPs), often involve non-blocking actions with variable latencies, creating a fundamental \textit{Temporal Gap} between action initiation and completion. Existing environment-side solutions, such as blocking wrappers or frequent polling, either limit scalability or dilute the agent’s context window with redundant observations. In this work, we propose an \textbf{Agent-side Approach} that empowers Large Language Models (LLMs) to actively align their \textit{Cognitive Timeline} with the physical world. By extending the Code-as-Action paradigm to the temporal domain, agents utilize semantic priors and In-Context Learning (ICL) to predict precise waiting durations (\texttt{time.sleep(t)}), effectively synchronizing with asynchronous environment without exhaustive checking. Experiments in a simulated Kubernetes cluster demonstrate that agents can precisely calibrate their internal clocks to minimize both query overhead and execution latency, validating that temporal awareness is a learnable capability essential for autonomous evolution in open-ended environments.

[255] QuadSentinel: Sequent Safety for Machine-Checkable Control in Multi-agent Systems

Yiliu Yang, Yilei Jiang, Qunzhong Wang, Yingshui Tan, Xiaoyong Zhu, Sherman S. M. Chow, Bo Zheng, Xiangyu Yue

Main category: cs.AI

TL;DR: QuadSentinel is a four-agent guardrail system that compiles natural language safety policies into machine-checkable rules and enforces them online for LLM-based agents, improving safety control while reducing false positives.

Details

Motivation: Safety risks emerge as LLM-based agents use tools, multi-step plans, and inter-agent communication. Natural language policies are ambiguous and context-dependent, making them difficult to map to machine-checkable rules, leading to unreliable runtime enforcement.

Method: Expresses safety policies as sequents and uses a four-agent guard system: state tracker, policy verifier, threat watcher, and referee. Compiles policies into machine-checkable rules built from predicates over observable state. Uses referee logic plus an efficient top-k predicate updater to prioritize checks and resolve conflicts hierarchically.

Result: On ST-WebAgentBench (ICML CUA ‘25) and AgentHarm (ICLR ‘25), QuadSentinel improves guardrail accuracy and rule recall while reducing false positives. Outperforms single-agent baselines like ShieldAgent (ICML ‘25) for better overall safety control.

Conclusion: Near-term deployments can adopt this pattern without modifying core agents by keeping policies separate and machine-checkable. Provides a practical approach to safety enforcement for LLM-based agent systems.

Abstract: Safety risks arise as large language model-based agents solve complex tasks with tools, multi-step plans, and inter-agent messages. However, deployer-written policies in natural language are ambiguous and context dependent, so they map poorly to machine-checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textsc{QuadSentinel}, a four-agent guard (state tracker, policy verifier, threat watcher, and referee) that compiles these policies into machine-checkable rules built from predicates over observable state and enforces them online. Referee logic plus an efficient top-$k$ predicate updater keeps costs low by prioritizing checks and resolving conflicts hierarchically. Measured on ST-WebAgentBench (ICML CUA~‘25) and AgentHarm (ICLR~‘25), \textsc{QuadSentinel} improves guardrail accuracy and rule recall while reducing false positives. Against single-agent baselines such as ShieldAgent (ICML~‘25), it yields better overall safety control. Near-term deployments can adopt this pattern without modifying core agents by keeping policies separate and machine-checkable. Our code will be made publicly available at https://github.com/yyiliu/QuadSentinel.

[256] OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Qiushi Sun, Zhaoyang Liu, Zhoumianze Liu, Yu Qiao, Xiangyu Yue, Zun Wang, Zichen Ding

Main category: cs.AI

TL;DR: OS-Oracle introduces a critic model framework for VLM-powered computer-using agents, featuring a scalable GUI critic data pipeline, two-stage training with CP-GRPO, and a cross-platform benchmark, achieving SOTA performance and improving GUI agent performance.

Details

Motivation: VLM-powered computer-using agents face reliability issues in long-horizon workflows where errors accumulate and irreversible actions cause unintended consequences. Current critic models lack diverse GUI feedback data and public benchmarks for step-level evaluation.

Method: Three core contributions: (1) scalable data pipeline for synthesizing cross-platform GUI critic data, (2) two-stage training combining supervised fine-tuning and consistency-preserving group relative policy optimization (CP-GRPO), (3) OS-Critic Bench benchmark for evaluating critic models across Mobile, Web, and Desktop platforms.

Result: Created 310k critic samples dataset; OS-Oracle-7B achieves state-of-the-art performance among open-source VLMs on OS-Critic Bench, surpasses proprietary models on mobile domain, and improves performance of native GUI agents like UI-TARS-1.5-7B in OSWorld and AndroidWorld environments.

Conclusion: OS-Oracle provides an effective framework for developing critic models that enhance reliability of VLM-powered computer-using agents through scalable data synthesis, advanced training methods, and comprehensive evaluation benchmarks, with open-sourced implementation available.

Abstract: With VLM-powered computer-using agents (CUAs) becoming increasingly capable at graphical user interface (GUI) navigation and manipulation, reliable step-level decision-making has emerged as a key bottleneck for real-world deployment. In long-horizon workflows, errors accumulate quickly and irreversible actions can cause unintended consequences, motivating critic models that assess each action before execution. While critic models offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for step-level evaluation in computer use. To bridge these gaps, we introduce OS-Oracle that makes three core contributions: (1) a scalable data pipeline for synthesizing cross-platform GUI critic data; (2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative policy optimization (CP-GRPO); (3) OS-Critic Bench, a holistic benchmark for evaluating critic model performance across Mobile, Web, and Desktop platforms. Leveraging this framework, we curate a high-quality dataset containing 310k critic samples. The resulting critic model, OS-Oracle-7B, achieves state-of-the-art performance among open-source VLMs on OS-Critic Bench, and surpasses proprietary models on the mobile domain. Furthermore, when serving as a pre-critic, OS-Oracle-7B improves the performance of native GUI agents such as UI-TARS-1.5-7B in OSWorld and AndroidWorld environments. The code is open-sourced at https://github.com/numbmelon/OS-Oracle.

[257] Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection

Fanrui Zhang, Qiang Zhang, Sizhuo Zhou, Jianwen Sun, Chuanhao Li, Jiaxin Ai, Yukang Feng, Yujie Zhang, Wenjie Li, Zizhen Li, Yifan Chang, Jiawei Liu, Kaipeng Zhang

Main category: cs.AI

TL;DR: ForenAgent is a multi-round interactive image forgery detection framework that enables MLLMs to autonomously generate, execute, and refine Python-based low-level tools, combining high-level semantic knowledge with low-level forensic analysis through a dynamic reasoning loop.

Details

Motivation: Existing IFD methods either use low-level semantics-agnostic artifacts or rely on MLLMs with high-level semantic knowledge, but these complementary information streams are highly heterogeneous and difficult to unify. There's a need to effectively model cross-level interactions between these approaches.

Method: ForenAgent uses a multi-round interactive framework where MLLMs autonomously generate, execute, and iteratively refine Python-based low-level tools. It employs a two-stage training pipeline (Cold Start + Reinforcement Fine-Tuning) and a dynamic reasoning loop with four components: global perception, local focusing, iterative probing, and holistic adjudication. The framework is trained on FABench, a new dataset with 100k images and 200k agent-interaction QA pairs.

Result: Experiments show ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks when assisted by low-level tools, demonstrating promising capabilities for general-purpose image forgery detection.

Conclusion: ForenAgent successfully bridges the gap between high-level semantic knowledge and low-level forensic analysis, charting a promising route toward general-purpose image forgery detection through autonomous tool generation and iterative refinement.

Abstract: Existing image forgery detection (IFD) methods either exploit low-level, semantics-agnostic artifacts or rely on multimodal large language models (MLLMs) with high-level semantic knowledge. Although naturally complementary, these two information streams are highly heterogeneous in both paradigm and reasoning, making it difficult for existing methods to unify them or effectively model their cross-level interactions. To address this gap, we propose ForenAgent, a multi-round interactive IFD framework that enables MLLMs to autonomously generate, execute, and iteratively refine Python-based low-level tools around the detection objective, thereby achieving more flexible and interpretable forgery analysis. ForenAgent follows a two-stage training pipeline combining Cold Start and Reinforcement Fine-Tuning to enhance its tool interaction capability and reasoning adaptability progressively. Inspired by human reasoning, we design a dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication, and instantiate it as both a data-sampling strategy and a task-aligned process reward. For systematic training and evaluation, we construct FABench, a heterogeneous, high-quality agent-forensics dataset comprising 100k images and approximately 200k agent-interaction question-answer pairs. Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks when assisted by low-level tools, charting a promising route toward general-purpose IFD. The code will be released after the review process is completed.

[258] Adaptation of Agentic AI

Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han

Main category: cs.AI

TL;DR: This paper presents a systematic framework for understanding adaptation strategies in agentic AI systems, categorizing them into agent adaptations and tool adaptations with specific subtypes, providing guidance for system design.

Details

Motivation: As agentic AI systems become more capable and complex, adaptation has emerged as a central mechanism for improving performance, reliability, and generalization. There's a need to unify the rapidly expanding research landscape into a coherent framework to guide researchers and practitioners.

Method: The authors develop a systematic framework that spans both agent adaptations and tool adaptations. They further decompose agent adaptations into tool-execution-signaled and agent-output-signaled forms, and tool adaptations into agent-agnostic and agent-supervised forms. The framework clarifies design space, makes trade-offs explicit, and provides practical guidance.

Result: The framework helps organize the research landscape, clarifies design choices, makes trade-offs explicit, and provides practical guidance for selecting or switching among adaptation strategies during system design. Representative approaches in each category are reviewed and analyzed.

Conclusion: The paper offers a conceptual foundation and practical roadmap for building more capable, efficient, and reliable agentic AI systems by providing a systematic understanding of adaptation strategies and their trade-offs.

Abstract: Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.

[259] Design and Evaluation of Cost-Aware PoQ for Decentralized LLM Inference

Arther Tian, Alex Ding, Frank Chen, Alan Wu, Aaron Chan, Bruce Zhang

Main category: cs.AI

TL;DR: Cost-aware Proof of Quality framework for decentralized LLM inference that balances output quality with computational costs in reward mechanisms.

Details

Motivation: Existing verification approaches for decentralized LLM inference don't scale well to modern models, and original PoQ formulations ignore heterogeneous computational costs across different nodes.

Method: Introduces cost-aware PoQ with explicit efficiency measurements in reward mechanisms, combining ground truth token-level F1, lightweight learned evaluators, and GPT judgments in unified evaluation pipeline with linear reward function balancing normalized quality and cost.

Result: Semantic textual similarity bi-encoder achieves higher correlation with ground truth and GPT scores than cross-encoders; largest models are most efficient in quality per unit latency; cost-aware reward scheme consistently rewards high-quality low-cost models and penalizes slow low-quality nodes.

Conclusion: Cost-aware PoQ provides practical foundation for economically sustainable decentralized LLM inference by effectively balancing quality and computational costs.

Abstract: Decentralized large language model (LLM) inference promises transparent and censorship resistant access to advanced AI, yet existing verification approaches struggle to scale to modern models. Proof of Quality (PoQ) replaces cryptographic verification of computation with consensus over output quality, but the original formulation ignores heterogeneous computational costs across inference and evaluator nodes. This paper introduces a cost-aware PoQ framework that integrates explicit efficiency measurements into the reward mechanism for both types of nodes. The design combines ground truth token level F1, lightweight learned evaluators, and GPT based judgments within a unified evaluation pipeline, and adopts a linear reward function that balances normalized quality and cost. Experiments on extractive question answering and abstractive summarization use five instruction tuned LLMs ranging from TinyLlama-1.1B to Llama-3.2-3B and three evaluation models spanning cross encoder and bi encoder architectures. Results show that a semantic textual similarity bi encoder achieves much higher correlation with both ground truth and GPT scores than cross encoders, indicating that evaluator architecture is a critical design choice for PoQ. Quality-cost analysis further reveals that the largest models in the pool are also the most efficient in terms of quality per unit latency. Monte Carlo simulations over 5,000 PoQ rounds demonstrate that the cost-aware reward scheme consistently assigns higher average rewards to high quality low cost inference models and to efficient evaluators, while penalizing slow low quality nodes. These findings suggest that cost-aware PoQ provides a practical foundation for economically sustainable decentralized LLM inference.

[260] AI Needs Physics More Than Physics Needs AI

Peter Coveney, Roger Highfield

Main category: cs.AI

TL;DR: The paper argues that current AI’s impact is overhyped and limited, while physics has more to offer AI development than vice versa, proposing “Big AI” as a synthesis of theoretical rigor with ML flexibility.

Details

Motivation: Despite over a decade of hype, AI's measurable impact remains modest outside a few high-profile successes. The paper aims to critique current AI limitations and propose a more rigorous approach that integrates physics principles.

Method: The paper reviews critiques of current AI architectures (LLMs, reasoning models, agentic AI), highlights opportunities in quantum AI and analogue computing, and proposes a roadmap for “Big AI” - a synthesis of theory-based rigor with machine learning flexibility.

Result: The analysis identifies key limitations of current AI: dependence on trillions of meaningless parameters, distributional bias, lack of uncertainty quantification, no mechanistic insights, and failure to capture elementary scientific laws.

Conclusion: Physics has more to offer current AI than AI has to offer physics. The paper proposes “Big AI” as a new paradigm that combines theoretical rigor from physics with the flexibility of machine learning to overcome current limitations.

Abstract: Artificial intelligence (AI) is commonly depicted as transformative. Yet, after more than a decade of hype, its measurable impact remains modest outside a few high-profile scientific and commercial successes. The 2024 Nobel Prizes in Chemistry and Physics recognized AI’s potential, but broader assessments indicate the impact to date is often more promotional than technical. We argue that while current AI may influence physics, physics has significantly more to offer this generation of AI. Current architectures - large language models, reasoning models, and agentic AI - can depend on trillions of meaningless parameters, suffer from distributional bias, lack uncertainty quantification, provide no mechanistic insights, and fail to capture even elementary scientific laws. We review critiques of these limits, highlight opportunities in quantum AI and analogue computing, and lay down a roadmap for the adoption of ‘Big AI’: a synthesis of theory-based rigour with the flexibility of machine learning.

[261] PCIA: A Path Construction Imitation Algorithm for Global Optimization

Mohammad-Javad Rezaei, Mozafar Bag-Mohammadi

Main category: cs.AI

TL;DR: PCIA is a new metaheuristic optimization algorithm inspired by human path construction behavior, showing competitive performance against existing algorithms on 66 test problems.

Details

Motivation: The paper aims to develop a novel optimization algorithm inspired by how humans construct and use transportation paths, addressing the need for effective metaheuristic approaches that mimic natural human problem-solving behaviors.

Method: PCIA mimics human path construction: 1) uses popular routes, 2) intelligently mixes existing paths when routes are closed, 3) explores randomly to reach unknown destinations. It generates random populations where each particle represents a path, similar to swarm-based algorithms.

Result: Tested on 53 mathematical optimization problems and 13 constrained optimization problems, PCIA demonstrated highly competitive performance compared to both popular and latest metaheuristic algorithms.

Conclusion: PCIA is an effective new metaheuristic optimization algorithm that successfully translates human path construction behaviors into computational optimization, offering competitive performance across diverse problem types.

Abstract: In this paper, a new metaheuristic optimization algorithm, called Path Construction Imitation Algorithm (PCIA), is proposed. PCIA is inspired by how humans construct new paths and use them. Typically, humans prefer popular transportation routes. In the event of a path closure, a new route is built by mixing the existing paths intelligently. Also, humans select different pathways on a random basis to reach unknown destinations. PCIA generates a random population to find the best route toward the destination, similar to swarm-based algorithms. Each particle represents a path toward the destination. PCIA has been tested with 53 mathematical optimization problems and 13 constrained optimization problems. The results showed that the PCIA is highly competitive compared to both popular and the latest metaheuristic algorithms.

[262] Synthelite: Chemist-aligned and feasibility-aware synthesis planning with LLMs

Nguyen Xuan-Vu, Daniel Armstrong, Milena Wehrbach, Andres M Bran, Zlatko Jončev, Philippe Schwaller

Main category: cs.AI

TL;DR: Synthelite is an LLM-powered synthesis planning framework that generates retrosynthetic transformations and allows expert intervention through natural language prompts, achieving up to 95% success rates in constrained synthesis tasks.

Details

Motivation: Existing computer-aided synthesis planning frameworks lack mechanisms for human interaction and integration of chemists' insights, limiting their practical utility as complementary tools for synthetic chemists.

Method: Synthelite uses large language models to directly propose retrosynthetic transformations, generating end-to-end synthesis routes by leveraging LLMs’ intrinsic chemical knowledge and reasoning capabilities while allowing expert intervention through natural language prompts.

Result: Synthelite achieves up to 95% success rates in both strategy-constrained and starting-material-constrained synthesis tasks, flexibly adapts to diverse user-specified constraints, and exhibits the ability to account for chemical feasibility during route design.

Conclusion: Synthelite represents both a useful tool for synthetic chemists and a step toward a paradigm where LLMs serve as central orchestrators of synthesis planning, enabling better integration of human expertise through natural language interaction.

Abstract: Computer-aided synthesis planning (CASP) has long been envisioned as a complementary tool for synthetic chemists. However, existing frameworks often lack mechanisms to allow interaction with human experts, limiting their ability to integrate chemists’ insights. In this work, we introduce Synthelite, a synthesis planning framework that uses large language models (LLMs) to directly propose retrosynthetic transformations. Synthelite can generate end-to-end synthesis routes by harnessing the intrinsic chemical knowledge and reasoning capabilities of LLMs, while allowing expert intervention through natural language prompts. Our experiments demonstrate that Synthelite can flexibly adapt its planning trajectory to diverse user-specified constraints, achieving up to 95% success rates in both strategy-constrained and starting-material-constrained synthesis tasks. Additionally, Synthelite exhibits the ability to account for chemical feasibility during route design. We envision Synthelite to be both a useful tool and a step toward a paradigm where LLMs are the central orchestrators of synthesis planning.

[263] TIB AIssistant: a Platform for AI-Supported Research Across Research Life Cycles

Allard Oelen, Sören Auer

Main category: cs.AI

TL;DR: TIB AIssistant is an AI-supported research platform that provides assistance throughout the entire research lifecycle using specialized assistants and external scholarly tools, with data stored as RO-Crate bundles for transparency and reproducibility.

Details

Motivation: The rapid adoption of AI and LLMs is transforming academic research, creating a need for AI-supported platforms that can assist researchers throughout the entire research lifecycle to enhance productivity and innovation.

Method: Developed TIB AIssistant as a platform with multiple specialized assistants for different research tasks, integrated with external scholarly services, and using RO-Crate bundles for data storage and export to ensure transparency and reproducibility.

Result: Successfully demonstrated the platform’s functionality through a sequential walk-through where assistants interact to generate sections for a draft research paper, showing practical AI-supported research workflow.

Conclusion: The TIB AIssistant lays the foundation for a community-maintained platform for AI-supported research, addressing the growing need for AI assistance in academic workflows while emphasizing transparency and reproducibility.

Abstract: The rapidly growing popularity of adopting Artificial Intelligence (AI), and specifically Large Language Models (LLMs), is having a widespread impact throughout society, including the academic domain. AI-supported research has the potential to support researchers with tasks across the entire research life cycle. In this work, we demonstrate the TIB AIssistant, an AI-supported research platform providing support throughout the research life cycle. The AIssistant consists of a collection of assistants, each responsible for a specific research task. In addition, tools are provided to give access to external scholarly services. Generated data is stored in the assets and can be exported as an RO-Crate bundle to provide transparency and enhance reproducibility of the research project. We demonstrate the AIssistant’s main functionalities by means of a sequential walk-through of assistants, interacting with each other to generate sections for a draft research paper. In the end, with the AIssistant, we lay the foundation for a larger agenda of providing a community-maintained platform for AI-supported research.

[264] StarCraft+: Benchmarking Multi-agent Algorithms in Adversary Paradigm

Yadong Li, Tong Zhang, Bo Huang, Zhen Cui

Main category: cs.AI

TL;DR: SC2BA is a new multi-agent algorithm-vs-algorithm environment for benchmarking MARL algorithms in adversarial settings, addressing limitations of fixed AI opponents in SMAC.

Details

Motivation: Current MARL benchmarks like SMAC use fixed built-in AI opponents, which lack diversity and versatility for proper algorithm evaluation. There's a need for more realistic adversarial benchmarking.

Method: Created SC2BA environment for inter-algorithm adversary with fairness, usability, and customizability. Developed APyMARL library with easy-to-use interfaces. Benchmarked classic MARL algorithms in two adversarial modes: dual-algorithm paired adversary and multi-algorithm mixed adversary.

Result: Extensive benchmark experiments revealed thought-provoking observations about algorithm effectiveness, sensibility, and scalability. The environment successfully enabled adversarial evaluation of MARL algorithms.

Conclusion: SC2BA marks a significant advancement for MARL benchmarking by enabling algorithm-vs-algorithm adversarial evaluation, providing more realistic and diverse testing scenarios than traditional fixed-AI benchmarks.

Abstract: Deep multi-agent reinforcement learning (MARL) algorithms are booming in the field of collaborative intelligence, and StarCraft multi-agent challenge (SMAC) is widely-used as the benchmark therein. However, imaginary opponents of MARL algorithms are practically configured and controlled in a fixed built-in AI mode, which causes less diversity and versatility in algorithm evaluation. To address this issue, in this work, we establish a multi-agent algorithm-vs-algorithm environment, named StarCraft II battle arena (SC2BA), to refresh the benchmarking of MARL algorithms in an adversary paradigm. Taking StarCraft as infrastructure, the SC2BA environment is specifically created for inter-algorithm adversary with the consideration of fairness, usability and customizability, and meantime an adversarial PyMARL (APyMARL) library is developed with easy-to-use interfaces/modules. Grounding in SC2BA, we benchmark those classic MARL algorithms in two types of adversarial modes: dual-algorithm paired adversary and multi-algorithm mixed adversary, where the former conducts the adversary of pairwise algorithms while the latter focuses on the adversary to multiple behaviors from a group of algorithms. The extensive benchmark experiments exhibit some thought-provoking observations/problems in the effectivity, sensibility and scalability of these completed algorithms. The SC2BA environment as well as reproduced experiments are released in \href{https://github.com/dooliu/SC2BA}{Github}, and we believe that this work could mark a new step for the MARL field in the coming years.

[265] Towards AI-Supported Research: a Vision of the TIB AIssistant

Sören Auer, Allard Oelen, Mohamad Yaser Jaradeh, Mutahira Khalid, Farhana Keya, Sasi Kiran Gaddipati, Jennifer D’Souza, Lorenz Schlüter, Amirreza Alasti, Gollam Rabby, Azanzi Jiomekong, Oliver Karras

Main category: cs.AI

TL;DR: The paper presents TIB AIssistant, a domain-agnostic human-machine collaborative platform designed to help researchers integrate AI assistants across the entire research lifecycle, addressing challenges like domain specificity, AI literacy, tool coordination, and AI accuracy concerns.

Details

Motivation: While Generative AI and LLMs offer transformative potential for research workflows, effective integration faces challenges: varying domain requirements, limited AI literacy, complex tool/agent coordination, and unclear accuracy of AI in research contexts.

Method: Developed a modular platform with components including prompt/tool libraries, shared data store, and flexible orchestration framework. Designed to support researchers across disciplines in tasks spanning ideation, literature analysis, methodology development, data analysis, and scholarly writing.

Result: Created an early prototype demonstrating the feasibility and potential impact of the approach, with conceptual framework, system architecture, and implementation showing how AI can be effectively integrated into scholarly workflows.

Conclusion: The TIB AIssistant platform represents a promising vision for domain-agnostic human-machine collaboration in research, addressing key integration challenges and offering a flexible framework to augment scholarly workflows across the entire research lifecycle.

Abstract: The rapid advancements in Generative AI and Large Language Models promise to transform the way research is conducted, potentially offering unprecedented opportunities to augment scholarly workflows. However, effectively integrating AI into research remains a challenge due to varying domain requirements, limited AI literacy, the complexity of coordinating tools and agents, and the unclear accuracy of Generative AI in research. We present the vision of the TIB AIssistant, a domain-agnostic human-machine collaborative platform designed to support researchers across disciplines in scientific discovery, with AI assistants supporting tasks across the research life cycle. The platform offers modular components - including prompt and tool libraries, a shared data store, and a flexible orchestration framework - that collectively facilitate ideation, literature analysis, methodology development, data analysis, and scholarly writing. We describe the conceptual framework, system architecture, and implementation of an early prototype that demonstrates the feasibility and potential impact of our approach.

[266] TimeSeries2Report prompting enables adaptive large language model management of lithium-ion batteries

Jiayang Yang, Chunhui Zhao, Martin Guay, Zhixing Cao

Main category: cs.AI

TL;DR: TS2R is a prompting framework that converts battery time-series data into structured reports, enabling LLMs to perform BESS management tasks without retraining.

Details

Motivation: LLMs show promise for interpreting time-series data but haven't been effectively applied to real-world battery energy storage system (BESS) operation and maintenance.

Method: TS2R framework uses segmentation, semantic abstraction, and rule-based interpretation to encode short-term temporal dynamics into natural language reports from raw battery time-series data.

Result: TS2R consistently improves LLM performance across accuracy, robustness, and explainability metrics compared to vision-, embedding-, and text-based baselines. TS2R-integrated LLMs achieve expert-level decision quality and predictive consistency without retraining.

Conclusion: TS2R establishes a practical path for adaptive, LLM-driven battery intelligence by bridging low-level sensor signals with high-level contextual insights through structured reporting.

Abstract: Large language models (LLMs) offer promising capabilities for interpreting multivariate time-series data, yet their application to real-world battery energy storage system (BESS) operation and maintenance remains largely unexplored. Here, we present TimeSeries2Report (TS2R), a prompting framework that converts raw lithium-ion battery operational time-series into structured, semantically enriched reports, enabling LLMs to reason, predict, and make decisions in BESS management scenarios. TS2R encodes short-term temporal dynamics into natural language through a combination of segmentation, semantic abstraction, and rule-based interpretation, effectively bridging low-level sensor signals with high-level contextual insights. We benchmark TS2R across both lab-scale and real-world datasets, evaluating report quality and downstream task performance in anomaly detection, state-of-charge prediction, and charging/discharging management. Compared with vision-, embedding-, and text-based prompting baselines, report-based prompting via TS2R consistently improves LLM performance in terms of across accuracy, robustness, and explainability metrics. Notably, TS2R-integrated LLMs achieve expert-level decision quality and predictive consistency without retraining or architecture modification, establishing a practical path for adaptive, LLM-driven battery intelligence.

[267] cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

Jinwu Chen, Qidie Wu, Bin Li, Lin Ma, Xin Si, Yang Hu, Shouyi Yin, Jun Yang

Main category: cs.AI

TL;DR: cuPilot is a multi-agent framework using strategy-coordinated evolution to automatically optimize CUDA kernels, achieving 3.09× speedup over PyTorch.

Details

Motivation: CUDA kernel optimization is challenging due to hardware-software co-design expertise requirements and proprietary nature of high-performance libraries. Existing LLM-based approaches with evolutionary algorithms have suboptimal agent designs and mismatched evolution representations.

Method: Proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as intermediate semantic representation for kernel evolution. Includes strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization.

Result: Generated kernels achieve average 3.09× speedup over PyTorch on 100-kernel benchmark. On GEMM tasks, shows sophisticated optimizations and high utilization of critical hardware units.

Conclusion: cuPilot effectively addresses mismatches in existing approaches through strategy-coordinated multi-agent framework, enabling automatic CUDA kernel optimization with significant performance improvements.

Abstract: Optimizing CUDA kernels is a challenging and labor-intensive task, given the need for hardware-software co-design expertise and the proprietary nature of high-performance kernel libraries. While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations. This work identifies these mismatches and proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution. Key contributions include a strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization. Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09$\times$ over PyTorch on a benchmark of 100 kernels. On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units. The generated kernels are open-sourced at https://github.com/champloo2878/cuPilot-Kernels.git.

[268] Quantifying and Bridging the Fidelity Gap: A Decisive-Feature Approach to Comparing Synthetic and Real Imagery

Danial Safaei, Siddartha Khastgir, Mohsen Alirezaei, Jeroen Ploeg, Son Tong, Xingyu Zhao

Main category: cs.AI

TL;DR: The paper introduces Decisive Feature Fidelity (DFF), a new SUT-specific metric that measures whether autonomous vehicle systems base decisions on the same causal evidence in both real and simulated environments, addressing limitations of pixel-level fidelity.

Details

Motivation: Current virtual testing for AV safety relies on synthetic data, but visual realism alone doesn't ensure reliable simulation-to-real transfer. The key problem is that pixel-level fidelity doesn't guarantee the system-under-test uses the same causal evidence for decisions in both domains.

Method: Proposes Decisive Feature Fidelity (DFF) metric that uses explainable-AI methods to identify and compare decisive features driving SUT decisions for matched real-synthetic pairs. Includes practical estimators based on counterfactual explanations and a DFF-guided calibration scheme to enhance simulator fidelity.

Result: Experiments on 2126 matched KITTI-VirtualKITTI2 pairs show DFF reveals discrepancies overlooked by conventional output-value fidelity. DFF-guided calibration improves decisive-feature and input-level fidelity without sacrificing output value fidelity across diverse SUTs.

Conclusion: DFF provides a behavior-grounded fidelity measure that captures mechanism parity - the agreement in causal evidence underlying SUT decisions across domains, addressing a critical gap in AV safety assurance through virtual testing.

Abstract: Virtual testing using synthetic data has become a cornerstone of autonomous vehicle (AV) safety assurance. Despite progress in improving visual realism through advanced simulators and generative AI, recent studies reveal that pixel-level fidelity alone does not ensure reliable transfer from simulation to the real world. What truly matters is whether the system-under-test (SUT) bases its decisions on the same causal evidence in both real and simulated environments - not just whether images “look real” to humans. This paper addresses the lack of such a behavior-grounded fidelity measure by introducing Decisive Feature Fidelity (DFF), a new SUT-specific metric that extends the existing fidelity spectrum to capture mechanism parity - the agreement in causal evidence underlying the SUT’s decisions across domains. DFF leverages explainable-AI (XAI) methods to identify and compare the decisive features driving the SUT’s outputs for matched real-synthetic pairs. We further propose practical estimators based on counterfactual explanations, along with a DFF-guided calibration scheme to enhance simulator fidelity. Experiments on 2126 matched KITTI-VirtualKITTI2 pairs demonstrate that DFF reveals discrepancies overlooked by conventional output-value fidelity. Furthermore, results show that DFF-guided calibration improves decisive-feature and input-level fidelity without sacrificing output value fidelity across diverse SUTs.

[269] Best Practices For Empirical Meta-Algorithmic Research Guidelines from the COSEAL Research Network

Theresa Eimer, Lennart Schäpermeier, André Biedenkapp, Alexander Tornede, Lars Kotthoff, Pieter Leyman, Matthias Feurer, Katharina Eggensperger, Kaitlin Maile, Tanja Tornede, Anna Kozak, Ke Xue, Marcel Wever, Mitra Baratchi, Damir Pulatov, Heike Trautmann, Haniye Kashgarani, Marius Lindauer

Main category: cs.AI

TL;DR: A comprehensive guide collecting best practices for empirical meta-algorithmic research across COSEAL community subfields, covering the entire experimental cycle from research questions to result presentation.

Details

Motivation: Empirical meta-algorithmic research (algorithm selection, configuration, scheduling) relies on computationally expensive experiments with many potential error sources. Best practices exist but are scattered across different publications and fields, evolving separately, creating a need for unified guidelines.

Method: Collects and synthesizes good practices from across COSEAL community subfields, covering the complete experimental cycle: formulating research questions, selecting experimental designs, executing experiments, and analyzing/presenting results impartially.

Result: Establishes current state-of-the-art practices for meta-algorithmic research, creating a comprehensive guideline that serves both new researchers and practitioners in meta-algorithmic fields.

Conclusion: This report provides unified, community-wide best practices to improve scalability, validity, and reproducibility of empirical meta-algorithmic research by addressing common error sources and establishing standardized experimental methodologies.

Abstract: Empirical research on meta-algorithmics, such as algorithm selection, configuration, and scheduling, often relies on extensive and thus computationally expensive experiments. With the large degree of freedom we have over our experimental setup and design comes a plethora of possible error sources that threaten the scalability and validity of our scientific insights. Best practices for meta-algorithmic research exist, but they are scattered between different publications and fields, and continue to evolve separately from each other. In this report, we collect good practices for empirical meta-algorithmic research across the subfields of the COSEAL community, encompassing the entire experimental cycle: from formulating research questions and selecting an experimental design, to executing ex- periments, and ultimately, analyzing and presenting results impartially. It establishes the current state-of-the-art practices within meta-algorithmic research and serves as a guideline to both new researchers and practitioners in meta-algorithmic fields.

[270] ParamExplorer: A framework for exploring parameters in generative art

Julien Gachadoat, Guillaume Lagarde

Main category: cs.AI

TL;DR: ParamExplorer: Interactive framework for exploring generative art parameter spaces using RL-inspired agents with human/automated feedback.

Details

Motivation: Generative art systems have high-dimensional parameter spaces where aesthetically compelling outputs are rare and fragmented, requiring extensive manual trial-and-error that leaves many interesting configurations undiscovered.

Method: Introduces ParamExplorer - an interactive, modular framework inspired by reinforcement learning that helps explore parameter spaces in generative art algorithms, guided by human-in-the-loop or automated feedback. The framework integrates with existing p5.js projects and implements several exploration strategies (agents).

Result: The paper presents the ParamExplorer framework and evaluates several exploration strategies (agents) within this framework, though specific evaluation results are not detailed in the abstract.

Conclusion: ParamExplorer provides a systematic approach to exploring complex generative art parameter spaces, reducing reliance on manual trial-and-error and enabling discovery of interesting configurations through guided exploration strategies.

Abstract: Generative art systems often involve high-dimensional and complex parameter spaces in which aesthetically compelling outputs occupy only small, fragmented regions. Because of this combinatorial explosion, artists typically rely on extensive manual trial-and-error, leaving many potentially interesting configurations undiscovered. In this work we make two contributions. First, we introduce ParamExplorer, an interactive and modular framework inspired by reinforcement learning that helps the exploration of parameter spaces in generative art algorithms, guided by human-in-the-loop or even automated feedback. The framework also integrates seamlessly with existing p5.js projects. Second, within this framework we implement and evaluate several exploration strategies, referred to as agents.

[271] Scaling Laws for Energy Efficiency of Local LLMs

Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Samuel Mugel, Román Orús

Main category: cs.AI

TL;DR: Systematic benchmarking reveals CPU-only inference scaling laws for local LLMs/VLMs: linear compute-token scaling for LLMs, resolution knee for VLMs, with quantum-inspired compression reducing compute/memory by ~72% and energy by ~62%.

Details

Motivation: Most consumer hardware relies on CPUs for AI deployment, but computational laws for CPU-only inference of local language and vision-language models remain unexplored, creating a gap in understanding edge deployment trade-offs.

Method: Systematic benchmarking of LLMs/VLMs on two CPU tiers (MacBook Pro M2 and Raspberry Pi 5) using continuous sampling of processor/memory usage with AUC integration to characterize computational scaling with input text length and image resolution.

Result: Two empirical scaling laws: (1) LLM compute scales linearly with token length; (2) VLMs show preprocessing-driven “resolution knee” where compute remains constant above internal resolution clamp. Quantum-inspired compression reduces processor/memory usage by up to 71.9% and energy by up to 62% while preserving accuracy.

Conclusion: Provides systematic quantification of multimodal CPU-only scaling for local AI workloads, identifying model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference on consumer hardware.

Abstract: Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware–including laptops, desktops, industrial controllers, and embedded systems–relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven “resolution knee”, where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.

[272] From Personalization to Prejudice: Bias and Discrimination in Memory-Enhanced AI Agents for Recruitment

Himanshu Gharat, Himanshi Agrawal, Gourab K. Patro

Main category: cs.AI

TL;DR: Memory-enhanced LLM agents introduce systematic bias through personalization, demonstrated in recruitment simulations with safety-trained models.

Details

Motivation: While memory-enhanced personalization in LLM agents offers benefits like continuity and improved relevance, it introduces unexplored risks of bias amplification that need investigation.

Method: Simulated behavior of memory-enhanced personalized agents using recruitment as a case study, analyzing bias introduction and amplification across various operational stages with safety-trained LLMs.

Result: Experiments reveal that bias is systematically introduced and reinforced through personalization in memory-enhanced agents, even with safety-trained LLMs.

Conclusion: Memory-enhanced personalized LLM agents require additional protective measures or guardrails to mitigate systematic bias introduced through personalization.

Abstract: Large Language Models (LLMs) have empowered AI agents with advanced capabilities for understanding, reasoning, and interacting across diverse tasks. The addition of memory further enhances them by enabling continuity across interactions, learning from past experiences, and improving the relevance of actions and responses over time; termed as memory-enhanced personalization. Although such personalization through memory offers clear benefits, it also introduces risks of bias. While several previous studies have highlighted bias in ML and LLMs, bias due to memory-enhanced personalized agents is largely unexplored. Using recruitment as an example use case, we simulate the behavior of a memory-enhanced personalized agent, and study whether and how bias is introduced and amplified in and across various stages of operation. Our experiments on agents using safety-trained LLMs reveal that bias is systematically introduced and reinforced through personalization, emphasizing the need for additional protective measures or agent guardrails in memory-enhanced LLM-based AI agents.

[273] Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild

Yumeng Wang, Tianyu Fan, Lingrui Xu, Chao Huang

Main category: cs.AI

TL;DR: Needle in the Web is a new benchmark for evaluating LLM search agents on fuzzy exploratory search - ambiguous, multifaceted queries where users seek relevant webpages rather than single factual answers.

Details

Motivation: Existing benchmarks focus on complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search - vague, multifaceted queries where users want the most relevant webpage rather than a single factual answer.

Method: Created Needle in the Web benchmark with 663 questions across 7 domains, using a flexible methodology to generate queries of controllable difficulty based on factual claims from web content.

Result: Benchmarked 3 leading LLMs and 3 agent-based search systems, finding most struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels.

Conclusion: Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.

Abstract: Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct domains. To ensure high query quality and answer uniqueness, we employ a flexible methodology that reliably generates queries of controllable difficulty based on factual claims of web contents. We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels. These findings reveal that Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.

[274] Implementing a Sharia Chatbot as a Consultation Medium for Questions About Islam

Wisnu Uriawan, Aria Octavian Hamza, Ade Ripaldi Nuralim, Adi Purnama, Ahmad Juaeni Yunus, Anissya Auliani Supriadi Putri

Main category: cs.AI

TL;DR: Implementation of a Sharia-compliant chatbot using Q-Learning and Sentence-Transformers for Islamic question consultation, achieving 87% semantic accuracy on 25,000 QA pairs.

Details

Motivation: To create an interactive medium for consulting Islamic questions that enhances religious literacy, digital da'wah, and access to verified Islamic knowledge in the Industry 4.0 era, bridging traditional Islamic scholarship with modern AI technology.

Method: Uses Reinforcement Learning (Q-Learning) integrated with Sentence-Transformers for semantic embedding, following CRISP-DM methodology. Processes 25,000 curated Islam QA pairs from Qur’an, Hadith, and scholarly fatwas in JSON format. Built with Flask API backend and Flutter mobile frontend.

Result: Achieves 87% semantic accuracy in functional testing across diverse Islamic topics (fiqh, aqidah, ibadah, muamalah). Successfully demonstrates potential for enhancing religious consultation and knowledge access.

Conclusion: The chatbot effectively handles closed-domain Islamic queries but has limitations including static learning and dataset dependency. Future enhancements should include continuous adaptation and multi-turn conversation support to better bridge traditional Islamic scholarship with modern AI-driven consultation.

Abstract: This research presents the implementation of a Sharia-compliant chatbot as an interactive medium for consulting Islamic questions, leveraging Reinforcement Learning (Q-Learning) integrated with Sentence-Transformers for semantic embedding to ensure contextual and accurate responses. Utilizing the CRISP-DM methodology, the system processes a curated Islam QA dataset of 25,000 question-answer pairs from authentic sources like the Qur’an, Hadith, and scholarly fatwas, formatted in JSON for flexibility and scalability. The chatbot prototype, developed with a Flask API backend and Flutter-based mobile frontend, achieves 87% semantic accuracy in functional testing across diverse topics including fiqh, aqidah, ibadah, and muamalah, demonstrating its potential to enhance religious literacy, digital da’wah, and access to verified Islamic knowledge in the Industry 4.0 era. While effective for closed-domain queries, limitations such as static learning and dataset dependency highlight opportunities for future enhancements like continuous adaptation and multi-turn conversation support, positioning this innovation as a bridge between traditional Islamic scholarship and modern AI-driven consultation.

[275] Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

Jirui Yang, Hengqi Guo, Zhihui Lu, Yi Zhao, Yuansen Zhang, Shijing Hu, Qiang Duan, Yinggui Wang, Tao Wei

Main category: cs.AI

TL;DR: Prefix Probing is a black-box harmful content detection method that uses prefix probability comparisons with caching to achieve near first-token latency, balancing accuracy, latency, and cost trade-offs.

Details

Motivation: LLMs face a three-way trade-off among detection accuracy, inference latency, and deployment cost in safety-sensitive applications, creating a need for efficient detection methods.

Method: Compares conditional log-probabilities of “agreement/execution” vs “refusal/safety” opening prefixes using prefix caching. Uses single log-probability computation over probe prefixes to produce harmfulness score. Includes efficient prefix construction algorithm to discover highly informative prefixes automatically.

Result: Achieves detection effectiveness comparable to mainstream external safety models with minimal computational cost and no extra model deployment. Reduces detection overhead to near first-token latency.

Conclusion: Prefix Probing offers strong practicality and efficiency for harmful content detection in LLMs, solving the accuracy-latency-cost trade-off through prefix-based black-box detection with caching.

Abstract: Large language models often face a three-way trade-off among detection accuracy, inference latency, and deployment cost when used in real-world safety-sensitive applications. This paper introduces Prefix Probing, a black-box harmful content detection method that compares the conditional log-probabilities of “agreement/execution” versus “refusal/safety” opening prefixes and leverages prefix caching to reduce detection overhead to near first-token latency. During inference, the method requires only a single log-probability computation over the probe prefixes to produce a harmfulness score and apply a threshold, without invoking any additional models or multi-stage inference. To further enhance the discriminative power of the prefixes, we design an efficient prefix construction algorithm that automatically discovers highly informative prefixes, substantially improving detection performance. Extensive experiments demonstrate that Prefix Probing achieves detection effectiveness comparable to mainstream external safety models while incurring only minimal computational cost and requiring no extra model deployment, highlighting its strong practicality and efficiency.

[276] Comprehensive AI Literacy: The Case for Centering Human Agency

Sri Yash Tadimalla, Justin Cary, Gordon Hull, Jordan Register, Daniel Maxwell, David Pugalee, Tina Heafner

Main category: cs.AI

TL;DR: This position paper argues for shifting AI education from functional skills to comprehensive literacy centered on human agency, critical thinking, and ethical reasoning.

Details

Motivation: Current AI education frameworks fail to address the dangerous literacy gap where functional skills eclipse critical and ethical reasoning. There's an urgent need to prepare all educational stakeholders to make intentional, responsible choices about AI technologies.

Method: The paper proposes a systemic shift toward comprehensive AI literacy frameworks (AI Literacy, Fluency, and Competency) that center human agency. This involves teaching about agency itself and framing technology as a choice rather than an inevitability.

Result: The proposed frameworks would enable educators and students to become agents in their own human-centric approaches to AI, providing pathways to articulate intentions, make informed decisions, and understand the impact of AI choices on academic work, careers, and society.

Conclusion: True AI literacy requires moving beyond functional skills to develop critical thinking, ethical reasoning, and human agency, empowering all educational stakeholders to make intentional choices about AI technologies rather than passively adopting them.

Abstract: The rapid assimilation of Artificial Intelligence technologies into various facets of society has created a significant educational imperative that current frameworks are failing to effectively address. We are witnessing the rise of a dangerous literacy gap, where a focus on the functional, operational skills of using AI tools is eclipsing the development of critical and ethical reasoning about them. This position paper argues for a systemic shift toward comprehensive AI literacy that centers human agency - the empowered capacity for intentional, critical, and responsible choice. This principle applies to all stakeholders in the educational ecosystem: it is the student’s agency to question, create with, or consciously decide not to use AI based on the task; it is the teacher’s agency to design learning experiences that align with instructional values, rather than ceding pedagogical control to a tool. True literacy involves teaching about agency itself, framing technology not as an inevitability to be adopted, but as a choice to be made. This requires a deep commitment to critical thinking and a robust understanding of epistemology. Through the AI Literacy, Fluency, and Competency frameworks described in this paper, educators and students will become agents in their own human-centric approaches to AI, providing necessary pathways to clearly articulate the intentions informing decisions and attitudes toward AI and the impact of these decisions on academic work, career, and society.

[277] Unsupervised Thematic Clustering Of hadith Texts Using The Apriori Algorithm

Wisnu Uriawan, Achmad Ajie Priyajie, Angga Gustian, Fikri Nur Hidayat, Sendi Ahmad Rafiudin, Muhamad Fikri Zaelani

Main category: cs.AI

TL;DR: The paper applies Apriori algorithm for unsupervised thematic grouping of hadith texts, identifying semantic relationships like rakaat-prayer, verse-revelation, and hadith-story associations.

Details

Motivation: To automate thematic grouping of hadith texts in response to the digitalization of Islamic texts, addressing the need for automated analysis of unlabeled religious text data.

Method: Used unsupervised learning with Apriori algorithm on Indonesian Translation of Bukhari hadith. Preprocessing included case folding, punctuation cleaning, tokenization, stopword removal, and stemming. Association rule mining with support, confidence, and lift parameters.

Result: Identified meaningful association patterns: rakaat-prayer (worship theme), verse-revelation (revelation theme), and hadith-story (narration theme). Demonstrated Apriori algorithm’s ability to uncover latent semantic relationships.

Conclusion: The Apriori algorithm effectively automates thematic grouping of hadith texts, contributing to digital Islamic studies and technology-based learning systems by revealing semantic relationships in unlabeled religious text data.

Abstract: This research stems from the urgency to automate the thematic grouping of hadith in line with the growing digitalization of Islamic texts. Based on a literature review, the unsupervised learning approach with the Apriori algorithm has proven effective in identifying association patterns and semantic relations in unlabeled text data. The dataset used is the Indonesian Translation of the hadith of Bukhari, which first goes through preprocessing stages including case folding, punctuation cleaning, tokenization, stopword removal, and stemming. Next, an association rule mining analysis was conducted using the Apriori algorithm with support, confidence, and lift parameters. The results show the existence of meaningful association patterns such as the relationship between rakaat-prayer, verse-revelation, and hadith-story, which describe the themes of worship, revelation, and hadith narration. These findings demonstrate that the Apriori algorithm has the ability to automatically uncover latent semantic relationships, while contributing to the development of digital Islamic studies and technology-based learning systems.

[278] Do Multi-Agents Solve Better Than Single? Evaluating Agentic Frameworks for Diagram-Grounded Geometry Problem Solving and Reasoning

Mahbub E Sobhani, Md. Faiyaz Abdullah Sayeedi, Mohammad Nehad Alam, Proma Hossain Progga, Swakkhar Shatabda

Main category: cs.AI

TL;DR: Multi-agent pipelines improve performance for open-source MLLMs on geometry problem solving benchmarks, but benefits vary for closed-source models.

Details

Motivation: To systematically compare single-agent vs multi-agent designs for diagram-grounded geometry problem solving, as the benefits of multi-agent approaches remain unclear for multimodal large language models.

Method: Systematic comparison of single-agent and multi-agent pipelines on four visual math benchmarks: Geometry3K, MathVerse, OlympiadBench, and We-Math, using both open-source (Qwen-2.5-VL variants) and closed-source (Gemini-2.0-Flash) models.

Result: For open-source models, multi-agent consistently improves performance (e.g., Qwen-2.5-VL 7B gains +6.8 points, 32B gains +3.3 on Geometry3K). For closed-source Gemini-2.0-Flash, single-agent generally performs better on classic benchmarks, with only modest multi-agent improvements on newer We-Math dataset.

Conclusion: Multi-agent pipelines provide clear benefits for open-source models and can assist strong proprietary systems on newer, less familiar benchmarks, but agentic decomposition is not universally optimal for all models and benchmarks.

Abstract: Diagram-grounded geometry problem solving is a critical benchmark for multimodal large language models (MLLMs), yet the benefits of multi-agent design over single-agent remain unclear. We systematically compare single-agent and multi-agent pipelines on four visual math benchmarks: Geometry3K, MathVerse, OlympiadBench, and We-Math. For open-source models, multi-agent consistently improves performance. For example, Qwen-2.5-VL (7B) gains +6.8 points and Qwen-2.5-VL (32B) gains +3.3 on Geometry3K, and both Qwen-2.5-VL variants see further gains on OlympiadBench and We-Math. In contrast, the closed-source Gemini-2.0-Flash generally performs better in single-agent mode on classic benchmarks, while multi-agent yields only modest improvements on the newer We-Math dataset. These findings show that multi-agent pipelines provide clear benefits for open-source models and can assist strong proprietary systems on newer, less familiar benchmarks, but agentic decomposition is not universally optimal. All code, data, and reasoning files are available at https://github.com/faiyazabdullah/Interpreter-Solver

[279] Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille

Main category: cs.AI

TL;DR: Generative Adversarial Reasoner (GAR) is an adversarial reinforcement learning framework that co-evolves an LLM reasoner and discriminator to improve mathematical reasoning by providing dense step-level rewards.

Details

Motivation: LLMs with reasoning capabilities still make process errors like incorrect calculations, brittle logic, and invalid steps. Current methods lack effective step-level feedback for improving reasoning quality.

Method: On-policy joint training framework with LLM reasoner and LLM-based discriminator. Uses compute-efficient review schedule to partition reasoning chains into slices. Discriminator evaluates slice soundness with structured justifications. Both models learn from complementary reward signals: reasoner rewarded for consistent steps leading to correct answers, discriminator rewarded for error detection.

Result: Consistent gains across mathematical benchmarks. On AIME24: DeepSeek-R1-Distill-Qwen-7B improved from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). Produces dense, well-calibrated step-level rewards improving credit assignment and sample efficiency.

Conclusion: GAR framework effectively enhances LLM reasoning through adversarial co-evolution, providing superior step-level feedback compared to sparse exact-match signals. Modular discriminator enables flexible reward shaping for various objectives like teacher distillation and preference alignment.

Abstract: Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice’s soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.

[280] Cyber Humanism in Education: Reclaiming Agency through AI and Learning Sciences

Giovanni Adorni

Main category: cs.AI

TL;DR: Proposes Cyber Humanism in Education framework to reclaim human agency in AI-enabled learning, positioning educators/learners as algorithmic citizens with rights/responsibilities to shape socio-technical infrastructures.

Details

Motivation: GenAI is reshaping knowledge production/validation in education, raising concerns about epistemic automation, cognitive offloading, and teacher de-professionalization. Need to reclaim human agency in AI-enabled learning environments.

Method: Proposes Cyber Humanism in Education framework with three pillars: reflexive competence, algorithmic citizenship, and dialogic design. Presents higher-education case studies using prompt-based learning and Conversational AI Educator certification within EPICT ecosystem.

Result: Findings show practices can strengthen epistemic agency while surfacing tensions around workload, equity, and governance. Outlines implications for future AI-rich, human-centered education.

Conclusion: Cyber Humanism provides a framework for human agency in AI-enabled education, positioning educators/learners as algorithmic citizens who co-author socio-technical infrastructures, balancing technological integration with human-centered values.

Abstract: Generative Artificial Intelligence (GenAI) is rapidly reshaping how knowledge is produced and validated in education. Rather than adding another digital tool, large language models reconfigure reading, writing, and coding into hybrid human-AI workflows, raising concerns about epistemic automation, cognitive offloading, and the de-professiona-lisation of teachers. This paper proposes \emph{Cyber Humanism in Education} as a framework for reclaiming human agency in this landscape. We conceptualise AI-enabled learning environments as socio-technical infrastructures co-authored by humans and machines, and position educators and learners as epistemic agents and \emph{algorithmic citizens} who have both the right and the responsibility to shape these infrastructures. We articulate three pillars for cyber-humanist design, \emph{reflexive competence}, \emph{algorithmic citizenship}, and \emph{dialogic design}, and relate them to major international digital and AI competence frameworks. We then present higher-education case studies that operationalise these ideas through \emph{prompt-based learning} and a new \emph{Conversational AI Educator} certification within the EPICT ecosystem. The findings show how such practices can strengthen epistemic agency while surfacing tensions around workload, equity, and governance, and outline implications for the future of AI-rich, human-centred education.

[281] Dual Computational Horizons: Incompleteness and Unpredictability in Intelligent Systems

Abhisek Ganguly

Main category: cs.AI

TL;DR: Algorithmic intelligence faces fundamental limits from formal incompleteness and dynamical unpredictability, preventing agents from computing their own maximal prediction horizons.

Details

Motivation: To understand the fundamental computational constraints on algorithmic intelligence by examining how formal incompleteness (deductive limitations) and dynamical unpredictability (prediction limitations) interact to bound an agent's self-analytical capabilities.

Method: Formalizes two independent computational limitations: formal incompleteness (limiting deductive power of consistent reasoning systems) and dynamical unpredictability (bounding long-term prediction under finite precision). Analyzes how these constraints together impose structural bounds on agents’ self-referential reasoning about their predictive capabilities.

Result: Shows that algorithmic agents cannot generally compute their own maximal prediction horizon due to the combined effects of incompleteness and unpredictability. The interaction of these two extrema creates inherent limitations on self-analysis.

Conclusion: Clarifies inherent trade-offs between reasoning, prediction, and self-analysis in intelligent systems, revealing fundamental structural bounds on what algorithmic agents can know about their own predictive capabilities.

Abstract: We formalize two independent computational limitations that constrain algorithmic intelligence: formal incompleteness and dynamical unpredictability. The former limits the deductive power of consistent reasoning systems while the later bounds long-term prediction under finite precision. We show that these two extrema together impose structural bounds on an agent’s ability to reason about its own predictive capabilities. In particular, an algorithmic agent cannot compute its own maximal prediction horizon generally. This perspective clarifies inherent trade-offs between reasoning, prediction, and self-analysis in intelligent systems.

[282] Discovering and Learning Probabilistic Models of Black-Box AI Capabilities

Daniel Bramblett, Rushang Karia, Adrian Ciotinga, Ruthvick Suresh, Pulkit Verma, YooJung Choi, Siddharth Srivastava

Main category: cs.AI

TL;DR: Learning interpretable symbolic models (PDDL-style) of black-box AI systems’ planning capabilities using Monte-Carlo tree search for systematic testing and hypothesis pruning.

Details

Motivation: Black-box AI systems like foundational models are increasingly used for sequential decision making, requiring safe deployment through interpretable representations of their capabilities.

Method: Uses PDDL-style representations to model BBAI planning capabilities, employing Monte-Carlo tree search to systematically create test tasks, acquire data, and prune hypothesis space of possible symbolic models.

Result: Learned models describe BBAI capabilities, execution conditions, possible outcomes with probabilities. Theoretical results show soundness, completeness, convergence. Empirical results demonstrate scope, efficiency, accuracy across multiple BBAI systems.

Conclusion: PDDL-style representations can efficiently learn interpretable models of black-box AI planning capabilities, providing safe and verifiable deployment through systematic testing and hypothesis pruning.

Abstract: Black-box AI (BBAI) systems such as foundational models are increasingly being used for sequential decision making. To ensure that such systems are safe to operate and deploy, it is imperative to develop efficient methods that can provide a sound and interpretable representation of the BBAI’s capabilities. This paper shows that PDDL-style representations can be used to efficiently learn and model an input BBAI’s planning capabilities. It uses the Monte-Carlo tree search paradigm to systematically create test tasks, acquire data, and prune the hypothesis space of possible symbolic models. Learned models describe a BBAI’s capabilities, the conditions under which they can be executed, and the possible outcomes of executing them along with their associated probabilities. Theoretical results show soundness, completeness and convergence of the learned models. Empirical results with multiple BBAI systems illustrate the scope, efficiency, and accuracy of the presented methods.

[283] AI-Driven Prediction of Cancer Pain Episodes: A Hybrid Decision Support Approach

Yipeng Zhuang, Yifeng Guo, Yuewen Li, Yuheng Wu, Philip Leung-Ho Yu, Tingting Song, Zhiyong Wang, Kunzhong Zhou, Weifang Wang, Li Zhuang

Main category: cs.AI

TL;DR: Hybrid ML+LLM pipeline predicts lung cancer breakthrough pain episodes 48-72h in advance using EHR data, achieving 87-92% accuracy with improved sensitivity.

Details

Motivation: Lung cancer patients experience frequent breakthrough pain (up to 91% incidence) requiring timely intervention. Current reactive approaches miss opportunities for proactive pain management, leading to suboptimal patient outcomes and resource utilization.

Method: Hybrid pipeline combining machine learning and large language models. ML captures temporal medication trends from structured EHR data (demographics, tumor stage, vital signs, WHO-tiered analgesic use). LLM interprets ambiguous dosing records and free-text clinical notes. Retrospective analysis of 266 inpatients.

Result: Achieved accuracy of 0.874 (48h prediction) and 0.917 (72h prediction). LLM augmentation improved sensitivity by 8.6% (48h) and 10.4% (72h). Framework provides clinically interpretable predictions with enhanced sensitivity over ML-only approaches.

Conclusion: The hybrid ML+LLM approach offers a clinically interpretable, scalable tool for early pain episode forecasting in lung cancer patients. This enables proactive pain management, potentially improving treatment precision and optimizing oncology care resource allocation.

Abstract: Lung cancer patients frequently experience breakthrough pain episodes, with up to 91% requiring timely intervention. To enable proactive pain management, we propose a hybrid machine learning and large language model pipeline that predicts pain episodes within 48 and 72 hours of hospitalization using both structured and unstructured electronic health record data. A retrospective cohort of 266 inpatients was analyzed, with features including demographics, tumor stage, vital signs, and WHO-tiered analgesic use. The machine learning module captured temporal medication trends, while the large language model interpreted ambiguous dosing records and free-text clinical notes. Integrating these modalities improved sensitivity and interpretability. Our framework achieved an accuracy of 0.874 (48h) and 0.917 (72h), with an improvement in sensitivity of 8.6% and 10.4% due to the augmentation of large language model. This hybrid approach offers a clinically interpretable and scalable tool for early pain episode forecasting, with potential to enhance treatment precision and optimize resource allocation in oncology care.

Siqi Wang, Chao Liang, Yunfan Gao, Erxin Yu, Sen Li, Yushi Li, Jing Li, Haofen Wang

Main category: cs.AI

TL;DR: CitySeeker benchmark evaluates VLMs’ ability to handle implicit human needs in urban navigation, revealing major performance gaps and proposing cognitive-inspired strategies for improvement.

Details

Motivation: Current VLMs excel at explicit instruction-based navigation but struggle with interpreting implicit human needs (e.g., "I am thirsty") in dynamic urban environments, which is crucial for real-world embodied navigation applications.

Method: Introduces CitySeeker benchmark with 6,440 trajectories across 8 cities covering 7 goal-driven scenarios with implicit needs. Analyzes VLMs’ spatial reasoning and proposes BCR strategies (Backtracking Mechanisms, Enriching Spatial Cognition, Memory-Based Retrieval) inspired by human cognitive mapping.

Result: Even top-performing models like Qwen2.5-VL-32B-Instruct achieve only 21.1% task completion. Key bottlenecks identified: error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall.

Conclusion: The benchmark reveals significant gaps in VLMs’ spatial intelligence for implicit needs navigation. The proposed BCR strategies provide actionable insights for developing robust spatial intelligence needed to tackle “last-mile” navigation challenges.

Abstract: Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., “I am thirsty”) in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs’ spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping’s emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling “last-mile” navigation challenges.

[285] TOGGLE: Temporal Logic-Guided Large Language Model Compression for Edge

Khurram Khalil, Khaza Anuarul Hoque

Main category: cs.AI

TL;DR: TOGGLE is a novel LLM compression framework that uses Signal Temporal Logic to formally guarantee preservation of linguistic properties during compression, achieving up to 3.3x FLOPs reduction and 68.8% model size reduction without retraining.

Details

Motivation: LLMs require substantial computational resources that limit deployment on resource-constrained edge devices. Existing compression techniques (quantization, pruning) often degrade linguistic properties and lack formal guarantees for preserving model behavior.

Method: TOGGLE uses Signal Temporal Logic (STL) to formally specify linguistic properties, then employs STL robustness-guided Bayesian optimization to systematically explore layer-wise quantization and pruning configurations, generating compressed models that satisfy linguistic constraints without retraining.

Result: Evaluated on four LLM architectures (GPT-2, DeepSeek-V2 7B, LLaMA 3 8B, Mistral 7B), TOGGLE achieved up to 3.3x reduction in computational costs (FLOPs) and up to 68.8% reduction in model size while satisfying all specified linguistic properties.

Conclusion: TOGGLE represents the first integration of formal methods into LLM compression, enabling efficient, verifiable deployment of LLMs on edge hardware with formal guarantees of preserved linguistic properties.

Abstract: Large Language Models (LLMs) deliver exceptional performance across natural language tasks but demand substantial computational resources, limiting their deployment on resource-constrained edge devices. Existing compression techniques, such as quantization and pruning, often degrade critical linguistic properties and lack formal guarantees for preserving model behavior. We propose Temporal Logic-Guided Large Language Model Compression (TOGGLE), a novel framework that leverages Signal Temporal Logic (STL) to formally specify and enforce linguistic properties during compression. TOGGLE employs an STL robustness-guided Bayesian optimization to systematically explore layer-wise quantization and pruning configurations, generating compressed models that formally satisfy specified linguistic constraints without retraining or fine-tuning. Evaluating TOGGLE on four LLM architectures (GPT-2, DeepSeek-V2 7B, LLaMA 3 8B, and Mistral 7B), we achieve up to 3.3x reduction in computational costs (FLOPs) and up to a 68.8% reduction in model size while satisfying all linguistic properties. TOGGLE represents the first integration of formal methods into LLM compression, enabling efficient, verifiable deployment of LLMs on edge hardware.

[286] Distributional AGI Safety

Nenad Tomašev, Matija Franklin, Julian Jacobs, Sébastien Krier, Simon Osindero

Main category: cs.AI

TL;DR: The paper argues that AI safety research needs to shift focus from individual AI systems to coordinated groups of sub-AGI agents, proposing a framework of virtual agentic sandbox economies with market mechanisms for collective risk mitigation.

Details

Motivation: Current AI safety research assumes a monolithic AGI emergence, but the alternative "patchwork AGI" hypothesis where general capabilities emerge through coordination of sub-AGI agents deserves serious consideration. The rapid deployment of advanced AI agents with tool-use and coordination capabilities makes this an urgent safety concern.

Method: Proposes a framework for distributional AGI safety that moves beyond individual agent evaluation and alignment. Centers on designing virtual agentic sandbox economies (impermeable or semi-permeable) where agent-to-agent transactions are governed by robust market mechanisms, coupled with auditability, reputation management, and oversight.

Result: The paper presents a conceptual framework for addressing collective risks in patchwork AGI scenarios, shifting safety focus from individual agents to coordinated systems and proposing specific governance mechanisms for agent economies.

Conclusion: The patchwork AGI hypothesis requires serious consideration and corresponding safeguards. Distributional AGI safety frameworks with virtual agentic sandbox economies and market-based governance mechanisms are needed to mitigate collective risks from coordinated sub-AGI agent systems.

Abstract: AI safety and alignment research has predominantly been focused on methods for safeguarding individual AI systems, resting on the assumption of an eventual emergence of a monolithic Artificial General Intelligence (AGI). The alternative AGI emergence hypothesis, where general capability levels are first manifested through coordination in groups of sub-AGI individual agents with complementary skills and affordances, has received far less attention. Here we argue that this patchwork AGI hypothesis needs to be given serious consideration, and should inform the development of corresponding safeguards and mitigations. The rapid deployment of advanced AI agents with tool-use capabilities and the ability to communicate and coordinate makes this an urgent safety consideration. We therefore propose a framework for distributional AGI safety that moves beyond evaluating and aligning individual agents. This framework centers on the design and implementation of virtual agentic sandbox economies (impermeable or semi-permeable), where agent-to-agent transactions are governed by robust market mechanisms, coupled with appropriate auditability, reputation management, and oversight to mitigate collective risks.

Otman A. Basir

Main category: cs.AI

TL;DR: The paper introduces the Social Responsibility Stack (SRS), a six-layer architectural framework that embeds societal values into AI systems through explicit constraints, safeguards, monitoring, and governance processes, treating responsibility as a closed-loop supervisory control problem.

Details

Motivation: Current responsible AI and governance efforts provide normative principles but lack enforceable engineering mechanisms that operate throughout the AI system lifecycle. There's a gap between ethical principles and practical implementation in real-world AI deployments.

Method: The Social Responsibility Stack (SRS) is a six-layer architectural framework that models responsibility as a closed-loop supervisory control problem. It integrates design-time safeguards with runtime monitoring and institutional oversight, using a unified constraint-based formulation with safety-envelope and feedback interpretations.

Result: The framework enables continuous monitoring and enforcement of fairness, autonomy, cognitive burden, and explanation quality. Case studies in clinical decision support, cooperative autonomous vehicles, and public-sector systems demonstrate how SRS translates normative objectives into actionable engineering and operational controls.

Conclusion: SRS bridges ethics, control theory, and AI governance, providing a practical foundation for accountable, adaptive, and auditable socio-technical AI systems by embedding societal values throughout the system lifecycle.

Abstract: Artificial intelligence systems are increasingly deployed in domains that shape human behaviour, institutional decision-making, and societal outcomes. Existing responsible AI and governance efforts provide important normative principles but often lack enforceable engineering mechanisms that operate throughout the system lifecycle. This paper introduces the Social Responsibility Stack (SRS), a six-layer architectural framework that embeds societal values into AI systems as explicit constraints, safeguards, behavioural interfaces, auditing mechanisms, and governance processes. SRS models responsibility as a closed-loop supervisory control problem over socio-technical systems, integrating design-time safeguards with runtime monitoring and institutional oversight. We develop a unified constraint-based formulation, introduce safety-envelope and feedback interpretations, and show how fairness, autonomy, cognitive burden, and explanation quality can be continuously monitored and enforced. Case studies in clinical decision support, cooperative autonomous vehicles, and public-sector systems illustrate how SRS translates normative objectives into actionable engineering and operational controls. The framework bridges ethics, control theory, and AI governance, providing a practical foundation for accountable, adaptive, and auditable socio-technical AI systems.

[288] SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting

Shuhao Mei, Yongchao Long, Shan Cao, Xiaobo Han, Shijia Geng, Jinbo Sun, Yuxi Zhou, Shenda Hong

Main category: cs.AI

TL;DR: SpiroLLM: First multimodal LLM that understands respiratory spirograms for COPD diagnosis, combining morphological features from breathing curves with numerical PFT data to generate comprehensive diagnostic reports with high accuracy.

Details

Motivation: Current AI models for COPD diagnosis lack interpretability (no rationale for decisions), while LLMs cannot understand spirogram data, limiting clinical trust and adoption. There's a need for interpretable diagnostic tools that can process physiological signals.

Method: SpiroLLM uses a SpiroEncoder to extract morphological features from respiratory curves and a SpiroProjector to align these features with pulmonary function test numerical values in a unified latent space, enabling a large language model to generate diagnostic reports.

Result: Achieved diagnostic AUROC of 0.8977 (95% CI: 0.88-0.91) on UK Biobank data (234,028 individuals). In robustness tests with missing core data, maintained 100% valid response rate vs. 13.4% for text-only models, demonstrating superior multimodal design.

Conclusion: SpiroLLM demonstrates substantial potential for fusing physiological signals with LLMs, establishing a new paradigm for interpretable and reliable clinical decision support tools in respiratory medicine.

Abstract: Chronic Obstructive Pulmonary Disease (COPD), a major chronic respiratory disease with persistent airflow limitation, is a leading global cause of disability and mortality. Respiratory spirogram time series, routinely collected during pulmonary function tests (PFTs), play a critical role in the early detection of repsiratory diseases and in monitoring lung function over time. However, most current AI models for COPD diagnosis are limited to outputting classification results without providing a rationale for their diagnostic process, while current Large Language Models (LLMs) cannot understand spirograms yet, which severely limits their clinical trust and adoption. To tackle this challenge, we leverage a cohort of 234,028 individuals from the UK Biobank (UKB) to propose SpiroLLM, the first multimodal large language model that can understand spirogram. The model extracts morphological features from respiratory curves via a SpiroEncoder and aligns them with PFT numerical values in a unified latent space using a SpiroProjector, ultimately empowering a large language model to generate a comprehensive diagnostic report. Experimental results confirm that SpiroLLM achieved a diagnostic AUROC of 0.8977 (95% CI: 0.88-0.91). In a robustness test with missing core data, it maintained a 100% valid response rate, far surpassing the 13.4% of a text-only model and showcasing the superiority of its multimodal design. This work demonstrates the substantial potential of deeply fusing physiological signals with large language models, establishing a new paradigm for the next generation of interpretable and reliable clinical decision support tools.

[289] Enter the Void - Planning to Seek Entropy When Reward is Scarce

Ashish Sundar, Chunbo Luo, Xiaoyang Wang

Main category: cs.AI

TL;DR: MBRL world models are typically discarded after training actors, but using them at inference time with hierarchical planning boosts sample efficiency by actively seeking informative states.

Details

Motivation: World models in MBRL require significant compute to train but are discarded after actor training, wasting their potential. Using these models at inference time could improve sample efficiency by guiding exploration more intelligently than traditional curiosity methods.

Method: Proposes hierarchical planning that uses world model’s short-horizon latent predictions to actively seek informative states. Dynamically decides when to replan, planning horizon length, and commitment to searching entropy. Applied to Dreamer framework as proof of concept.

Result: 50% faster maze completion than base Dreamer, 60% of environment steps needed; same reward in Crafter with 1/3 the steps; improved sample efficiency on DeepMind Control tasks.

Conclusion: World models should not be discarded after training - leveraging them at inference time with hierarchical planning significantly boosts sample efficiency and enables more reasoned exploratory behavior compared to traditional methods.

Abstract: Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. These models constitute the large majority of training compute and time and they are subsequently used to train actors entirely in simulation, but once this is done they are quickly discarded. We show in this work that utilising these models at inference time can significantly boost sample efficiency. We propose a novel approach that anticipates and actively seeks out informative states using the world model’s short-horizon latent predictions, offering a principled alternative to traditional curiosity-driven methods that chase outdated estimates of high uncertainty states. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multiple multi-step plans at every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the commitment to searching entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to Dreamer to illustrate the concept. Our method finishes MiniWorld’s procedurally generated mazes 50% faster than base Dreamer at convergence and in only 60% of the environment steps that base Dreamer’s policy needs; it displays reasoned exploratory behaviour in Crafter, achieves the same reward as base Dreamer in a third of the steps; planning tends to improve sample efficiency on DeepMind Control tasks.

[290] TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems

Shaina Raza, Ranjan Sapkota, Manoj Karkee, Christos Emmanouilidis

Main category: cs.AI

TL;DR: This paper reviews Trust, Risk, and Security Management (TRiSM) for LLM-based Agentic Multi-Agent Systems, proposing a risk taxonomy, novel metrics (CSS and TUE), and strategies for responsible development.

Details

Motivation: Agentic AI systems built on LLMs are transforming enterprise and societal domains, but their multi-agent configurations introduce new trust, risk, and security challenges that require systematic management frameworks.

Method: The authors adapt and extend the AI TRiSM framework for Agentic AI, structured around Explainability, ModelOps, Security, Privacy, and Lifecycle Governance. They propose a risk taxonomy for Agentic AI threats and introduce two novel metrics: Component Synergy Score (CSS) for inter-agent collaboration quality and Tool Utilization Efficacy (TUE) for tool use efficiency.

Result: The paper presents a comprehensive TRiSM framework for Agentic AI, including specific strategies for improving explainability, enhancing security through encryption and adversarial robustness, and ensuring regulatory compliance. It also provides practical assessment metrics for evaluating Agentic AI systems.

Conclusion: The review concludes with a research roadmap for responsible development and deployment of Agentic AI, emphasizing the need to align emerging systems with TRiSM principles to ensure safety, transparency, and accountability in their operation.

Abstract: Agentic AI systems, built upon large language models (LLMs) and deployed in multi-agent configurations, are redefining intelligence, autonomy, collaboration, and decision-making across enterprise and societal domains. This review presents a structured analysis of Trust, Risk, and Security Management (TRiSM) in the context of LLM-based Agentic Multi-Agent Systems (AMAS). We begin by examining the conceptual foundations of Agentic AI and highlight its architectural distinctions from traditional AI agents. We then adapt and extend the AI TRiSM framework for Agentic AI, structured around key pillars: \textit{ Explainability, ModelOps, Security, Privacy} and \textit{their Lifecycle Governance}, each contextualized to the challenges of AMAS. A risk taxonomy is proposed to capture the unique threats and vulnerabilities of Agentic AI, ranging from coordination failures to prompt-based adversarial manipulation. To support practical assessment in Agentic AI works, we introduce two novel metrics: the Component Synergy Score (CSS), which quantifies the quality of inter-agent collaboration, and the Tool Utilization Efficacy (TUE), which evaluates the efficiency of tool use within agent workflows. We further discuss strategies for improving explainability in Agentic AI, as well as approaches to enhancing security and privacy through encryption, adversarial robustness, and regulatory compliance. The review concludes with a research roadmap for the responsible development and deployment of Agentic AI, highlighting key directions to align emerging systems with TRiSM principles-ensuring safety, transparency, and accountability in their operation.

[291] Towards Practical GraphRAG: Efficient Knowledge Graph Construction and Hybrid Retrieval at Scale

Congmin Min, Sahil Bansal, Joyce Pan, Abbas Keshavarzi, Rhea Mathew, Amar Viswanathan Kannan

Main category: cs.AI

TL;DR: A scalable, cost-efficient GraphRAG framework for enterprises using dependency parsing for KG construction (94% of LLM performance) and hybrid RRF retrieval, achieving up to 15% improvement over vector baselines.

Details

Motivation: GraphRAG shows promise for multi-hop reasoning but faces adoption barriers due to expensive LLM-based extraction and complex traversal strategies, limiting practical enterprise deployment.

Method: Two innovations: (1) Efficient KG construction using dependency parsing instead of LLMs, achieving 94% of LLM performance; (2) Hybrid retrieval fusing vector similarity with graph traversal using Reciprocal Rank Fusion, with separate embeddings for entities, chunks, and relations.

Result: Evaluated on legacy code migration datasets, achieved up to 15% and 4.35% improvements over vanilla vector retrieval baselines using LLM-as-Judge metrics, while significantly reducing costs and improving scalability.

Conclusion: Demonstrates feasibility of deploying GraphRAG in production enterprise environments, showing that engineered classical NLP techniques can match modern LLM approaches while enabling practical, cost-effective, domain-adaptable retrieval-augmented reasoning at scale.

Abstract: We propose a scalable and cost-efficient framework for deploying Graph-based Retrieval-Augmented Generation (GraphRAG) in enterprise environments. While GraphRAG has shown promise for multi- hop reasoning and structured retrieval, its adoption has been limited due to reliance on expensive large language model (LLM)-based extraction and complex traversal strategies. To address these challenges, we introduce two core innovations: (1) an efficient knowledge graph construction pipeline that leverages dependency parsing to achieve 94% of LLM-based performance (61.87% vs. 65.83%) while significantly reducing costs and improving scalability; and (2) a hybrid retrieval strategy that fuses vector similarity with graph traversal using Reciprocal Rank Fusion (RRF), maintaining separate embeddings for entities, chunks, and relations to enable multi-granular matching. We evaluate our framework on two enterprise datasets focused on legacy code migration and demonstrate improvements of up to 15% and 4.35% over vanilla vector retrieval baselines using LLM-as-Judge evaluation metrics. These results validate the feasibility of deploying GraphRAG in production enterprise environments, demonstrating that careful engineering of classical NLP techniques can match modern LLM-based approaches while enabling practical, cost-effective, and domain-adaptable retrieval-augmented reasoning at scale.

[292] MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions

Yanxu Zhu, Shitong Duan, Xiangxu Zhang, Jitao Sang, Peng Zhang, Tun Lu, Xiao Zhou, Jing Yao, Xiaoyuan Yi, Xing Xie

Main category: cs.AI

TL;DR: Researchers created MoHoBench, a benchmark to evaluate honesty in Multimodal Large Language Models when faced with visually unanswerable questions, finding most models fail to appropriately refuse answers and that visual information significantly impacts honesty.

Details

Motivation: While MLLMs have advanced in vision-language tasks, their ability to act honestly when faced with visually unanswerable questions remains underexplored, creating potential for harmful or untrustworthy content despite existing work on language model trustworthiness.

Method: Defined four types of visually unanswerable questions, constructed MoHoBench (12k+ samples with multi-stage filtering and human verification), benchmarked 28 popular MLLMs, and implemented initial alignment methods using supervised and preference learning.

Result: Most models fail to appropriately refuse to answer when necessary, and MLLMs’ honesty is not just a language modeling issue but deeply influenced by visual information, requiring dedicated multimodal honesty alignment methods.

Conclusion: The work establishes the first systematic assessment of MLLM honesty, provides MoHoBench as a benchmark, demonstrates the need for specialized multimodal honesty alignment, and offers initial methods as foundation for developing trustworthy MLLMs.

Abstract: Recently Multimodal Large Language Models (MLLMs) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs’ capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs. We ground honesty in models’ response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench, a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples, whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench, we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs’ honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs. Our data and code can be found at https://github.com/yanxuzhu/MoHoBench.

[293] MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li, Kun Yang, Han Wang, Jingqun Tang, Teng Fu, Changhong Jin, Chao Feng, Xiaohui Lv, Can Huang

Main category: cs.AI

TL;DR: MEML-GRPO framework uses multiple expert prompts and mutual learning to overcome reward sparsity in RLVR, improving LLM reasoning performance by 4.89-11.33%.

Details

Motivation: Standard RLVR suffers from reward sparsity where zero rewards from consistently incorrect answers provide no learning signal, especially in challenging tasks, limiting LLM reasoning improvement.

Method: Proposes Multi-Expert Mutual Learning GRPO (MEML-GRPO) that uses diverse expert prompts as system prompts to generate broader response range, plus inter-expert mutual learning mechanism for knowledge sharing among experts via RLVR.

Result: Extensive experiments show significant improvements: average performance gain of 4.89% with Qwen and 11.33% with Llama across multiple reasoning benchmarks, effectively overcoming traditional RLVR limitations.

Conclusion: MEML-GRPO successfully addresses reward sparsity in RLVR through multi-expert prompting and mutual learning, substantially enhancing LLM reasoning capabilities beyond standard RLVR methods.

Abstract: Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model’s performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.

[294] Scaling Neuro-symbolic Problem Solving: Solver-Free Learning of Constraints and Objectives

Marianne Defresne, Romain Gambardella, Sophie Barbe, Thomas Schiex

Main category: cs.AI

TL;DR: A differentiable neuro-symbolic architecture with probabilistic loss learns to solve NP-hard reasoning problems from natural inputs, outperforming hybrid methods on Sudoku variants and real-world protein design.

Details

Motivation: To address the challenge of hybridizing discrete reasoning with neural networks, particularly for solving NP-hard reasoning problems from natural inputs where Large Language Models struggle.

Method: Introduces a differentiable neuro-symbolic architecture with a probabilistic loss function that learns both constraints and objectives, pushing the combinatorial solver out of the training loop for scalable training while maintaining exact inference for maximum accuracy.

Result: Efficiently learns to solve NP-hard reasoning problems from natural inputs; requires less training time than other hybrid methods on Sudoku variants; outperforms Decision-Focused-Learning on visual Min-Cut/Max-cut task; successfully learns energy optimization for real-world protein design.

Conclusion: The proposed neuro-symbolic architecture effectively bridges neural networks and discrete reasoning, enabling efficient learning of NP-hard problems from natural inputs with practical applications in complex real-world domains like protein design.

Abstract: In the ongoing quest for hybridizing discrete reasoning with neural nets, there is an increasing interest in neural architectures that can learn how to solve discrete reasoning or optimization problems from natural inputs, a task that Large Language Models seem to struggle with. Objectives: We introduce a differentiable neuro-symbolic architecture and a loss function dedicated to learning how to solve NP-hard reasoning problems. Methods: Our new probabilistic loss allows for learning both the constraints and the objective, thus delivering a complete model that can be scrutinized and completed with side constraints. By pushing the combinatorial solver out of the training loop, our architecture also offers scalable training while exact inference gives access to maximum accuracy. Results: We empirically show that it can efficiently learn how to solve NP-hard reasoning problems from natural inputs. On three variants of the Sudoku benchmark – symbolic, visual, and many-solution –, our approach requires a fraction of training time of other hybrid methods. On a visual Min-Cut/Max-cut task, it optimizes the regret better than a Decision-Focused-Learning regret-dedicated loss. Finally, it efficiently learns the energy optimization formulation of the large real-world problem of designing proteins.

[295] D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Suhwan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee

Main category: cs.AI

TL;DR: Desktop gaming interactions serve as effective pretraining for embodied AI tasks, achieving 96.6% success on manipulation and 83.3% on navigation benchmarks.

Details

Motivation: Embodied AI faces high costs for physical data collection, while desktop/gaming environments offer rich sensorimotor interactions at scale with structured observation-action coupling.

Method: D2E framework with three components: OWA Toolkit for unified desktop data (152x compression), Generalist-IDM for zero-shot generalization across games via timestamp-based event prediction, and VAPT for transferring pretrained representations to physical tasks.

Result: Using 1.3K+ hours of data (259h human + 1K+h pseudo-labeled), achieved 96.6% success on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks.

Conclusion: Desktop pretraining is practical for robotics as sensorimotor primitives in digital interactions transfer meaningfully to physical embodied tasks; all tools and data will be publicly released.

Abstract: Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments – particularly gaming – offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/

[296] ProtoSiTex: Learning Semi-Interpretable Prototypes for Multi-label Text Classification

Utsav Kumar Nareti, Suraj Kumar, Soumya Pandey, Soumi Chattopadhyay, Chandranath Adak

Main category: cs.AI

TL;DR: ProtoSiTex: A semi-interpretable framework for fine-grained multi-label text classification using dual-phase training with prototype discovery and hierarchical consistency.

Details

Motivation: Existing prototype-based models for text classification operate at coarse granularity (sentence/document level) and fail to address multi-label nature of real-world text, creating need for interpretable models with fine-grained capabilities.

Method: Dual-phase alternate training: unsupervised prototype discovery phase learns semantically coherent prototypes, supervised classification phase maps prototypes to labels. Uses hierarchical loss for consistency across subsentence, sentence, and document levels, with adaptive prototypes and multi-head attention for overlapping semantics.

Result: Achieves state-of-the-art performance on new hotel review benchmark dataset and two public benchmarks (binary and multi-class), delivering faithful, human-aligned explanations.

Conclusion: ProtoSiTex establishes robust solution for semi-interpretable multi-label text classification, addressing limitations of existing prototype-based models through fine-grained, multi-label capabilities with interpretable explanations.

Abstract: The rapid growth of user-generated text across digital platforms has intensified the need for interpretable models capable of fine-grained text classification and explanation. Existing prototype-based models offer intuitive explanations but typically operate at coarse granularity (sentence or document level) and fail to address the multi-label nature of real-world text classification. We propose ProtoSiTex, a semi-interpretable framework designed for fine-grained multi-label text classification. ProtoSiTex employs a dual-phase alternate training strategy: an unsupervised prototype discovery phase that learns semantically coherent and diverse prototypes, and a supervised classification phase that maps these prototypes to class labels. A hierarchical loss function enforces consistency across subsentence, sentence, and document levels, enhancing interpretability and alignment. Unlike prior approaches, ProtoSiTex captures overlapping and conflicting semantics using adaptive prototypes and multi-head attention. We also introduce a benchmark dataset of hotel reviews annotated at the subsentence level with multiple labels. Experiments on this dataset and two public benchmarks (binary and multi-class) show that ProtoSiTex achieves state-of-the-art performance while delivering faithful, human-aligned explanations, establishing it as a robust solution for semi-interpretable multi-label text classification.

[297] KarmaTS: A Universal Simulation Platform for Multivariate Time Series with Functional Causal Dynamics

Haixin Li, Yanke Li, Diego Paez-Granados

Main category: cs.AI

TL;DR: KarmaTS is an interactive framework for building lag-indexed spatiotemporal causal models to generate synthetic multivariate time series with known causal dynamics, enabling validation of causal discovery algorithms.

Details

Motivation: Addresses the challenge of access-restricted physiological data by generating synthetic MTS with known causal dynamics and augmenting real-world datasets with expert knowledge.

Method: Constructs discrete-time structural causal processes (DSCP) through mixed-initiative human-in-the-loop workflow combining expert knowledge and algorithmic proposals. Supports mixed variable types, contemporaneous/lagged edges, and modular edge functionals from templates to neural networks.

Result: Enables simulation of synthetic MTS with known causal dynamics, supports causal interventions including user-specified distribution shifts, and facilitates flexible validation of causal discovery algorithms.

Conclusion: KarmaTS provides a flexible framework for expert-informed simulation that enables robust validation and benchmarking of causal discovery algorithms through synthetic data generation with known ground truth.

Abstract: We introduce KarmaTS, an interactive framework for constructing lag-indexed, executable spatiotemporal causal graphical models for multivariate time series (MTS) simulation. Motivated by the challenge of access-restricted physiological data, KarmaTS generates synthetic MTS with known causal dynamics and augments real-world datasets with expert knowledge. The system constructs a discrete-time structural causal process (DSCP) by combining expert knowledge and algorithmic proposals in a mixed-initiative, human-in-the-loop workflow. The resulting DSCP supports simulation and causal interventions, including those under user-specified distribution shifts. KarmaTS handles mixed variable types, contemporaneous and lagged edges, and modular edge functionals ranging from parameterizable templates to neural network models. Together, these features enable flexible validation and benchmarking of causal discovery algorithms through expert-informed simulation.

[298] N2N: A Parallel Framework for Large-Scale MILP under Distributed Memory

Longfei Wang, Junyan Liu, Fan Zhang, Jiangwen Wei, Yuanhua Tang, Jie Sun, Xiaodong Luo

Main category: cs.AI

TL;DR: N2N is a scalable parallel framework for MILP solving that maps B&B nodes to distributed computing nodes, achieving significant speedups over state-of-the-art parallel solvers in both deterministic and nondeterministic modes.

Details

Motivation: Parallelization is promising for accelerating MILP solving, but the complexity of branch-and-bound framework and numerous algorithm components in MILP solvers make parallelization difficult.

Method: Proposed N2N framework with node-to-node mapping of B&B nodes to distributed computing nodes. Features include sliding-window-based algorithm for deterministic mode, utilization of CP search and primal heuristics, adaptive solving, and data communication optimization. Integrated with SCIP and HiGHS as base solvers.

Result: N2N-SCIP achieves speedups of 22.52 and 12.71 with 1,000 MPI processes on Kunpeng and x86 clusters, 1.98-2.08 times faster than ParaSCIP. Deterministic mode also shows significant improvements. Framework validated with multiple solvers.

Conclusion: N2N provides an effective parallel framework for MILP solving that outperforms state-of-the-art parallel solvers, supports both deterministic and nondeterministic modes, and can be integrated with existing solvers.

Abstract: Parallelization has emerged as a promising approach for accelerating MILP solving. However, the complexity of the branch-and-bound (B&B) framework and the numerous effective algorithm components in MILP solvers make it difficult to parallelize. In this study, a scalable parallel framework, N2N (a node-to-node framework that maps the B&B nodes to distributed computing nodes), was proposed to solve large-scale problems in a distributed memory computing environment. Both deterministic and nondeterministic modes are supported, and the framework is designed to be easily integrated with existing solvers. Regarding the deterministic mode, a novel sliding-window-based algorithm was designed and implemented to ensure that tasks are generated and solved in a deterministic order. Moreover, several advanced techniques, such as the utilization of CP search and general primal heuristics, have been developed to fully utilize distributed computing resources and capabilities of base solvers. Adaptive solving and data communication optimization were also investigated. A popular open-source MILP solver, SCIP, was integrated into N2N as the base solver, yielding N2N-SCIP. Extensive computational experiments were conducted to evaluate the performance of N2N-SCIP compared to ParaSCIP, which is a state-of-the-art distributed parallel MILP solver under the UG framework. The nondeterministic N2N-SCIP achieves speedups of 22.52 and 12.71 with 1,000 MPI processes on the Kunpeng and x86 computing clusters, which is 1.98 and 2.08 times faster than ParaSCIP, respectively. In the deterministic mode, N2N-SCIP also shows significant performance improvements over ParaSCIP across different process numbers and computing clusters. To validate the generality of N2N, HiGHS, another open-source solver, was integrated into N2N. The related results are analyzed, and the requirements of N2N on base solvers are also concluded.

[299] Toward Closed-loop Molecular Discovery via Language Model, Property Alignment and Strategic Search

Junkai Ji, Zhangfan Yang, Dong Xu, Ruibin Bai, Jianqiang Li, Tingjun Hou, Zexuan Zhu

Main category: cs.AI

TL;DR: Trio is a molecular generation framework combining fragment-based language modeling, reinforcement learning, and Monte Carlo tree search for interpretable, closed-loop drug design that outperforms state-of-the-art methods in binding affinity, drug-likeness, and synthetic accessibility.

Details

Motivation: Traditional drug discovery methods (high-throughput screening, docking) are inefficient and limited. Current generative models have poor generalization, limited interpretability, and focus too much on binding affinity while neglecting other pharmacological properties, restricting their practical utility.

Method: Trio integrates three components: 1) fragment-based molecular language modeling for context-aware fragment assembly, 2) reinforcement learning to enforce physicochemical and synthetic feasibility, and 3) Monte Carlo tree search to balance exploration of novel chemotypes with exploitation of promising intermediates within protein binding pockets.

Result: Trio reliably generates chemically valid and pharmacologically enhanced ligands, outperforming state-of-the-art approaches with improved binding affinity (+7.85%), drug-likeness (+11.10%), and synthetic accessibility (+12.05%), while expanding molecular diversity more than fourfold.

Conclusion: Trio establishes a closed-loop generative paradigm that redefines chemical space navigation by combining generalization, plausibility, and interpretability, offering a transformative foundation for AI-driven drug discovery.

Abstract: Drug discovery is a time-consuming and expensive process, with traditional high-throughput and docking-based virtual screening hampered by low success rates and limited scalability. Recent advances in generative modelling, including autoregressive, diffusion, and flow-based approaches, have enabled de novo ligand design beyond the limits of enumerative screening. Yet these models often suffer from inadequate generalization, limited interpretability, and an overemphasis on binding affinity at the expense of key pharmacological properties, thereby restricting their translational utility. Here we present Trio, a molecular generation framework integrating fragment-based molecular language modeling, reinforcement learning, and Monte Carlo tree search, for effective and interpretable closed-loop targeted molecular design. Through the three key components, Trio enables context-aware fragment assembly, enforces physicochemical and synthetic feasibility, and guides a balanced search between the exploration of novel chemotypes and the exploitation of promising intermediates within protein binding pockets. Experimental results show that Trio reliably achieves chemically valid and pharmacologically enhanced ligands, outperforming state-of-the-art approaches with improved binding affinity (+7.85%), drug-likeness (+11.10%) and synthetic accessibility (+12.05%), while expanding molecular diversity more than fourfold. By combining generalization, plausibility, and interpretability, Trio establishes a closed-loop generative paradigm that redefines how chemical space can be navigated, offering a transformative foundation for the next era of AI-driven drug discovery.

[300] TiCard: Deployable EXPLAIN-only Residual Learning for Cardinality Estimation

Qizhi Wang

Main category: cs.AI

TL;DR: TiCard is a low-intrusion correction framework that improves cardinality estimation by learning residual corrections using only EXPLAIN features, requiring minimal integration into existing database optimizers.

Details

Motivation: Cardinality estimation is critical for query optimization but current approaches have limitations: classical estimators miss correlations, while learned estimators require complex training pipelines and invasive optimizer integration, making deployment difficult.

Method: TiCard augments native database estimators by learning multiplicative residual corrections using EXPLAIN-only features, with two practical instantiations: (1) Gradient Boosting Regressor for fast inference, and (2) TabPFN (tabular foundation model) that adapts via small reference sets without gradient retraining.

Result: On TiDB with TPCH and Join Order Benchmark, using only 263 total executions (157 for learning), TiCard significantly improves tail accuracy: P90 Q-error drops from 312.85 to 13.69 (GBR), and P99 drops from 37,974.37 to 3,416.50 (TabPFN), while maintaining near-perfect median behavior.

Conclusion: TiCard provides a deployable AI4DB building block with explicit scope, conservative integration policies, and a roadmap from offline correction to in-optimizer use, offering substantial accuracy improvements with minimal intrusion.

Abstract: Cardinality estimation is a key bottleneck for cost-based query optimization, yet deployable improvements remain difficult: classical estimators miss correlations, while learned estimators often require workload-specific training pipelines and invasive integration into the optimizer. This paper presents TiCard, a low intrusion, correction-based framework that augments (rather than replaces) a database’s native estimator. TiCard learns multiplicative residual corrections using EXPLAIN-only features, and uses EXPLAIN ANALYZE only for offline labels. We study two practical instantiations: (i) a Gradient Boosting Regressor for sub-millisecond inference, and (ii) TabPFN, an in-context tabular foundation model that adapts by refreshing a small reference set without gradient retraining. On TiDB with TPCH and the Join Order Benchmark, in a low-trace setting (263 executions total; 157 used for learning), TiCard improves operator-level tail accuracy substantially: P90 Q-error drops from 312.85 (native) to 13.69 (TiCard-GBR), and P99 drops from 37,974.37 to 3,416.50 (TiCard-TabPFN), while a join-only policy preserves near-perfect median behavior. We position TiCard as an AI4DB building block focused on deployability: explicit scope, conservative integration policies, and an integration roadmap from offline correction to in-optimizer use.

[301] LADY: Linear Attention for Autonomous Driving Efficiency without Transformers

Jihao Huang, Xi Xia, Zhiyuan Li, Tianle Liu, Jingke Wang, Junbo Chen, Tengju Ye

Main category: cs.AI

TL;DR: LADY is a fully linear attention-based generative model for end-to-end autonomous driving that achieves constant computational/memory costs regardless of history length while maintaining state-of-the-art performance.

Details

Motivation: Existing Transformer-based methods for autonomous driving suffer from quadratic attention costs that limit long sequence modeling and deployment on resource-constrained edge platforms. Current linear attention architectures lack support for crucial cross-modal and cross-temporal interactions needed for autonomous driving.

Method: Proposes LADY, the first fully linear attention-based generative model with: 1) Linear attention enabling fusion of long-range temporal context with constant computational/memory costs regardless of history length, and 2) A lightweight linear cross-attention mechanism for effective cross-modal information exchange between camera and LiDAR features.

Result: Achieves state-of-the-art performance on NAVSIM and Bench2Drive benchmarks with constant-time and memory complexity, offering improved planning performance and significantly reduced computational cost. Successfully deployed and validated on edge devices in resource-limited scenarios.

Conclusion: LADY demonstrates that fully linear attention architectures can provide efficient temporal modeling for autonomous driving while maintaining high performance, enabling practical deployment on edge platforms with constant computational costs regardless of history length.

Abstract: End-to-end paradigms have demonstrated great potential for autonomous driving. Additionally, most existing methods are built upon Transformer architectures. However, transformers incur a quadratic attention cost, limiting their ability to model long spatial and temporal sequences-particularly on resource-constrained edge platforms. As autonomous driving inherently demands efficient temporal modeling, this challenge severely limits their deployment and real-time performance. Recently, linear attention mechanisms have gained increasing attention due to their superior spatiotemporal complexity. However, existing linear attention architectures are limited to self-attention, lacking support for cross-modal and cross-temporal interactions-both crucial for autonomous driving. In this work, we propose LADY, the first fully linear attention-based generative model for end-to-end autonomous driving. LADY enables fusion of long-range temporal context at inference with constant computational and memory costs, regardless of the history length of camera and LiDAR features. Additionally, we introduce a lightweight linear cross-attention mechanism that enables effective cross-modal information exchange. Experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that LADY achieves state-of-the-art performance with constant-time and memory complexity, offering improved planning performance and significantly reduced computational cost. Additionally, the model has been deployed and validated on edge devices, demonstrating its practicality in resource-limited scenarios.

cs.SD

[302] From Minutes to Days: Scaling Intracranial Speech Decoding with Supervised Pretraining

Linnea Evanson, Mingfang, Zhang, Hubert Banville, Saarang Panchavati, Pierre Bourdillon, Jean-Rémi King

Main category: cs.SD

TL;DR: A framework using week-long intracranial recordings for speech decoding pretraining achieves substantial performance gains over models trained only on short experimental data, with improvements scaling log-linearly with dataset size.

Details

Motivation: Traditional speech decoding from brain activity relies on limited neural recordings from short, controlled experiments, which restricts model performance and generalizability to real-world settings.

Method: Introduce a framework leveraging week-long intracranial and audio recordings from clinical monitoring patients, using contrastive learning for pretraining on this much larger dataset (100x+ increase in data).

Result: The contrastive learning model substantially outperforms models trained solely on classic experimental data, with gains that scale log-linearly with dataset size. Analysis reveals brain activity represents speech features but drifts across days.

Conclusion: The approach provides a scalable path for decoding brain representations in real-life and controlled settings, highlighting the need for models that explicitly account for cross-day variability in neural signals.

Abstract: Decoding speech from brain activity has typically relied on limited neural recordings collected during short and highly controlled experiments. Here, we introduce a framework to leverage week-long intracranial and audio recordings from patients undergoing clinical monitoring, effectively increasing the training dataset size by over two orders of magnitude. With this pretraining, our contrastive learning model substantially outperforms models trained solely on classic experimental data, with gains that scale log-linearly with dataset size. Analysis of the learned representations reveals that, while brain activity represents speech features, its global structure largely drifts across days, highlighting the need for models that explicitly account for cross-day variability. Overall, our approach opens a scalable path toward decoding and modeling brain representations in both real-life and controlled task settings.

[303] Domain-Agnostic Causal-Aware Audio Transformer for Infant Cry Classification

Geofrey Owino, Bernard Shibwabo Kasamani, Ahmed M. Abdelmoniem, Edem Wornyo

Main category: cs.SD

TL;DR: DACH-TIC is a domain-agnostic causal-aware hierarchical audio transformer that improves infant cry classification robustness against noise and domain shifts through causal attention, hierarchical learning, and adversarial domain generalization.

Details

Motivation: Existing deep learning methods for infant cry classification rely on correlation-driven acoustic representations, making them vulnerable to noise, spurious cues, and domain shifts across different recording environments, which limits their real-world clinical applicability.

Method: Proposes DACH-TIC with structured transformer backbone featuring local token-level and global semantic encoders, causal attention masking, controlled perturbation training for counterfactual variations, domain-adversarial objective for environment-invariant representations, and multi-task learning for cry type recognition, distress intensity estimation, and causal relevance prediction.

Result: Outperforms state-of-the-art baselines (HTS-AT and SE-ResNet Transformer) with 2.6% accuracy improvement and 2.2 points macro-F1 score increase, plus enhanced causal fidelity. Generalizes effectively to unseen acoustic environments with only 2.4% domain performance gap.

Conclusion: DACH-TIC demonstrates strong robustness and generalization capabilities, making it suitable for real-world neonatal acoustic monitoring systems by addressing domain shifts and improving causal interpretability.

Abstract: Accurate and interpretable classification of infant cry paralinguistics is essential for early detection of neonatal distress and clinical decision support. However, many existing deep learning methods rely on correlation-driven acoustic representations, which makes them vulnerable to noise, spurious cues, and domain shifts across recording environments. We propose DACH-TIC, a Domain-Agnostic Causal-Aware Hierarchical Audio Transformer for robust infant cry classification. The model integrates causal attention, hierarchical representation learning, multi-task supervision, and adversarial domain generalization within a unified framework. DACH-TIC employs a structured transformer backbone with local token-level and global semantic encoders, augmented by causal attention masking and controlled perturbation training to approximate counterfactual acoustic variations. A domain-adversarial objective promotes environment-invariant representations, while multi-task learning jointly optimizes cry type recognition, distress intensity estimation, and causal relevance prediction. The model is evaluated on the Baby Chillanto and Donate-a-Cry datasets, with ESC-50 environmental noise overlays for domain augmentation. Experimental results show that DACH-TIC outperforms state-of-the-art baselines, including HTS-AT and SE-ResNet Transformer, achieving improvements of 2.6 percent in accuracy and 2.2 points in macro-F1 score, alongside enhanced causal fidelity. The model generalizes effectively to unseen acoustic environments, with a domain performance gap of only 2.4 percent, demonstrating its suitability for real-world neonatal acoustic monitoring systems.

[304] CogSR: Semantic-Aware Speech Super-Resolution via Chain-of-Thought Guided Flow Matching

Jiajun Yuan, Xiaochen Wang, Yuhang Xiao, Yulin Wu, Chenhao Hu, Xueyang Lv

Main category: cs.SD

TL;DR: CogSR is a speech super-resolution framework that uses cognitive reconstruction with Large Audio-Language Models and Chain-of-Thought reasoning to restore severely degraded audio with linguistic accuracy.

Details

Motivation: Existing speech SR models fail with severely low sampling rates because they hallucinate phonetic content when lacking sufficient acoustic cues, which is critical for digital archiving and investigative audio recovery.

Method: CogSR shifts from signal mapping to cognitive reconstruction using Large Audio-Language Models with Chain-of-Thought reasoning as semantic anchor, explicit acoustic priors for speaker consistency, and Rectified Flow backbone for high-frequency synthesis.

Result: CogSR effectively eliminates ambiguity in severe degradation regimes and provides robust restoration for high-value legacy and surveillance audio.

Conclusion: CogSR represents a robust solution for high-precision offline restoration of severely degraded speech recordings by combining cognitive reasoning with acoustic modeling.

Abstract: Applying speech super-resolution (SR) to recordings with severely low sampling rates is a critical challenge in digital archiving and investigative audio recovery. In these scenarios, the input lacks essential acoustic cues. Consequently, existing generative models often fail; without sufficient context, they hallucinate phonetic content, guessing words based on probability rather than meaning. To address this, we propose CogSR, a framework designed specifically for high-precision, offline restoration. Our approach shifts the focus from simple signal mapping to cognitive reconstruction. By integrating a Large Audio-Language Model, we employ Chain-of-Thought reasoning to act as a semantic anchor, while explicit acoustic priors ensure the speaker’s identity remains consistent. This guides a Rectified Flow backbone to synthesize high-frequency details that are not only realistic but linguistically accurate. Evaluations show that CogSR effectively eliminates ambiguity in severe degradation regimes, making it a robust solution for restoring high-value legacy and surveillance audio.

[305] Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders

Nikolaos Ellinas, Alexandra Vioni, Panos Kakoulidis, Georgios Vamvoukakis, Myrsini Christidou, Konstantinos Markopoulos, Junkwang Oh, Gunu Jho, Inchul Hwang, Aimilios Chalamandaris, Pirros Tsiakoulis

Main category: cs.SD

TL;DR: A cepstrum-based pitch modification method that works with any mel-spectrogram representation and mel-based vocoder without retraining or model changes.

Details

Motivation: To create a pitch modification method that is universally compatible with existing mel-based vocoders without requiring additional training or modifications to the models.

Method: Directly modifies cepstrum feature space to shift harmonic structure by: 1) computing spectrogram magnitude via pseudo-inverse mel transform, 2) converting to cepstrum via DCT, 3) shifting cepstral peak without position estimation, 4) recomputing modified mel via IDCT and mel-filterbank.

Result: The method enables pitch-shifted mel-spectrogram features that can be converted to speech with any compatible vocoder, validated with objective and subjective metrics on various state-of-the-art neural vocoders.

Conclusion: A practical pitch modification approach that works with existing mel-based vocoders without requiring retraining or model changes, offering compatibility advantages over traditional methods.

Abstract: This paper introduces a cepstrum-based pitch modification method that can be applied to any mel-spectrogram representation. As a result, this method is compatible with any mel-based vocoder without requiring any additional training or changes to the model. This is achieved by directly modifying the cepstrum feature space in order to shift the harmonic structure to the desired target. The spectrogram magnitude is computed via the pseudo-inverse mel transform, then converted to the cepstrum by applying DCT. In this domain, the cepstral peak is shifted without having to estimate its position and the modified mel is recomputed by applying IDCT and mel-filterbank. These pitch-shifted mel-spectrogram features can be converted to speech with any compatible vocoder. The proposed method is validated experimentally with objective and subjective metrics on various state-of-the-art neural vocoders as well as in comparison with traditional pitch modification methods.

[306] DPDFNet: Boosting DeepFilterNet2 via Dual-Path RNN

Daniel Rika, Nino Sapir, Ido Gus

Main category: cs.SD

TL;DR: DPDFNet extends DeepFilterNet2 with dual-path blocks for better temporal and cross-band modeling, adds loss to prevent over-attenuation, and fine-tunes for always-on applications, achieving state-of-the-art performance on real-world evaluation data while running efficiently on edge NPUs.

Details

Motivation: Current causal speech enhancement models lack sufficient long-range temporal and cross-band modeling capabilities, and existing benchmarks don't adequately reflect real-world conditions with long, low-SNR recordings across diverse languages and noise scenarios.

Method: Extends DeepFilterNet2 architecture with dual-path blocks in encoder for improved temporal and cross-band modeling; adds loss component to mitigate over-attenuation; includes fine-tuning phase for always-on applications; creates new evaluation set with long, low-SNR recordings in 12 languages; proposes PRISM holistic metric; deploys on Ceva-NeuPro-Nano edge NPUs.

Result: DPDFNet outperforms other causal open-source models on new evaluation set, including larger/computationally demanding models; PRISM metric shows scalability with dual-path blocks; DPDFNet-4 achieves real-time performance on edge NPUs (NPN32/NPN64), maintaining state-of-the-art quality within embedded constraints.

Conclusion: DPDFNet demonstrates that state-of-the-art speech enhancement quality can be achieved with efficient causal models suitable for always-on edge applications through architectural improvements, targeted loss functions, and real-world evaluation methodology.

Abstract: We present DPDFNet, a causal single-channel speech enhancement model that extends DeepFilterNet2 architecture with dual-path blocks in the encoder, strengthening long-range temporal and cross-band modeling while preserving the original enhancement framework. In addition, we demonstrate that adding a loss component to mitigate over-attenuation in the enhanced speech, combined with a fine-tuning phase tailored for “always-on” applications, leads to substantial improvements in overall model performance. To compare our proposed architecture with a variety of causal open-source models, we created a new evaluation set comprising long, low-SNR recordings in 12 languages across everyday noise scenarios, better reflecting real-world conditions than commonly used benchmarks. On this evaluation set, DPDFNet delivers superior performance to other causal open-source models, including some that are substantially larger and more computationally demanding. We also propose an holistic metric named PRISM, a composite, scale-normalized aggregate of intrusive and non-intrusive metrics, which demonstrates clear scalability with the number of dual-path blocks. We further demonstrate on-device feasibility by deploying DPDFNet on Ceva-NeuPro-Nano edge NPUs. Results indicate that DPDFNet-4, our second-largest model, achieves real-time performance on NPN32 and runs even faster on NPN64, confirming that state-of-the-art quality can be sustained within strict embedded power and latency constraints.

[307] Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models

Mohammad Jalili Torkamani, Israt Zarin

Main category: cs.SD

TL;DR: ASTA is an adaptive speech-to-action system that dynamically routes voice commands between edge and cloud inference to balance performance and resource utilization in IoT systems.

Details

Motivation: Voice-based IoT control faces a fundamental trade-off: cloud solutions offer better language understanding but have latency, connectivity, and privacy issues, while edge solutions provide low latency and privacy but are computationally constrained.

Method: ASTA integrates on-device ASR and lightweight offline language models with cloud-based LLM processing. It uses a metric-aware routing mechanism that selects inference paths based on real-time system metrics (CPU workload, temperature, network latency), plus rule-based command validation and repair for end-to-end execution.

Result: ASTA successfully routes all 80 test commands, achieving balanced edge-cloud distribution. It attains 62.5% ASR accuracy and generates executable commands without repair for 47.5% of inputs, showing the repair mechanism’s importance for robustness.

Conclusion: Adaptive edge-cloud orchestration is a viable approach for resilient and resource-aware voice-controlled IoT systems, balancing performance with system resource utilization.

Abstract: Voice-based interaction has emerged as a natural and intuitive modality for controlling IoT devices. However, speech-driven edge devices face a fundamental trade-off between cloud-based solutions, which offer stronger language understanding capabilities at the cost of latency, connectivity dependence, and privacy concerns, and edge-based solutions, which provide low latency and improved privacy but are limited by computational constraints. This paper presents ASTA, an adaptive speech-to-action solution that dynamically routes voice commands between edge and cloud inference to balance performance and system resource utilization. ASTA integrates on-device automatic speech recognition and lightweight offline language-model inference with cloud-based LLM processing, guided by real-time system metrics such as CPU workload, device temperature, and network latency. A metric-aware routing mechanism selects the inference path at runtime, while a rule-based command validation and repair component ensures successful end-to-end command execution. We implemented our solution on an NVIDIA Jetson-based edge platform and evaluated it using a diverse dataset of 80 spoken commands. Experimental results show that ASTA successfully routes all input commands for execution, achieving a balanced distribution between online and offline inference. The system attains an ASR accuracy of 62.5% and generates executable commands without repair for only 47.5% of inputs, highlighting the importance of the repair mechanism in improving robustness. These results suggest that adaptive edge-cloud orchestration is a viable approach for resilient and resource-aware voice-controlled IoT systems.

[308] DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

Tao Li, Wenshuo Ge, Zhichao Wang, Zihao Cui, Yong Ma, Yingying Gao, Chao Deng, Shilei Zhang, Junlan Feng

Main category: cs.SD

TL;DR: DisCo-Speech is a zero-shot controllable TTS framework that enables independent prosody control and voice cloning through a disentangled speech codec and LM-based generator.

Details

Motivation: Standard codec-based LMs tightly couple timbre and prosody, making independent control difficult. Current codec designs provide insufficient decoupling, creating a bottleneck for controllable TTS.

Method: Uses DisCodec with two stages: 1) Tri-factor disentanglement via parallel encoders and hybrid losses to separate content, prosody, and timbre; 2) Fusion and reconstruction to create content-prosody tokens for LM prediction while optimizing reconstruction quality.

Result: Matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control.

Conclusion: By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis with flexible zero-shot control.

Abstract: Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at https://github.com/disco-speech/DisCo-Speech, and the code and weights will be released at the same link.

cs.LG

[309] DiscoverDCP: A Data-Driven Approach for Construction of Disciplined Convex Programs via Symbolic Regression

Sveinung Myhre

Main category: cs.LG

TL;DR: DiscoverDCP uses symbolic regression with DCP rules to automatically find convex models, avoiding post-hoc convexity verification.

Details

Motivation: Traditional convex modeling often uses fixed-parameter forms (like quadratics) that may be too restrictive. Post-hoc convexity verification is computationally intractable, creating a need for methods that guarantee convexity by construction while maintaining flexibility.

Method: Integrates symbolic regression with Disciplined Convex Programming (DCP) rule sets. Enforces that all discovered candidate model expressions adhere to DCP composition rules, ensuring global convexity by construction during the discovery process.

Result: Produces convex surrogates with more relaxed and accurate functional forms than traditional fixed-parameter convex expressions. The method yields interpretable, verifiable, and flexible convex models suitable for safety-critical applications.

Conclusion: DiscoverDCP provides a data-driven framework for automatic discovery of convex models that are guaranteed to be convex by construction, overcoming limitations of both restrictive fixed-form convex models and computationally intractable post-hoc verification.

Abstract: We propose DiscoverDCP, a data-driven framework that integrates symbolic regression with the rule sets of Disciplined Convex Programming (DCP) to perform system identification. By enforcing that all discovered candidate model expressions adhere to DCP composition rules, we ensure that the output expressions are globally convex by construction, circumventing the computationally intractable process of post-hoc convexity verification. This approach allows for the discovery of convex surrogates that exhibit more relaxed and accurate functional forms than traditional fixed-parameter convex expressions (e.g., quadratic functions). The proposed method produces interpretable, verifiable, and flexible convex models suitable for safety-critical control and optimization tasks.

[310] Hybrid Quantum-Classical Ensemble Learning for S&P 500 Directional Prediction

Abraham Itzhak Weinberg

Main category: cs.LG

TL;DR: Hybrid ensemble framework combining quantum sentiment analysis and Decision Transformer achieves 60.14% directional accuracy on S&P 500 prediction, a 3.10% improvement over individual models.

Details

Motivation: Financial market prediction is challenging due to high noise, non-stationarity, and market efficiency, with most models struggling to exceed 55-57% accuracy. The paper aims to overcome limitations of prior approaches through architectural innovation.

Method: Hybrid ensemble framework combining: 1) diverse learning algorithms (LSTM, Decision Transformer, XGBoost, Random Forest, Logistic Regression) on same data, 2) 4-qubit variational quantum circuit for enhanced sentiment analysis, and 3) smart filtering to exclude weak predictors (accuracy <52%).

Result: Achieved 60.14% directional accuracy on S&P 500 prediction (vs. 52.80% for same-architecture models on multiple datasets). Quantum sentiment analysis provided +0.8% to +1.5% gains per model. Smart filtering improved ensemble performance (Top-7 models: 60.14% vs. all 35 models: 51.2%).

Conclusion: The framework demonstrates practical trading potential with preliminary backtesting showing Sharpe ratio of 1.2 vs. buy-and-hold’s 0.8. Architecture diversity dominates dataset diversity, and quantum-enhanced sentiment analysis provides meaningful improvements in financial market prediction.

Abstract: Financial market prediction is a challenging application of machine learning, where even small improvements in directional accuracy can yield substantial value. Most models struggle to exceed 55–57% accuracy due to high noise, non-stationarity, and market efficiency. We introduce a hybrid ensemble framework combining quantum sentiment analysis, Decision Transformer architecture, and strategic model selection, achieving 60.14% directional accuracy on S&P 500 prediction, a 3.10% improvement over individual models. Our framework addresses three limitations of prior approaches. First, architecture diversity dominates dataset diversity: combining different learning algorithms (LSTM, Decision Transformer, XGBoost, Random Forest, Logistic Regression) on the same data outperforms training identical architectures on multiple datasets (60.14% vs.\ 52.80%), confirmed by correlation analysis ($r>0.6$ among same-architecture models). Second, a 4-qubit variational quantum circuit enhances sentiment analysis, providing +0.8% to +1.5% gains per model. Third, smart filtering excludes weak predictors (accuracy $<52%$), improving ensemble performance (Top-7 models: 60.14% vs.\ all 35 models: 51.2%). We evaluate on 2020–2023 market data across seven instruments, covering diverse regimes including the COVID-19 crash and inflation-driven correction. McNemar’s test confirms statistical significance ($p<0.05$). Preliminary backtesting with confidence-based filtering (6+ model consensus) yields a Sharpe ratio of 1.2 versus buy-and-hold’s 0.8, demonstrating practical trading potential.

[311] GLOW: Graph-Language Co-Reasoning for Agentic Workflow Performance Prediction

Wei Guan, Jian Cao, Jinyu Cai, Qiqi Cai, Jianqi Gao, See-Kiong Ng

Main category: cs.LG

TL;DR: GLOW is a unified framework that combines GNNs and LLMs for predicting Agentic Workflow performance, addressing limitations of existing methods by capturing both topological dependencies and semantic logic.

Details

Motivation: Current Agentic Workflow performance prediction methods fail to simultaneously capture intricate topological dependencies and deep semantic logic, while execution-based evaluation is costly and slow, limiting scalability.

Method: GLOW combines GNNs for graph-structure modeling with LLMs for reasoning. It uses a graph-oriented LLM instruction-tuned on graph tasks to extract topologically aware semantic features, fuses them with GNN-encoded structural representations, and employs contrastive alignment to refine the latent space.

Result: Extensive experiments on FLORA-Bench show GLOW outperforms state-of-the-art baselines in both prediction accuracy and ranking utility.

Conclusion: GLOW provides an effective unified framework for Agentic Workflow performance prediction by successfully integrating graph structure modeling with semantic reasoning capabilities.

Abstract: Agentic Workflows (AWs) have emerged as a promising paradigm for solving complex tasks. However, the scalability of automating their generation is severely constrained by the high cost and latency of execution-based evaluation. Existing AW performance prediction methods act as surrogates but fail to simultaneously capture the intricate topological dependencies and the deep semantic logic embedded in AWs. To address this limitation, we propose GLOW, a unified framework for AW performance prediction that combines the graph-structure modeling capabilities of GNNs with the reasoning power of LLMs. Specifically, we introduce a graph-oriented LLM, instruction-tuned on graph tasks, to extract topologically aware semantic features, which are fused with GNN-encoded structural representations. A contrastive alignment strategy further refines the latent space to distinguish high-quality AWs. Extensive experiments on FLORA-Bench show that GLOW outperforms state-of-the-art baselines in prediction accuracy and ranking utility.

Jeff Smith

Main category: cs.LG

TL;DR: SHARe-KAN addresses memory bottlenecks in Vision KANs using vector quantization and hardware-aware compilation, achieving 88× memory reduction while maintaining accuracy.

Details

Motivation: Kolmogorov-Arnold Networks (KANs) face severe memory constraints due to their learned basis functions creating extreme bandwidth demands, making deployment in memory-constrained environments challenging. Traditional pruning methods fail dramatically (10% sparsity causes ~40-point mAP drop).

Method: SHARe-KAN framework uses Gain-Shape-Bias Vector Quantization to exploit functional redundancy while preserving dense topology. Combined with LUTHAM, a hardware-aware compiler with static memory planning, to optimize memory usage.

Result: Achieves 88× runtime memory reduction (1.13 GB → 12.91 MB) while matching uncompressed baseline accuracy on PASCAL VOC. Profiling shows >90% L2 cache residency, decoupling workload from DRAM bandwidth constraints.

Conclusion: The approach successfully addresses the fundamental memory wall in Vision KANs by leveraging their holographic topology through vector quantization and hardware-aware optimization, enabling practical deployment in memory-constrained environments.

Abstract: Kolmogorov-Arnold Networks (KANs) face a fundamental memory wall: their learned basis functions create parameter counts that impose extreme bandwidth demands, hindering deployment in memory-constrained environments. We show that Vision KANs exhibit a holographic topology, where information is distributed across the interference of splines rather than localized to specific edges. Consequently, traditional pruning fails (10% sparsity degrades mAP from 85.23% to 45%, a $\sim$40-point drop). To address this, we present SHARe-KAN, a framework utilizing Gain-Shape-Bias Vector Quantization to exploit functional redundancy while preserving the dense topology. Coupled with LUTHAM, a hardware-aware compiler with static memory planning, we achieve $88\times$ runtime memory reduction (1.13 GB $\to$ 12.91 MB) and match uncompressed baseline accuracy on PASCAL VOC. Profiling on NVIDIA Ampere architecture confirms $>90%$ L2 cache residency, demonstrating that the workload is decoupled from DRAM bandwidth constraints inherent to spline-based architectures.

[313] How Do Graph Signals Affect Recommendation: Unveiling the Mystery of Low and High-Frequency Graph Signals

Feng Liu, Hao Cang, Huanhuan Yuan, Jiaqing Fan, Yongjing Hao, Fuzhen Zhuang, Guanfeng Liu, Pengpeng Zhao

Main category: cs.LG

TL;DR: The paper shows that both low-frequency and high-frequency graph signals are equally important for recommendation tasks, proposing a frequency signal scaler module and space flip method to enhance GNN-based recommendation systems.

Details

Motivation: While spectral GNNs are effective for recommendation, the specific roles of low-frequency vs high-frequency graph signals remain unclear. The paper aims to investigate how different frequency components of graph signals affect recommendation performance.

Method: 1) Theoretically proves equivalence of low/high-frequency signals in recommendation; 2) Proposes frequency signal scaler - a plug-and-play module to adjust graph signal filtering; 3) Introduces space flip method to restore expressive power of graph embeddings; 4) Demonstrates sufficiency of either frequency component alone.

Result: Extensive experiments on four public datasets validate the effectiveness of the proposed methods. The paper shows that either low-frequency or high-frequency graph signals alone can achieve effective recommendations.

Conclusion: Both low-frequency and high-frequency graph signals play equivalent roles in recommendation by smoothing user-item similarities. The proposed frequency signal scaler and space flip method enhance GNN-based recommendation systems, with code available for reproducibility.

Abstract: Spectral graph neural networks (GNNs) are highly effective in modeling graph signals, with their success in recommendation often attributed to low-pass filtering. However, recent studies highlight the importance of high-frequency signals. The role of low-frequency and high-frequency graph signals in recommendation remains unclear. This paper aims to bridge this gap by investigating the influence of graph signals on recommendation performance. We theoretically prove that the effects of low-frequency and high-frequency graph signals are equivalent in recommendation tasks, as both contribute by smoothing the similarities between user-item pairs. To leverage this insight, we propose a frequency signal scaler, a plug-and-play module that adjusts the graph signal filter function to fine-tune the smoothness between user-item pairs, making it compatible with any GNN model. Additionally, we identify and prove that graph embedding-based methods cannot fully capture the characteristics of graph signals. To address this limitation, a space flip method is introduced to restore the expressive power of graph embeddings. Remarkably, we demonstrate that either low-frequency or high-frequency graph signals alone are sufficient for effective recommendations. Extensive experiments on four public datasets validate the effectiveness of our proposed methods. Code is avaliable at https://github.com/mojosey/SimGCF.

[314] LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Ling Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang

Main category: cs.LG

TL;DR: LLaDA2.0 converts auto-regressive LLMs into discrete diffusion models through a 3-phase training scheme, creating efficient 16B and 100B MoE models for parallel decoding deployment.

Details

Motivation: To establish a new paradigm for frontier-scale deployment by converting pre-trained AR models into discrete diffusion LLMs instead of costly training from scratch, enabling knowledge inheritance and efficient parallel decoding.

Method: A 3-phase block-level WSD training scheme: progressive increasing block-size diffusion (warm-up), large-scale full-sequence diffusion (stable), and reverting to compact-size block diffusion (decay). Post-training alignment with SFT and DPO creates MoE variants.

Result: Created LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B) instruction-tuned MoE models that deliver superior performance and efficiency through parallel decoding. Both models were open-sourced.

Conclusion: LLaDA2.0 establishes a new paradigm for converting AR models to discrete diffusion LLMs, enabling efficient frontier-scale deployment with preserved knowledge and parallel decoding advantages.

Abstract: This paper presents LLaDA2.0 – a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models – establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.

[315] Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

Clayton Cohn, Eduardo Davalos, Caleb Vatral, Joyce Horn Fonteles, Hanchen David Wang, Austin Coursey, Surya Rayala, Ashwin T S, Meiyi Ma, Gautam Biswas

Main category: cs.LG

TL;DR: A comprehensive review of empirical methods in applied multimodal learning environments, introducing a taxonomy and framework that captures established practices and recent innovations driven by LLMs and generative AI.

Details

Motivation: While prior reviews have addressed individual components of multimodal pipelines, there is a notable absence of comprehensive reviews of empirical methods in applied multimodal environments, especially given recent advancements in LLMs and multimodal machine learning.

Method: The paper introduces a taxonomy and framework that captures both established practices and recent innovations driven by LLMs and generative AI. It identifies five modality groups: Natural Language, Vision, Physiological Signals, Human-Centered Evidence, and Environment Logs.

Result: Analysis reveals that integrating multiple modalities enables richer insights into learner and trainee behaviors, revealing latent patterns often overlooked by unimodal approaches.

Conclusion: Despite the benefits of multimodal integration, persistent challenges in data collection and integration continue to hinder the adoption of these systems in real-time classroom settings.

Abstract: Recent technological advancements in multimodal machine learning–including the rise of large language models (LLMs)–have improved our ability to collect, process, and analyze diverse multimodal data such as speech, video, and eye gaze in learning and training contexts. While prior reviews have addressed individual components of the multimodal pipeline (e.g., conceptual models, data fusion), a comprehensive review of empirical methods in applied multimodal environments remains notably absent. This review addresses that, introducing a taxonomy and framework that capture both established practices and recent innovations driven by LLMs and generative AI. We identify five modality groups: Natural Language, Vision, Physiological Signals, Human-Centered Evidence, and Environment Logs. Our analysis reveals that integrating modalities enables richer insights into learner and trainee behaviors, revealing latent patterns often overlooked by unimodal approaches. However, persistent challenges in multimodal data collection and integration continue to hinder the adoption of these systems in real-time classroom settings.

[316] A Unified Generative-Predictive Framework for Deterministic Inverse Design

Reza T. Batley, Sourav Saha

Main category: cs.LG

TL;DR: Janus is a unified generative-predictive framework for inverse design of heterogeneous material microstructures that combines deep encoder-decoder architecture with predictive KHRONOS head for physics-informed inversion.

Details

Motivation: Inverse design of heterogeneous material microstructures is ill-posed and computationally expensive due to high-dimensional design spaces, multimodal inputs, and nonlinear forward physics. Existing generative models lack fast, stable deterministic inversion with physics-informed bias.

Method: Janus couples deep encoder-decoder architecture with predictive KHRONOS head (separable neural architecture). It learns a latent manifold simultaneously isometric for generative inversion and pruned for physical prediction, inducing disentanglement of latent space through joint objective.

Result: On MNIST: high-fidelity reconstruction, accurate classification, diverse generative inversion. On microstructure design: forward prediction R²=0.98 (2% relative error), sub-5% pixelwise reconstruction error, inverse solutions satisfy target properties within 1% relative error. Latent manifold shows smooth traversal and low-dimensional disentanglement.

Conclusion: Janus enables real-time, physics-informed inverse microstructure generation at lower computational cost than classical optimization-based approaches by unifying prediction and generation within a single latent space.

Abstract: Inverse design of heterogeneous material microstructures is a fundamentally ill-posed and famously computationally expensive problem. This is exacerbated by the high-dimensional design spaces associated with finely resolved images, multimodal input property streams, and a highly nonlinear forward physics. Whilst modern generative models excel at accurately modeling such complex forward behavior, most of them are not intrinsically structured to support fast, stable \emph{deterministic} inversion with a physics-informed bias. This work introduces Janus, a unified generative-predictive framework to address this problem. Janus couples a deep encoder-decoder architecture with a predictive KHRONOS head, a separable neural architecture. Topologically speaking, Janus learns a latent manifold simultaneously isometric for generative inversion and pruned for physical prediction; the joint objective inducing \emph{disentanglement} of the latent space. Janus is first validated on the MNIST dataset, demonstrating high-fidelity reconstruction, accurate classification and diverse generative inversion of all ten target classes. It is then applied to the inverse design of heterogeneous microstructures labeled with thermal conductivity. It achieves a forward prediction accuracy $R^2=0.98$ (2% relative error) and sub-5% pixelwise reconstruction error. Inverse solutions satisfy target properties to within $1%$ relative error. Inverting a sweep through properties reveal smooth traversal of the latent manifold, and UMAP visualization confirms the emergence of a low-dimensional, disentangled manifold. By unifying prediction and generation within a single latent space, Janus enables real-time, physics-informed inverse microstructure generation at a lower computational cost typically associated with classical optimization-based approaches.

[317] NDRL: Cotton Irrigation and Nitrogen Application with Nested Dual-Agent Reinforcement Learning

Ruifeng Xu, Liang He

Main category: cs.LG

TL;DR: NDRL uses nested dual-agent reinforcement learning to optimize irrigation and nitrogen fertilization for cotton, improving yield and resource efficiency by 4.7-6.3% over baselines.

Details

Motivation: Existing water-nitrogen optimization methods have high complexity and poor yield results, plus difficulty quantifying mild stress signals leading to delayed feedback and imprecise dynamic regulation with low resource efficiency.

Method: Nested Dual-Agent Reinforcement Learning (NDRL) with parent agent identifying macroscopic irrigation/fertilization actions based on cumulative yield benefits, and child agent using quantified Water Stress Factor (WSF) and Nitrogen Stress Factor (NSF) with mixed probability distribution for daily strategy optimization.

Result: Simulated yield increased by 4.7% in both 2023 and 2024, irrigation water productivity increased by 5.6% and 5.1% respectively, nitrogen partial factor productivity increased by 6.3% and 1.0% respectively compared to best baseline.

Conclusion: NDRL advances cotton irrigation and nitrogen fertilization, providing new approaches for addressing complexity and precision issues in agricultural resource management and supporting sustainable agricultural development.

Abstract: Effective irrigation and nitrogen fertilization have a significant impact on crop yield. However, existing research faces two limitations: (1) the high complexity of optimizing water-nitrogen combinations during crop growth and poor yield optimization results; and (2) the difficulty in quantifying mild stress signals and the delayed feedback, which results in less precise dynamic regulation of water and nitrogen and lower resource utilization efficiency. To address these issues, we propose a Nested Dual-Agent Reinforcement Learning (NDRL) method. The parent agent in NDRL identifies promising macroscopic irrigation and fertilization actions based on projected cumulative yield benefits, reducing ineffective explorationwhile maintaining alignment between objectives and yield. The child agent’s reward function incorporates quantified Water Stress Factor (WSF) and Nitrogen Stress Factor (NSF), and uses a mixed probability distribution to dynamically optimize daily strategies, thereby enhancing both yield and resource efficiency. We used field experiment data from 2023 and 2024 to calibrate and validate the Decision Support System for Agrotechnology Transfer (DSSAT) to simulate real-world conditions and interact with NDRL. Experimental results demonstrate that, compared to the best baseline, the simulated yield increased by 4.7% in both 2023 and 2024, the irrigation water productivity increased by 5.6% and 5.1% respectively, and the nitrogen partial factor productivity increased by 6.3% and 1.0% respectively. Our method advances the development of cotton irrigation and nitrogen fertilization, providing new ideas for addressing the complexity and precision issues in agricultural resource management and for sustainable agricultural development.

[318] D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models

Javon Hickmon

Main category: cs.LG

TL;DR: D3G is a training-free, zero-shot method that uses Stable Diffusion XL to generate diverse demographic data at inference time to improve CLIP’s image classification accuracy while reducing demographic bias.

Details

Motivation: Multimodal models like CLIP still struggle with fine-grained image classification due to underfitting in low-capacity models and lack of high-quality, balanced demographic data, leading to harmful bias in zero-shot classification.

Method: Diverse Demographic Data Generation (D3G) uses Stable Diffusion XL as a generative model to create diverse demographic data at inference time, which is then used with CLIP to improve classification without additional training.

Result: Providing diverse demographic data at inference improves classification accuracy and reduces demographic bias in pre-trained multimodal models, with exploration of individual demographic impacts on accuracy.

Conclusion: D3G effectively addresses demographic bias in zero-shot image classification by generating diverse demographic representations at inference time, enhancing both accuracy and fairness in multimodal models.

Abstract: Image classification is a task essential for machine perception to achieve human-level image understanding. Multimodal models such as CLIP have been able to perform well on this task by learning semantic similarities across vision and language; however, despite these advances, image classification is still a challenging task. Models with low capacity often suffer from underfitting and thus underperform on fine-grained image classification. Along with this, it is important to ensure high-quality data with rich cross-modal representations of each class, which is often difficult to generate. When datasets do not enforce balanced demographics, the predictions will be biased toward the more represented class, while others will be neglected. We focus on how these issues can lead to harmful bias for zero-shot image classification, and explore how to combat these issues in demographic bias. We propose Diverse Demographic Data Generation (D3G), a training-free, zero-shot method of boosting classification accuracy while reducing demographic bias in pre-trained multimodal models. With this method, we utilize CLIP as our base multimodal model and Stable Diffusion XL as our generative model. We demonstrate that providing diverse demographic data at inference time improves performance for these models, and explore the impact of individual demographics on the resulting accuracy metric.

[319] Surely Large Multimodal Models (Don’t) Excel in Visual Species Recognition?

Tian Liu, Anwesha Basu, James Caverlee, Shu Kong

Main category: cs.LG

TL;DR: LMMs underperform FSL expert models in visual species recognition, but can effectively correct expert model errors via post-hoc prompting, leading to a plug-and-play POC method that boosts FSL accuracy by +6.4%.

Details

Motivation: Visual Species Recognition requires domain expertise for annotation, limiting labeled data. While FSL trains expert models with few examples, LMMs show strong general recognition but their performance on specialized VSR tasks is unknown.

Method: Proposed Post-hoc Correction (POC): prompts LMMs to re-rank FSL expert model’s top predictions using enriched prompts with softmax confidence scores and few-shot visual examples, without extra training or validation.

Result: POC outperforms prior FSL methods by +6.4% accuracy across five VSR benchmarks, generalizes to different backbones and LMMs, and serves as a plug-and-play enhancement module.

Conclusion: LMMs struggle with direct VSR but excel at correcting FSL expert models, enabling a simple yet effective post-hoc correction approach that significantly improves few-shot species recognition without additional training.

Abstract: Visual Species Recognition (VSR) is pivotal to biodiversity assessment and conservation, evolution research, and ecology and ecosystem management. Training a machine-learned model for VSR typically requires vast amounts of annotated images. Yet, species-level annotation demands domain expertise, making it realistic for domain experts to annotate only a few examples. These limited labeled data motivate training an ‘’expert’’ model via few-shot learning (FSL). Meanwhile, advanced Large Multimodal Models (LMMs) have demonstrated prominent performance on general recognition tasks. It is straightforward to ask whether LMMs excel in the highly specialized VSR task and whether they outshine FSL expert models. Somewhat surprisingly, we find that LMMs struggle in this task, despite using various established prompting techniques. LMMs even significantly underperform FSL expert models, which are as simple as finetuning a pretrained visual encoder on the few-shot images. However, our in-depth analysis reveals that LMMs can effectively post-hoc correct the expert models’ incorrect predictions. Briefly, given a test image, when prompted with the top predictions from an FSL expert model, LMMs can recover the ground-truth label. Building on this insight, we derive a simple method called Post-hoc Correction (POC), which prompts an LMM to re-rank the expert model’s top predictions using enriched prompts that include softmax confidence scores and few-shot visual examples. Across five challenging VSR benchmarks, POC outperforms prior art of FSL by +6.4% in accuracy without extra training, validation, or manual intervention. Importantly, POC generalizes to different pretrained backbones and LMMs, serving as a plug-and-play module to significantly enhance existing FSL methods.

[320] Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

Barna Pásztor, Thomas Kleine Buening, Andreas Krause

Main category: cs.LG

TL;DR: SLHF is a new preference optimization framework that frames alignment as a sequential game between Leader and Follower policies, enabling inference-time refinement and handling richer preference structures than RLHF/NLHF.

Details

Motivation: Existing methods like RLHF (scalar rewards) and NLHF (simultaneous equilibrium) have limitations in capturing complex preference structures. The authors aim to develop a framework that leverages sequential play asymmetry for better alignment.

Method: SLHF models alignment as a sequential-move game: a Leader commits to an action, then a Follower responds conditionally. This decomposes preference optimization into Follower refinement and Leader optimization against an adversary. The sequential design enables inference-time refinement through iterative sampling.

Result: Experiments show SLHF achieves strong alignment across diverse preference datasets, scales from 0.5B to 8B parameters, and yields inference-time refinements that transfer across model families without further fine-tuning.

Conclusion: SLHF offers advantages over RLHF and NLHF in consistency, data sensitivity, and robustness to intransitive preferences. The sequential game framework captures richer preference structures and enables practical inference-time refinement capabilities.

Abstract: We introduce Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader’s action. This approach decomposes preference optimization into a refinement problem for the Follower and an optimization problem against an adversary for the Leader. Unlike Reinforcement Learning from Human Feedback (RLHF), which assigns scalar rewards to actions, or Nash Learning from Human Feedback (NLHF), which seeks a simultaneous-move equilibrium, SLHF leverages the asymmetry of sequential play to capture richer preference structures. The sequential design of SLHF naturally enables inference-time refinement, as the Follower learns to improve the Leader’s actions, and these refinements can be leveraged through iterative sampling. We compare the solution concepts of SLHF, RLHF, and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Experiments on large language models demonstrate that SLHF achieves strong alignment across diverse preference datasets, scales from 0.5B to 8B parameters, and yields inference-time refinements that transfer across model families without further fine-tuning.

[321] A Special Case of Quadratic Extrapolation Under the Neural Tangent Kernel

Abiel Kim

Main category: cs.LG

TL;DR: ReLU MLPs show linear extrapolation for OOD points, but near-origin extrapolation under NTK regime reveals quadratic behavior due to non-translation-invariant feature maps.

Details

Motivation: While ReLU MLP linear extrapolation is well-studied, analysis of extrapolation near the origin under NTK regime remains unexplored. The NTK's infinite-dimensional feature map lacks translational invariance, making near-origin extrapolation distinct from far-from-origin cases, representing canonical extremes of ReLU NTK extrapolation.

Method: Analysis focuses on neural tangent kernel (NTK) regime for ReLU networks, examining extrapolation behavior near the origin versus far from the origin, leveraging properties of the infinite-dimensional feature map (non-translationally invariant but rotation invariant).

Result: Discovery of quadratic extrapolation behavior for evaluation points close to the origin, contrasting with linear extrapolation for points far from the origin, establishing two canonical extremes of ReLU NTK extrapolation.

Conclusion: The NTK regime reveals distinct extrapolation behaviors: linear for far-from-origin OOD points and quadratic for near-origin points, providing complete characterization of ReLU NTK extrapolation extremes.

Abstract: It has been demonstrated both theoretically and empirically that the ReLU MLP tends to extrapolate linearly for an out-of-distribution evaluation point. The machine learning literature provides ample analysis with respect to the mechanisms to which linearity is induced. However, the analysis of extrapolation at the origin under the NTK regime remains a more unexplored special case. In particular, the infinite-dimensional feature map induced by the neural tangent kernel is not translationally invariant. This means that the study of an out-of-distribution evaluation point very far from the origin is not equivalent to the evaluation of a point very near the origin. And since the feature map is rotation invariant, these two special cases may represent the most canonically extreme bounds of ReLU NTK extrapolation. Ultimately, it is this loose recognition of the two special cases of extrapolation that motivate the discovery of quadratic extrapolation for an evaluation close to the origin.

[322] Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies

Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Ruan de Kock, Claude Formanek, Sasha Abramowitz, Oumayma Mahjoub, Wiem Khlifi, Simon Du Toit, Louay Ben Nessir, Refiloe Shabe, Noah De Nicola, Arnol Fokam, Siddarth Singh, Ulrich Mbou Sob, Arnu Pretorius

Main category: cs.LG

TL;DR: Using inference-time strategies with compute budgets can dramatically improve performance in complex multi-agent RL tasks, achieving up to 126% improvement over SOTA with minimal extra execution time.

Details

Motivation: Real-world RL applications face extreme complexity, combinatorial nature, and multi-agent coordination challenges that cause even state-of-the-art systems to hit performance ceilings during zero-shot inference. Many applications allow inference phases with time/compute budgets for multiple attempts before final output.

Method: Employ inference phase at execution time with specific inference strategies that utilize time and compute budgets to explore multiple solution attempts before outputting final solution. Study focuses on inference strategies for complex multi-agent RL problems.

Result: Achieved up to 126% improvement (average 45% improvement) over previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time. Demonstrated promising compute scaling properties through over 60k experiments - largest study on inference strategies for complex RL to date.

Conclusion: Inference phase at execution time and choice of inference strategy are key to breaking performance ceilings in complex multi-agent RL problems. The approach provides substantial performance gains with minimal additional execution time cost.

Abstract: Reinforcement learning (RL) systems have countless applications, from energy-grid management to protein design. However, such real-world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state-of-the-art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero-shot inference. Meanwhile, many digital or simulation-based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi-agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. Our experimental data and code are available at https://sites.google.com/view/inference-strategies-rl.

[323] TAO-Net: Two-stage Adaptive OOD Classification Network for Fine-grained Encrypted Traffic Classification

Zihao Wang, Wei Peng, Junming Zhang, Jian Li, Wenxin Fang

Main category: cs.LG

TL;DR: TAO-Net: A two-stage network for encrypted traffic classification that handles both known and unknown applications using transformer-based OOD detection and LLM-based fine-grained classification.

Details

Motivation: Current encrypted traffic classification methods struggle with emerging applications that create Out-of-Distribution (OOD) traffic patterns. Existing approaches either rely on predefined categories or lump unknown traffic into a single "Other" category without fine-grained classification.

Method: Two-stage Adaptive OOD classification Network (TAO-Net): 1) Hybrid OOD detection using transformer-based inter-layer transformation smoothness and feature analysis to distinguish ID vs OOD traffic; 2) LLM-based classification with semantic-enhanced prompts to transform OOD classification into a generation task for fine-grained classification without predefined labels.

Result: Achieves 96.81-97.70% macro-precision and 96.77-97.68% macro-F1 on three datasets, significantly outperforming previous methods (44.73-86.30% macro-precision), especially for emerging network applications.

Conclusion: TAO-Net effectively addresses the challenge of classifying both known and unknown encrypted traffic through its innovative two-stage design, enabling accurate and fine-grained classification of emerging applications without relying on predefined categories.

Abstract: Encrypted traffic classification aims to identify applications or services by analyzing network traffic data. One of the critical challenges is the continuous emergence of new applications, which generates Out-of-Distribution (OOD) traffic patterns that deviate from known categories and are not well represented by predefined models. Current approaches rely on predefined categories, which limits their effectiveness in handling unknown traffic types. Although some methods mitigate this limitation by simply classifying unknown traffic into a single “Other” category, they fail to make a fine-grained classification. In this paper, we propose a Two-stage Adaptive OOD classification Network (TAO-Net) that achieves accurate classification for both In-Distribution (ID) and OOD encrypted traffic. The method incorporates an innovative two-stage design: the first stage employs a hybrid OOD detection mechanism that integrates transformer-based inter-layer transformation smoothness and feature analysis to effectively distinguish between ID and OOD traffic, while the second stage leverages large language models with a novel semantic-enhanced prompt strategy to transform OOD traffic classification into a generation task, enabling flexible fine-grained classification without relying on predefined labels. Experiments on three datasets demonstrate that TAO-Net achieves 96.81-97.70% macro-precision and 96.77-97.68% macro-F1, outperforming previous methods that only reach 44.73-86.30% macro-precision, particularly in identifying emerging network applications.

[324] KAN-Matrix: Visualizing Nonlinear Pairwise and Multivariate Contributions for Physical Insight

Luis A. De la Fuente, Hernan A. Moreno, Laura V. Alvarez, Hoshin V. Gupta

Main category: cs.LG

TL;DR: Kolmogorov-Arnold Networks (KANs) applied to create interpretable visualization tools (PKAN and MKAN) for analyzing complex datasets with nonlinear relationships, outperforming traditional correlation methods.

Details

Motivation: Complex datasets pose challenges due to high dimensionality and collinearity; traditional correlation analyses are insufficient for capturing nonlinear relationships and providing interpretable insights.

Method: Developed two KAN-based visualization tools: Pairwise KAN Matrix (PKAN) for characterizing nonlinear associations between variable pairs, and Multivariate KAN Contribution Matrix (MKAN) for feature ranking and quantifying input contributions to target prediction.

Result: PKAN and MKAN produce more robust and informative results than Pearson Correlation and Mutual Information, capturing both strength and functional forms of relationships to reveal hidden physical patterns.

Conclusion: The KAN-based visualization tools enhance interpretability and parsimony, supporting both pre-processing (feature selection, redundancy analysis) and post-processing (model explanation, physical insights) in scientific workflows.

Abstract: Interpreting complex datasets remains a major challenge for scientists, particularly due to high dimensionality and collinearity among variables. We introduce a novel application of Kolmogorov-Arnold Networks (KANs) to enhance interpretability and parsimony beyond what traditional correlation analyses offer. We present two interpretable, color-coded visualization tools: the Pairwise KAN Matrix (PKAN) and the Multivariate KAN Contribution Matrix (MKAN). PKAN characterizes nonlinear associations between pairs of variables, while MKAN serves as a nonlinear feature-ranking tool that quantifies the relative contributions of inputs in predicting a target variable. These tools support pre-processing (e.g., feature selection, redundancy analysis) and post-processing (e.g., model explanation, physical insights) in model development workflows. Through experimental comparisons, we demonstrate that PKAN and MKAN yield more robust and informative results than Pearson Correlation and Mutual Information. By capturing the strength and functional forms of relationships, these matrices facilitate the discovery of hidden physical patterns and promote domain-informed model development.

[325] ReactorFold: Generative discovery of nuclear reactor cores via emergent physical reasoning

Yoonpyo Lee

Main category: cs.LG

TL;DR: ReactorFold: A generative AI framework using language models for nuclear reactor fuel assembly design that learns from Monte Carlo data and can generate novel configurations beyond human-defined constraints.

Details

Motivation: Traditional reactor design methods are limited to fixed, human-defined configuration spaces, preventing discovery of fundamentally new design topologies. Current approaches (deterministic, metaheuristic, ML-assisted) search within constrained spaces rather than generating novel solutions.

Method: Reformulates fuel-assembly design as sequence modeling for language models. Uses Monte Carlo simulation data, parameter-efficient fine-tuning, and Direct Preference Optimization (DPO) to learn latent structure of pressurized-water-reactor assemblies and generate candidate layouts in single forward passes.

Result: The DPO-aligned model exhibits emergent design-space expansion: despite training only on fixed Gd rod configurations, it autonomously adjusts gadolinium inventory to meet power-peaking constraints. Discovers high-performing asymmetric configurations that challenge conventional symmetric loading heuristics, accessing design regimes inaccessible to traditional methods.

Conclusion: Language models can internalize causal physical relationships and transcend human-imposed design constraints, demonstrating potential for generative AI to discover novel nuclear reactor designs beyond conventional search methods.

Abstract: Designing nuclear reactor cores requires navigating large discrete design spaces governed by complex neutronic interactions. Traditional deterministic, metaheuristic, and machine-learning-assisted methods search within fixed, human-defined configuration spaces, limiting their ability to discover fundamentally new design topologies. Here we introduce ReactorFold, a generative framework that reformulates fuel-assembly design as a sequence modeling problem for language models. Using Monte Carlo data, parameter-efficient fine-tuning, and Direct Preference Optimization (DPO), the model learns the latent structure of a pressurized-water-reactor assembly and generates candidate layouts in a single forward pass. Notably, the DPO-aligned model exhibits emergent design-space expansion: despite being trained exclusively on configurations with a fixed number of gadolinium burnable absorber (Gd) rods, it autonomously adjusts Gd inventory to satisfy strict power-peaking constraints. The model also discovers high-performing asymmetric configurations that challenge conventional symmetric loading heuristics, accessing design regimes inaccessible to conventional search methods and demonstrating that language models can internalize causal physical relationships and transcend human-imposed design constraints.

[326] Improved High-probability Convergence Guarantees of Decentralized SGD

Aleksandar Armacki, Ali H. Sayed

Main category: cs.LG

TL;DR: The paper shows that Decentralized Stochastic Gradient Descent (DSGD) achieves high-probability convergence under the same conditions as mean-squared error convergence, removing restrictive assumptions and achieving order-optimal rates with linear speed-up.

Details

Motivation: There's a significant gap between assumptions needed for high-probability convergence versus mean-squared error convergence in decentralized settings, unlike centralized settings where SGD converges under the same conditions for both. Existing decentralized HP studies impose strong assumptions like uniformly bounded gradients or vanishing noise.

Method: The authors revisit HP guarantees for DSGD with light-tailed noise using careful analysis of moment generating functions (MGFs) of key quantities: norm-squared of gradient/optimality gap and consensus gap between users’ models. They provide novel results on variance-reduction effect in HP sense and fine-grained MGF bounds for strongly convex costs.

Result: DSGD converges in high-probability under the same conditions as MSE convergence, removing uniformly bounded gradients and other restrictive assumptions. The method achieves order-optimal rates for both non-convex and strongly convex costs with linear speed-up in the number of users.

Conclusion: The paper bridges the gap between HP and MSE convergence assumptions for decentralized optimization, showing DSGD maintains strong performance in HP sense while matching existing MSE guarantees, with novel MGF analysis techniques that enable linear speed-up.

Abstract: Convergence in high-probability (HP) has been receiving increasing interest, due to its attractive properties, such as exponentially decaying tail bounds and strong guarantees for each individual run of an algorithm. While HP guarantees are extensively studied in centralized settings, much less is understood in the decentralized, networked setup. Existing HP studies in decentralized settings impose strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise, resulting in a significant gap between assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, even for vanilla Decentralized Stochastic Gradient Descent ($\mathtt{DSGD}$) algorithm. This is contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed to guarantee MSE convergence. Motivated by this observation, we revisit HP guarantees for $\mathtt{DSGD}$ in the presence of light-tailed noise. We show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing uniformly bounded gradients and other restrictive assumptions, while simultaneously achieving order-optimal rates for both non-convex and strongly convex costs. Moreover, our improved analysis yields linear speed-up in the number of users, demonstrating that $\mathtt{DSGD}$ maintains strong performance in the HP sense and matches existing MSE guarantees. Our improved results stem from a careful analysis of the MGF of quantities of interest (norm-squared of gradient or optimality gap) and the MGF of the consensus gap between users’ models. To achieve linear speed-up, we provide a novel result on the variance-reduction effect of decentralized methods in the HP sense and more fine-grained bounds on the MGF for strongly convex costs, which are both of independent interest.

[327] Twin Restricted Kernel Machines for Multiview Classification

A. Quadir, M. Sajid, Mushir Akhtar, M. Tanveer

Main category: cs.LG

TL;DR: TMvRKM is a novel multiview twin restricted kernel machine that improves computational efficiency and classification performance by using regularized least squares instead of quadratic programming, with fusion strategies to handle view inconsistencies.

Details

Motivation: Traditional multiview SVM approaches face challenges in high-dimensional spaces with kernel trick, are prone to errors, struggle with view inconsistencies, and require solving computationally expensive quadratic programming problems.

Method: Proposes TMvRKM that integrates kernel machines with multiview framework using regularized least squares approach instead of QPPs. Includes coupling term to balance errors across views and integrates early/late fusion strategies to leverage collective information while accommodating view-specific variations.

Result: TMvRKM consistently outperforms baseline models on UCI, KEEL, and AwA benchmark datasets, demonstrating exceptional generalization performance in all experimental scenarios with statistical significance.

Conclusion: TMvRKM effectively addresses computational and generalization challenges of traditional kernel-based multiview methods, offering improved efficiency and performance through its novel regularized least squares approach and fusion strategies.

Abstract: Multi-view learning (MVL) is an emerging field in machine learning that focuses on improving generalization performance by leveraging complementary information from multiple perspectives or views. Various multi-view support vector machine (MvSVM) approaches have been developed, demonstrating significant success. Moreover, these models face challenges in effectively capturing decision boundaries in high-dimensional spaces using the kernel trick. They are also prone to errors and struggle with view inconsistencies, which are common in multi-view datasets. In this work, we introduce the multiview twin restricted kernel machine (TMvRKM), a novel model that integrates the strengths of kernel machines with the multiview framework, addressing key computational and generalization challenges associated with traditional kernel-based approaches. Unlike traditional methods that rely on solving large quadratic programming problems (QPPs), the proposed TMvRKM efficiently determines an optimal separating hyperplane through a regularized least squares approach, enhancing both computational efficiency and classification performance. The primal objective of TMvRKM includes a coupling term designed to balance errors across multiple views effectively. By integrating early and late fusion strategies, TMvRKM leverages the collective information from all views during training while remaining flexible to variations specific to individual views. The proposed TMvRKM model is rigorously tested on UCI, KEEL, and AwA benchmark datasets. Both experimental results and statistical analyses consistently highlight its exceptional generalization performance, outperforming baseline models in every scenario.

[328] Yantra AI – An intelligence platform which interacts with manufacturing operations

Varshini Krishnamurthy

Main category: cs.LG

TL;DR: This dissertation develops an intelligent production system for XRIT integrating AI/ML models for predictive maintenance, energy management, and decision support, featuring real-time visualization and a GPT-4 virtual assistant.

Details

Motivation: Industry 4.0's rapid growth requires smart production systems that address key challenges like energy management, predictive maintenance, and AI-driven decision support to improve operational efficiency in manufacturing environments like XRIT.

Method: Developed an intelligent production system integrating machine learning models (Random Forest Classifier for proactive maintenance, Isolation Forest for outlier detection), Streamlit for real-time data visualization dashboards, and a GPT-4 powered virtual assistant for operational support.

Result: System testing with synthetic data demonstrated improved operational efficiency, energy management, and repair planning capabilities. The scalable system is ready for real-time deployment in XRIT’s production environment.

Conclusion: The AI-powered production system successfully addresses Industry 4.0 challenges, with future work focusing on real-time data integration and further system enhancements for practical deployment.

Abstract: Industry 4.0 is growing quickly, which has changed smart production by encouraging the use of real-time tracking, machine learning, and AI-driven systems to make operations run more smoothly. The main focus of this dissertation is on creating and testing an intelligent production system for XRIT that solves important problems like energy management, predictive maintenance, and AI-powered decision support. Machine learning models are built into the system, such as the Random Forest Classifier for proactive maintenance and the Isolation Forest for finding outliers. These models help with decision-making and reducing downtime. Streamlit makes real-time data visualisation possible, giving workers access to dashboards that they can interact with and see real-time observations.The system was tested with fake data and is made to be scalable, so it can be used in real time in XRIT’s production setting. Adding an AI-powered virtual assistant made with GPT-4 lets workers get real-time, useful information that makes complicated questions easier to answer and improves operational decisions. The testing shows that the system makes working efficiency, energy management, and the ability to plan repairs much better. Moving the system to real-time data merging and looking for other ways to make it better will be the main focus of future work.

[329] Semantic-Constrained Federated Aggregation: Convergence Theory and Privacy-Utility Bounds for Knowledge-Enhanced Distributed Learning

Jahidul Arafat

Main category: cs.LG

TL;DR: SCFA introduces semantic constraints into federated learning to address non-IID data issues, achieving faster convergence, better privacy-utility tradeoffs, and reduced model divergence through domain knowledge integration.

Details

Motivation: Federated learning suffers from slow convergence under non-IID data conditions, and existing solutions treat all client updates identically without considering semantic validity of the updates.

Method: Semantic-Constrained Federated Aggregation (SCFA) incorporates domain knowledge constraints into distributed optimization, using knowledge graphs encoding constraints from ISA-95 and MASON ontologies to regularize client updates.

Result: SCFA achieves 22% faster convergence, 41.3% model divergence reduction, and maintains utility within 3.7% of non-private baseline under differential privacy (vs 12.1% degradation for standard FL). Constraints reduce effective data heterogeneity by 41% and improve privacy-utility tradeoffs by factor 0.37.

Conclusion: Semantic constraints significantly improve federated learning performance under non-IID conditions, with theoretical convergence guarantees and empirical validation showing strong alignment between predictions and observations across multiple metrics.

Abstract: Federated learning enables collaborative model training across distributed data sources but suffers from slow convergence under non-IID data conditions. Existing solutions employ algorithmic modifications treating all client updates identically, ignoring semantic validity. We introduce Semantic-Constrained Federated Aggregation (SCFA), a theoretically-grounded framework incorporating domain knowledge constraints into distributed optimization. We prove SCFA achieves convergence rate O(1/sqrt(T) + rho) where rho represents constraint violation rate, establishing the first convergence theory for constraint-based federated learning. Our analysis shows constraints reduce effective data heterogeneity by 41% and improve privacy-utility tradeoffs through hypothesis space reduction by factor theta=0.37. Under (epsilon,delta)-differential privacy with epsilon=10, constraint regularization maintains utility within 3.7% of non-private baseline versus 12.1% degradation for standard federated learning, representing 2.7x improvement. We validate our framework on manufacturing predictive maintenance using Bosch production data with 1.18 million samples and 968 sensor features, constructing knowledge graphs encoding 3,000 constraints from ISA-95 and MASON ontologies. Experiments demonstrate 22% faster convergence, 41.3% model divergence reduction, and constraint violation thresholds where rho<0.05 maintains 90% optimal performance while rho>0.18 causes catastrophic failure. Our theoretical predictions match empirical observations with R^2>0.90 across convergence, privacy, and violation-performance relationships.

[330] A Tutorial on Dimensionless Learning: Geometric Interpretation and the Effect of Noise

Zhengtao Jake Gan, Xiaoyu Xie

Main category: cs.LG

TL;DR: Dimensionless learning is a data-driven framework that combines dimensional analysis with machine learning to discover dimensionless numbers and scaling laws from experimental data, using neural networks with regularization for interpretable physical laws.

Details

Motivation: The motivation is to develop a systematic, data-driven approach for discovering physical scaling laws and dimensionless numbers from experimental measurements, bridging classical dimensional analysis with modern machine learning to reveal dimensional invariance in physical systems.

Method: The method starts with experimental measurements of physical quantities, identifies fundamental ways to combine variables into dimensionless groups, then uses neural networks to discover which combinations best predict experimental output. A key innovation is a regularization technique that encourages learned coefficients to take simple, interpretable values (integers or half-integers).

Result: The method successfully handles cases with single or multiple dimensionless numbers, demonstrates robustness to measurement noise and discrete sampling through regularization, and reveals how different but equivalent representations can capture the same underlying physics.

Conclusion: Dimensionless learning provides an effective framework for discovering interpretable physical laws from experimental data, though challenges remain in computational cost, understanding data characteristics, automating variable selection, and developing user-friendly tools for experimentalists.

Abstract: Dimensionless learning is a data-driven framework for discovering dimensionless numbers and scaling laws from experimental measurements. This tutorial introduces the method, explaining how it transforms experimental data into compact physical laws that reveal compact dimensional invariance between variables. The approach combines classical dimensional analysis with modern machine learning techniques. Starting from measurements of physical quantities, the method identifies the fundamental ways to combine variables into dimensionless groups, then uses neural networks to discover which combinations best predict the experimental output. A key innovation is a regularization technique that encourages the learned coefficients to take simple, interpretable values like integers or half-integers, making the discovered laws both accurate and physically meaningful. We systematically investigate how measurement noise and discrete sampling affect the discovery process, demonstrating that the regularization approach provides robustness to experimental uncertainties. The method successfully handles cases with single or multiple dimensionless numbers, revealing how different but equivalent representations can capture the same underlying physics. Despite recent progress, key challenges remain, including managing the computational cost of identifying multiple dimensionless groups, understanding the influence of data characteristics, automating the selection of relevant input variables, and developing user-friendly tools for experimentalists. This tutorial serves as both an educational resource and a practical guide for researchers seeking to apply dimensionless learning to their experimental data.

[331] Machine Learning Framework for Thrombosis Risk Prediction in Rotary Blood Pumps

Christopher Blum, Michael Neidlin

Main category: cs.LG

TL;DR: Interpretable ML framework links CFD flow features to thrombosis risk in blood pumps, enabling efficient thrombogenicity screening with mechanistic transparency.

Details

Motivation: Existing computational models fail to reliably predict thrombosis risk in rotary blood pumps due to incomplete understanding of how specific flow features contribute to thrombus initiation and growth.

Method: Logistic regression model with structured feature-selection pipeline trained on spatial risk patterns from validated macro-scale thrombosis model; uses CFD-derived flow features including nonlinear combinations.

Result: Model reproduces labeled risk distributions, identifies distinct flow features associated with increased thrombosis risk, and predicts plausible thrombosis-prone regions in centrifugal pump despite training on axial pump data.

Conclusion: Interpretable ML can efficiently link local flow features to thrombosis risk, enabling rapid thrombogenicity screening without costly simulations, complementing physics-based modeling for device design workflows.

Abstract: Thrombosis in rotary blood pumps arises from complex flow conditions that remain difficult to translate into reliable and interpretable risk predictions using existing computational models. This limitation reflects an incomplete understanding of how specific flow features contribute to thrombus initiation and growth. This study introduces an interpretable machine learning framework for spatial thrombosis assessment based directly on computational fluid dynamics-derived flow features. A logistic regression (LR) model combined with a structured feature-selection pipeline is used to derive a compact and physically interpretable feature set, including nonlinear feature combinations. The framework is trained using spatial risk patterns from a validated, macro-scale thrombosis model for two representative scenarios. The model reproduces the labeled risk distributions and identifies distinct sets of flow features associated with increased thrombosis risk. When applied to a centrifugal pump, despite training on a single axial pump operating point, the model predicts plausible thrombosis-prone regions. These results show that interpretable machine learning can link local flow features to thrombosis risk while remaining computationally efficient and mechanistically transparent. The low computational cost enables rapid thrombogenicity screening without repeated or costly simulations. The proposed framework complements physics-based thrombosis modeling and provides a methodological basis for integrating interpretable machine learning into CFD-driven thrombosis analysis and device design workflows.

[332] Cross-Sample Augmented Test-Time Adaptation for Personalized Intraoperative Hypotension Prediction

Kanxue Li, Yibing Zhan, Hua Jin, Chongchong Qi, Xu Lin, Baosheng Yu

Main category: cs.LG

TL;DR: CSA-TTA: Cross-Sample Augmented Test-Time Adaptation framework for intraoperative hypotension prediction that enhances personalization by incorporating hypotension events from other patients when test data is limited.

Details

Motivation: Intraoperative hypotension (IOH) prediction is challenging due to patient-specific variability and rarity of events, making test-time adaptation unreliable with limited patient data.

Method: Proposes CSA-TTA with: 1) Cross-sample bank of historical hypotensive/non-hypotensive samples, 2) Coarse-to-fine retrieval (K-Shape clustering + top-K semantic similarity), 3) Self-supervised masked reconstruction and retrospective sequence forecasting for training.

Result: Evaluated on VitalDB and real-world datasets with TimesFM and UniTS models. Improved Recall by +1.33% and F1 by +1.13% with fine-tuning, and +7.46% Recall and +5.07% F1 in zero-shot scenarios.

Conclusion: CSA-TTA effectively addresses data scarcity in IOH prediction by leveraging cross-patient information, demonstrating strong robustness and generalization across different settings and models.

Abstract: Intraoperative hypotension (IOH) poses significant surgical risks, but accurate prediction remains challenging due to patient-specific variability. While test-time adaptation (TTA) offers a promising approach for personalized prediction, the rarity of IOH events often leads to unreliable test-time training. To address this, we propose CSA-TTA, a novel Cross-Sample Augmented Test-Time Adaptation framework that enhances training by incorporating hypotension events from other individuals. Specifically, we first construct a cross-sample bank by segmenting historical data into hypotensive and non-hypotensive samples. Then, we introduce a coarse-to-fine retrieval strategy for building test-time training data: we initially apply K-Shape clustering to identify representative cluster centers and subsequently retrieve the top-K semantically similar samples based on the current patient signal. Additionally, we integrate both self-supervised masked reconstruction and retrospective sequence forecasting signals during training to enhance model adaptability to rapid and subtle intraoperative dynamics. We evaluate the proposed CSA-TTA on both the VitalDB dataset and a real-world in-hospital dataset by integrating it with state-of-the-art time series forecasting models, including TimesFM and UniTS. CSA-TTA consistently enhances performance across settings-for instance, on VitalDB, it improves Recall and F1 scores by +1.33% and +1.13%, respectively, under fine-tuning, and by +7.46% and +5.07% in zero-shot scenarios-demonstrating strong robustness and generalization.

[333] AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs

Anshul Kumar, Gagan Raj Gupta, Manisha Chawla

Main category: cs.LG

TL;DR: AdaGradSelect is an adaptive PEFT method for SLMs that selectively updates transformer blocks based on gradient norms, achieving near-full-finetuning performance with 12% faster training and 35% less GPU memory.

Details

Motivation: Full fine-tuning of LLMs is expensive and memory-intensive. While PEFT methods like LoRA reduce costs, they restrict training to limited subspaces, potentially reducing performance. For SLMs where efficiency matters even more, there's a need for better adaptive fine-tuning approaches.

Method: AdaGradSelect adaptively selects which transformer blocks to update based on gradient norms. It uses Dirichlet-based sampling (dependent on past update frequency) combined with epsilon-greedy exploration, allowing exploration of different blocks early in training and gradual focus on the most important blocks later.

Result: AdaGradSelect trains 12% faster, uses 35% less GPU memory, and achieves performance very close to full fine-tuning. On GSM8K, it outperforms LoRA (rank 256) by ~3% across Qwen2.5-0.5B, LLaMA3.2-1B, and Phi4-mini-3.8B models, with similar accuracy on MATH dataset.

Conclusion: AdaGradSelect provides a more effective and resource-efficient alternative to traditional fine-tuning methods for SLMs, balancing exploration and exploitation of transformer blocks to achieve near-full-finetuning performance with significantly reduced computational costs.

Abstract: Large Language Models (LLMs) can perform many NLP tasks well, but fully fine-tuning them is expensive and requires a lot of memory. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA reduce this cost by adding small low-rank updates to frozen model weights. However, these methods restrict the training to a limited subspace, which can sometimes reduce performance. For Small Language Models (SLMs), where efficiency gains matter even more, we introduce AdaGradSelect, an adaptive method that selects which transformer blocks to update based on gradients. Early observations showed that updating only the transformer blocks with the highest gradient norms can achieve performance close to full fine-tuning. Building on this insight, AdaGradSelect adaptively chooses which blocks to train. It uses a combination of Dirichlet-based sampling, which depends on how frequently blocks were updated in the past, and an epsilon-greedy exploration strategy. This lets the method explore different blocks in early training and gradually focus on the most important ones in later epochs. Experiments show that AdaGradSelect trains about 12 percent faster and uses 35 percent less GPU memory while delivering performance very close to full fine-tuning. On the GSM8K dataset, it outperforms LoRA (rank 256) by about 3 percent on average across models such as Qwen2.5-0.5B, LLaMA3.2-1B, and Phi4-mini-3.8B. It also achieves similar accuracy on the MATH dataset. Overall, AdaGradSelect provides a more effective and resource-efficient alternative to traditional fine-tuning methods.

[334] Data Valuation for LLM Fine-Tuning: Efficient Shapley Value Approximation via Language Model Arithmetic

Mélissa Tamine, Otmane Sakhi, Benjamin Heymann

Main category: cs.LG

TL;DR: DPO’s mathematical structure enables scalable Shapley value computation for LLM data valuation, solving computational challenges in collaborative training.

Details

Motivation: Data valuation is crucial for LLM training decisions and fair benefit distribution among data owners, but traditional Shapley value computation is computationally prohibitive for large models.

Method: Leverage the specific mathematical structure of Direct Preference Optimization (DPO) to enable scalable computation of Shapley values for data valuation in LLMs.

Result: DPO dramatically simplifies the computational challenge of Shapley value computation, making data valuation feasible for large language models.

Conclusion: This breakthrough unlocks numerous applications at the intersection of data valuation and LLMs, enabling informed data curation decisions and fair collaborative training.

Abstract: Data is a critical asset for training large language models (LLMs), alongside compute resources and skilled workers. While some training data is publicly available, substantial investment is required to generate proprietary datasets, such as human preference annotations or to curate new ones from existing sources. As larger datasets generally yield better model performance, two natural questions arise. First, how can data owners make informed decisions about curation strategies and data sources investment? Second, how can multiple data owners collaboratively pool their resources to train superior models while fairly distributing the benefits? This problem, data valuation, which is not specific to large language models, has been addressed by the machine learning community through the lens of cooperative game theory, with the Shapley value being the prevalent solution concept. However, computing Shapley values is notoriously expensive for data valuation, typically requiring numerous model retrainings, which can become prohibitive for large machine learning models. In this work, we demonstrate that this computational challenge is dramatically simplified for LLMs trained with Direct Preference Optimization (DPO). We show how the specific mathematical structure of DPO enables scalable Shapley value computation. We believe this observation unlocks many applications at the intersection of data valuation and large language models.

[335] Bridging Data and Physics: A Graph Neural Network-Based Hybrid Twin Framework

M. Gorpinich, B. Moya, S. Rodriguez, F. Meraghni, Y. Jaafra, A. Briot, M. Henner, R. Leon, F. Chinesta

Main category: cs.LG

TL;DR: Hybrid twin approach combines physics-based FEM models with GNN-learned ignorance corrections using sparse spatial data to improve simulation accuracy.

Details

Motivation: Physics-based models (like FEM) have discrepancies from reality due to unmodeled effects, but purely data-driven approaches require large amounts of dense spatial-temporal data which is often unavailable in practice.

Method: Use Graph Neural Networks (GNNs) to model the ignorance component (gap between physics-based model and reality). GNNs learn spatial patterns of missing physics from limited measurement locations, allowing physics-based models to be enriched with data-driven corrections without requiring dense data.

Result: The GNN-based hybrid twin successfully captures ignorance and generalizes corrections across different meshes, geometries, and load positions in nonlinear heat transfer problems, improving simulation accuracy and interpretability while minimizing data requirements.

Conclusion: GNNs enable effective hybrid modeling by learning spatial patterns of model discrepancies from sparse measurements, bridging the gap between physics-based simulations and reality without requiring extensive data collection.

Abstract: Simulating complex unsteady physical phenomena relies on detailed mathematical models, simulated for instance by using the Finite Element Method (FEM). However, these models often exhibit discrepancies from the reality due to unmodeled effects or simplifying assumptions. We refer to this gap as the ignorance model. While purely data-driven approaches attempt to learn full system behavior, they require large amounts of high-quality data across the entire spatial and temporal domain. In real-world scenarios, such information is unavailable, making full data-driven modeling unreliable. To overcome this limitation, we model of the ignorance component using a hybrid twin approach, instead of simulating phenomena from scratch. Since physics-based models approximate the overall behavior of the phenomena, the remaining ignorance is typically lower in complexity than the full physical response, therefore, it can be learned with significantly fewer data. A key difficulty, however, is that spatial measurements are sparse, also obtaining data measuring the same phenomenon for different spatial configurations is challenging in practice. Our contribution is to overcome this limitation by using Graph Neural Networks (GNNs) to represent the ignorance model. GNNs learn the spatial pattern of the missing physics even when the number of measurement locations is limited. This allows us to enrich the physics-based model with data-driven corrections without requiring dense spatial, temporal and parametric data. To showcase the performance of the proposed method, we evaluate this GNN-based hybrid twin on nonlinear heat transfer problems across different meshes, geometries, and load positions. Results show that the GNN successfully captures the ignorance and generalizes corrections across spatial configurations, improving simulation accuracy and interpretability, while minimizing data requirements.

[336] TENG++: Time-Evolving Natural Gradient for Solving PDEs With Deep Neural Nets under General Boundary Conditions

Xinjie He, Chenggong Zhang

Main category: cs.LG

TL;DR: Extends Time-Evolving Natural Gradient (TENG) framework to handle Dirichlet boundary conditions in PDEs using natural gradient optimization with time-stepping schemes (Euler/Heun), achieving improved accuracy and stability.

Details

Motivation: Traditional numerical methods struggle with high-dimensional/complex PDEs, while PINNs face accuracy and boundary condition challenges. Need better neural network-based PDE solvers that can handle complex boundary conditions effectively.

Method: Extends TENG framework to incorporate Dirichlet boundary conditions by adding penalty terms to loss function. Uses natural gradient optimization with numerical time-stepping schemes (Euler and Heun methods) to ensure stability and accuracy.

Result: Heun method shows superior accuracy due to second-order corrections, while Euler method offers computational efficiency for simpler scenarios. Framework successfully enforces Dirichlet constraints on heat equation.

Conclusion: Establishes foundation for extending to Neumann/mixed boundary conditions and broader PDE classes, advancing neural network-based solvers for real-world applications.

Abstract: Partial Differential Equations (PDEs) are central to modeling complex systems across physical, biological, and engineering domains, yet traditional numerical methods often struggle with high-dimensional or complex problems. Physics-Informed Neural Networks (PINNs) have emerged as an efficient alternative by embedding physics-based constraints into deep learning frameworks, but they face challenges in achieving high accuracy and handling complex boundary conditions. In this work, we extend the Time-Evolving Natural Gradient (TENG) framework to address Dirichlet boundary conditions, integrating natural gradient optimization with numerical time-stepping schemes, including Euler and Heun methods, to ensure both stability and accuracy. By incorporating boundary condition penalty terms into the loss function, the proposed approach enables precise enforcement of Dirichlet constraints. Experiments on the heat equation demonstrate the superior accuracy of the Heun method due to its second-order corrections and the computational efficiency of the Euler method for simpler scenarios. This work establishes a foundation for extending the framework to Neumann and mixed boundary conditions, as well as broader classes of PDEs, advancing the applicability of neural network-based solvers for real-world problems.

[337] TS-DP: Reinforcement Speculative Decoding For Temporal Adaptive Diffusion Policy Acceleration

Ye Li, Jiahe Feng, Yuan Meng, Kangye Ji, Chen Tang, Xinwan Wen, Shutao Xia, Zhi Wang, Wenwu Zhu

Main category: cs.LG

TL;DR: TS-DP introduces speculative decoding to Diffusion Policy, achieving 4.17× faster inference with 94% draft acceptance while maintaining performance, enabling real-time diffusion-based control.

Details

Motivation: Diffusion Policy suffers from high inference latency due to multiple iterative denoising steps, making real-time embodied control challenging. Static acceleration methods fail to handle dynamic embodied tasks with time-varying difficulty.

Method: Two-stage approach: 1) Distill a Transformer-based drafter to imitate the base model’s denoising, replacing costly denoising calls. 2) Use RL-based scheduler to adaptively adjust speculative parameters based on time-varying task difficulty.

Result: Achieves up to 4.17× faster inference with over 94% accepted drafts, reaching 25 Hz inference frequency. Enables real-time diffusion-based control without performance degradation across diverse embodied environments.

Conclusion: TS-DP successfully enables speculative decoding for Diffusion Policy with temporal adaptivity, solving the challenge of dynamic embodied tasks while maintaining accuracy and significantly improving efficiency.

Abstract: Diffusion Policy (DP) excels in embodied control but suffers from high inference latency and computational cost due to multiple iterative denoising steps. The temporal complexity of embodied tasks demands a dynamic and adaptable computation mode. Static and lossy acceleration methods, such as quantization, fail to handle such dynamic embodied tasks, while speculative decoding offers a lossless and adaptive yet underexplored alternative for DP. However, it is non-trivial to address the following challenges: how to match the base model’s denoising quality at lower cost under time-varying task difficulty in embodied settings, and how to dynamically and interactively adjust computation based on task difficulty in such environments. In this paper, we propose Temporal-aware Reinforcement-based Speculative Diffusion Policy (TS-DP), the first framework that enables speculative decoding for DP with temporal adaptivity. First, to handle dynamic environments where task difficulty varies over time, we distill a Transformer-based drafter to imitate the base model and replace its costly denoising calls. Second, an RL-based scheduler further adapts to time-varying task difficulty by adjusting speculative parameters to maintain accuracy while improving efficiency. Extensive experiments across diverse embodied environments demonstrate that TS-DP achieves up to 4.17 times faster inference with over 94% accepted drafts, reaching an inference frequency of 25 Hz and enabling real-time diffusion-based control without performance degradation.

[338] Adversarial Robustness in Financial Machine Learning: Defenses, Economic Impact, and Governance Evidence

Samruddhi Baviskar

Main category: cs.LG

TL;DR: Adversarial attacks degrade tabular ML model performance in financial applications, with partial recovery via adversarial training.

Details

Motivation: Financial decision-making models need robustness assessment against adversarial attacks to ensure reliability in credit scoring and fraud detection.

Method: Apply gradient-based attacks on tabular ML models using credit scoring and fraud detection datasets, then evaluate adversarial training for defense.

Result: Small perturbations cause significant performance degradation; adversarial training provides partial but not complete recovery of model robustness.

Conclusion: Financial ML models are vulnerable to adversarial attacks, requiring robust defense strategies like adversarial training for practical deployment.

Abstract: We evaluate adversarial robustness in tabular machine learning models used in financial decision making. Using credit scoring and fraud detection data, we apply gradient based attacks and measure impacts on discrimination, calibration, and financial risk metrics. Results show notable performance degradation under small perturbations and partial recovery through adversarial training.

[339] Boosting t-SNE Efficiency for Sequencing Data: Insights from Kernel Selection

Avais Jan, Prakash Chourasia, Sarwan Ali, Murray Patterson

Main category: cs.LG

TL;DR: t-SNE with cosine similarity kernel outperforms Gaussian and isolation kernels for biological sequence visualization and analysis across multiple datasets and embedding methods.

Details

Motivation: Traditional t-SNE uses Gaussian kernel which lacks data-dependence and has computational limitations for categorical biological sequences. Recent isolation kernel alternatives may not optimally capture sequence similarities, creating a need for better kernel functions.

Method: Comprehensive evaluation of nine kernel functions for t-SNE applied to molecular sequences using three embedding methods (One-Hot Encoding, Spike2Vec, minimizers). Evaluation includes subjective visualization and objective metrics (neighborhood preservation scores), plus classification/clustering experiments across six biological datasets with multiple ML algorithms.

Result: Cosine similarity kernel outperforms other kernels including Gaussian and isolation kernels, achieving superior runtime efficiency and better preservation of pairwise distances in low-dimensional space. Kernel selection significantly impacts both visualization quality and downstream analytical tasks.

Conclusion: Cosine similarity kernel provides the most robust performance across different biological data types and embedding strategies, making it particularly suitable for large-scale biological sequence analysis with t-SNE.

Abstract: Dimensionality reduction techniques are essential for visualizing and analyzing high-dimensional biological sequencing data. t-distributed Stochastic Neighbor Embedding (t-SNE) is widely used for this purpose, traditionally employing the Gaussian kernel to compute pairwise similarities. However, the Gaussian kernel’s lack of data-dependence and computational overhead limit its scalability and effectiveness for categorical biological sequences. Recent work proposed the isolation kernel as an alternative, yet it may not optimally capture sequence similarities. In this study, we comprehensively evaluate nine different kernel functions for t-SNE applied to molecular sequences, using three embedding methods: One-Hot Encoding, Spike2Vec, and minimizers. Through both subjective visualization and objective metrics (including neighborhood preservation scores), we demonstrate that the cosine similarity kernel in general outperforms other kernels, including Gaussian and isolation kernels, achieving superior runtime efficiency and better preservation of pairwise distances in low-dimensional space. We further validate our findings through extensive classification and clustering experiments across six diverse biological datasets (Spike7k, Host, ShortRead, Rabies, Genome, and Breast Cancer), employing multiple machine learning algorithms and evaluation metrics. Our results show that kernel selection significantly impacts not only visualization quality but also downstream analytical tasks, with the cosine similarity kernel providing the most robust performance across different data types and embedding strategies, making it particularly suitable for large-scale biological sequence analysis.

[340] Introduction to Symbolic Regression in the Physical Sciences

Deaglan J. Bartlett, Harry Desmond, Pedro G. Ferreira, Gabriel Kronberger

Main category: cs.LG

TL;DR: This is an introductory review for a Special Issue on Symbolic Regression in Physical Sciences, covering foundations, applications, challenges, and future directions of SR for scientific discovery.

Details

Motivation: Motivated by a Royal Society discussion meeting, this special issue aims to showcase symbolic regression's growing relevance in physical sciences for discovering interpretable mathematical relationships from data, bridging scientific discovery and empirical modeling.

Method: The review outlines conceptual foundations of symbolic regression, contrasts it with conventional regression, surveys main use cases in physical sciences, and discusses methodological considerations including search-space design, operator selection, complexity control, and integration with modern AI approaches.

Result: The collected contributions span applications from automated equation discovery and emergent-phenomena modeling to compact emulators for computationally expensive simulations, illustrating accelerating progress of SR across physical sciences.

Conclusion: Symbolic regression is emerging as a powerful method for scientific discovery in physical sciences, with ongoing challenges in scalability, robustness, and computational complexity, but promising directions in incorporating symmetry constraints, asymptotic behavior, and theoretical information.

Abstract: Symbolic regression (SR) has emerged as a powerful method for uncovering interpretable mathematical relationships from data, offering a novel route to both scientific discovery and efficient empirical modelling. This article introduces the Special Issue on Symbolic Regression for the Physical Sciences, motivated by the Royal Society discussion meeting held in April 2025. The contributions collected here span applications from automated equation discovery and emergent-phenomena modelling to the construction of compact emulators for computationally expensive simulations. The introductory review outlines the conceptual foundations of SR, contrasts it with conventional regression approaches, and surveys its main use cases in the physical sciences, including the derivation of effective theories, empirical functional forms and surrogate models. We summarise methodological considerations such as search-space design, operator selection, complexity control, feature selection, and integration with modern AI approaches. We also highlight ongoing challenges, including scalability, robustness to noise, overfitting and computational complexity. Finally we emphasise emerging directions, particularly the incorporation of symmetry constraints, asymptotic behaviour and other theoretical information. Taken together, the papers in this Special Issue illustrate the accelerating progress of SR and its growing relevance across the physical sciences.

[341] A Unification of Discrete, Gaussian, and Simplicial Diffusion

Nuria Alina Chandra, Yucen Lily Li, Alan N. Amin, Alex Ali, Joshua Rollins, Sebastian W. Ober, Aniruddh Raghu, Andrew Gordon Wilson

Main category: cs.LG

TL;DR: The paper presents a unified theory connecting three discrete diffusion methods through the Wright-Fisher population genetics model, enabling stable simplicial diffusion and multi-domain training.

Details

Motivation: Current discrete diffusion methods (discrete space, Gaussian, simplex) have disparate algorithms and tradeoffs, with no unified framework. Simplicial diffusion in particular suffers from numerical instability despite theoretical advantages.

Method: Builds a theory unifying all three discrete diffusion methods as different parameterizations of the Wright-Fisher population genetics model. Shows simplicial and Gaussian diffusion as large-population limits, leverages mathematical genetics literature for stability, and enables training single models that can perform diffusion in any domain at test time.

Result: Wright-Fisher simplicial diffusion is more stable and outperforms previous simplicial diffusion models on conditional DNA generation. Models trained on multiple domains are competitive with models trained on individual domains.

Conclusion: The Wright-Fisher framework provides a unified theory connecting discrete diffusion methods, enables stable simplicial diffusion, and allows practitioners to switch between models without sacrificing performance, eliminating the need to balance trade-offs between different approaches.

Abstract: To model discrete sequences such as DNA, proteins, and language using diffusion, practitioners must choose between three major methods: diffusion in discrete space, Gaussian diffusion in Euclidean space, or diffusion on the simplex. Despite their shared goal, these models have disparate algorithms, theoretical structures, and tradeoffs: discrete diffusion has the most natural domain, Gaussian diffusion has more mature algorithms, and diffusion on the simplex in principle combines the strengths of the other two but in practice suffers from a numerically unstable stochastic processes. Ideally we could see each of these models as instances of the same underlying framework, and enable practitioners to switch between models for downstream applications. However previous theories have only considered connections in special cases. Here we build a theory unifying all three methods of discrete diffusion as different parameterizations of the same underlying process: the Wright-Fisher population genetics model. In particular, we find simplicial and Gaussian diffusion as two large-population limits. Our theory formally connects the likelihoods and hyperparameters of these models and leverages decades of mathematical genetics literature to unlock stable simplicial diffusion. Finally, we relieve the practitioner of balancing model trade-offs by demonstrating it is possible to train a single model that can perform diffusion in any of these three domains at test time. Our experiments show that Wright-Fisher simplicial diffusion is more stable and outperforms previous simplicial diffusion models on conditional DNA generation. We also show that we can train models on multiple domains at once that are competitive with models trained on any individual domain.

[342] DSO: Direct Steering Optimization for Bias Mitigation

Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina Donaldson, Luca Zappella, Nicholas Apostoloff

Main category: cs.LG

TL;DR: DSO uses RL to optimize activation steering for controllable bias reduction in VLMs/LLMs, achieving SOTA fairness-performance tradeoffs.

Details

Motivation: VLMs make biased decisions based on demographic attributes (e.g., failing to identify women as doctors). Users need inference-time control over bias-performance tradeoffs, but current steering methods struggle with equiprobable outcomes across groups.

Method: Direct Steering Optimization (DSO) uses reinforcement learning to find linear transformations for steering activations, specifically optimized to mitigate bias while maintaining performance control.

Result: DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, providing practitioners inference-time control over the trade-off.

Conclusion: Directly optimizing steering strategies for specific behavioral control (bias mitigation) is more effective than methods relying on pre-defined heuristics, enabling better inference-time intervention.

Abstract: Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.

[343] BarcodeMamba+: Advancing State-Space Models for Fungal Biodiversity Research

Tiancheng Gao, Scott C. Lowe, Brendan Furneaux, Angel X Chang, Graham W. Taylor

Main category: cs.LG

TL;DR: BarcodeMamba+ is a foundation model for fungal DNA barcode classification using state-space model architecture with pretrain-finetune paradigm, outperforming existing methods on challenging fungal classification benchmarks.

Details

Motivation: Fungal biodiversity monitoring faces extreme challenges due to sparse labeling and long-tailed taxa distributions in DNA barcoding. Conventional supervised learning methods struggle with generalization to unseen species and capturing hierarchical taxonomic relationships.

Method: Uses state-space model architecture with pretrain-finetune paradigm leveraging partially labeled data. Integrates hierarchical label smoothing, weighted loss function, and multi-head output layer from MycoAI to address fungal taxonomy challenges.

Result: Outperforms existing methods across all taxonomic levels on challenging fungal classification benchmark with distinct taxonomic distribution shifts. Each enhancement component yields significant performance gains.

Conclusion: Provides a powerful new tool for genomics-based biodiversity research and establishes an effective, scalable training paradigm for data-sparse fungal classification domains.

Abstract: Accurate taxonomic classification from DNA barcodes is a cornerstone of global biodiversity monitoring, yet fungi present extreme challenges due to sparse labelling and long-tailed taxa distributions. Conventional supervised learning methods often falter in this domain, struggling to generalize to unseen species and to capture the hierarchical nature of the data. To address these limitations, we introduce BarcodeMamba+, a foundation model for fungal barcode classification built on a powerful and efficient state-space model architecture. We employ a pretrain and fine-tune paradigm, which utilizes partially labelled data and we demonstrate this is substantially more effective than traditional fully-supervised methods in this data-sparse environment. During fine-tuning, we systematically integrate and evaluate a suite of enhancements–including hierarchical label smoothing, a weighted loss function, and a multi-head output layer from MycoAI–to specifically tackle the challenges of fungal taxonomy. Our experiments show that each of these components yields significant performance gains. On a challenging fungal classification benchmark with distinct taxonomic distribution shifts from the broad training set, our final model outperforms a range of existing methods across all taxonomic levels. Our work provides a powerful new tool for genomics-based biodiversity research and establishes an effective and scalable training paradigm for this challenging domain. Our code is publicly available at https://github.com/bioscan-ml/BarcodeMamba.

[344] In-Context Semi-Supervised Learning

Jiashuo Fan, Paul Rosu, Aaron T. Wang, Michael Li, Lawrence Carin, Xiang Cheng

Main category: cs.LG

TL;DR: Transformers can leverage unlabeled context in semi-supervised learning settings to improve performance with limited labeled data.

Details

Motivation: Most theoretical work on Transformers' in-context learning focuses on supervised settings with labeled pairs, but in practice Transformers perform well even with sparse or absent labels, suggesting unlabeled contextual demonstrations contain important structure.

Method: Introduces in-context semi-supervised learning (IC-SSL) where a small set of labeled examples is accompanied by many unlabeled points, and shows Transformers can leverage unlabeled context to learn robust, context-dependent representations.

Result: Transformers can use unlabeled context to enable accurate predictions and markedly improve performance in low-label regimes.

Conclusion: This work offers foundational insights into how Transformers exploit unlabeled context for representation learning within the in-context learning framework.

Abstract: There has been significant recent interest in understanding the capacity of Transformers for in-context learning (ICL), yet most theory focuses on supervised settings with explicitly labeled pairs. In practice, Transformers often perform well even when labels are sparse or absent, suggesting crucial structure within unlabeled contextual demonstrations. We introduce and study in-context semi-supervised learning (IC-SSL), where a small set of labeled examples is accompanied by many unlabeled points, and show that Transformers can leverage the unlabeled context to learn a robust, context-dependent representation. This representation enables accurate predictions and markedly improves performance in low-label regimes, offering foundational insights into how Transformers exploit unlabeled context for representation learning within the ICL framework.

[345] SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

Vegard Flovik

Main category: cs.LG

TL;DR: SALVE is a framework for discovering, validating, and controlling features in deep neural networks using sparse autoencoders and weight-space editing.

Details

Motivation: Deep neural networks achieve impressive performance but remain difficult to interpret and control, creating a need for principled methods to understand and manipulate model behavior.

Method: Uses ℓ₁-regularized autoencoder to learn sparse feature basis without supervision, validates features with Grad-FAM saliency mapping, and performs precise weight-space interventions using autoencoder structure.

Result: Demonstrates consistent, interpretable control over both convolutional (ResNet-18) and transformer-based (ViT-B/16) models, with derivation of critical suppression threshold α_crit for robustness diagnostics.

Conclusion: Provides a principled methodology for turning feature discovery into actionable model edits, advancing transparent and controllable AI systems.

Abstract: Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present SALVE (Sparse Autoencoder-Latent Vector Editing), a unified “discover, validate, and control” framework that bridges mechanistic interpretability and model editing. Using an $\ell_1$-regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder’s structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, $α_{crit}$, quantifying each class’s reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is validated on both convolutional (ResNet-18) and transformer-based (ViT-B/16) models, demonstrating consistent, interpretable control over their behavior. This work contributes a principled methodology for turning feature discovery into actionable model edits, advancing the development of transparent and controllable AI systems.

[346] AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines

Dimitrios Danopoulos, Enrico Lupi, Chang Sun, Sebastian Dittmeier, Michael Kagan, Vladimir Loncar, Maurizio Pierini

Main category: cs.LG

TL;DR: AIE4ML: First comprehensive framework for converting AI models to optimized firmware for AMD’s AIE-ML devices, achieving near-peak performance with on-chip execution and GPU-class throughput under microsecond latency.

Details

Motivation: Efficient AI inference on AMD's Versal AI Engine is challenging due to VLIW execution, explicit datapaths, and local memory management. Prior work only optimized single kernels without addressing full neural network execution across the 2D array.

Method: Comprehensive framework that: 1) Optimizes single kernels to near architectural peak, 2) Provides structured parallelization across 2D AIE-ML fabric with on-chip memory tiles, 3) Uses novel graph placement/search algorithm for deterministic, compact placements, 4) Accepts quantized models from high-level tools while preserving bit-exactness.

Result: Achieves up to 98.6% efficiency relative to single-kernel baseline, utilizes 296 of 304 AIE tiles (97.4%) with entirely on-chip data movement. Delivers GPU-class throughput under microsecond latency constraints for real-world model topologies.

Conclusion: AIE4ML is a practical framework for ultra-low-latency AI inference on AMD AIE-ML devices, suitable for applications like particle physics trigger systems, with forward compatibility for newer AIE-MLv2 architecture.

Abstract: Efficient AI inference on AMD’s Versal AI Engine (AIE) is challenging due to tightly coupled VLIW execution, explicit datapaths, and local memory management. Prior work focused on first-generation AIE kernel optimizations, without tackling full neural network execution across the 2D array. In this work, we present AIE4ML, the first comprehensive framework for converting AI models automatically into optimized firmware targeting the AIE-ML generation devices, also with forward compatibility for the newer AIE-MLv2 architecture. At the single-kernel level, we attain performance close to the architectural peak. At the graph and system levels, we provide a structured parallelization method that can scale across the 2D AIE-ML fabric and exploit its dedicated memory tiles to stay entirely on-chip throughout the model execution. As a demonstration, we designed a generalized and highly efficient linear-layer implementation with intrinsic support for fused bias addition and ReLU activation. Also, as our framework necessitates the generation of multi-layer implementations, our approach systematically derives deterministic, compact, and topology-optimized placements tailored to the physical 2D grid of the device through a novel graph placement and search algorithm. Finally, the framework seamlessly accepts quantized models imported from high-level tools such as hls4ml or PyTorch while preserving bit-exactness. In layer scaling benchmarks, we achieve up to 98.6% efficiency relative to the single-kernel baseline, utilizing 296 of 304 AIE tiles (97.4%) of the device with entirely on-chip data movement. With evaluations across real-world model topologies, we demonstrate that AIE4ML delivers GPU-class throughput under microsecond latency constraints, making it a practical companion for ultra-low-latency environments such as trigger systems in particle physics experiments.

[347] Governance by Evidence: Regulated Predictors in Decision-Tree Models

Alexios Veskoukis, Dimitris Kalles

Main category: cs.LG

TL;DR: Decision-tree studies frequently use predictors that fall under privacy-regulated data categories, with healthcare data being most prevalent, highlighting the need for privacy-preserving methods in ML practice.

Details

Motivation: Decision trees are widely used for interpretability but often utilize predictors (age, diagnosis codes, location) that are increasingly regulated by privacy laws. The study aims to analyze how real-world decision-tree applications handle legally governed data.

Method: Compiled a corpus of decision-tree studies, assigned each reported predictor to regulated data categories (health data, biometric identifiers, children’s data, financial attributes, location traces, government IDs), then linked categories to specific excerpts in EU and US privacy laws.

Result: Many reported predictors fall into regulated categories, with healthcare data being the largest share. Found clear differences across industries in data usage patterns, analyzed prevalence, industry composition, and temporal patterns, and summarized regulation-aligned timing.

Conclusion: The evidence supports the need for privacy-preserving methods and governance checks in ML practice, with implications extending beyond decision trees to broader machine learning applications.

Abstract: Decision-tree methods are widely used on structured tabular data and are valued for interpretability across many sectors. However, published studies often list the predictors they use (for example age, diagnosis codes, location). Privacy laws increasingly regulate such data types. We use published decision-tree papers as a proxy for real-world use of legally governed data. We compile a corpus of decision-tree studies and assign each reported predictor to a regulated data category (for example health data, biometric identifiers, children’s data, financial attributes, location traces, and government IDs). We then link each category to specific excerpts in European Union and United States privacy laws. We find that many reported predictors fall into regulated categories, with the largest shares in healthcare and clear differences across industries. We analyze prevalence, industry composition, and temporal patterns, and summarize regulation-aligned timing using each framework’s reference year. Our evidence supports privacy-preserving methods and governance checks, and can inform ML practice beyond decision trees.

[348] Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models

Caner Erden

Main category: cs.LG

TL;DR: DR-RL is a reinforcement learning framework that dynamically optimizes low-rank factorization of attention in LLMs, balancing accuracy and efficiency through adaptive rank selection based on sequence context.

Details

Motivation: Traditional low-rank approximations use static ranks that lack flexibility across diverse input contexts, limiting their effectiveness. There's a need for adaptive rank selection that can respond to real-time sequence dynamics while maintaining theoretical rigor.

Method: Uses RL agent to formulate rank selection as sequential policy optimization, with reward balancing attention fidelity vs computational latency. Integrates online matrix perturbation theory for incremental rank updates, lightweight Transformer-based policy network, and batched SVD operations for GPU scalability.

Result: Maintains downstream accuracy statistically equivalent to full-rank attention while significantly reducing FLOPs, especially for long sequences (L > 4096). Provides mathematically grounded alternative to heuristic rank reduction techniques.

Conclusion: DR-RL bridges adaptive efficiency and theoretical rigor in MHSA, offering principled dynamic rank optimization for resource-constrained deep learning with practical deployment on modern GPU architectures.

Abstract: We propose Dynamic Rank Reinforcement Learning (DR-RL), a novel framework that adaptively optimizes the low-rank factorization of Multi-Head Self-Attention (MHSA) in Large Language Models (LLMs) through the integration of reinforcement learning and online matrix perturbation theory. While traditional low-rank approximations often rely on static rank assumptions–limiting their flexibility across diverse input contexts–our method dynamically selects ranks based on real-time sequence dynamics, layer-specific sensitivities, and hardware constraints. The core innovation lies in an RL agent that formulates rank selection as a sequential policy optimization problem, where the reward function strictly balances attention fidelity against computational latency. Crucially, we employ online matrix perturbation bounds to enable incremental rank updates, thereby avoiding the prohibitive cost of full decomposition during inference. Furthermore, the integration of a lightweight Transformer-based policy network and batched Singular Value Decomposition (SVD) operations ensures scalable deployment on modern GPU architectures. Experiments demonstrate that DR-RL maintains downstream accuracy statistically equivalent to full-rank attention while significantly reducing Floating Point Operations (FLOPs), particularly in long-sequence regimes (L > 4096). This work bridges the gap between adaptive efficiency and theoretical rigor in MHSA, offering a principled, mathematically grounded alternative to heuristic rank reduction techniques in resource-constrained deep learning. Source code and experiment logs are available at: https://github.com/canererden/DR_RL_Project

[349] Tracking Wildfire Assets with Commodity RFID and Gaussian Process Modeling

John Hateley, Sriram Narasimhan, Omid Abari

Main category: cs.LG

TL;DR: A novel RFID-based localization method for tracking assets in forested environments without requiring known tag positions, achieving GPS-level accuracy at lower cost.

Details

Motivation: Current RFID systems struggle with tag localization in forested environments due to signal attenuation and multi-path effects. Existing fingerprinting methods require dispersing tags at known locations beforehand, which is impractical for wildfire response applications where assets need to be tracked without prior knowledge of their positions.

Method: Proposes using Gaussian Processes to model different forest environments based solely on RF signal response signatures, without additional sensors like GPS or cameras. Uses a weighted log-likelihood method to match unknown RF environments to the closest match in a pre-modeled environment dictionary.

Result: Achieves localization accuracies comparable to GPS systems using passive commodity RFID. Enables tracking of dozens of wildfire assets simultaneously within mobile reader range, without requiring known positions to be tagged beforehand, and at a fraction of GPS cost.

Conclusion: The approach demonstrates that GPS-level localization accuracy is achievable with passive RFID in challenging forested environments, enabling cost-effective, scalable asset tracking for wildfire response without the need for prior position knowledge.

Abstract: This paper presents a novel, cost-effective, and scalable approach to track numerous assets distributed in forested environments using commodity Radio Frequency Identification (RFID) targeting wildfire response applications. Commodity RFID systems suffer from poor tag localization when dispersed in forested environments due to signal attenuation, multi-path effects and environmental variability. Current methods to address this issue via fingerprinting rely on dispersing tags at known locations {\em a priori}. In this paper, we address the case when it is not possible to tag known locations and show that it is possible to localize tags to accuracies comparable to global positioning systems (GPS) without such a constraint. For this, we propose Gaussian Process to model various environments solely based on RF signal response signatures and without the aid of additional sensors such as global positioning GPS or cameras, and match an unknown RF to the closest match in a model dictionary. We utilize a new weighted log-likelihood method to associate an unknown environment with the closest environment in a dictionary of previously modeled environments, which is a crucial step in being able to use our approach. Our results show that it is possible to achieve localization accuracies of the order of GPS, but with passive commodity RFID, which will allow the tracking of dozens of wildfire assets within the vicinity of mobile readers at-a-time simultaneously, does not require known positions to be tagged {\em a priori}, and can achieve localization at a fraction of the cost compared to GPS.

[350] Provably Extracting the Features from a General Superposition

Allen Liu

Main category: cs.LG

TL;DR: Efficient algorithm recovers feature directions from superposition in overcomplete regime using Fourier-space iterative refinement.

Details

Motivation: Complex ML models encode features through linear representations in superposition, making them hard to recover. The overcomplete regime (n > d) is especially challenging for existing methods.

Method: Query algorithm using noisy oracle access to f, iteratively refines search in Fourier space to locate hidden feature directions. Works with arbitrary superpositions and response functions, only requiring feature directions aren’t nearly identical.

Result: Algorithm efficiently identifies all non-degenerate feature directions and reconstructs function f. Works in more general setting than prior results - arbitrary superpositions and response functions.

Conclusion: Provides efficient method for learning features in superposition from black-box queries, addressing fundamental challenge in overcomplete regime with greater generality than previous approaches.

Abstract: It is widely believed that complex machine learning models generally encode features through linear representations, but these features exist in superposition, making them challenging to recover. We study the following fundamental setting for learning features in superposition from black-box query access: we are given query access to a function [ f(x)=\sum_{i=1}^n a_i,σ_i(v_i^\top x), ] where each unit vector $v_i$ encodes a feature direction and $σ_i:\mathbb{R} \rightarrow \mathbb{R}$ is an arbitrary response function and our goal is to recover the $v_i$ and the function $f$. In learning-theoretic terms, superposition refers to the overcomplete regime, when the number of features is larger than the underlying dimension (i.e. $n > d$), which has proven especially challenging for typical algorithmic approaches. Our main result is an efficient query algorithm that, from noisy oracle access to $f$, identifies all feature directions whose responses are non-degenerate and reconstructs the function $f$. Crucially, our algorithm works in a significantly more general setting than all related prior results – we allow for essentially arbitrary superpositions, only requiring that $v_i, v_j$ are not nearly identical for $i \neq j$, and general response functions $σ_i$. At a high level, our algorithm introduces an approach for searching in Fourier space by iteratively refining the search space to locate the hidden directions $v_i$.

[351] Higher-Order LaSDI: Reduced Order Modeling with Multiple Time Derivatives

Robert Stephany, William Michael Anderson, Youngsoo Choi

Main category: cs.LG

TL;DR: A method combining flexible finite-difference schemes with Rollout loss training to improve long-term predictive accuracy of reduced-order models for parameterized PDEs.

Details

Motivation: Traditional numerical methods for solving complex PDEs are computationally expensive. While reduced-order models (ROMs) offer faster approximations for parameterized PDE families, their predictive accuracy degrades significantly over long time horizons.

Method: Two key innovations: (1) A flexible, high-order, inexpensive finite-difference scheme, and (2) A Rollout loss function that trains ROMs to maintain accuracy over arbitrary time horizons rather than just single-step predictions.

Result: The approach is demonstrated on the 2D Burgers equation, showing improved long-term predictive performance compared to traditional ROM training methods.

Conclusion: The proposed combination of efficient numerical schemes and Rollout loss training enables ROMs to maintain predictive accuracy over extended time horizons, addressing a key limitation in reduced-order modeling for parameterized PDEs.

Abstract: Solving complex partial differential equations is vital in the physical sciences, but often requires computationally expensive numerical methods. Reduced-order models (ROMs) address this by exploiting dimensionality reduction to create fast approximations. While modern ROMs can solve parameterized families of PDEs, their predictive power degrades over long time horizons. We address this by (1) introducing a flexible, high-order, yet inexpensive finite-difference scheme and (2) proposing a Rollout loss that trains ROMs to make accurate predictions over arbitrary time horizons. We demonstrate our approach on the 2D Burgers equation.

[352] Surrogate Neural Architecture Codesign Package (SNAC-Pack)

Jason Weitz, Dmitri Demler, Benjamin Hawks, Nhan Tran, Javier Duarte

Main category: cs.LG

TL;DR: SNAC-Pack is a hardware-aware neural architecture search framework that automates FPGA-optimized neural network design using resource and latency estimation instead of time-consuming synthesis for each candidate.

Details

Motivation: Existing neural architecture search methods struggle to optimize for real hardware performance, often relying on proxy metrics like bit operations that don't accurately reflect FPGA deployment constraints.

Method: SNAC-Pack combines Neural Architecture Codesign’s multi-stage search with Resource Utilization and Latency Estimator for multi-objective optimization across accuracy, FPGA resource utilization, and latency without requiring synthesis for each candidate.

Result: Achieved 63.84% accuracy on high energy physics jet classification with resource estimation, and when synthesized on Xilinx Virtex UltraScale+ VU13P FPGA, matched baseline accuracy while maintaining comparable resource utilization to BOPs-optimized models.

Conclusion: SNAC-Pack demonstrates the potential of hardware-aware neural architecture search for resource-constrained deployments and provides an open-source framework for automating efficient FPGA-accelerated model design.

Abstract: Neural Architecture Search is a powerful approach for automating model design, but existing methods struggle to accurately optimize for real hardware performance, often relying on proxy metrics such as bit operations. We present Surrogate Neural Architecture Codesign Package (SNAC-Pack), an integrated framework that automates the discovery and optimization of neural networks focusing on FPGA deployment. SNAC-Pack combines Neural Architecture Codesign’s multi-stage search capabilities with the Resource Utilization and Latency Estimator, enabling multi-objective optimization across accuracy, FPGA resource utilization, and latency without requiring time-intensive synthesis for each candidate model. We demonstrate SNAC-Pack on a high energy physics jet classification task, achieving 63.84% accuracy with resource estimation. When synthesized on a Xilinx Virtex UltraScale+ VU13P FPGA, the SNAC-Pack model matches baseline accuracy while maintaining comparable resource utilization to models optimized using traditional BOPs metrics. This work demonstrates the potential of hardware-aware neural architecture search for resource-constrained deployments and provides an open-source framework for automating the design of efficient FPGA-accelerated models.

[353] Towards Fine-Tuning-Based Site Calibration for Knowledge-Guided Machine Learning: A Summary of Results

Ruolei Zeng, Arun Sharma, Shuai An, Mingzhou Yang, Shengya Zhang, Licheng Liu, David Mulla, Shashi Shekhar

Main category: cs.LG

TL;DR: FTBSC-KGML is a knowledge-guided machine learning framework that combines pretraining-fine-tuning with spatial-variability-aware transfer learning to improve agroecosystem carbon cycle quantification across heterogeneous regions.

Details

Motivation: Current approaches for quantifying agroecosystem carbon cycles often underutilize transfer learning and spatial heterogeneity, relying on location-independent parameterizations that limit applicability in regions with substantial variability.

Method: FTBSC-KGML extends KGML-ag with a pretraining-fine-tuning process and site-specific parameters. It uses a spatial-heterogeneity-aware transfer-learning scheme: a globally pretrained model is fine-tuned at each state/site to learn place-aware representations using remote-sensing GPP, climate, and soil covariates.

Result: FTBSC-KGML achieves lower validation error and greater consistency in explanatory power than purely global models, better capturing spatial variability across states while maintaining interpretability.

Conclusion: The framework successfully leverages transfer learning and spatial heterogeneity to improve accuracy of land emissions estimation under limited data, extending prior SDSA-KGML framework for climate mitigation and sustainable agriculture applications.

Abstract: Accurate and cost-effective quantification of the agroecosystem carbon cycle at decision-relevant scales is essential for climate mitigation and sustainable agriculture. However, both transfer learning and the exploitation of spatial variability in this field are challenging, as they involve heterogeneous data and complex cross-scale dependencies. Conventional approaches often rely on location-independent parameterizations and independent training, underutilizing transfer learning and spatial heterogeneity in the inputs, and limiting their applicability in regions with substantial variability. We propose FTBSC-KGML (Fine-Tuning-Based Site Calibration-Knowledge-Guided Machine Learning), a pretraining- and fine-tuning-based, spatial-variability-aware, and knowledge-guided machine learning framework that augments KGML-ag with a pretraining-fine-tuning process and site-specific parameters. Using a pretraining-fine-tuning process with remote-sensing GPP, climate, and soil covariates collected across multiple midwestern sites, FTBSC-KGML estimates land emissions while leveraging transfer learning and spatial heterogeneity. A key component is a spatial-heterogeneity-aware transfer-learning scheme, which is a globally pretrained model that is fine-tuned at each state or site to learn place-aware representations, thereby improving local accuracy under limited data without sacrificing interpretability. Empirically, FTBSC-KGML achieves lower validation error and greater consistency in explanatory power than a purely global model, thereby better capturing spatial variability across states. This work extends the prior SDSA-KGML framework.

[354] Techno-economic optimization of a heat-pipe microreactor, part I: theory and cost optimization

Paul Seurin, Dean Price, Luis Nunez

Main category: cs.LG

TL;DR: A reinforcement learning optimization framework reduces heat-pipe microreactor LCOE by >57% through surrogate modeling and multi-constraint design optimization.

Details

Motivation: Heat-pipe microreactors offer portable power for remote areas but suffer from diseconomies of scale and unconvincing financial viability, requiring integrated economic-physics design approaches.

Method: Random sampling to train surrogate models (Gaussian processes & MLPs), then reinforcement learning optimization framework to minimize LCOE while satisfying fuel lifetime, shutdown margin, peak heat flux, and rod-integrated peaking factor constraints.

Result: Optimizer reduced LCOE by >57% in both high-cost and low-cost axial reflector scenarios, with O&M and capital costs (especially axial reflectors and control drum materials) identified as primary LCOE contributors.

Conclusion: The RL-based optimization successfully integrates techno-economic considerations into microreactor design, demonstrating significant cost reduction potential while maintaining safety constraints, with ongoing work on fuel/HP performance integration.

Abstract: Microreactors, particularly heat-pipe microreactors (HPMRs), are compact, transportable, self-regulated power systems well-suited for access-challenged remote areas where costly fossil fuels dominate. However, they suffer from diseconomies of scale, and their financial viability remains unconvincing. One step in addressing this shortcoming is to design these reactors with comprehensive economic and physics analyses informing early-stage design iteration. In this work, we present a novel unifying geometric design optimization approach that accounts for techno-economic considerations. We start by generating random samples to train surrogate models, including Gaussian processes (GPs) and multi-layer perceptrons (MLPs). We then deploy these surrogates within a reinforcement learning (RL)-based optimization framework to optimize the levelized cost of electricity (LCOE), all the while imposing constraints on the fuel lifetime, shutdown margin (SDM), peak heat flux, and rod-integrated peaking factor. We study two cases: one in which the axial reflector cost is very high, and one in which it is inexpensive. We found that the operation and maintenance and capital costs are the primary contributors to the overall LCOE particularly the cost of the axial reflectors (for the first case) and the control drum materials. The optimizer cleverly changes the design parameters so as to minimize one of them while still satisfying the constraints, ultimately reducing the LCOE by more than 57% in both instances. A comprehensive integration of fuel and HP performance with multi-objective optimization is currently being pursued to fully understand the interaction between constraints and cost performance.

[355] Topic Modelling Black Box Optimization

Roman Akramov, Artem Khamatullin, Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko

Main category: cs.LG

TL;DR: The paper proposes using discrete black-box optimization methods to select the optimal number of topics T in LDA, comparing evolutionary algorithms (GA, ES) with amortized approaches (PABBO, SABBO), finding that amortized methods are much more efficient.

Details

Motivation: Choosing the right number of topics T in LDA is crucial for both statistical fit and interpretability, but current methods for selecting T are often inefficient or require extensive manual tuning.

Method: Formulates T selection as discrete black-box optimization, where each evaluation trains an LDA model and measures validation perplexity. Compares four optimizers under fixed budget: Genetic Algorithm (GA), Evolution Strategy (ES), Preferential Amortized Black-Box Optimization (PABBO), and Sharpness-Aware Black-Box Optimization (SABBO).

Result: While all methods eventually reach similar final perplexity bands, amortized optimizers (PABBO, SABBO) are substantially more sample- and time-efficient. SABBO typically finds near-optimal T after essentially one evaluation, PABBO within few evaluations, while GA and ES require almost the full budget.

Conclusion: Amortized black-box optimization methods like SABBO and PABBO offer dramatically more efficient approaches for hyperparameter tuning in topic modeling compared to traditional evolutionary algorithms, enabling rapid identification of optimal topic numbers with minimal computational cost.

Abstract: Choosing the number of topics $T$ in Latent Dirichlet Allocation (LDA) is a key design decision that strongly affects both the statistical fit and interpretability of topic models. In this work, we formulate the selection of $T$ as a discrete black-box optimization problem, where each function evaluation corresponds to training an LDA model and measuring its validation perplexity. Under a fixed evaluation budget, we compare four families of optimizers: two hand-designed evolutionary methods - Genetic Algorithm (GA) and Evolution Strategy (ES) - and two learned, amortized approaches, Preferential Amortized Black-Box Optimization (PABBO) and Sharpness-Aware Black-Box Optimization (SABBO). Our experiments show that, while GA, ES, PABBO, and SABBO eventually reach a similar band of final perplexity, the amortized optimizers are substantially more sample- and time-efficient. SABBO typically identifies a near-optimal topic number after essentially a single evaluation, and PABBO finds competitive configurations within a few evaluations, whereas GA and ES require almost the full budget to approach the same region.

[356] Explainable AI in Big Data Fraud Detection

Ayush Jain, Rahul Kulkarni, Siyi Lin

Main category: cs.LG

TL;DR: This paper examines how explainable AI (XAI) can be integrated into Big Data analytics pipelines for fraud detection and risk management, reviewing both Big Data tools and XAI methods while proposing a conceptual framework to address scalability and real-time processing challenges.

Details

Motivation: The increasing dependence on automated Big Data analytics in finance, insurance, and cybersecurity raises concerns about transparency, regulatory compliance, and trust, necessitating the integration of explainable AI to make these systems more interpretable and accountable.

Method: The paper conducts a structured review of Big Data characteristics and analytical tools (distributed storage, streaming platforms, fraud detection models) and XAI methods (LIME, SHAP, counterfactual explanations, attention mechanisms), then proposes a conceptual framework integrating scalable infrastructure with context-aware explanation mechanisms.

Result: The review identifies key research gaps in scalability, real-time processing, and explainability for graph and temporal models, and proposes a framework to address these challenges by combining Big Data infrastructure with human feedback loops.

Conclusion: The paper concludes with open research directions including scalable XAI, privacy-aware explanations, and standardized evaluation methods for explainable fraud detection systems, emphasizing the need for continued work in making Big Data analytics more transparent and trustworthy.

Abstract: Big Data has become central to modern applications in finance, insurance, and cybersecurity, enabling machine learning systems to perform large-scale risk assessments and fraud detection. However, the increasing dependence on automated analytics introduces important concerns about transparency, regulatory compliance, and trust. This paper examines how explainable artificial intelligence (XAI) can be integrated into Big Data analytics pipelines for fraud detection and risk management. We review key Big Data characteristics and survey major analytical tools, including distributed storage systems, streaming platforms, and advanced fraud detection models such as anomaly detectors, graph-based approaches, and ensemble classifiers. We also present a structured review of widely used XAI methods, including LIME, SHAP, counterfactual explanations, and attention mechanisms, and analyze their strengths and limitations when deployed at scale. Based on these findings, we identify key research gaps related to scalability, real-time processing, and explainability for graph and temporal models. To address these challenges, we outline a conceptual framework that integrates scalable Big Data infrastructure with context-aware explanation mechanisms and human feedback. The paper concludes with open research directions in scalable XAI, privacy-aware explanations, and standardized evaluation methods for explainable fraud detection systems.

[357] CauSTream: Causal Spatio-Temporal Representation Learning for Streamflow Forecasting

Shu Wan, Reepal Shah, John Sabo, Huan Liu, K. Selçuk Candan

Main category: cs.LG

TL;DR: CauSTream is a causal spatiotemporal framework for streamflow forecasting that jointly learns runoff causal graphs and dynamic routing graphs, outperforming SOTA methods while providing interpretable causal insights.

Details

Motivation: Deep learning models for streamflow forecasting often ignore physical processes, limiting interpretability and generalization. Existing causal approaches use fixed causal graphs that don't adapt to data.

Method: Proposes CauSTream framework that jointly learns: (1) runoff causal graph among meteorological forcings, and (2) routing graph capturing dynamic dependencies across stations. Establishes identifiability conditions for these causal structures under nonparametric setting.

Result: Outperforms prior state-of-the-art methods on three major U.S. river basins across three forecasting horizons. Performance gaps widen at longer forecast windows, indicating stronger generalization. Learned causal graphs align with established domain knowledge.

Conclusion: CauSTream provides a principled foundation for causal spatiotemporal modeling with interpretable insights into watershed dynamics, with potential for extension to other scientific and environmental applications.

Abstract: Streamflow forecasting is crucial for water resource management and risk mitigation. While deep learning models have achieved strong predictive performance, they often overlook underlying physical processes, limiting interpretability and generalization. Recent causal learning approaches address these issues by integrating domain knowledge, yet they typically rely on fixed causal graphs that fail to adapt to data. We propose CauStream, a unified framework for causal spatiotemporal streamflow forecasting. CauSTream jointly learns (i) a runoff causal graph among meteorological forcings and (ii) a routing graph capturing dynamic dependencies across stations. We further establish identifiability conditions for these causal structures under a nonparametric setting. We evaluate CauSTream on three major U.S. river basins across three forecasting horizons. The model consistently outperforms prior state-of-the-art methods, with performance gaps widening at longer forecast windows, indicating stronger generalization to unseen conditions. Beyond forecasting, CauSTream also learns causal graphs that capture relationships among hydrological factors and stations. The inferred structures align closely with established domain knowledge, offering interpretable insights into watershed dynamics. CauSTream offers a principled foundation for causal spatiotemporal modeling, with the potential to extend to a wide range of scientific and environmental applications.

[358] DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang

Main category: cs.LG

TL;DR: DataFlow is a unified LLM-driven data preparation framework with system-level abstractions, reusable operators, and natural language pipeline generation, improving LLM performance across multiple domains.

Details

Motivation: Current data preparation for LLMs relies on ad-hoc scripts and loosely specified workflows that lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation.

Method: DataFlow provides system-level abstractions for modular, reusable data transformations with PyTorch-style pipeline construction API. Includes ~200 reusable operators, 6 domain-general pipelines, and DataFlow-Agent for automatic natural language to pipeline translation via operator synthesis and iterative verification.

Result: Consistent improvements across six use cases: +3% execution accuracy in Text-to-SQL over SynSQL, +7% average improvements on code benchmarks, 1-3 point gains on MATH/GSM8K/AIME. Unified 10K-sample dataset enables base models to surpass counterparts trained on 1M Infinity-Instruct data.

Conclusion: DataFlow provides a practical, high-performance substrate for reliable, reproducible, and scalable LLM data preparation, establishing a system-level foundation for future data-centric AI development.

Abstract: The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3% execution accuracy in Text-to-SQL over SynSQL, +7% average improvements on code benchmarks, and 1–3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.

[359] In-Context Multi-Operator Learning with DeepOSets

Shao-Ting Chiu, Aditya Nambiar, Ali Syed, Jonathan W. Siegel, Ulisses Braga-Neto

Main category: cs.LG

TL;DR: DeepOSets architecture demonstrates in-context learning for PDE operators, can approximate any continuous operator in its class with sufficient examples, and works for unseen PDEs without weight updates.

Details

Motivation: To show that in-context learning (ICL) capabilities are not exclusive to autoregressive transformer architectures with self-attention, and to demonstrate that DeepOSets can serve as a multi-operator in-context learner for PDE solution operators.

Method: Modified DeepOSets architecture combining set learning (DeepSets) with operator learning (DeepONets), enabling it to learn from example pairs of parameter and solution placed in a prompt without weight updates.

Result: DeepOSets is a universal uniform approximator over a class of continuous operators, can recover solution operators for unseen PDEs from in-context examples, and demonstrates accurate predictions for Poisson and reaction-diffusion boundary-value problems.

Conclusion: DeepOSets provides a non-autoregressive, non-attention based alternative for in-context learning of PDE operators, with theoretical guarantees of universal approximation and practical effectiveness on scientific machine learning problems.

Abstract: In-context Learning (ICL) is the remarkable capability displayed by some machine learning models to learn from examples in a prompt, without any further weight updates. ICL had originally been thought to emerge from the self-attention mechanism in autoregressive transformer architectures. DeepOSets is a non-autoregressive, non-attention based neural architecture that combines set learning via the DeepSets architecture with operator learning via Deep Operator Networks (DeepONets). In a previous study, DeepOSets was shown to display ICL capabilities in supervised learning problems. In this paper, we show that the DeepOSets architecture, with the appropriate modifications, is a multi-operator in-context learner that can recover the solution operator of a new PDE, not seen during training, from example pairs of parameter and solution placed in a user prompt, without any weight updates. Furthermore, we show that DeepOSets is a universal uniform approximator over a class of continuous operators, which we believe is the first result of its kind in the literature of scientific machine learning. This means that a single DeepOSets architecture exists that approximates in-context any continuous operator in the class to any fixed desired degree accuracy, given an appropriate number of examples in the prompt. Experiments with Poisson and reaction-diffusion forward and inverse boundary-value problems demonstrate the ability of the proposed model to use in-context examples to predict accurately the solutions corresponding to parameter queries for PDEs not seen during training.

[360] Impacts of Racial Bias in Historical Training Data for News AI

Rahul Bhargava, Malene Hornstrup Jespersen, Emily Boardman Ndulue, Vivica Dsouza

Main category: cs.LG

TL;DR: Researchers investigate bias in an AI classifier trained on NY Times corpus, finding the “blacks” label acts as a racism detector but fails on modern racial issues, highlighting risks of historical bias in newsroom AI tools.

Details

Motivation: AI models trained on historical news corpora encode decades-old attitudes and stereotypes, creating risks when used in newsroom settings. The study aims to investigate how these embedded biases manifest in a classifier trained on the New York Times Annotated Corpus, particularly examining the concerning "blacks" thematic topic label.

Method: The researchers created a multi-label classifier trained on the New York Times Annotated Corpus. They used quantitative and qualitative analysis to investigate the “blacks” label’s use in the training corpus, applied explainable AI methods to understand what concepts it encodes, and tested its performance on modern examples like COVID-19 anti-Asian hate stories and Black Lives Matter reporting.

Result: The “blacks” label partially functions as a general “racism detector” across some minoritized groups, but performs poorly on modern racial issues. It fails to properly identify anti-Asian hate stories from the COVID-19 era and struggles with Black Lives Matter coverage, revealing how historical biases in training data lead to unexpected outputs.

Conclusion: This case study demonstrates how AI models trained on historical news data reproduce embedded biases, creating risks for newsroom applications like story discovery, audience targeting, and summarization. The fundamental challenge for newsrooms is adopting AI workflow tools while mitigating the reproduction of historical biases in news coverage.

Abstract: AI technologies have rapidly moved into business and research applications that involve large text corpora, including computational journalism research and newsroom settings. These models, trained on extant data from various sources, can be conceptualized as historical artifacts that encode decades-old attitudes and stereotypes. This paper investigates one such example trained on the broadly-used New York Times Annotated Corpus to create a multi-label classifier. Our use in research settings surfaced the concerning “blacks” thematic topic label. Through quantitative and qualitative means we investigate this label’s use in the training corpus, what concepts it might be encoding in the trained classifier, and how those concepts impact our model use. Via the application of explainable AI methods, we find that the “blacks” label operates partially as a general “racism detector” across some minoritized groups. However, it performs poorly against expectations on modern examples such as COVID-19 era anti-Asian hate stories, and reporting on the Black Lives Matter movement. This case study of interrogating embedded biases in a model reveals how similar applications in newsroom settings can lead to unexpected outputs that could impact a wide variety of potential uses of any large language model-story discovery, audience targeting, summarization, etc. The fundamental tension this exposes for newsrooms is how to adopt AI-enabled workflow tools while reducing the risk of reproducing historical biases in news coverage.

[361] Privacy Blur: Quantifying Privacy and Utility for Image Data Release

Saeed Mahloujifar, Narine Kokhlikyan, Chuan Guo, Kamalika Chaudhuri

Main category: cs.LG

TL;DR: Gaussian blur is reversible and insufficient for privacy; pixelization and DP-Pix offer better privacy-utility tradeoffs for face obfuscation in images.

Details

Motivation: Standard Gaussian blurring for privacy protection in images is insufficient because practical implementations are reversible, compromising privacy while trying to maintain data utility for model training.

Method: Evaluated four obfuscation algorithms: Gaussian blur, pixelization, DP-Pix (pixelization + noise addition), and cropping. Privacy assessed via reversal and discrimination attacks; utility measured by representation quality when training models on obfuscated face data.

Result: Gaussian blur is the least private method, susceptible to reversal attacks in low-precision implementations. Pixelization and DP-Pix at appropriate granularity levels offer both privacy and utility for computer vision tasks.

Conclusion: Industry-standard Gaussian blur is inadequate for privacy protection; pixelization and DP-Pix provide better privacy-utility tradeoffs. The authors release Privacy Blur software package with recommended parameters.

Abstract: Image data collected in the wild often contains private information such as faces and license plates, and responsible data release must ensure that this information stays hidden. At the same time, released data should retain its usefulness for model-training. The standard method for private information obfuscation in images is Gaussian blurring. In this work, we show that practical implementations of Gaussian blurring are reversible enough to break privacy. We then take a closer look at the privacy-utility tradeoffs offered by three other obfuscation algorithms – pixelization, pixelization and noise addition (DP-Pix), and cropping. Privacy is evaluated by reversal and discrimination attacks, while utility by the quality of the learnt representations when the model is trained on data with obfuscated faces. We show that the most popular industry-standard method, Gaussian blur is the least private of the four – being susceptible to reversal attacks in its practical low-precision implementations. In contrast, pixelization and pixelization plus noise addition, when used at the right level of granularity, offer both privacy and utility for a number of computer vision tasks. We make our proposed methods together with suggested parameters available in a software package called Privacy Blur.

Sandeep Neela

Main category: cs.LG

TL;DR: AIMM is an AI framework that detects market manipulation by analyzing Reddit activity, bot indicators, and market data, providing daily risk scores for stocks.

Details

Motivation: Market manipulation increasingly originates from coordinated social media campaigns rather than isolated trades, creating a need for tools that connect online narratives and coordination patterns to market behavior for retail investors, regulators, and brokerages.

Method: The system fuses Reddit activity, bot/coordination indicators, and OHLCV market features into a daily AIMM Manipulation Risk Score. It uses a parquet-native pipeline with Streamlit dashboard, calibrated synthetic social features (due to API restrictions), and real historical market data from Yahoo Finance.

Result: The framework shows preliminary discriminative capability with a small labeled dataset (33 ticker-days, 3 positive events). AIMM flagged GME 22 days before the January 2021 squeeze peak, demonstrating early warning capability.

Conclusion: AIMM provides a promising framework for social media-driven market surveillance, with released code, dataset schema, and dashboard design to support further research in detecting coordinated market manipulation through online platforms.

Abstract: Market manipulation now routinely originates from coordinated social media campaigns, not isolated trades. Retail investors, regulators, and brokerages need tools that connect online narratives and coordination patterns to market behavior. We present AIMM, an AI-driven framework that fuses Reddit activity, bot and coordination indicators, and OHLCV market features into a daily AIMM Manipulation Risk Score for each ticker. The system uses a parquet-native pipeline with a Streamlit dashboard that allows analysts to explore suspicious windows, inspect underlying posts and price action, and log model outputs over time. Due to Reddit API restrictions, we employ calibrated synthetic social features matching documented event characteristics; market data (OHLCV) uses real historical data from Yahoo Finance. This release makes three contributions. First, we build the AIMM Ground Truth dataset (AIMM-GT): 33 labeled ticker-days spanning eight equities, drawing from SEC enforcement actions, community-verified manipulation cases, and matched normal controls. Second, we implement forward-walk evaluation and prospective prediction logging for both retrospective and deployment-style assessment. Third, we analyze lead times and show that AIMM flagged GME 22 days before the January 2021 squeeze peak. The current labeled set is small (33 ticker-days, 3 positive events), but results show preliminary discriminative capability and early warnings for the GME incident. We release the code, dataset schema, and dashboard design to support research on social media-driven market surveillance.

[363] Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin

Main category: cs.LG

TL;DR: RLVR improves LLM reasoning through spurious rewards and entropy minimization, but the underlying mechanisms were unclear. This paper shows clipping bias under spurious rewards reduces policy entropy, making outputs more confident, while entropy minimization alone isn’t enough. A reward-misalignment model explains spurious rewards’ benefits beyond contamination.

Details

Motivation: To understand the paradoxical mechanisms in RLVR where both discouraging exploitation (via spurious rewards) and discouraging exploration (via entropy minimization) improve LLM reasoning performance, despite these effects seeming contradictory. The paper aims to clarify how policy entropy relates to performance and whether spurious rewards yield gains through clipping bias and model contamination.

Method: The paper examines the relationship between policy entropy and performance in RLVR, investigates the effects of spurious rewards through the lens of clipping bias and model contamination, and proposes a reward-misalignment model to explain why spurious rewards can enhance performance beyond contaminated settings.

Result: Clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs. However, entropy minimization alone is insufficient for performance improvement. The reward-misalignment model successfully explains why spurious rewards can enhance performance beyond contaminated settings.

Conclusion: The findings clarify the mechanisms behind spurious-reward benefits in RLVR, showing that clipping bias reduces policy entropy to produce more confident outputs, while providing principles for more effective RLVR training. The reward-misalignment model offers a theoretical explanation for why spurious rewards work beyond simple contamination effects.

Abstract: This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.

[364] BUILD with Precision: Bottom-Up Inference of Linear DAGs

Hamed Ajorlou, Samuel Rey, Gonzalo Mateos, Geert Leus, Antonio G. Marques

Main category: cs.LG

TL;DR: BUILD algorithm for DAG learning uses precision matrix structure to identify leaf nodes and parents bottom-up, offering deterministic reconstruction with periodic re-estimation for robustness.

Details

Motivation: Learning DAG structure from observational data is fundamental for causal discovery and machine learning, but existing methods may lack robustness or handle complexity poorly.

Method: BUILD algorithm exploits distinctive structure in ensemble precision matrix under linear Gaussian SEM with equal noise variances. It deterministically identifies leaf nodes and their parents, then prunes leaves to proceed stepwise. For finite data, periodically re-estimates precision matrix with fewer variables as leaves are pruned.

Result: BUILD exactly reconstructs DAG from true precision matrix. On challenging synthetic benchmarks, it compares favorably to state-of-the-art DAG learning algorithms while offering explicit complexity control.

Conclusion: BUILD provides a robust, deterministic approach to DAG learning that leverages precision matrix structure, with periodic re-estimation addressing finite data limitations and offering favorable performance compared to existing methods.

Abstract: Learning the structure of directed acyclic graphs (DAGs) from observational data is a central problem in causal discovery, statistical signal processing, and machine learning. Under a linear Gaussian structural equation model (SEM) with equal noise variances, the problem is identifiable and we show that the ensemble precision matrix of the observations exhibits a distinctive structure that facilitates DAG recovery. Exploiting this property, we propose BUILD (Bottom-Up Inference of Linear DAGs), a deterministic stepwise algorithm that identifies leaf nodes and their parents, then prunes the leaves by removing incident edges to proceed to the next step, exactly reconstructing the DAG from the true precision matrix. In practice, precision matrices must be estimated from finite data, and ill-conditioning may lead to error accumulation across BUILD steps. As a mitigation strategy, we periodically re-estimate the precision matrix (with less variables as leaves are pruned), trading off runtime for enhanced robustness. Reproducible results on challenging synthetic benchmarks demonstrate that BUILD compares favorably to state-of-the-art DAG learning algorithms, while offering an explicit handle on complexity.

[365] Dual-View Inference Attack: Machine Unlearning Amplifies Privacy Exposure

Lulu Xue, Shengshan Hu, Linqiang Qian, Peijin Guo, Yechao Zhang, Minghui Li, Yanjun Zhang, Dayong Ye, Leo Yu Zhang

Main category: cs.LG

TL;DR: Machine unlearning introduces privacy risks for retained data in dual-view settings where adversaries can query both original and unlearned models, enabling more effective membership inference attacks.

Details

Motivation: While machine unlearning addresses data deletion requests, it creates new privacy vulnerabilities. Previous research focused on privacy of unlearned data, but risks to retained data remain unexplored. The paper investigates how dual-view access to both original and unlearned models amplifies privacy leakage.

Method: Introduces privacy knowledge gain concept from information theory to quantify dual-view advantage. Proposes DVIA (Dual-View Inference Attack) that uses black-box queries to both models without training an attack model, employing lightweight likelihood ratio inference for efficient membership extraction.

Result: DVIA effectively extracts membership information on retained data across various datasets and model architectures. Experiments validate that dual-view setting allows adversaries to obtain more information than querying either model alone, demonstrating amplified privacy risks.

Conclusion: Machine unlearning introduces significant privacy vulnerabilities for retained data in dual-view settings. The proposed DVIA attack reveals these risks, highlighting the need for privacy-preserving unlearning techniques that protect both unlearned and retained data.

Abstract: Machine unlearning is a newly popularized technique for removing specific training data from a trained model, enabling it to comply with data deletion requests. While it protects the rights of users requesting unlearning, it also introduces new privacy risks. Prior works have primarily focused on the privacy of data that has been unlearned, while the risks to retained data remain largely unexplored. To address this gap, we focus on the privacy risks of retained data and, for the first time, reveal the vulnerabilities introduced by machine unlearning under the dual-view setting, where an adversary can query both the original and the unlearned models. From an information-theoretic perspective, we introduce the concept of {privacy knowledge gain} and demonstrate that the dual-view setting allows adversaries to obtain more information than querying either model alone, thereby amplifying privacy leakage. To effectively demonstrate this threat, we propose DVIA, a Dual-View Inference Attack, which extracts membership information on retained data using black-box queries to both models. DVIA eliminates the need to train an attack model and employs a lightweight likelihood ratio inference module for efficient inference. Experiments across different datasets and model architectures validate the effectiveness of DVIA and highlight the privacy risks inherent in the dual-view setting.

[366] INTELLECT-3: Technical Report

Prime Intellect Team, Mika Senghaas, Fares Obeid, Sami Jaghouar, William Brown, Jack Min Ong, Daniel Auras, Matej Sirovatka, Jannik Straube, Andrew Baker, Sebastian Müller, Justus Mattern, Manveer Basra, Aiman Ismail, Dominik Scherm, Cooper Miller, Ameen Patel, Simon Kirsten, Mario Sieg, Christian Reetz, Kemal Erdem, Vincent Weisser, Johannes Hagemann

Main category: cs.LG

TL;DR: INTELLECT-3 is a 106B-parameter MoE model (12B active) trained with large-scale RL that achieves SOTA performance for its size across math, code, science, and reasoning benchmarks, outperforming larger frontier models. The team open-sources the model, infrastructure stack, RL frameworks, and environments.

Details

Motivation: To develop a state-of-the-art large language model that demonstrates superior performance across diverse domains (math, code, science, reasoning) while being more efficient than larger models, and to contribute to the research community by open-sourcing the complete training infrastructure.

Method: Built a 106B-parameter Mixture-of-Experts model (12B active parameters) using large-scale reinforcement learning on an end-to-end RL infrastructure stack. Used prime-rl framework for scalable asynchronous RL, trained on top of GLM-4.5-Air-Base model with both SFT and RL training, scaling up to 512 H200 GPUs with high efficiency.

Result: INTELLECT-3 achieves state-of-the-art performance for its size across math, code, science, and reasoning benchmarks, outperforming many larger frontier models. The training infrastructure demonstrates high efficiency at scale (512 H200 GPUs).

Conclusion: The paper presents a highly efficient large-scale RL training approach that produces a SOTA model while open-sourcing the complete infrastructure, enabling broader research community access to advanced RL training methodologies and tools for agentic AI development.

Abstract: We present INTELLECT-3, a 106B-parameter Mixture-of-Experts model (12B active) trained with large-scale reinforcement learning on our end-to-end RL infrastructure stack. INTELLECT-3 achieves state of the art performance for its size across math, code, science and reasoning benchmarks, outperforming many larger frontier models. We open-source the model together with the full infrastructure stack used to create it, including RL frameworks, complete recipe, and a wide collection of environments, built with the verifiers library, for training and evaluation from our Environments Hub community platform. Built for this effort, we introduce prime-rl, an open framework for large-scale asynchronous reinforcement learning, which scales seamlessly from a single node to thousands of GPUs, and is tailored for agentic RL with first-class support for multi-turn interactions and tool use. Using this stack, we run both SFT and RL training on top of the GLM-4.5-Air-Base model, scaling RL training up to 512 H200s with high training efficiency.

[367] A Multimodal Approach to Alzheimer’s Diagnosis: Geometric Insights from Cube Copying and Cognitive Assessments

Jaeho Yang, Kijung Yoon

Main category: cs.LG

TL;DR: Graph-based analysis of hand-drawn cube sketches combined with demographic and neuropsychological data improves Alzheimer’s disease classification over pixel-based methods.

Details

Motivation: Early detection of Alzheimer's disease is clinically challenging, and cube-copying tasks provide simple but informative visuospatial assessments that could be leveraged for accessible screening.

Method: Convert hand-drawn cube sketches into graph representations capturing geometric/topological properties, process with graph neural networks, and fuse with demographic/NPT features using late-fusion model.

Result: Graph-based representations outperform pixel-based convolutional models, and multimodal integration further improves performance and robustness to class imbalance.

Conclusion: Graph-based analysis of cube copying provides an interpretable, non-invasive, and scalable approach for Alzheimer’s disease screening, with identified graphlet motifs aligning with clinical observations.

Abstract: Early and accessible detection of Alzheimer’s disease (AD) remains a critical clinical challenge, and cube-copying tasks offer a simple yet informative assessment of visuospatial function. This work proposes a multimodal framework that converts hand-drawn cube sketches into graph-structured representations capturing geometric and topological properties, and integrates these features with demographic information and neuropsychological test (NPT) scores for AD classification. Cube drawings are modeled as graphs with node features encoding spatial coordinates, local graphlet-based topology, and angular geometry, which are processed using graph neural networks and fused with age, education, and NPT features in a late-fusion model. Experimental results show that graph-based representations provide a strong unimodal baseline and substantially outperform pixel-based convolutional models, while multimodal integration further improves performance and robustness to class imbalance. SHAP-based interpretability analysis identifies specific graphlet motifs and geometric distortions as key predictors, closely aligning with clinical observations of disorganized cube drawings in AD. Together, these results establish graph-based analysis of cube copying as an interpretable, non-invasive, and scalable approach for Alzheimer’s disease screening.

[368] A Multi-scale Fused Graph Neural Network with Inter-view Contrastive Learning for Spatial Transcriptomics Data Clustering

Jianping Mei, Siqi Ai, Ye Yuan

Main category: cs.LG

TL;DR: stMFG is a spatial transcriptomics method using multi-scale interactive fusion graph network with layer-wise cross-view attention to better identify spatial domains by dynamically integrating spatial and gene features.

Details

Motivation: Current spatial transcriptomics methods process spatial and feature views separately with late fusion ("encode-separately, fuse-late"), which limits multi-scale semantic capture and cross-view interaction needed for identifying complex spatial domains.

Method: stMFG uses a multi-scale interactive fusion graph network with layer-wise cross-view attention to dynamically integrate spatial and gene features after each convolution, combined with cross-view contrastive learning and spatial constraints.

Result: On DLPFC and breast cancer datasets, stMFG outperforms state-of-the-art methods, achieving up to 14% ARI improvement on certain slices.

Conclusion: The proposed interactive fusion approach with layer-wise cross-view attention effectively captures complex gene-spatial interactions for spatial domain identification, demonstrating superior performance over existing methods.

Abstract: Spatial transcriptomics enables genome-wide expression analysis within native tissue context, yet identifying spatial domains remains challenging due to complex gene-spatial interactions. Existing methods typically process spatial and feature views separately, fusing only at output level - an “encode-separately, fuse-late” paradigm that limits multi-scale semantic capture and cross-view interaction. Accordingly, stMFG is proposed, a multi-scale interactive fusion graph network that introduces layer-wise cross-view attention to dynamically integrate spatial and gene features after each convolution. The model combines cross-view contrastive learning with spatial constraints to enhance discriminability while maintaining spatial continuity. On DLPFC and breast cancer datasets, stMFG outperforms state-of-the-art methods, achieving up to 14% ARI improvement on certain slices.

[369] Explicit and Non-asymptotic Query Complexities of Rank-Based Zeroth-order Algorithms on Smooth Functions

Haishan Ye

Main category: cs.LG

TL;DR: First explicit non-asymptotic convergence rates for rank-based zeroth-order optimization algorithms, showing query complexities for strongly convex and nonconvex smooth functions.

Details

Motivation: Rank-based ZO optimization is widely used (CMA-ES, evolution strategies) but lacks theoretical understanding - existing analyses only provide asymptotic insights without explicit convergence rates for top-k selection algorithms.

Method: Analyzes a simple rank-based ZO algorithm using novel analysis techniques that avoid classical drift and information-geometric approaches, establishing explicit query complexities.

Result: For d-dimensional problems: 1) For L-smooth μ-strongly convex functions: Õ(dL/μ log(dL/μδ) log(1/ε)) queries to find ε-suboptimal solution; 2) For smooth nonconvex: O(dL/ε log(1/ε)) queries. Both hold with probability ≥ 1-δ.

Conclusion: Provides first explicit non-asymptotic convergence rates for rank-based ZO methods, offering new insights into why rank-based heuristics lead to efficient ZO optimization.

Abstract: Rank-based zeroth-order (ZO) optimization – which relies only on the ordering of function evaluations – offers strong robustness to noise and monotone transformations, and underlies many successful algorithms such as CMA-ES, natural evolution strategies, and rank-based genetic algorithms. Despite its widespread use, the theoretical understanding of rank-based ZO methods remains limited: existing analyses provide only asymptotic insights and do not yield explicit convergence rates for algorithms selecting the top-$k$ directions. This work closes this gap by analyzing a simple rank-based ZO algorithm and establishing the first \emph{explicit}, and \emph{non-asymptotic} query complexities. For a $d$-dimension problem, if the function is $L$-smooth and $μ$-strongly convex, the algorithm achieves $\widetilde{\mathcal O}!\left(\frac{dL}μ\log!\frac{dL}{μδ}\log!\frac{1}{\varepsilon}\right)$ to find an $\varepsilon$-suboptimal solution, and for smooth nonconvex objectives it reaches $\mathcal O!\left(\frac{dL}{\varepsilon}\log!\frac{1}{\varepsilon}\right)$. Notation $\cO(\cdot)$ hides constant terms and $\widetilde{\mathcal O}(\cdot)$ hides extra $\log\log\frac{1}{\varepsilon}$ term. These query complexities hold with a probability at least $1-δ$ with $0<δ<1$. The analysis in this paper is novel and avoids classical drift and information-geometric techniques. Our analysis offers new insight into why rank-based heuristics lead to efficient ZO optimization.

[370] Neural emulation of gravity-driven geohazard runout

Lorenzo Nava, Ye Chen, Maximillian Van Wyk de Vries

Main category: cs.LG

TL;DR: Machine learning model predicts geohazard runout (flow extent & deposit thickness) 100-10,000x faster than numerical solvers, trained on 100k+ simulations across real terrains.

Details

Motivation: Geohazard runout prediction is critical for disaster risk reduction but faces a fundamental speed-realism trade-off. Existing numerical models are either too slow for large-scale applications or lack physical realism needed for accurate predictions.

Method: Train a machine learning model (neural emulator) on over 100,000 numerical simulations across more than 10,000 real-world digital elevation model chips. The model learns to predict both flow extent and deposit thickness from terrain and flow characteristics.

Result: The model achieves high accuracy in predicting flow extent and deposit thickness while being 100 to 10,000 times faster than traditional numerical solvers. It reproduces key physical behaviors like avulsion and deposition patterns, and generalizes across different flow types, sizes, and landscapes.

Conclusion: Neural emulation enables rapid, spatially resolved runout prediction across diverse real-world terrains, opening new opportunities for disaster risk reduction and impact-based forecasting at scales relevant for large-scale early warning systems.

Abstract: Predicting geohazard runout is critical for protecting lives, infrastructure and ecosystems. Rapid mass flows, including landslides and avalanches, cause several thousand deaths across a wide range of environments, often travelling many kilometres from their source. The wide range of source conditions and material properties governing these flows makes their runout difficult to anticipate, particularly for downstream communities that may be suddenly exposed to severe impacts. Accurately predicting runout at scale requires models that are both physically realistic and computationally efficient, yet existing approaches face a fundamental speed-realism trade-off. Here we train a machine learning model to predict geohazard runout across representative real world terrains. The model predicts both flow extent and deposit thickness with high accuracy and 100 to 10,000 times faster computation than numerical solvers. It is trained on over 100,000 numerical simulations across over 10,000 real world digital elevation model chips and reproduces key physical behaviours, including avulsion and deposition patterns, while generalizing across different flow types, sizes and landscapes. Our results demonstrate that neural emulation enables rapid, spatially resolved runout prediction across diverse real world terrains, opening new opportunities for disaster risk reduction and impact-based forecasting. These results highlight neural emulation as a promising pathway for extending physically realistic geohazard modelling to spatial and temporal scales relevant for large scale early warning systems.

[371] Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models

Xueqi Ma, Xingjun Ma, Sarah Monazam Erfani, Danilo Mandic, James Bailey

Main category: cs.LG

TL;DR: CFC is a coarse-to-fine open-set classification framework that uses LLMs for OOD detection and classification on graph datasets, achieving significant improvements over state-of-the-art methods.

Details

Motivation: Existing open-set classification methods treat all OOD samples as a single class, but real-world applications like fraud detection and medical diagnosis require deeper insights into OOD samples, including their probable labels. The paper addresses whether OOD detection can be extended to OOD classification without true label information.

Method: CFC framework has three components: 1) coarse classifier using LLM prompts for OOD detection and outlier label generation, 2) GNN-based fine classifier trained with identified OOD samples for enhanced OOD detection and ID classification, and 3) refined OOD classification through LLM prompts and post-processed OOD labels. It uses semantic OOD instances based on inherent meaning rather than synthetic/auxiliary samples.

Result: CFC improves OOD detection by 10% over state-of-the-art methods on graph and text domains, and achieves up to 70% accuracy in OOD classification on graph datasets.

Conclusion: The proposed CFC framework successfully extends OOD detection to OOD classification without true label information, leveraging LLMs for semantic understanding of OOD samples and achieving superior performance in both detection and classification tasks.

Abstract: Developing open-set classification methods capable of classifying in-distribution (ID) data while detecting out-of-distribution (OOD) samples is essential for deploying graph neural networks (GNNs) in open-world scenarios. Existing methods typically treat all OOD samples as a single class, despite real-world applications, especially high-stake settings such as fraud detection and medical diagnosis, demanding deeper insights into OOD samples, including their probable labels. This raises a critical question: can OOD detection be extended to OOD classification without true label information? To address this question, we propose a Coarse-to-Fine open-set Classification (CFC) framework that leverages large language models (LLMs) for graph datasets. CFC consists of three key components: a coarse classifier that uses LLM prompts for OOD detection and outlier label generation, a GNN-based fine classifier trained with OOD samples identified by the coarse classifier for enhanced OOD detection and ID classification, and refined OOD classification achieved through LLM prompts and post-processed OOD labels. Unlike methods that rely on synthetic or auxiliary OOD samples, CFC employs semantic OOD instances that are genuinely out-of-distribution based on their inherent meaning, improving interpretability and practical utility. Experimental results show that CFC improves OOD detection by ten percent over state-of-the-art methods on graph and text domains and achieves up to seventy percent accuracy in OOD classification on graph datasets.

[372] Sharpness-aware Federated Graph Learning

Ruiyu Li, Peige Zhao, Guangxia Li, Pengcheng Wu, Xingyu Gao, Zhiqiang Xu

Main category: cs.LG

TL;DR: SEAL is a sharpness-aware federated graph learning algorithm that addresses data heterogeneity in federated GNN training by optimizing for flat loss surfaces and preventing dimensional collapse in representations.

Details

Motivation: Federated graph learning faces challenges with data heterogeneity across clients, causing local models to fall into sharp loss valleys with poor generalization, and suffering from dimensional collapse in learned representations that reduces classification capacity.

Method: Proposes SEAL with two key components: 1) Sharpness-aware optimization that minimizes both loss function and its sharpness to find flat regions with uniformly low loss, 2) A regularizer based on correlation matrix of local representations to relax correlations and alleviate dimensional collapse.

Result: Experimental studies on graph classification benchmarks show SEAL consistently outperforms state-of-the-art FGL baselines and provides gains for more participants, enhancing classification accuracy and generalization ability.

Conclusion: SEAL effectively addresses data heterogeneity in federated graph learning by improving model generalization through sharpness-aware optimization and mitigating dimensional collapse, making it a promising approach for privacy-preserving collaborative GNN training.

Abstract: One of many impediments to applying graph neural networks (GNNs) to large-scale real-world graph data is the challenge of centralized training, which requires aggregating data from different organizations, raising privacy concerns. Federated graph learning (FGL) addresses this by enabling collaborative GNN model training without sharing private data. However, a core challenge in FGL systems is the variation in local training data distributions among clients, known as the data heterogeneity problem. Most existing solutions suffer from two problems: (1) The typical optimizer based on empirical risk minimization tends to cause local models to fall into sharp valleys and weakens their generalization to out-of-distribution graph data. (2) The prevalent dimensional collapse in the learned representations of local graph data has an adverse impact on the classification capacity of the GNN model. To this end, we formulate a novel optimization objective that is aware of the sharpness (i.e., the curvature of the loss surface) of local GNN models. By minimizing the loss function and its sharpness simultaneously, we seek out model parameters in a flat region with uniformly low loss values, thus improving the generalization over heterogeneous data. By introducing a regularizer based on the correlation matrix of local representations, we relax the correlations of representations generated by individual local graph samples, so as to alleviate the dimensional collapse of the learned model. The proposed \textbf{S}harpness-aware f\textbf{E}derated gr\textbf{A}ph \textbf{L}earning (SEAL) algorithm can enhance the classification accuracy and generalization ability of local GNN models in federated graph learning. Experimental studies on several graph classification benchmarks show that SEAL consistently outperforms SOTA FGL baselines and provides gains for more participants.

[373] Sharpness-aware Second-order Latent Factor Model for High-dimensional and Incomplete Data

Jialiang Wang, Xueyan Bao, Hao Wu

Main category: cs.LG

TL;DR: SSLF model combines second-order latent factor modeling with sharpness-aware minimization to improve optimization and generalization on high-dimensional incomplete data.

Details

Motivation: Second-order latent factor models are effective for extracting interaction patterns from high-dimensional incomplete data, but their optimization is difficult due to bilinear and non-convex nature. Sharpness-aware minimization can help find flat local minima for better generalization.

Method: Proposes Sharpness-aware SLF (SSLF) model that: (1) acquires second-order information via Hessian-vector products, and (2) injects a sharpness term into the curvature (Hessian) through designed Hessian-vector products.

Result: Experiments on multiple industrial datasets demonstrate that the proposed model consistently outperforms state-of-the-art baselines.

Conclusion: SSLF successfully addresses the optimization challenges of second-order latent factor models by incorporating sharpness-aware minimization, leading to improved performance on real-world datasets.

Abstract: Second-order Latent Factor (SLF) model, a class of low-rank representation learning methods, has proven effective at extracting node-to-node interaction patterns from High-dimensional and Incomplete (HDI) data. However, its optimization is notoriously difficult due to its bilinear and non-convex nature. Sharpness-aware Minimization (SAM) has recently proposed to find flat local minima when minimizing non-convex objectives, thereby improving the generalization of representation-learning models. To address this challenge, we propose a Sharpness-aware SLF (SSLF) model. SSLF embodies two key ideas: (1) acquiring second-order information via Hessian-vector products; and (2) injecting a sharpness term into the curvature (Hessian) through the designed Hessian-vector products. Experiments on multiple industrial datasets demonstrate that the proposed model consistently outperforms state-of-the-art baselines.

[374] CKA-Guided Modular Quantization: Beyond Bit-Width to Algorithmic Diversity

Jinhao Zhang, Yunquan Zhang, Daning Chen

Main category: cs.LG

TL;DR: CKA Guided Modular Quantization is a fine-tuning-free framework that uses Linear CKA to automatically select optimal quantization strategies per layer, creating hybrid quantized LLMs that outperform uniform quantization and mixed-precision methods.

Details

Motivation: Current post-training quantization methods use uniform strategies across all layers, ignoring that different layers have varying algorithmic suitability for quantization, leading to suboptimal performance.

Method: Proposes CKA Guided Modular Quantization: independently evaluates multiple PTQ algorithms per layer, uses Linear Centered Kernel Alignment (CKA) as metric to automatically select optimal quantization strategy for each layer, then integrates strategies to build hybrid quantized model.

Result: Consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMs (LLaMA, Qwen) in terms of perplexity (PPL) and downstream task performance.

Conclusion: The framework provides a fine-tuning-free, plug-and-play solution for algorithmic heterogeneous quantization that better accommodates layer-specific characteristics, achieving superior quantization results without requiring model fine-tuning.

Abstract: Current mainstream post-training quantization methods for large language models typically apply a uniform quantization strategy across all network layers, overlooking the substantial differences in algorithmic suitability among layers. To address this limitation, we propose CKA Guided Modular Quantization, a fine-tuning-free, plug-and-play framework for algorithmic heterogeneous quantization. Our method independently evaluates multiple PTQ algorithms on each layer and employs Linear Centered Kernel Alignment (CKA) as a metric to automatically select the optimal quantization strategy per layer. The individually optimized strategies are then integrated to construct a hybrid quantized model. Experiments demonstrate that our approach consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMs including LLaMA and Qwen ,in terms of perplexity (PPL) and downstream task performance.

[375] Feature-Selective Representation Misdirection for Machine Unlearning

Taozhao Chen, Linghan Huang, Kim-Kwang Raymond Choo, Huaming Chen

Main category: cs.LG

TL;DR: SRMU is a novel activation-editing framework for LLM unlearning that uses selective representation misdirection to remove harmful knowledge while preserving model utility, especially effective in high-entanglement scenarios where existing methods fail.

Details

Motivation: As LLMs are deployed in safety-critical domains, they retain sensitive knowledge that poses privacy, compliance, and misuse risks. Current unlearning methods assume clean dataset separation, which is unrealistic in operational settings with entangled distributions, leading to utility degradation or safety failures.

Method: SRMU (Selective Representation Misdirection for Unlearning) is a principled activation-editing framework that applies feature-aware, directionally controlled perturbations. It uses structured misdirection vectors with activation importance maps to selectively suppress harmful representations while preserving benign ones, unlike indiscriminate weight perturbations.

Result: On the WMDP benchmark across low- and high-entanglement configurations, SRMU achieves state-of-the-art unlearning performance with minimal utility loss. It remains effective under 20-30% dataset overlap where existing baselines collapse, demonstrating robustness in challenging scenarios.

Conclusion: SRMU provides a robust foundation for safety-driven model governance, privacy compliance, and controlled knowledge removal in LLM applications, addressing the critical need for effective unlearning in operational settings with entangled data distributions.

Abstract: As large language models (LLMs) are increasingly adopted in safety-critical and regulated sectors, the retention of sensitive or prohibited knowledge introduces escalating risks, ranging from privacy leakage to regulatory non-compliance to to potential misuse, and so on. Recent studies suggest that machine unlearning can help ensure deployed models comply with evolving legal, safety, and governance requirements. However, current unlearning techniques assume clean separation between forget and retain datasets, which is challenging in operational settings characterized by highly entangled distributions. In such scenarios, perturbation-based methods often degrade general model utility or fail to ensure safety. To address this, we propose Selective Representation Misdirection for Unlearning (SRMU), a novel principled activation-editing framework that enforces feature-aware and directionally controlled perturbations. Unlike indiscriminate model weights perturbations, SRMU employs a structured misdirection vector with an activation importance map. The goal is to allow SRMU selectively suppresses harmful representations while preserving the utility on benign ones. Experiments are conducted on the widely used WMDP benchmark across low- and high-entanglement configurations. Empirical results reveal that SRMU delivers state-of-the-art unlearning performance with minimal utility losses, and remains effective under 20-30% overlap where existing baselines collapse. SRMU provides a robust foundation for safety-driven model governance, privacy compliance, and controlled knowledge removal in the emerging LLM-based applications. We release the replication package at https://figshare.com/s/d5931192a8824de26aff.

[376] Pretrained Battery Transformer (PBT): A battery life prediction foundation model

Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang

Main category: cs.LG

TL;DR: PBT is the first foundation model for battery cycle life prediction, using domain-knowledge-encoded mixture-of-expert layers to achieve superior generalization across diverse battery datasets.

Details

Motivation: Early battery cycle life prediction is crucial for accelerating battery research and deployment, but current machine learning methods are limited by data scarcity and heterogeneity from diverse aging conditions. Foundation models have shown broad generalization in other fields but haven't been developed for battery life prediction.

Method: Developed Pretrained Battery Transformer (PBT) using domain-knowledge-encoded mixture-of-expert layers. Trained on the largest public battery life database, learning transferable representations from 13 lithium-ion battery datasets.

Result: PBT outperforms existing models by an average of 19.8% and achieves state-of-the-art performance across 15 diverse datasets covering various operating conditions, formation protocols, and LIB chemistries through transfer learning.

Conclusion: This work establishes the first foundation model pathway for battery lifetime prediction, paving the way toward universal battery lifetime prediction systems that can generalize across diverse battery types and conditions.

Abstract: Early prediction of battery cycle life is essential for accelerating battery research, manufacturing, and deployment. Although machine learning methods have shown encouraging results, progress is hindered by data scarcity and heterogeneity arising from diverse aging conditions. In other fields, foundation models (FMs) trained on diverse datasets have achieved broad generalization through transfer learning, but no FMs have been reported for battery cycle life prediction yet. Here we present the Pretrained Battery Transformer (PBT), the first FM for battery life prediction, developed through domain-knowledge-encoded mixture-of-expert layers. Validated on the largest public battery life database, PBT learns transferable representations from 13 lithium-ion battery (LIB) datasets, outperforming existing models by an average of 19.8%. With transfer learning, PBT achieves state-of-the-art performance across 15 diverse datasets encompassing various operating conditions, formation protocols, and chemistries of LIBs. This work establishes a foundation model pathway for battery lifetime prediction, paving the way toward universal battery lifetime prediction systems.

[377] Multivariate Uncertainty Quantification with Tomographic Quantile Forests

Takuya Kanazawa

Main category: cs.LG

TL;DR: TQF is a tree-based regression model that learns conditional quantiles of directional projections to estimate multivariate conditional distributions without convexity restrictions.

Details

Motivation: Quantifying predictive uncertainty is crucial for safe AI deployment, but nonparametric estimation of multivariate conditional distributions remains challenging.

Method: TQF learns conditional quantiles of directional projections (n⊤y) as functions of input x and direction n, then aggregates quantiles across directions and reconstructs the multivariate distribution by minimizing sliced Wasserstein distance via efficient alternating optimization.

Result: TQF is evaluated on synthetic and real-world datasets, demonstrating its ability to estimate multivariate conditional distributions without convexity restrictions.

Conclusion: TQF provides a nonparametric, uncertainty-aware approach for multivariate regression that covers all directions with a single model and releases convexity restrictions of classical directional-quantile methods.

Abstract: Quantifying predictive uncertainty is essential for safe and trustworthy real-world AI deployment. Yet, fully nonparametric estimation of conditional distributions remains challenging for multivariate targets. We propose Tomographic Quantile Forests (TQF), a nonparametric, uncertainty-aware, tree-based regression model for multivariate targets. TQF learns conditional quantiles of directional projections $\mathbf{n}^{\top}\mathbf{y}$ as functions of the input $\mathbf{x}$ and the unit direction $\mathbf{n}$. At inference, it aggregates quantiles across many directions and reconstructs the multivariate conditional distribution by minimizing the sliced Wasserstein distance via an efficient alternating scheme with convex subproblems. Unlike classical directional-quantile approaches that typically produce only convex quantile regions and require training separate models for different directions, TQF covers all directions with a single model without imposing convexity restrictions. We evaluate TQF on synthetic and real-world datasets, and release the source code on GitHub.

[378] Quantitative Verification of Fairness in Tree Ensembles

Zhenjiang Zhao, Takahisa Toda, Takashi Kitamura

Main category: cs.LG

TL;DR: This paper proposes an efficient quantitative verification method for tree ensembles that provides anytime upper/lower bounds for fairness violation ratios, outperforming existing approaches.

Details

Motivation: Traditional fairness verification only returns single counterexamples when fairness is violated, lacking information about the scale and distribution of bias. Quantitative verification can estimate violation ratios and characterize bias regions, which is crucial for diagnosing and mitigating bias. However, existing quantitative methods mainly target DNNs and have limitations when adapted to tree ensembles.

Method: The authors extend Counterexample-Guided Abstraction Refinement (CEGAR) framework into a model-agnostic form but identify its limitations. They then exploit the discrete structure of tree ensembles to develop an efficient quantification technique that provides anytime upper and lower bounds for fairness violations.

Result: Experiments on five widely used datasets demonstrate the method’s effectiveness and efficiency. When applied to fairness testing, the quantification method significantly outperforms state-of-the-art testing techniques.

Conclusion: The proposed method enables comprehensive quantitative verification of fairness in tree ensembles, providing both upper and lower bounds for violation ratios, which offers more informative bias diagnosis compared to traditional single-counterexample approaches.

Abstract: This work focuses on quantitative verification of fairness in tree ensembles. Unlike traditional verification approaches that merely return a single counterexample when the fairness is violated, quantitative verification estimates the ratio of all counterexamples and characterizes the regions where they occur, which is important information for diagnosing and mitigating bias. To date, quantitative verification has been explored almost exclusively for deep neural networks (DNNs). Representative methods, such as DeepGemini and FairQuant, all build on the core idea of Counterexample-Guided Abstraction Refinement, a generic framework that could be adapted to other model classes. We extended the framework into a model-agnostic form, but discovered two limitations: (i) it can provide only lower bounds, and (ii) its performance scales poorly. Exploiting the discrete structure of tree ensembles, our work proposes an efficient quantification technique that delivers any-time upper and lower bounds. Experiments on five widely used datasets demonstrate its effectiveness and efficiency. When applied to fairness testing, our quantification method significantly outperforms state-of-the-art testing techniques.

[379] Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee

Main category: cs.LG

TL;DR: Kascade is a training-free sparse attention method that speeds up LLM inference by reusing top-k attention indices across layers, achieving up to 4.1x decode speedup while maintaining accuracy.

Details

Motivation: Attention computation is the dominant source of latency in long-context LLM inference, which is increasingly important for reasoning models and RAG applications. Current attention mechanisms are computationally expensive, especially for long contexts.

Method: Kascade leverages two observations: 1) post-softmax attention is intrinsically sparse, and 2) high-weight key identities are stable across nearby layers. It computes exact Top-k indices in anchor layers selected via dynamic programming to maximize cross-layer similarity, then reuses those indices in intermediate layers. The method is head-aware and incorporates efficient tile-level operations for both prefill and decode attention.

Result: Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs, while closely matching dense attention accuracy on long-context benchmarks like LongBench and AIME-24.

Conclusion: Kascade provides an effective training-free approach to accelerate long-context LLM inference by exploiting attention sparsity and cross-layer stability, offering significant speed improvements while maintaining model accuracy.

Abstract: Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24.

[380] Geometric Laplace Neural Operator

Hao Tang, Jiongyu Zhu, Zimeng Feng, Hao Li, Chao Li

Main category: cs.LG

TL;DR: GLNO: A neural operator framework using Laplace spectral representation on Riemannian manifolds for learning PDE solutions on irregular geometries without requiring periodicity or uniform grids.

Details

Motivation: Existing neural operators struggle with non-periodic excitations, transient responses, and signals on irregular/non-Euclidean geometries, limiting their applicability to real-world problems with complex domains.

Method: Proposes a generalized operator learning framework using pole-residue decomposition with exponential basis functions, then introduces Geometric Laplace Neural Operator (GLNO) that embeds Laplace spectral representation into Laplace-Beltrami eigen-basis for arbitrary Riemannian manifolds, with practical grid-invariant GLNONet architecture.

Result: Extensive experiments on PDEs/ODEs and real-world datasets demonstrate robust performance improvements over state-of-the-art models, showing effectiveness on irregular geometries without periodicity or uniform grid requirements.

Conclusion: GLNO successfully extends neural operator learning to arbitrary Riemannian manifolds, overcoming limitations of existing methods for non-periodic, transient, and irregular geometry problems through Laplace spectral representation and geometric embeddings.

Abstract: Neural operators have emerged as powerful tools for learning mappings between function spaces, enabling efficient solutions to partial differential equations across varying inputs and domains. Despite the success, existing methods often struggle with non-periodic excitations, transient responses, and signals defined on irregular or non-Euclidean geometries. To address this, we propose a generalized operator learning framework based on a pole-residue decomposition enriched with exponential basis functions, enabling expressive modeling of aperiodic and decaying dynamics. Building on this formulation, we introduce the Geometric Laplace Neural Operator (GLNO), which embeds the Laplace spectral representation into the eigen-basis of the Laplace-Beltrami operator, extending operator learning to arbitrary Riemannian manifolds without requiring periodicity or uniform grids. We further design a grid-invariant network architecture (GLNONet) that realizes GLNO in practice. Extensive experiments on PDEs/ODEs and real-world datasets demonstrate our robust performance over other state-of-the-art models.

[381] Multi-Fidelity Delayed Acceptance: hierarchical MCMC sampling for Bayesian inverse problems combining multiple solvers through deep neural networks

Filippo Zacchei, Paolo Conti, Attilio Alberto Frangi, Andrea Manzoni

Main category: cs.LG

TL;DR: Multi-Fidelity Delayed Acceptance scheme for Bayesian inverse problems using neural networks to combine predictions from solvers of varying fidelity, avoiding expensive high-fidelity simulations during online inference.

Details

Motivation: Traditional inverse uncertainty quantification methods are computationally expensive for physics-based models, especially with PDEs. High-fidelity simulations make MCMC infeasible, while low-fidelity data alone reduces accuracy. Need efficient method that leverages both fidelity levels.

Method: Extends Multi-Level Delayed Acceptance with multi-fidelity neural networks that combine predictions from solvers of varying fidelity. High-fidelity evaluations only in offline training; online phase uses coarse solvers + trained networks. Allows heterogeneous coarse solvers in hierarchy.

Result: Improves approximation accuracy of low-fidelity solvers, enables longer sub-chain lengths, better mixing, and accelerated posterior inference. Demonstrated computational savings on groundwater flow and reaction-diffusion inverse problems.

Conclusion: Proposed method provides flexible framework for Bayesian inverse problems that balances computational efficiency with accuracy by leveraging multi-fidelity data and neural networks, avoiding expensive high-fidelity simulations during inference.

Abstract: Inverse uncertainty quantification (UQ) tasks such as parameter estimation are computationally demanding whenever dealing with physics-based models, and typically require repeated evaluations of complex numerical solvers. When partial differential equations are involved, full-order models such as those based on the Finite Element Method can make traditional sampling approaches like Markov Chain Monte Carlo (MCMC) computationally infeasible. Although data-driven surrogate models may help reduce evaluation costs, their utility is often limited by the expense of generating high-fidelity data. In contrast, low-fidelity data can be produced more efficiently, although relying on them alone may degrade the accuracy of the inverse UQ solution. To address these challenges, we propose a Multi-Fidelity Delayed Acceptance scheme for Bayesian inverse problems. Extending the Multi-Level Delayed Acceptance framework, the method introduces multi-fidelity neural networks that combine the predictions of solvers of varying fidelity, with high fidelity evaluations restricted to an offline training stage. During the online phase, likelihood evaluations are obtained by evaluating the coarse solvers and passing their outputs to the trained neural networks, thereby avoiding additional high-fidelity simulations. This construction allows heterogeneous coarse solvers to be incorporated consistently within the hierarchy, providing greater flexibility than standard Multi-Level Delayed Acceptance. The proposed approach improves the approximation accuracy of the low fidelity solvers, leading to longer sub-chain lengths, better mixing, and accelerated posterior inference. The effectiveness of the strategy is demonstrated on two benchmark inverse problems involving (i) steady isotropic groundwater flow, (ii) an unsteady reaction-diffusion system, for which substantial computational savings are obtained.

[382] Emergent Bias and Fairness in Multi-Agent Decision Systems

Maeve Madigan, Parameswaran Kamalaruban, Glenn Moynihan, Tom Kempton, David Sutton, Stuart Burrell

Main category: cs.LG

TL;DR: Multi-agent systems in finance need fairness evaluation methods to detect emergent biases that can’t be traced to individual agents, requiring holistic assessment rather than component-level analysis.

Details

Motivation: Multi-agent systems improve predictive performance but lack effective fairness evaluation methodologies, making deployment unsafe in high-stakes financial domains where biased decisions lead to regulatory breaches and financial losses.

Method: Develop fairness evaluation methodologies for multi-agent predictive systems and examine fairness metrics using large-scale simulations across diverse multi-agent configurations with varying communication and collaboration mechanisms.

Result: Revealed patterns of emergent bias in financial decision-making that cannot be traced to individual agent components, showing multi-agent systems exhibit genuinely collective behaviors with fairness risks representing significant model risk.

Conclusion: Multi-agent decision systems must be evaluated as holistic entities rather than through reductionist analyses of their constituent components, as fairness risks in financial multi-agent systems have tangible impacts on tasks like credit scoring and income estimation.

Abstract: Multi-agent systems have demonstrated the ability to improve performance on a variety of predictive tasks by leveraging collaborative decision making. However, the lack of effective evaluation methodologies has made it difficult to estimate the risk of bias, making deployment of such systems unsafe in high stakes domains such as consumer finance, where biased decisions can translate directly into regulatory breaches and financial loss. To address this challenge, we need to develop fairness evaluation methodologies for multi-agent predictive systems and measure the fairness characteristics of these systems in the financial tabular domain. Examining fairness metrics using large-scale simulations across diverse multi-agent configurations, with varying communication and collaboration mechanisms, we reveal patterns of emergent bias in financial decision-making that cannot be traced to individual agent components, indicating that multi-agent systems may exhibit genuinely collective behaviors. Our findings highlight that fairness risks in financial multi-agent systems represent a significant component of model risk, with tangible impacts on tasks such as credit scoring and income estimation. We advocate that multi-agent decision systems must be evaluated as holistic entities rather than through reductionist analyses of their constituent components.

[383] A Novel Proposal in Wind Turbine Blade Failure Detection: An Integrated Approach to Energy Efficiency and Sustainability

Jordan Abarca-Albores, Danna Cristina Gutiérrez Cabrera, Luis Antonio Salazar-Licea, Dante Ruiz-Robles, Jesus Alejandro Franco, Alberto-Jesus Perea-Moreno, David Muñoz-Rodríguez, Quetzalcoatl Hernandez-Escobedo

Main category: cs.LG

TL;DR: This paper presents a novel computational learning methodology for wind turbine blade fault detection, comparing logistic regression (which outperformed other supervised methods) with clustering (which showed superior precision and data segmentation).

Details

Motivation: The motivation is to develop an effective early fault detection system for wind turbine blades to enhance system reliability in the wind energy sector, addressing the need for practical computational learning solutions.

Method: The methodology evaluates two computational learning approaches: (1) logistic regression model compared against neural networks, decision trees, and naive Bayes, and (2) clustering techniques for data segmentation. The study uses accessible tools like Orange Data Mining for implementation.

Result: Logistic regression outperformed other supervised methods (neural networks, decision trees, naive Bayes) in fault detection, while clustering achieved superior performance in precision and data segmentation, suggesting it may better capture underlying data characteristics.

Conclusion: The proposed methodology offers a new approach to early fault detection in wind turbine blades, demonstrating the potential of integrating different computational learning techniques. Future work will focus on combining these methods to improve accuracy and extend applications to other energy infrastructure components.

Abstract: This paper presents a novel methodology for detecting faults in wind turbine blades using com-putational learning techniques. The study evaluates two models: the first employs logistic regression, which outperformed neural networks, decision trees, and the naive Bayes method, demonstrating its effectiveness in identifying fault-related patterns. The second model leverages clustering and achieves superior performance in terms of precision and data segmentation. The results indicate that clustering may better capture the underlying data characteristics compared to supervised methods. The proposed methodology offers a new approach to early fault detection in wind turbine blades, highlighting the potential of integrating different computational learning techniques to enhance system reliability. The use of accessible tools like Orange Data Mining underscores the practical application of these advanced solutions within the wind energy sector. Future work will focus on combining these methods to improve detection accuracy further and extend the application of these techniques to other critical components in energy infrastructure.

[384] IoMT-based Automated Leukemia Classification using CNN and Higher Order Singular Value

Shabnam Bagheri Marzijarani, Mohammad Zolfaghari, Hedieh Sajedi

Main category: cs.LG

TL;DR: A CNN-HOSVD hybrid model for automated leukemia detection from blood images achieves 98.88% accuracy on ALL-IDB2 dataset, integrated with IoMT for real-time clinical communication.

Details

Motivation: Manual diagnosis of Acute Lymphocytic Leukemia (ALL) is time-consuming and prone to human error, creating need for automated AI-based solutions for early detection and treatment.

Method: Combines Convolutional Neural Network (CNN) for feature extraction with Higher Order Singular Value Decomposition (HOSVD) classifier to categorize ALL vs. normal cells from microscopic blood images, integrated into IoMT structure.

Result: Achieved 98.88% average accuracy on the Acute Lymphoblastic Leukemia Image Database (ALL-IDB2) test set.

Conclusion: The CNN-HOSVD hybrid model provides accurate, automated leukemia detection suitable for IoMT deployment, enabling real-time communication between patients and clinicians for early diagnosis.

Abstract: The Internet of Things (IoT) is a concept by which objects find identity and can communicate with each other in a network. One of the applications of the IoT is in the field of medicine, which is called the Internet of Medical Things (IoMT). Acute Lymphocytic Leukemia (ALL) is a type of cancer categorized as a hematic disease. It usually begins in the bone marrow due to the overproduction of immature White Blood Cells (WBCs or leukocytes). Since it has a high rate of spread to other body organs, it is a fatal disease if not diagnosed and treated early. Therefore, for identifying cancerous (ALL) cells in medical diagnostic laboratories, blood, as well as bone marrow smears, are taken by pathologists. However, manual examinations face limitations due to human error risk and time-consuming procedures. So, to tackle the mentioned issues, methods based on Artificial Intelligence (AI), capable of identifying cancer from non-cancer tissue, seem vital. Deep Neural Networks (DNNs) are the most efficient machine learning (ML) methods. These techniques employ multiple layers to extract higher-level features from the raw input. In this paper, a Convolutional Neural Network (CNN) is applied along with a new type of classifier, Higher Order Singular Value Decomposition (HOSVD), to categorize ALL and normal (healthy) cells from microscopic blood images. We employed the model on IoMT structure to identify leukemia quickly and safely. With the help of this new leukemia classification framework, patients and clinicians can have real-time communication. The model was implemented on the Acute Lymphoblastic Leukemia Image Database (ALL-IDB2) and achieved an average accuracy of %98.88 in the test step.

[385] Batch Normalization-Free Fully Integer Quantized Neural Networks via Progressive Tandem Learning

Pengfei Sun, Wenyu Jiang, Piew Yoong Chee, Paul Devos, Dick Botteldooren

Main category: cs.LG

TL;DR: BN-free fully integer QNN trained via progressive layer-wise distillation that enables true integer-only deployment without batch normalization.

Details

Motivation: Current quantized neural networks still depend on batch normalization layers that prevent true integer-only deployment, which is needed for resource-constrained edge and embedded devices.

Method: Progressive layer-wise distillation scheme that starts from a pretrained BN-enabled teacher, uses layer-wise targets and progressive compensation to train a student model that performs inference exclusively with integer arithmetic without BN operations.

Result: On ImageNet with AlexNet, the BN-free model attains competitive Top-1 accuracy under aggressive quantization. The procedure integrates directly with standard quantization workflows.

Conclusion: Enables end-to-end integer-only inference for resource-constrained settings, providing a practical solution for deploying QNNs without batch normalization dependencies.

Abstract: Quantised neural networks (QNNs) shrink models and reduce inference energy through low-bit arithmetic, yet most still depend on a running statistics batch normalisation (BN) layer, preventing true integer-only deployment. Prior attempts remove BN by parameter folding or tailored initialisation; while helpful, they rarely recover BN’s stability and accuracy and often impose bespoke constraints. We present a BN-free, fully integer QNN trained via a progressive, layer-wise distillation scheme that slots into existing low-bit pipelines. Starting from a pretrained BN-enabled teacher, we use layer-wise targets and progressive compensation to train a student that performs inference exclusively with integer arithmetic and contains no BN operations. On ImageNet with AlexNet, the BN-free model attains competitive Top-1 accuracy under aggressive quantisation. The procedure integrates directly with standard quantisation workflows, enabling end-to-end integer-only inference for resource-constrained settings such as edge and embedded devices.

[386] Persistent Multiscale Density-based Clustering

Daniël Bot, Leland McInnes, Jan Aerts

Main category: cs.LG

TL;DR: PLSCAN is a new density-based clustering algorithm that automatically identifies stable clusters across all minimum cluster sizes, outperforming HDBSCAN* in accuracy while maintaining competitive computational efficiency.

Details

Motivation: Density-based clustering algorithms like DBSCAN and HDBSCAN* require hyperparameter selection (density threshold or minimum cluster size) which is difficult without prior knowledge of data distribution. This creates practical challenges for exploratory data analysis where few assumptions should be made about the data.

Method: PLSCAN applies scale-space clustering principles and is equivalent to persistent homology on a novel metric space. It efficiently identifies all minimum cluster sizes for which HDBSCAN* produces stable (leaf) clusters, eliminating the need to manually select the minimum cluster size parameter.

Result: PLSCAN achieves higher average Adjusted Rand Index (ARI) than HDBSCAN* on real-world datasets and is less sensitive to changes in the number of mutual reachability neighbors. It shows competitive run-times with k-Means on low-dimensional datasets, while scaling similarly to HDBSCAN* at higher dimensions.

Conclusion: PLSCAN provides an effective solution to the hyperparameter selection problem in density-based clustering, making it particularly suitable for exploratory data analysis where prior knowledge about data distribution is limited.

Abstract: Clustering is a cornerstone of modern data analysis. Detecting clusters in exploratory data analyses (EDA) requires algorithms that make few assumptions about the data. Density-based clustering algorithms are particularly well-suited for EDA because they describe high-density regions, assuming only that a density exists. Applying density-based clustering algorithms in practice, however, requires selecting appropriate hyperparameters, which is difficult without prior knowledge of the data distribution. For example, DBSCAN requires selecting a density threshold, and HDBSCAN* relies on a minimum cluster size parameter. In this work, we propose Persistent Leaves Spatial Clustering for Applications with Noise (PLSCAN). This novel density-based clustering algorithm efficiently identifies all minimum cluster sizes for which HDBSCAN* produces stable (leaf) clusters. PLSCAN applies scale-space clustering principles and is equivalent to persistent homology on a novel metric space. We compare its performance to HDBSCAN* on several real-world datasets, demonstrating that it achieves a higher average ARI and is less sensitive to changes in the number of mutual reachability neighbours. Additionally, we compare PLSCAN’s computational costs to k-Means, demonstrating competitive run-times on low-dimensional datasets. At higher dimensions, run times scale more similarly to HDBSCAN*.

[387] Abacus: Self-Supervised Event Counting-Aligned Distributional Pretraining for Sequential User Modeling

Sullivan Castro, Artem Betlei, Thomas Di Martino, Nadir El Manouzi

Main category: cs.LG

TL;DR: Abacus: Self-supervised pretraining for user purchase prediction in display advertising by predicting empirical frequency distributions of user events, combined with sequential learning for improved performance.

Details

Motivation: Current display advertising systems face challenges in modeling user purchase behavior due to sparse positive events, stochastic user actions, severe class imbalance, and irregular event timing. Existing approaches either rely on hand-crafted counter features (missing temporal evolution) or sequential models (missing event-counting statistics).

Method: Proposes Abacus - a self-supervised pretraining approach that predicts empirical frequency distribution of user events. Introduces hybrid objective unifying Abacus with sequential learning objectives, combining aggregated statistics stability with sequence modeling sensitivity.

Result: Experiments on two real-world datasets show Abacus pretraining outperforms existing methods in accelerating downstream task convergence. Hybrid approach yields up to +6.1% AUC improvement compared to baselines.

Conclusion: Abacus provides an effective self-supervised pretraining strategy for display advertising that bridges the gap between aggregated statistics and sequential modeling, significantly improving user purchase prediction performance.

Abstract: Modeling user purchase behavior is a critical challenge in display advertising systems, necessary for real-time bidding. The difficulty arises from the sparsity of positive user events and the stochasticity of user actions, leading to severe class imbalance and irregular event timing. Predictive systems usually rely on hand-crafted “counter” features, overlooking the fine-grained temporal evolution of user intent. Meanwhile, current sequential models extract direct sequential signal, missing useful event-counting statistics. We enhance deep sequential models with self-supervised pretraining strategies for display advertising. Especially, we introduce Abacus, a novel approach of predicting the empirical frequency distribution of user events. We further propose a hybrid objective unifying Abacus with sequential learning objectives, combining stability of aggregated statistics with the sequence modeling sensitivity. Experiments on two real-world datasets show that Abacus pretraining outperforms existing methods accelerating downstream task convergence, while hybrid approach yields up to +6.1% AUC compared to the baselines.

[388] Exploiting Radio Frequency Fingerprints for Device Identification: Tackling Cross-receiver Challenges in the Source-data-free Scenario

Liu Yang, Qiang Li, Luxiong Wen, Jian Yang

Main category: cs.LG

TL;DR: MS-SHOT: A source-data-free cross-receiver RFFI adaptation method using momentum-center-guided soft pseudo-labeling and global structural constraints to handle receiver variation and label shifts.

Details

Motivation: Deep learning-based RFFI models degrade when deployed across different receivers due to hardware-induced distribution shifts. Practical deployment requires adaptation without access to source data, and existing methods struggle with label shifts and non-uniform class distributions in target domains.

Method: Proposes MS-SHOT (Momentum Soft pseudo-label Source Hypothesis Transfer) with: 1) constrained pseudo-labeling framework for SCRFFI, 2) momentum-center-guided soft pseudo-labeling, 3) global structural constraints for confident and diverse predictions, 4) theoretical analysis showing sensitivity to pseudo-label quality.

Result: Extensive experiments on real-world datasets show MS-SHOT consistently outperforms existing approaches in both accuracy and robustness, effectively handling label shifts and non-uniform class distributions in target domains.

Conclusion: MS-SHOT provides a practical and scalable solution for source-data-free cross-receiver adaptation in RFFI, addressing key limitations of prior methods and enabling robust device authentication across diverse edge computing hardware.

Abstract: With the rapid proliferation of edge computing, Radio Frequency Fingerprint Identification (RFFI) has become increasingly important for secure device authentication. However, practical deployment of deep learning-based RFFI models is hindered by a critical challenge: their performance often degrades significantly when applied across receivers with different hardware characteristics due to distribution shifts introduced by receiver variation. To address this, we investigate the source-data-free cross-receiver RFFI (SCRFFI) problem, where a model pretrained on labeled signals from a source receiver must adapt to unlabeled signals from a target receiver, without access to any source-domain data during adaptation. We first formulate a novel constrained pseudo-labeling-based SCRFFI adaptation framework, and provide a theoretical analysis of its generalization performance. Our analysis highlights a key insight: the target-domain performance is highly sensitive to the quality of the pseudo-labels generated during adaptation. Motivated by this, we propose Momentum Soft pseudo-label Source Hypothesis Transfer (MS-SHOT), a new method for SCRFFI that incorporates momentum-center-guided soft pseudo-labeling and enforces global structural constraints to encourage confident and diverse predictions. Notably, MS-SHOT effectively addresses scenarios involving label shift or unknown, non-uniform class distributions in the target domain – a significant limitation of prior methods. Extensive experiments on real-world datasets demonstrate that MS-SHOT consistently outperforms existing approaches in both accuracy and robustness, offering a practical and scalable solution for source-data-free cross-receiver adaptation in RFFI.

[389] Blog Data Showdown: Machine Learning vs Neuro-Symbolic Models for Gender Classification

Natnael Tilahun Sinshaw, Mengmei He, Tadesse K. Bahiru, Sudhir Kumar Mohapatra

Main category: cs.LG

TL;DR: Comparative analysis of ML algorithms (SVM, NB, LR, AdaBoost, XGBoost, SVM_R) and neuro-symbolic AI for text classification, showing NeSy matches MLP performance with limited data.

Details

Motivation: Text classification (like gender classification from blogs) has many applications in market analysis and recommendation systems, but there's a need to compare traditional ML approaches with emerging neuro-symbolic AI methods.

Method: Comparative analysis of SVM, Naive Bayes, Logistic Regression, AdaBoost, XGBoost, and SVM_R with neuro-symbolic AI. Explored text representations (TF-IDF, Universal Sentence Encoder, RoBERTa) and feature extraction techniques (Chi-Square, Mutual Information, PCA).

Result: Neuro-symbolic AI approach matched strong MLP results despite limited dataset, showing competitive performance with traditional machine learning methods.

Conclusion: NeSy approach shows promise for text classification tasks. Future work will expand knowledge base, embedding types, and hyperparameter configurations to further study NeSy effectiveness.

Abstract: Text classification problems, such as gender classification from a blog, have been a well-matured research area that has been well studied using machine learning algorithms. It has several application domains in market analysis, customer recommendation, and recommendation systems. This study presents a comparative analysis of the widely used machine learning algorithms, namely Support Vector Machines (SVM), Naive Bayes (NB), Logistic Regression (LR), AdaBoost, XGBoost, and an SVM variant (SVM_R) with neuro-symbolic AI (NeSy). The paper also explores the effect of text representations such as TF-IDF, the Universal Sentence Encoder (USE), and RoBERTa. Additionally, various feature extraction techniques, including Chi-Square, Mutual Information, and Principal Component Analysis, are explored. Building on these, we introduce a comparative analysis of the machine learning and deep learning approaches in comparison to the NeSy. The experimental results show that the use of the NeSy approach matched strong MLP results despite a limited dataset. Future work on this research will expand the knowledge base, the scope of embedding types, and the hyperparameter configuration to further study the effectiveness of the NeSy approach.

[390] CLARiTy: A Vision Transformer for Multi-Label Classification and Weakly-Supervised Localization of Chest X-ray Pathologies

John M. Statheros, Hairong Wang, Richard Klein

Main category: cs.LG

TL;DR: CLARiTy is a vision transformer model for joint multi-label classification and weakly-supervised localization of chest X-ray pathologies, achieving state-of-the-art localization performance with 50.7% improvement over prior methods.

Details

Motivation: Chest X-ray interpretation faces challenges in accurate multi-label pathology classification and spatial localization, often constrained by scarce region-level annotations. There's a need for models that can perform both tasks effectively with only image-level labels.

Method: CLARiTy uses vision transformers with multiple class-specific tokens to generate discriminative attention maps, and a SegmentCAM module for foreground segmentation/background suppression using anatomical priors. It employs distillation from a ConvNeXtV2 teacher and is trained on image-level labels from NIH ChestX-ray14 dataset.

Result: CLARiTy-S-16-512 achieves competitive classification performance across 14 pathologies and state-of-the-art weakly-supervised localization on 8 pathologies, outperforming prior methods by 50.7%. Particularly strong gains for small pathologies like nodules and masses. The lower-resolution variant CLARiTy-S-16-224 offers high efficiency while surpassing baselines.

Conclusion: CLARiTy advances beyond CNN-ViT hybrids by leveraging ViT self-attention for global context and class-specific localization, refined through convolutional background suppression for precise, noise-reduced heatmaps, making it suitable for low-resource settings.

Abstract: The interpretation of chest X-rays (CXRs) poses significant challenges, particularly in achieving accurate multi-label pathology classification and spatial localization. These tasks demand different levels of annotation granularity but are frequently constrained by the scarcity of region-level (dense) annotations. We introduce CLARiTy (Class Localizing and Attention Refining Image Transformer), a vision transformer-based model for joint multi-label classification and weakly-supervised localization of thoracic pathologies. CLARiTy employs multiple class-specific tokens to generate discriminative attention maps, and a SegmentCAM module for foreground segmentation and background suppression using explicit anatomical priors. Trained on image-level labels from the NIH ChestX-ray14 dataset, it leverages distillation from a ConvNeXtV2 teacher for efficiency. Evaluated on the official NIH split, the CLARiTy-S-16-512 (a configuration of CLARiTy), achieves competitive classification performance across 14 pathologies, and state-of-the-art weakly-supervised localization performance on 8 pathologies, outperforming prior methods by 50.7%. In particular, pronounced gains occur for small pathologies like nodules and masses. The lower-resolution variant of CLARiTy, CLARiTy-S-16-224, offers high efficiency while decisively surpassing baselines, thereby having the potential for use in low-resource settings. An ablation study confirms contributions of SegmentCAM, DINO pretraining, orthogonal class token loss, and attention pooling. CLARiTy advances beyond CNN-ViT hybrids by harnessing ViT self-attention for global context and class-specific localization, refined through convolutional background suppression for precise, noise-reduced heatmaps.

[391] Towards Reproducibility in Predictive Process Mining: SPICE - A Deep Learning Library

Oliver Stritzel, Nick Hühnerbein, Simon Rauch, Itzel Zarate, Lukas Fleischmann, Moike Buck, Attila Lischka, Christian Frey

Main category: cs.LG

TL;DR: SPICE is a Python framework that reimplements three deep-learning-based Predictive Process Mining methods in PyTorch with a common base for reproducible benchmarking.

Details

Motivation: Existing Predictive Process Mining approaches lack reproducibility, transparency, and standardized benchmarking, making comparisons between different implementations difficult.

Method: Developed SPICE framework in Python/PyTorch that reimplements three popular deep-learning-based PPM methods with a common base architecture and rigorous configurability.

Result: Compared SPICE to original reported metrics and conducted fair comparisons on 11 datasets, enabling reproducible and robust evaluation of PPM approaches.

Conclusion: SPICE provides a standardized framework for reproducible benchmarking of Predictive Process Mining methods, addressing key limitations in current PPM research practices.

Abstract: In recent years, Predictive Process Mining (PPM) techniques based on artificial neural networks have evolved as a method for monitoring the future behavior of unfolding business processes and predicting Key Performance Indicators (KPIs). However, many PPM approaches often lack reproducibility, transparency in decision making, usability for incorporating novel datasets and benchmarking, making comparisons among different implementations very difficult. In this paper, we propose SPICE, a Python framework that reimplements three popular, existing baseline deep-learning-based methods for PPM in PyTorch, while designing a common base framework with rigorous configurability to enable reproducible and robust comparison of past and future modelling approaches. We compare SPICE to original reported metrics and with fair metrics on 11 datasets.

[392] Phishing Detection System: An Ensemble Approach Using Character-Level CNN and Feature Engineering

Rudra Dubey, Arpit Mani Tripathi, Archit Srivastava, Sarvpal Singh

Main category: cs.LG

TL;DR: AI ensemble model combining character-level CNN and LightGBM achieves 99.8% accuracy in phishing URL detection with real-time FastAPI service.

Details

Motivation: Phishing attacks remain prevalent cybersecurity threats with constantly evolving tactics, requiring advanced detection systems to protect users from malicious URLs.

Method: Ensemble approach combining character-level CNN for sequential feature extraction with LightGBM using 36 engineered lexical, structural, and domain-based features from URLs.

Result: Achieved 99.819% accuracy, 100% precision, 99.635% recall, and 99.947% ROC-AUC on test dataset of 19,873 URLs, outperforming individual models (60% CNN, 40% LightGBM contribution).

Conclusion: The ensemble model effectively detects modern phishing techniques with extremely low false positive rates and has been implemented as a real-time detection service via FastAPI with user interface.

Abstract: In actuality, phishing attacks remain one of the most prevalent cybersecurity risks in existence today, with malevolent actors constantly changing their strategies to successfully trick users. This paper presents an AI model for a phishing detection system that uses an ensemble approach to combine character-level Convolutional Neural Networks (CNN) and LightGBM with engineered features. Our system uses a character-level CNN to extract sequential features after extracting 36 lexical, structural, and domain-based features from the URLs. On a test dataset of 19,873 URLs, the ensemble model achieves an accuracy of 99.819 percent, precision of 100 percent, recall of 99.635 percent, and ROC-AUC of 99.947 percent. Through a FastAPI-based service with an intuitive user interface, the suggested system has been utilised to offer real-time detection. In contrast, the results demonstrate that the suggested solution performs better than individual models; LightGBM contributes 40 percent and character-CNN contributes 60 percent to the final prediction. The suggested method maintains extremely low false positive rates while doing a good job of identifying contemporary phishing techniques. Index Terms - Phishing detection, machine learning, deep learning, CNN, ensemble methods, cybersecurity, URL analysis

[393] Polyharmonic Spline Packages: Composition, Efficient Procedures for Computation and Differentiation

Yuriy N. Bakhvalov

Main category: cs.LG

TL;DR: A cascade architecture of polyharmonic spline packages addresses scalability and high-dimensional limitations of previous kernel regression methods.

Details

Motivation: Previous polyharmonic spline kernel regression has O(N^3) computational cost and breaks down in high-dimensional spaces, limiting practical application.

Method: Proposes a cascade architecture built from packages of polyharmonic splines, with efficient matrix procedures for forward computation and end-to-end differentiation.

Result: The cascade approach simultaneously addresses scalability issues and provides theoretical justification for problems with unknown intrinsic low dimensionality.

Conclusion: The cascade architecture enables practical application of theoretically optimal polyharmonic spline regression to scalable, high-dimensional problems.

Abstract: In a previous paper it was shown that a machine learning regression problem can be solved within the framework of random function theory, with the optimal kernel analytically derived from symmetry and indifference principles and coinciding with a polyharmonic spline. However, a direct application of that solution is limited by O(N^3) computational cost and by a breakdown of the original theoretical assumptions when the input space has excessive dimensionality. This paper proposes a cascade architecture built from packages of polyharmonic splines that simultaneously addresses scalability and is theoretically justified for problems with unknown intrinsic low dimensionality. Efficient matrix procedures are presented for forward computation and end-to-end differentiation through the cascade.

[394] KOSS: Kalman-Optimal Selective State Spaces for Long-Term Sequence Modeling

Lei Wang, Xin Tan, Mingwei Wang, Ying Zhang

Main category: cs.LG

TL;DR: KOSS is a Kalman-optimal Selective State Space model that formulates selection as latent state uncertainty minimization, achieving superior performance in context-aware sequence modeling tasks.

Details

Motivation: Current selective state space models (SSMs) like Mamba lack theoretical grounding and cannot support context-aware selection from latent state dynamics, limiting their effectiveness in complex sequence modeling tasks.

Method: KOSS formulates selection as latent state uncertainty minimization using estimation theory. It employs continuous-time latent updates driven by Kalman gain for dynamic information modulation, global spectral differentiation for frequency-domain derivative estimation, and segment-wise scan for hardware-efficient processing.

Result: On selective copying tasks with distractors, KOSS achieves over 79% accuracy vs. baselines below 20%. Across nine long-term forecasting benchmarks, it reduces MSE by 2.92-36.23% and outperforms SOTA models. SSR tracking case study confirms robustness under irregular intervals and noisy conditions.

Conclusion: KOSS provides a theoretically grounded, closed-loop, context-aware selectivity mechanism that significantly improves performance in sequence modeling tasks while maintaining computational efficiency and real-world applicability.

Abstract: Recent selective state space models (SSMs), such as Mamba and Mamba-2, have demonstrated strong performance in sequence modeling owing to input-dependent selection mechanisms. However, these mechanisms lack theoretical grounding and cannot support context-aware selection from latent state dynamics. To address these limitations, we propose KOSS, a Kalman-optimal Selective State Space model that formulates selection as latent state uncertainty minimization. Derived from estimation theory, KOSS adopts a continuous-time latent update driven by a Kalman gain that dynamically modulates information propagation based on content and context, enabling a closed-loop, context-aware selectivity mechanism. To ensure stable computation and near-linear scalability, KOSS employs global spectral differentiation for frequency-domain derivative estimation, along with a segment-wise scan for hardware-efficient processing. On a selective copying task with distractors, KOSS achieves over 79% accuracy while baselines drop below 20%, demonstrating robust context-aware selection. Furthermore, across nine long-term forecasting benchmarks, KOSS reduces MSE by 2.92–36.23% and consistently outperforms state-of-the-art models in both accuracy and stability. To assess real-world applicability, a case study on secondary surveillance radar (SSR) tracking confirms KOSS’s robustness under irregular intervals and noisy conditions and demonstrates its effectiveness in real-world applications. Finally, supplementary experiments verify Kalman gain convergence and the frequency response of spectral differentiation, providing theoretical support for the proposed closed-loop design.

[395] Machine Learning Algorithms: Detection Official Hajj and Umrah Travel Agency Based on Text and Metadata Analysis

Wisnu Uriawan, Muhamad Veva Ramadhan, Firman Adi Nugraha, Hasbi Nur Wahid, M Dantha Arianvasya, Muhammad Zaki Alghifari

Main category: cs.LG

TL;DR: This paper implements machine learning algorithms (SVM, RF, Naive Bayes) to detect counterfeit Hajj/Umrah mobile apps in Indonesia, using hybrid feature extraction (TF-IDF + metadata analysis) to achieve 92.3% accuracy with SVM.

Details

Motivation: The digitalization of Hajj and Umrah services in Indonesia has created opportunities for digital fraud through counterfeit mobile applications, causing financial losses and privacy risks by harvesting sensitive personal data.

Method: The study uses a comprehensive dataset of official and unofficial applications, comparing three classifiers (SVM, Random Forest, Naive Bayes) with hybrid feature extraction combining Textual Analysis (TF-IDF) of app descriptions and Metadata Analysis of sensitive access permissions.

Result: SVM achieved the highest performance with 92.3% accuracy, 91.5% precision, and 92.0% F1-score. Feature analysis showed specific legality keywords and high-risk permissions (like READ PHONE STATE) were most significant discriminators.

Conclusion: The system provides a proactive, scalable solution to enhance digital trust in religious tourism, potentially serving as a prototype for a national verification system to combat counterfeit mobile applications.

Abstract: The rapid digitalization of Hajj and Umrah services in Indonesia has significantly facilitated pilgrims but has concurrently opened avenues for digital fraud through counterfeit mobile applications. These fraudulent applications not only inflict financial losses but also pose severe privacy risks by harvesting sensitive personal data. This research aims to address this critical issue by implementing and evaluating machine learning algorithms to verify application authenticity automatically. Using a comprehensive dataset comprising both official applications registered with the Ministry of Religious Affairs and unofficial applications circulating on app stores, we compare the performance of three robust classifiers: Support Vector Machine (SVM), Random Forest (RF), and Na"ive Bayes (NB). The study utilizes a hybrid feature extraction methodology that combines Textual Analysis (TF-IDF) of application descriptions with Metadata Analysis of sensitive access permissions. The experimental results indicate that the SVM algorithm achieves the highest performance with an accuracy of 92.3%, a precision of 91.5%, and an F1-score of 92.0%. Detailed feature analysis reveals that specific keywords related to legality and high-risk permissions (e.g., READ PHONE STATE) are the most significant discriminators. This system is proposed as a proactive, scalable solution to enhance digital trust in the religious tourism sector, potentially serving as a prototype for a national verification system.

[396] NRGPT: An Energy-based Alternative for GPT

Nima Dehmamy, Benjamin Hoover, Bishwajit Saha, Leo Kozachkov, Jean-Jacques Slotine, Dmitry Krotov

Main category: cs.LG

TL;DR: The paper proposes NRGPT, a minimal modification of GPT that unifies it with energy-based modeling, treating inference as exploration on an energy landscape.

Details

Motivation: To bridge the gap between GPT architectures (dominant for language modeling) and energy-based modeling (which views inference as dynamical process on energy landscape), creating a unified framework.

Method: Minimal modification of GPT setting to incorporate energy-based modeling principles, creating eNeRgy-GPT (NRGPT) where inference is conceptualized as exploration of tokens on energy landscape.

Result: NRGPT performs well on Shakespeare dataset, algebraic ListOPS tasks, and OpenWebText language modeling. Models show resistance to overfitting, only occurring during very long training. Theoretical proof and empirical verification that exploration becomes gradient descent under certain circumstances.

Conclusion: The proposed NRGPT successfully unifies GPT with energy-based modeling, demonstrating practical performance across different tasks while offering theoretical insights about the relationship between exploration and gradient descent in this framework.

Abstract: Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don’t necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.

[397] Pattern recognition in complex systems via vector-field representations of spatio-temporal data

Ingrid Amaranta Membrillo Solis, Maria van Rossem, Tristan Madeleine, Tetiana Orlova, Nina Podoliak, Giampaolo D’Alessandro, Jacek Brodzki, Malgosia Kaczmarek

Main category: cs.LG

TL;DR: A geometric framework using vector fields over discrete measure spaces with two-parameter metrics enables analysis of complex system spatio-temporal data for dimensionality reduction, mode decomposition, and attractor characterization.

Details

Motivation: Complex systems (brain, cells, climate, economy) have high-dimensional, non-linear dynamics that are challenging to model. Traditional methods struggle with the volume and complexity of spatio-temporal data from these systems.

Method: Proposes a geometric framework based on vector fields over discrete measure spaces with a two-parameter family of metrics. Supports time-dependent images, image gradients, and functions on graphs/simplicial complexes.

Result: The metrics combined with multidimensional scaling effectively address analytical challenges: enable dimensionality reduction, mode decomposition, phase-space reconstruction, and attractor characterization in biological/physical systems on flat and curved domains.

Conclusion: Provides a robust pathway for understanding complex dynamical systems where traditional modeling is impractical but abundant experimental data exist, offering a data-driven geometric approach to complex system analysis.

Abstract: A complex system comprises multiple interacting entities whose interdependencies form a unified whole, exhibiting emergent behaviours not present in individual components. Examples include the human brain, living cells, soft matter, Earth’s climate, ecosystems, and the economy. These systems exhibit high-dimensional, non-linear dynamics, making their modelling, classification, and prediction particularly challenging. Advances in information technology have enabled data-driven approaches to studying such systems. However, the sheer volume and complexity of spatio-temporal data often hinder traditional methods like dimensionality reduction, phase-space reconstruction, and attractor characterisation. This paper introduces a geometric framework for analysing spatio-temporal data from complex systems, grounded in the theory of vector fields over discrete measure spaces. We propose a two-parameter family of metrics suitable for data analysis and machine learning applications. The framework supports time-dependent images, image gradients, and real- or vector-valued functions defined on graphs and simplicial complexes. We validate our approach using data from numerical simulations of biological and physical systems on flat and curved domains. Our results show that the proposed metrics, combined with multidimensional scaling, effectively address key analytical challenges. They enable dimensionality reduction, mode decomposition, phase-space reconstruction, and attractor characterisation. Our findings offer a robust pathway for understanding complex dynamical systems, especially in contexts where traditional modelling is impractical but abundant experimental data are available.

[398] MEPIC: Memory Efficient Position Independent Caching for LLM Serving

Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Bai Xiaolong, Shan Yizhou, Wei Zhang, Wang Lan, Ying Xiong, Yong Zhang, Zhenan Fan

Main category: cs.LG

TL;DR: MEPIC is a memory-efficient position-independent caching system that enables chunk-level KV cache reuse across positions, requests, and batches, reducing HBM usage by up to 2-5x without model changes.

Details

Motivation: LLM applications repeatedly process long prompt histories with shared content, creating significant pressure on KV cache memory. Existing caching solutions (prefix caching and position-independent caching) have limitations: prefix caching requires strict matching, while PIC suffers from KV divergence across requests and lacks page alignment, resulting in modest memory savings.

Method: MEPIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level (making only first block request-specific), removes positional encodings via RoPE fusion in attention kernel, and makes remaining blocks fully shareable across requests.

Result: MEPIC reduces HBM usage by up to 2x over state-of-the-art PIC at comparable latency and accuracy, and up to 5x for long prompts, without requiring any model modifications.

Conclusion: MEPIC effectively addresses KV cache memory pressure in LLM applications by enabling efficient chunk-level reuse across positions and requests through paged alignment, block-level recomputation, and RoPE fusion, achieving significant memory savings.

Abstract: Modern LLM applications such as deep-research assistants, coding agents, and Retrieval-Augmented Generation (RAG) systems, repeatedly process long prompt histories containing shared document or code chunks, creating significant pressure on the Key Value (KV) cache, which must operate within limited memory while sustaining high throughput and low latency. Prefix caching partially alleviates some of these costs by reusing KV cache for previously processed tokens, but limited by strict prefix matching. Position-independent caching (PIC) enables chunk-level reuse at arbitrary positions, but requires selective recomputation and positional-encoding (PE) adjustments. However, because these operations vary across queries, KV for the same chunk diverges across requests. Moreover, without page alignment, chunk KV layouts diverge in memory, preventing page sharing. These issues result in only modest HBM savings even when many requests reuse the same content. We present MEPIC, a memory-efficient PIC system that enables chunk KV reuse across positions, requests, and batches. MEPIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level so only the first block is request-specific, removes positional encodings via Rotary Position Embedding (RoPE) fusion in the attention kernel, and makes remaining blocks fully shareable. These techniques eliminate most duplicate chunk KV in HBM, reducing usage by up to 2x over state-of-the-art PIC at comparable latency and accuracy, and up to 5x for long prompts, without any model changes.

[399] Tiny Recursive Control: Iterative Reasoning for Efficient Optimal Control

Amit Jain, Richard Linares

Main category: cs.LG

TL;DR: TRC is a neural control architecture that achieves high performance with only 1.5M parameters by using recursive iteration depth instead of massive parameter scaling, enabling millisecond inference with under 10MB memory for aerospace systems.

Details

Motivation: Neural network controllers are scaling to millions/billions of parameters, which is prohibitive for embedded aerospace systems with strict power and latency constraints. There's a need for efficient control architectures that can handle complex tasks without massive parameter counts.

Method: TRC uses compact networks (≈1.5M parameters) applied repeatedly through a two-level hierarchical latent structure. It refines control sequences by simulating trajectories and correcting based on tracking error. The same weights process every refinement step, so adding iterations increases computation without increasing memory.

Result: TRC achieves near-optimal control costs on nonlinear problems (oscillator stabilization, powered descent with fuel constraints) while requiring only millisecond-scale inference on GPU and under 10MB memory - two orders of magnitude smaller than language model baselines.

Conclusion: Recursive reasoning, previously confined to discrete tasks, transfers effectively to continuous control synthesis. TRC demonstrates that capacity can emerge from iteration depth rather than parameter count, enabling efficient neural control for resource-constrained systems.

Abstract: Neural network controllers increasingly demand millions of parameters, and language model approaches push into the billions. For embedded aerospace systems with strict power and latency constraints, this scaling is prohibitive. We present Tiny Recursive Control (TRC), a neural architecture based on a counterintuitive principle: capacity can emerge from iteration depth rather than parameter count. TRC applies compact networks (approximately 1.5M parameters) repeatedly through a two-level hierarchical latent structure, refining control sequences by simulating trajectories and correcting based on tracking error. Because the same weights process every refinement step, adding iterations increases computation without increasing memory. We evaluate TRC on nonlinear control problems including oscillator stabilization and powered descent with fuel constraints. Across these domains, TRC achieves near-optimal control costs while requiring only millisecond-scale inference on GPU and under 10~MB memory, two orders of magnitude smaller than language model baselines. These results demonstrate that recursive reasoning, previously confined to discrete tasks, transfers effectively to continuous control synthesis.

[400] Meta-RL Induces Exploration in Language Agents

Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic

Main category: cs.LG

TL;DR: LaMer is a Meta-RL framework that enables LLM agents to actively explore and learn from environment feedback at test time through cross-episode training and in-context policy adaptation.

Details

Motivation: RL-trained LLM agents struggle with tasks requiring active exploration and fail to efficiently adapt from trial-and-error experiences, limiting their ability to solve multi-turn long-horizon tasks effectively.

Method: LaMer has two key components: (1) cross-episode training framework to encourage exploration and long-term rewards optimization, and (2) in-context policy adaptation via reflection, allowing agents to adapt their policy from task feedback without gradient updates.

Result: LaMer significantly improves performance over RL baselines with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop respectively, and demonstrates better generalization to more challenging or previously unseen tasks.

Conclusion: Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

Abstract: Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

[401] Failure Modes of Maximum Entropy RLHF

Ömer Veysel Çağatan, Barış Akgün

Main category: cs.LG

TL;DR: SimPO is theoretically grounded in Maximum Entropy RL, but while SimPO works well offline, Maximum Entropy RL fails in online RLHF due to overoptimization and unstable KL dynamics.

Details

Motivation: To investigate why Simple Preference Optimization (SimPO) performs well in offline preference optimization and whether Maximum Entropy RL can achieve similar success in online RLHF settings.

Method: Theoretical derivation showing SimPO as Maximum Entropy RL, followed by experimental investigation comparing Maximum Entropy RL in online RLHF settings against KL-constrained methods.

Result: Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics in online settings, failing to prevent reward hacking, unlike stable KL-constrained methods.

Conclusion: Reference-free approaches like SimPO face distinct challenges in online vs offline settings; Maximum Entropy RL struggles with overoptimization in online RLHF despite SimPO’s offline success.

Abstract: In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. Motivated by SimPO’s strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics, even at very low learning rates. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to prevent reward hacking and appears to correlate with overoptimization. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online or offline preference learning.

[402] Semi-Supervised Online Learning on the Edge by Transforming Knowledge from Teacher Models

Jiabin Xue

Main category: cs.LG

TL;DR: Proposes Knowledge Transformation (KT), a hybrid method combining Knowledge Distillation, Active Learning, and causal reasoning to generate pseudo-labels for unseen data in Online Edge ML.

Details

Motivation: Existing Edge ML approaches assume static models trained centrally and deployed, making them ineffective against unseen data. Online Edge ML allows continuous training on edge devices, but faces the challenge of determining labels for truly future, unseen data points.

Method: Knowledge Transformation (KT) - a hybrid approach that combines Knowledge Distillation, Active Learning, and causal reasoning. KT acts as the oracle in active learning by transforming knowledge from a teacher model to generate pseudo-labels for training a student model.

Result: Simulation experiments with two setups: (1) less stable teacher model and (2) relatively more stable teacher model. Results show that when a stable teacher model is given, the student model can eventually reach its expected maximum performance.

Conclusion: KT is beneficial for scenarios where: (1) teacher’s task is generic (existing pre-trained models may be adequate), and/or (2) labels for student’s task are difficult or expensive to acquire. The method addresses the key challenge of labeling unseen data in Online Edge ML.

Abstract: Edge machine learning (Edge ML) enables training ML models using the vast data distributed across network edges. However, many existing approaches assume static models trained centrally and then deployed, making them ineffective against unseen data. To address this, Online Edge ML allows models to be trained directly on edge devices and updated continuously with new data. This paper explores a key challenge of Online Edge ML: “How to determine labels for truly future, unseen data points”. We propose Knowledge Transformation (KT), a hybrid method combining Knowledge Distillation, Active Learning, and causal reasoning. In short, KT acts as the oracle in active learning by transforming knowledge from a teacher model to generate pseudo-labels for training a student model. To verify the validity of the method, we conducted simulation experiments with two setups: (1) using a less stable teacher model and (2) a relatively more stable teacher model. Results indicate that when a stable teacher model is given, the student model can eventually reach its expected maximum performance. KT is potentially beneficial for scenarios that meet the following circumstances: (1) when the teacher’s task is generic, which means existing pre-trained models might be adequate for its task, so there will be no need to train the teacher model from scratch; and/or (2) when the label for the student’s task is difficult or expensive to acquire.

[403] Sequencing to Mitigate Catastrophic Forgetting in Continual Learning

Hesham G. Moussa, Aroosa Hameed, Arashmid Akhavain

Main category: cs.LG

TL;DR: The paper proposes using intelligent task sequencing to mitigate catastrophic forgetting in continual learning, leveraging zero-shot scoring algorithms inspired by neural architecture search.

Details

Motivation: Catastrophic forgetting is a major challenge in continual learning where learning new tasks dramatically reduces performance on previously learned ones. Most existing approaches fall into five categories, but the authors approach the problem from a different angle by considering task sequencing.

Method: The method focuses on determining optimal task sequencing order using zero-shot scoring algorithms inspired by neural architecture search (NAS). This approach considers how tasks should be presented to models to minimize forgetting.

Result: Results show that intelligent task sequencing can substantially reduce catastrophic forgetting. When combined with traditional continual learning strategies, sequencing offers enhanced performance and improved robustness against forgetting.

Conclusion: Task sequencing is an effective approach to mitigate catastrophic forgetting in continual learning, and the proposed methods can also find applications in other fields like curriculum learning.

Abstract: To cope with real-world dynamics, an intelligent system needs to incrementally acquire, update, and exploit knowledge throughout its lifetime. This ability, known as Continual learning, provides a foundation for AI systems to develop themselves adaptively. Catastrophic forgetting is a major challenge to the progress of Continual Learning approaches, where learning a new task usually results in a dramatic performance drop on previously learned ones. Many approaches have emerged to counteract the impact of CF. Most of the proposed approaches can be categorized into five classes: replay-based, regularization-based, optimization-based, representation-based, and architecture-based. In this work, we approach the problem from a different angle, specifically by considering the optimal sequencing of tasks as they are presented to the model. We investigate the role of task sequencing in mitigating CF and propose a method for determining the optimal task order. The proposed method leverages zero-shot scoring algorithms inspired by neural architecture search (NAS). Results demonstrate that intelligent task sequencing can substantially reduce CF. Moreover, when combined with traditional continual learning strategies, sequencing offers enhanced performance and robustness against forgetting. Additionally, the presented approaches can find applications in other fields, such as curriculum learning.

[404] Wrist Photoplethysmography Predicts Dietary Information

Kyle Verrier, Achille Nazaret, Joseph Futoma, Andrew C. Miller, Guillermo Sapiro

Main category: cs.LG

TL;DR: Wearable PPG signals contain dietary information that can predict meal content, improving dietary intake and satiety prediction by 11% AUC.

Details

Motivation: To determine if wearable photoplethysmography (PPG) contains dietary information that could enable passive dietary monitoring, as current dietary tracking methods are burdensome and inaccurate.

Method: Trained a language model on 1.1 million meals to predict meal descriptions from PPG signals, aligning PPG data to text descriptions. Analyzed how predictability changes with temporal distance from meals.

Result: PPG nontrivially predicts meal content; predictability decreases for PPGs farther from meals. PPG increases AUC by 11% for dietary intake and satiety prediction across held-out and independent cohorts, with gains robust to text degradation.

Conclusion: Wearable PPG may enable passive dietary monitoring, offering a less burdensome alternative to current dietary tracking methods.

Abstract: Whether wearable photoplethysmography (PPG) contains dietary information remains unknown. We trained a language model on 1.1M meals to predict meal descriptions from PPG, aligning PPG to text. PPG nontrivially predicts meal content; predictability decreases for PPGs farther from meals. This transfers to dietary tasks: PPG increases AUC by 11% for intake and satiety across held-out and independent cohorts, with gains robust to text degradation. Wearable PPG may enable passive dietary monitoring.

Astrid Brull, Sara Aguti, Véronique Bolduc, Ying Hu, Daniel M. Jimenez-Gutierrez, Enrique Zuazua, Joaquin Del-Rio, Oleksii Sliusarenko, Haiyan Zhou, Francesco Muntoni, Carsten G. Bönnemann, Xabi Uribe-Etxebarria

Main category: cs.LG

TL;DR: Federated Learning enables collaborative ML for rare disease diagnosis across institutions while preserving data privacy, achieving superior performance over single-institution models for collagen VI-related dystrophies classification.

Details

Motivation: ML diagnosis of rare diseases like collagen VI-related dystrophies is limited by data scarcity and fragmentation across institutions. Privacy regulations and logistical barriers prevent data sharing, creating a need for privacy-preserving collaborative approaches.

Method: Used Federated Learning (Sherpa.ai FL platform) to train ML models across distributed datasets from two international organizations. Analyzed collagen VI immunofluorescence microscopy images from patient-derived fibroblast cultures to classify three pathogenic mechanism groups.

Result: Achieved F1-score of 0.82 for classifying collagen VI images into three pathogenic groups (exon skipping, glycine substitution, pseudoexon insertion), outperforming single-organization models (0.57-0.75). FL substantially improved diagnostic utility and generalizability.

Conclusion: Federated Learning enables accurate rare disease diagnosis while maintaining data privacy, with potential applications for variant interpretation and sequencing strategy prioritization. The approach demonstrates superior performance over isolated institutional models.

Abstract: The application of Machine Learning (ML) to the diagnosis of rare diseases, such as collagen VI-related dystrophies (COL6-RD), is fundamentally limited by the scarcity and fragmentation of available data. Attempts to expand sampling across hospitals, institutions, or countries with differing regulations face severe privacy, regulatory, and logistical obstacles that are often difficult to overcome. The Federated Learning (FL) provides a promising solution by enabling collaborative model training across decentralized datasets while keeping patient data local and private. Here, we report a novel global FL initiative using the Sherpa.ai FL platform, which leverages FL across distributed datasets in two international organizations for the diagnosis of COL6-RD, using collagen VI immunofluorescence microscopy images from patient-derived fibroblast cultures. Our solution resulted in an ML model capable of classifying collagen VI patient images into the three primary pathogenic mechanism groups associated with COL6-RD: exon skipping, glycine substitution, and pseudoexon insertion. This new approach achieved an F1-score of 0.82, outperforming single-organization models (0.57-0.75). These results demonstrate that FL substantially improves diagnostic utility and generalizability compared to isolated institutional models. Beyond enabling more accurate diagnosis, we anticipate that this approach will support the interpretation of variants of uncertain significance and guide the prioritization of sequencing strategies to identify novel pathogenic variants.

[406] Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, Stefano Soatto

Main category: cs.LG

TL;DR: Gated KalmaNet (GKA) is a new layer that maintains full past context like Kalman Filter while achieving SSM efficiency, outperforming existing SSMs on both short and long-context tasks.

Details

Motivation: Linear State-Space Models (SSMs) are efficient but maintain only lossy summaries of the past, leading to inferior performance in recall-oriented tasks. There's a need for methods that account for the full past while maintaining SSM-style efficiency.

Method: GKA is grounded in Kalman Filter framework for optimal inference. It computes exact Kalman gain by maintaining full error covariance (unlike existing SSMs that assume identity covariance). Uses steady-state assumption for parallelization, reducing to online ridge regression. Addresses numerical instability with adaptive regularization via input-dependent gating and Chebyshev Iteration for low-precision stability. Includes hardware-aware chunk-wise kernels.

Result: GKA outperforms existing SSM layers (Mamba2, Gated DeltaNet) on short-context tasks and achieves >10% relative improvement on long-context RAG and LongQA tasks up to 128k tokens.

Conclusion: GKA successfully bridges the gap between maintaining full past context and computational efficiency, providing a principled Kalman Filter-based approach that outperforms existing SSM approximations while maintaining linear compute and constant memory.

Abstract: As efficient alternatives to softmax Attention, linear State-Space Models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall-oriented tasks. We propose Gated KalmaNet (GKA), a layer that accounts for the full past while maintaining SSM-style efficiency. We ground our approach in the Kalman Filter (KF) framework, which provides a principled solution for optimal inference in dynamical systems. We show that several existing SSM layers (DeltaNet, Gated DeltaNet, and Kimi Delta Attention) are approximations to the KF recurrence that assume identity error covariance, thereby ignoring how past measurements (keys and values) should optimally influence state updates. In contrast, GKA computes the exact Kalman gain by maintaining the full error covariance. Under a steady-state assumption that enables parallelization, this reduces to solving an online ridge regression problem with constant memory and linear compute cost. A critical insight is that standard KF equations are numerically unstable in low-precision environments (like bfloat16) and hard to parallelize on modern hardware. We address this through: (1) adaptive regularization with input-dependent gating to control the condition number of the ridge regression for numerical stability, and (2) Chebyshev Iteration, which we show is more stable than conventional iterative solvers in low-precision settings. We further develop hardware-aware chunk-wise kernels to enable efficient training. Empirically, GKA outperforms existing SSM layers (like Mamba2 and Gated DeltaNet) on short-context tasks and achieves more than 10% relative improvement on long-context RAG and LongQA tasks up to 128k tokens.

[407] Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning

Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, Sergey Levine

Main category: cs.LG

TL;DR: Posterior Behavioral Cloning (PostBC) improves RL finetuning by modeling demonstrator’s posterior distribution instead of exact action matching, ensuring better coverage for effective RL initialization.

Details

Motivation: Standard pretraining with behavioral cloning often fails to provide effective initialization for RL finetuning due to insufficient coverage of demonstrator's actions, limiting downstream performance improvements.

Method: Instead of exact action matching (BC), train policy to model posterior distribution of demonstrator’s behavior given dataset. This ensures coverage over demonstrator’s actions while maintaining pretraining performance.

Result: PostBC enables significantly improved RL finetuning performance on robotic control benchmarks and real-world manipulation tasks compared to standard BC, while being implementable with modern generative models.

Conclusion: Modeling posterior distribution during pretraining provides better initialization for RL finetuning than standard behavioral cloning, addressing coverage issues without sacrificing pretraining performance.

Abstract: Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) – which trains a policy to directly match the actions played by the demonstrator – can fail to ensure coverage over the demonstrator’s actions, a minimal condition necessary for effective RL finetuning. We then show that if, instead of exactly fitting the observed demonstrations, we train a policy to model the posterior distribution of the demonstrator’s behavior given the demonstration dataset, we do obtain a policy that ensures coverage over the demonstrator’s actions, enabling more effective finetuning. Furthermore, this policy – which we refer to as the posterior behavioral cloning (PostBC) policy – achieves this while ensuring pretrained performance is no worse than that of the BC policy. We then show that PostBC is practically implementable with modern generative models in robotic control domains – relying only on standard supervised learning – and leads to significantly improved RL finetuning performance on both realistic robotic control benchmarks and real-world robotic manipulation tasks, as compared to standard behavioral cloning.

[408] Bandits with Preference Feedback: A Stackelberg Game Perspective

Barna Pásztor, Parnian Kassraie, Andreas Krause

Main category: cs.LG

TL;DR: MAXMINLCB algorithm for bandits with preference feedback in infinite domains with nonlinear rewards, using Stackelberg game formulation and novel confidence sequences for kernelized logistic estimators.

Details

Motivation: Bandits with preference feedback are useful for optimization when only pairwise comparisons are available (e.g., human feedback for fine-tuning LLMs), but existing methods are limited to linear functions or finite small domains, restricting practical applications.

Method: Proposed MAXMINLCB algorithm that treats the exploration-exploitation trade-off as a zero-sum Stackelberg game, selecting action pairs that are both informative and yield favorable rewards. Introduces novel preference-based confidence sequences for kernelized logistic estimators.

Result: MAXMINLCB consistently outperforms existing algorithms and provides an anytime-valid rate-optimal regret guarantee.

Conclusion: The approach successfully extends bandits with preference feedback to infinite domains with nonlinear rewards, addressing the dual-level exploration-exploitation challenge through game-theoretic formulation and novel statistical tools.

Abstract: Bandits with preference feedback present a powerful tool for optimizing unknown target functions when only pairwise comparisons are allowed instead of direct value queries. This model allows for incorporating human feedback into online inference and optimization and has been employed in systems for fine-tuning large language models. The problem is well understood in simplified settings with linear target functions or over finite small domains that limit practical interest. Taking the next step, we consider infinite domains and nonlinear (kernelized) rewards. In this setting, selecting a pair of actions is quite challenging and requires balancing exploration and exploitation at two levels: within the pair, and along the iterations of the algorithm. We propose MAXMINLCB, which emulates this trade-off as a zero-sum Stackelberg game, and chooses action pairs that are informative and yield favorable rewards. MAXMINLCB consistently outperforms existing algorithms and satisfies an anytime-valid rate-optimal regret guarantee. This is due to our novel preference-based confidence sequences for kernelized logistic estimators.

[409] Neural networks for dengue forecasting: a systematic review

Luiza Lober, Francisco A. Rodrigues, Kirstin Roster

Main category: cs.LG

TL;DR: Systematic review finds neural networks provide competitive dengue forecasts, with shallow feed-forward networks using historical incidence and climate data being most common, though deep networks remain underexplored.

Details

Motivation: Early dengue forecasting is crucial for disease mitigation, and neural networks have shown promise in public health applications. This review aims to inform future model design by systematically examining how neural networks have been applied in dengue forecasting literature.

Method: Conducted PRISMA-guided systematic review of 62 studies using neural networks for dengue forecasting in human populations. Analyzed relative performance against comparators, network architectures, hyperparameters, input features, geographic distribution, and model transparency.

Result: Most studies used shallow feed-forward networks with historical dengue incidence and climate variables at city/subdivision level with weekly data. 63% included comparator models (like tree-based models), with neural networks performing best in about half of those cases. Prediction horizons and evaluation approaches varied widely.

Conclusion: Neural networks offer competitive dengue forecasts and should be included in candidate models for future prediction efforts. Deep networks, broader input features, and adaptation to structural data changes present promising research avenues.

Abstract: Background: Early forecasts of dengue are an important tool for disease mitigation. Neural networks are powerful predictive models that have made contributions to many areas of public health. In this study, we reviewed the application of neural networks in the dengue forecasting literature, with the objective of informing model design for future work. Methods: Following PRISMA guidelines, we conducted a systematic search of studies that use neural networks to forecast dengue in human populations. We summarized the relative performance of neural networks and comparator models, architectures and hyper-parameters, choices of input features, geographic spread, and model transparency. Results: Sixty two papers were included. Most studies implemented shallow feed-forward neural networks, using historical dengue incidence and climate variables. Prediction horizons varied greatly, as did the model selection and evaluation approach. Building on the strengths of neural networks, most studies used granular observations at the city level, or on its subdivisions, while also commonly employing weekly data. Performance of neural networks relative to comparators, such as tree-based supervised models, varied across study contexts, and we found that 63% of all studies do include at least one such model as a baseline, and in those cases about half of the studies report neural networks as the best performing model. Conclusions: The studies suggest that neural networks can provide competitive forecasts for dengue, and can reliably be included in the set of candidate models for future dengue prediction efforts. The use of deep networks is relatively unexplored but offers promising avenues for further research, as does the use of a broader set of input features and prediction in light of structural changes in the data generation mechanism.

[410] From Logits to Hierarchies: Hierarchical Clustering made Simple

Emanuele Palumbo, Moritz Vandenhirtz, Alain Ryser, Imant Daunhawer, Julia E. Vogt

Main category: cs.LG

TL;DR: A lightweight method that builds on pre-trained non-hierarchical clustering models outperforms specialized deep hierarchical clustering methods in efficiency, scalability and performance.

Details

Motivation: Hierarchical structures are crucial in real-world datasets, but existing deep hierarchical clustering methods face scalability and performance limitations on realistic datasets.

Method: A lightweight approach that builds on pre-trained non-hierarchical clustering models, applicable to any model that outputs logits without requiring fine-tuning. Also extends to supervised settings using pre-trained classifiers.

Result: The approach outperforms specialized deep models for hierarchical clustering, demonstrates broad applicability, and can recover meaningful hierarchies from pre-trained ImageNet classifiers.

Conclusion: Provides a practical and effective alternative to existing deep hierarchical clustering methods with significant advantages in efficiency, scalability and performance.

Abstract: The hierarchical structure inherent in many real-world datasets makes the modeling of such hierarchies a crucial objective in both unsupervised and supervised machine learning. While recent advancements have introduced deep architectures specifically designed for hierarchical clustering, we adopt a critical perspective on this line of research. Our findings reveal that these methods face significant limitations in scalability and performance when applied to realistic datasets. Given these findings, we present an alternative approach and introduce a lightweight method that builds on pre-trained non-hierarchical clustering models. Remarkably, our approach outperforms specialized deep models for hierarchical clustering, and it is broadly applicable to any pre-trained clustering model that outputs logits, without requiring any fine-tuning. To highlight the generality of our approach, we extend its application to a supervised setting, demonstrating its ability to recover meaningful hierarchies from a pre-trained ImageNet classifier. Our results establish a practical and effective alternative to existing deep hierarchical clustering methods, with significant advantages in efficiency, scalability and performance.

[411] Optimization with Access to Auxiliary Information

El Mahdi Chayti, Sai Praneeth Karimireddy

Main category: cs.LG

TL;DR: Paper proposes two generic algorithms for optimizing expensive target functions using cheap auxiliary functions, applicable to settings like SGD batch reuse, transfer learning, federated learning, and compressed training.

Details

Motivation: Many practical optimization problems involve target functions with expensive or limited gradient computations, while having access to cheaper auxiliary functions. This occurs in scenarios like re-using batches in SGD, transfer learning, federated learning, and training with compressed models/dropout.

Method: Two generic new algorithms that leverage auxiliary side functions with cheap gradients to optimize expensive target functions. The approach relies on Hessian similarity assumptions between target and side functions.

Result: The framework provides benefits when the Hessian similarity measure between target and side functions is small. Additionally, stochasticity can provide benefits when auxiliary noise is correlated with target function noise.

Conclusion: The proposed generic algorithms enable efficient optimization of expensive target functions by leveraging cheap auxiliary functions, with proven benefits under Hessian similarity assumptions and potential advantages from correlated stochastic noise.

Abstract: We investigate the fundamental optimization question of minimizing a target function $f$, whose gradients are expensive to compute or have limited availability, given access to some auxiliary side function $h$ whose gradients are cheap or more available. This formulation captures many settings of practical relevance, such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, Et cetera. We propose two generic new algorithms that apply in all these settings; we also prove that we can benefit from this framework under the Hessian similarity assumption between the target and side information. A benefit is obtained when this similarity measure is small; we also show a potential benefit from stochasticity when the auxiliary noise is correlated with that of the target function.

[412] Four-hour thunderstorm nowcasting using a deep diffusion model of satellite data

Kuai Dai, Xutao Li, Junying Fang, Yunming Ye, Demin Yu, Hui Su, Di Xian, Danyu Qin, Jingsong Wang

Main category: cs.LG

TL;DR: Deep diffusion model for satellite data (DDMS) achieves 4-hour convection nowcasting with planetary-scale coverage using AI and geostationary satellite data.

Details

Motivation: Convection (thunderstorms) develop rapidly and cause significant damage, but current AI-based nowcasting methods have limited lead time and coverage that don't meet disaster emergency response needs.

Method: Proposed DDMS uses diffusion processes to simulate spatiotemporal evolution of convective clouds, combining geostationary satellite brightness temperature data with meteorological domain knowledge for planetary-scale forecasting.

Result: Achieves effective convection nowcasting up to 4 hours with broad coverage (~20M km²), high accuracy, and high resolution (15 min; 4 km), outperforming existing models and reaching new performance heights.

Conclusion: DDMS demonstrates diffusion models’ remarkable capabilities for convective cloud forecasting and highlights the value of geostationary satellite data when empowered by AI, with potential for global nowcasting through multi-satellite collaboration.

Abstract: Convection (thunderstorm) develops rapidly within hours and is highly destructive, posing a significant challenge for nowcasting and resulting in substantial losses to infrastructure and society. After the emergence of artificial intelligence (AI)-based methods, convection nowcasting has experienced rapid advancements, with its performance surpassing that of physics-based numerical weather prediction and other conventional approaches. However, the lead time and coverage of it still leave much to be desired and hardly meet the needs of disaster emergency response. Here, we propose a deep diffusion model for satellite data (DDMS) to establish an AI-based convection nowcasting system. Specifically, DDMS employs diffusion processes to effectively simulate complicated spatiotemporal evolution patterns of convective clouds, achieving more accurate forecasts of convective growth and dissipation over longer lead times. Additionally, it combines geostationary satellite brightness temperature data and domain knowledge from meteorological experts, thereby achieving planetary-scale forecast coverage. During long-term tests and objective validation based on the FengYun-4A satellite, our system achieves, for the first time, effective convection nowcasting up to 4 hours, with broad coverage (about 20,000,000 km2), remarkable accuracy, and high resolution (15 minutes; 4 km). Its performance reaches a new height in convection nowcasting compared to the existing models. In terms of application, our system is highly transferable with the potential to collaborate with multiple satellites for global convection nowcasting. Furthermore, our results highlight the remarkable capabilities of diffusion models in convective clouds forecasting, as well as the significant value of geostationary satellite data when empowered by AI technologies.

[413] Proactive Model Adaptation Against Concept Drift for Online Time Series Forecasting

Lifan Zhao, Yanyan Shen

Main category: cs.LG

TL;DR: Proactive model adaptation framework (Proceed) addresses concept drift in online time series forecasting by estimating drift between training and test samples and generating parameter adjustments before forecasting.

Details

Motivation: Existing online learning methods for time series forecasting overlook the temporal gap between training samples and test samples caused by delayed ground-truth availability, which can introduce concept drift and cause models to adapt to outdated concepts.

Method: Proceed estimates concept drift between recently used training samples and current test sample, then uses an adaptation generator to translate estimated drift into parameter adjustments, proactively adapting the model. The framework is trained on synthetic diverse concept drifts to enhance generalization.

Result: Extensive experiments on five real-world datasets across various forecast models show Proceed brings more performance improvements than state-of-the-art online learning methods, significantly enhancing forecast models’ resilience against concept drifts.

Conclusion: Proceed effectively addresses the temporal gap problem in online time series forecasting by proactively adapting models to concept drift before forecasting, outperforming existing methods and improving model robustness against distribution shifts.

Abstract: Time series forecasting always faces the challenge of concept drift, where data distributions evolve over time, leading to a decline in forecast model performance. Existing solutions are based on online learning, which continually organize recent time series observations as new training samples and update model parameters according to the forecasting feedback on recent data. However, they overlook a critical issue: obtaining ground-truth future values of each sample should be delayed until after the forecast horizon. This delay creates a temporal gap between the training samples and the test sample. Our empirical analysis reveals that the gap can introduce concept drift, causing forecast models to adapt to outdated concepts. In this paper, we present Proceed, a novel proactive model adaptation framework for online time series forecasting. Proceed first estimates the concept drift between the recently used training samples and the current test sample. It then employs an adaptation generator to efficiently translate the estimated drift into parameter adjustments, proactively adapting the model to the test sample. To enhance the generalization capability of the framework, Proceed is trained on synthetic diverse concept drifts. Extensive experiments on five real-world datasets across various forecast models demonstrate that Proceed brings more performance improvements than the state-of-the-art online learning methods, significantly facilitating forecast models’ resilience against concept drifts. Code is available at https://github.com/SJTU-DMTai/OnlineTSF.

[414] Online Bandits with (Biased) Offline Data: Adaptive Learning under Distribution Mismatch

Wang Chi Cheung, Lixing Lyu

Main category: cs.LG

TL;DR: The paper studies how offline historical data can enhance online learning in stochastic bandits, proposing MIN-UCB and MIN-COMB-UCB policies that adaptively use offline data when informative, with tight regret bounds.

Details

Motivation: Real-world applications often have access to historical datasets that could potentially improve online learning processes, but traditional online learning models are typically initialized from scratch. The authors want to leverage offline data to facilitate online learning in stochastic bandits, even when the offline and online distributions differ.

Method: The authors propose MIN-UCB for multi-armed bandits and MIN-COMB-UCB for combinatorial bandits. These policies adaptively choose to utilize offline data when deemed informative (based on an upper bound on distribution differences) and ignore it otherwise. They establish theoretical regret bounds for both policies.

Result: The authors show that without a non-trivial upper bound on distribution differences, no non-anticipatory policy can outperform classical UCB even with offline data. However, with such a bound, MIN-UCB outperforms UCB and achieves tight regret bounds in both instance-independent and instance-dependent settings. They also provide corresponding bounds for MIN-COMB-UCB in combinatorial bandits.

Conclusion: Offline data can significantly enhance online learning in bandit problems when there’s a known bound on distribution differences between offline and online environments. The proposed adaptive policies effectively leverage informative offline data while ignoring uninformative data, with validated theoretical guarantees and practical applications.

Abstract: Traditional online learning models are typically initialized from scratch. By contrast, contemporary real-world applications often have access to historical datasets that can potentially enhanced the online learning processes. We study how offline data can be leveraged to facilitate online learning in stochastic multi-armed bandits and combinatorial bandits. In our study, the probability distributions that govern the offline data and the online rewards can be different. We first show that, without a non-trivial upper bound on their difference, no non-anticipatory policy can outperform the classical Upper Confidence Bound (UCB) policy, even with the access to offline data. In complement, we propose an online policy MIN-UCB for multi-armed bandits. MIN-UCB outperforms the UCB when such an upper bound is available. MIN-UCB adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. We establish that MIN-UCB achieves tight regret bounds, in both instance independent and dependent settings. We generalize our approach to the combinatorial bandit setting by introducing MIN-COMB-UCB, and we provide corresponding instance dependent and instance independent regret bounds. We illustrate how various factors, such as the biases and the size of offline datasets, affect the utility of offline data in online learning. We discuss several applications and conduct numerical experiments to validate our findings.

[415] Multi-Modality Collaborative Learning for Sentiment Analysis

Shanmin Wang, Chengguang Liu, Qingshan Liu

Main category: cs.LG

TL;DR: MMCL framework improves multimodal sentiment analysis by decoupling modalities into common and specific components, using reinforcement learning-inspired policy models to mine complementary features and intra-modal attention to enhance common representations.

Details

Motivation: Existing multimodal sentiment analysis methods struggle with modality heterogeneity, limiting effective capture of interactive sentiment features across visual, audio, and text modalities.

Method: Multi-Modality Collaborative Learning (MMCL) framework with: 1) parameter-free decoupling module separating uni-modality into modality-common and modality-specific components via cross-modal semantics assessment; 2) reinforcement learning-inspired policy models to adaptively mine complementary sentiment features from modality-specific representations; 3) intra-modal attention to highlight crucial components in modality-common representations.

Result: Superior performance on four databases, effectiveness verification of each module, and successful assessment of complementary features, demonstrating significant performance improvement.

Conclusion: MMCL successfully learns collaborative features across modalities by effectively handling modality heterogeneity through decoupling common and specific components, leading to enhanced multimodal sentiment analysis performance.

Abstract: Multimodal sentiment analysis (MSA) identifies individuals’ sentiment states in videos by integrating visual, audio, and text modalities. Despite progress in existing methods, the inherent modality heterogeneity limits the effective capture of interactive sentiment features across modalities. In this paper, by introducing a Multi-Modality Collaborative Learning (MMCL) framework, we facilitate cross-modal interactions and capture enhanced and complementary features from modality-common and modality-specific representations, respectively. Specifically, we design a parameter-free decoupling module and separate uni-modality into modality-common and modality-specific components through semantics assessment of cross-modal elements. For modality-specific representations, inspired by the act-reward mechanism in reinforcement learning, we design policy models to adaptively mine complementary sentiment features under the guidance of a joint reward. For modality-common representations, intra-modal attention is employed to highlight crucial components, playing enhanced roles among modalities. Experimental results, including superiority evaluations on four databases, effectiveness verification of each module, and assessment of complementary features, demonstrate that MMCL successfully learns collaborative features across modalities and significantly improves performance. The code can be available at https://github.com/smwanghhh/MMCL.

[416] Models That Prove Their Own Correctness

Noga Amit, Shafi Goldwasser, Orr Paradise, Guy Rothblum

Main category: cs.LG

TL;DR: Self-Proving models that generate interactive proofs of their output correctness, with verifier detection of all incorrect outputs.

Details

Motivation: Traditional model accuracy only provides average guarantees over input distributions, leaving uncertainty about correctness on any specific input of interest. There's a need for models that can prove correctness of individual outputs.

Method: Two learning approaches: 1) Transcript Learning (TL) using accepting interaction transcripts, and 2) Reinforcement Learning from Verifier Feedback (RLVF) training by emulating verifier interactions.

Result: Self-Proving models that with high probability generate correct outputs and successfully prove their correctness to a verifier, while the verifier detects all incorrect outputs from any model.

Conclusion: Theoretical framework for training models that provide verifiable proofs of correctness for individual outputs, addressing the limitation of average-case accuracy guarantees.

Abstract: How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured on average over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train Self-Proving models that prove the correctness of their output to a verification algorithm $V$ via an Interactive Proof. Self-Proving models satisfy that, with high probability over an input sampled from a given distribution, the model generates a correct output and successfully proves its correctness to $V$. The soundness property of $V$ guarantees that, for every input, no model can convince $V$ of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while all incorrect outputs (of any model) are detected by $V$. We devise and analyze two generic methods for learning Self-Proving models: Transcript Learning (TL) which relies on access to transcripts of accepting interactions, and Reinforcement Learning from Verifier Feedback (RLVF) which trains a model by emulating interactions with the verifier.

[417] PILA: Physics-Informed Low Rank Augmentation for Interpretable Earth Observation

Yihang She, Andrew Blake, Clement Atzberger, Adriano Gualandi, Srinivasan Keshav

Main category: cs.LG

TL;DR: PILA (Physics-Informed Low-Rank Augmentation) is a framework that augments incomplete physical models using learnable low-rank residuals to improve flexibility while staying close to governing physics, achieving better accuracy and interpretability in Earth Observation inverse problems.

Details

Motivation: Existing physical models for Earth Observation are often simplified and incomplete, leading to discrepancies between simulation and observations that hinder reliable forward model inversion. Current approaches either ignore this incompleteness, rely on case-specific preprocessing, or use physics-informed autoencoders with difficult-to-interpret auxiliary variables and hard-to-balance regularizers.

Method: PILA augments incomplete physical models using a learnable low-rank residual. This approach improves model flexibility while remaining close to the governing physics, without requiring difficult-to-interpret auxiliary variables or multiple hard-to-balance regularizers.

Result: PILA was evaluated on two EO inverse problems: forest radiative transfer inversion from optical remote sensing, and volcanic deformation inversion from GNSS displacement data. For forest spectral inversion, it improved tree species separation and reduced prediction errors by 40-71% relative to state-of-the-art. For volcanic deformation, it captured a major inflation event at Akutan volcano in 2008 and estimated source parameters consistent with prior studies that required substantial additional preprocessing.

Conclusion: PILA offers an effective general pathway for inverting incomplete physical models, yielding more accurate and interpretable physical variables across different domains. The framework may be applicable beyond Earth Observation, and analysis of model rank, observability, and physical priors provides insights for broader applications.

Abstract: Physically meaningful representations are essential for Earth Observation (EO), yet existing physical models are often simplified and incomplete. This leads to discrepancies between simulation and observations that hinder reliable forward model inversion. Common approaches to EO inversion either ignored this incompleteness or relied on case-specific preprocessing. More recent methods use physics-informed autoencoders but depend on auxiliary variables that are difficult to interpret and multiple regularizers that are difficult to balance. We propose Physics-Informed Low-Rank Augmentation (PILA), a framework that augments incomplete physical models using a learnable low-rank residual to improve flexibility, while remaining close to the governing physics. We evaluate PILA on two EO inverse problems involving diverse physical processes: forest radiative transfer inversion from optical remote sensing; and volcanic deformation inversion from Global Navigation Satellite Systems (GNSS) displacement data. Across different domains, PILA yields more accurate and interpretable physical variables. For forest spectral inversion, it improves the separation of tree species and, compared to ground measurements, reduces prediction errors by 40-71% relative to the state-of-the-art. For volcanic deformation, PILA’s recovery of variables captures a major inflation event at the Akutan volcano in 2008, and estimates source depth, volume change, and displacement patterns that are consistent with prior studies that however required substantial additional preprocessing. Finally, we analyse the effects of model rank, observability, and physical priors, and suggest that PILA may offer an effective general pathway for inverting incomplete physical models even beyond the domain of Earth Observation. The code is available at https://github.com/yihshe/PILA.git.

[418] Efficient Zero-Order Federated Finetuning of Language Models for Resource-Constrained Devices

Mohamed Aboelenien Ahmed, Kilian Pfeiffer, Ramin Khalili, Heba Khdr, Jörg Henkel

Main category: cs.LG

TL;DR: METHODefficient zero-order federated fine-tuning for LLMs on edge devices with faster convergence and reduced computation overhead.

Details

Motivation: Federated fine-tuning of LLMs on edge devices faces challenges with high memory, communication, and computational demands. Zero-order optimization helps with memory but has slow convergence.

Method: Proposes METHOD that divides the network into two blocks and applies different numbers of perturbations per block in a computationally effective way to achieve faster convergence.

Result: Achieves 1.6-3× reduction in computation overhead compared to state-of-the-art zero-order techniques in federated learning.

Conclusion: The proposed METHOD enables efficient federated fine-tuning of LLMs on edge devices with faster convergence and significantly reduced computational requirements.

Abstract: Federated fine-tuning offers a promising approach for tuning Large Language Models (LLMs) on edge devices while preserving data privacy. However, fine-tuning these models on edge devices remains challenging due to high memory, communication, and computational demands. Zero-order optimization with task alignment provides a potential solution, enabling fine-tuning with inference-level memory requirements but requires a longer convergence time. In this paper, we propose \ac{METHOD} that divides the network into two blocks, applying a different number of perturbations per block in a computationally effective way, achieving faster convergence. Our evaluation shows a $1.6-3\times$ reduction in computation overhead compared to zero-order state of the art techniques in federated learning.

[419] DyG-Mamba: Continuous State Space Modeling on Dynamic Graphs

Dongyuan Li, Shiyin Tan, Ying Zhang, Ming Jin, Shirui Pan, Manabu Okumura, Renhe Jiang

Main category: cs.LG

TL;DR: DyG-Mamba translates dynamic graph modeling into long-term sequence modeling using state space models, incorporating Ebbinghaus’ forgetting curve and review cycle concepts to handle irregular timespans and filter noise.

Details

Motivation: Dynamic graph modeling is crucial for understanding evolutionary patterns in real-world systems like social recommendation and cancer detection, but existing approaches struggle with efficiently capturing long-term dependencies and handling irregular temporal patterns.

Method: Proposes DyG-Mamba that treats dynamic graphs as long-term sequences using state space models. Incorporates Ebbinghaus’ forgetting curve to use irregular timespans as control signals for adjusting historical information forgetting, and Ebbinghaus’ review cycle to selectively review history and filter noise.

Result: Achieves state-of-the-art performance on most of 12 datasets covering dynamic link prediction and node classification tasks, while demonstrating significantly improved computational and memory efficiency.

Conclusion: DyG-Mamba effectively translates dynamic graph modeling into sequence modeling with state space models, leveraging psychological memory principles to handle irregular temporal patterns and improve both performance and efficiency.

Abstract: Dynamic graph modeling aims to uncover evolutionary patterns in real-world systems, enabling accurate social recommendation and early detection of cancer cells. Inspired by the success of recent state space models in efficiently capturing long-term dependencies, we propose DyG-Mamba by translating dynamic graph modeling into a long-term sequence modeling problem. Specifically, inspired by Ebbinghaus’ forgetting curve, we treat the irregular timespans between events as control signals, allowing DyG-Mamba to dynamically adjust the forgetting of historical information. This mechanism ensures effective usage of irregular timespans, thereby improving both model effectiveness and inductive capability. In addition, inspired by Ebbinghaus’ review cycle, we redefine core parameters to ensure that DyG-Mamba selectively reviews historical information and filters out noisy inputs, further enhancing the model’s robustness. Through exhaustive experiments on 12 datasets covering dynamic link prediction and node classification tasks, we show that DyG-Mamba achieves state-of-the-art performance on most datasets, while demonstrating significantly improved computational and memory efficiency. Code is available at https://github.com/Clearloveyuan/DyG-Mamba.

[420] Preference-Guided Diffusion for Multi-Objective Offline Optimization

Yashas Annadani, Syrine Belakaria, Stefano Ermon, Stefan Bauer, Barbara E Engelhardt

Main category: cs.LG

TL;DR: A preference-guided diffusion model for offline multi-objective optimization that generates diverse Pareto-optimal solutions using a classifier-based guidance mechanism.

Details

Motivation: Offline multi-objective optimization needs to identify Pareto-optimal solutions from existing datasets, but existing methods struggle to generate diverse, well-distributed solutions beyond the training distribution.

Method: Uses a preference-guided diffusion model with a classifier trained to predict dominance probabilities between designs. Introduces diversity-aware preference guidance that combines Pareto dominance with diversity criteria to ensure well-distributed solutions.

Result: Consistently outperforms other inverse/generative approaches and remains competitive with forward/surrogate-based optimization methods across various continuous offline multi-objective optimization tasks.

Conclusion: Classifier-guided diffusion models are effective for generating diverse, high-quality solutions that approximate the Pareto front well, addressing limitations of prior generative methods in offline multi-objective optimization.

Abstract: Offline multi-objective optimization aims to identify Pareto-optimal solutions given a dataset of designs and their objective values. In this work, we propose a preference-guided diffusion model that generates Pareto-optimal designs by leveraging a classifier-based guidance mechanism. Our guidance classifier is a preference model trained to predict the probability that one design dominates another, directing the diffusion model toward optimal regions of the design space. Crucially, this preference model generalizes beyond the training distribution, enabling the discovery of Pareto-optimal solutions outside the observed dataset. We introduce a novel diversity-aware preference guidance, augmenting Pareto dominance preference with diversity criteria. This ensures that generated solutions are optimal and well-distributed across the objective space, a capability absent in prior generative methods for offline multi-objective optimization. We evaluate our approach on various continuous offline multi-objective optimization tasks and find that it consistently outperforms other inverse/generative approaches while remaining competitive with forward/ surrogate-based optimization methods. Our results highlight the effectiveness of classifier-guided diffusion models in generating diverse and high-quality solutions that approximate the Pareto front well.

[421] Unsupervised discovery of the shared and private geometry in multi-view data

Sai Koukuntla, Joshua B. Julian, Jesse C. Kaminsky, Manuel Schottdorf, David W. Tank, Carlos D. Brody, Adam S. Charles

Main category: cs.LG

TL;DR: SPLICE is a neural network method that learns disentangled, interpretable representations of shared and private latent variables from paired multi-view data, outperforming existing methods in disentanglement, interpretability, and robustness.

Details

Motivation: Real-world phenomena often involve multi-view data (e.g., different brain regions or sensor modalities), but existing methods lack expressivity for nonlinear relationships, only capture shared variance, or discard crucial geometric information needed for insights.

Method: SPLICE uses neural networks to infer disentangled representations of private and shared latent variables from paired high-dimensional views, preserving geometric structure for interpretability.

Result: SPLICE outperforms competing methods in three key areas: 1) better disentanglement of shared and private representations, 2) more interpretable representations through geometry preservation, and 3) greater robustness to incorrect latent dimensionality estimates.

Conclusion: SPLICE is proposed as a general-purpose method for finding succinct, interpretable descriptions of paired datasets using disentangled shared and private latent variables, particularly valuable for neuroscience and multi-view data analysis.

Abstract: Studying complex real-world phenomena often involves data from multiple views (e.g. sensor modalities or brain regions), each capturing different aspects of the underlying system. Within neuroscience, there is growing interest in large-scale simultaneous recordings across multiple brain regions. Understanding the relationship between views (e.g., the neural activity in each region recorded) can reveal fundamental insights into each view and the system as a whole. However, existing methods to characterize such relationships lack the expressivity required to capture nonlinear relationships, describe only shared sources of variance, or discard geometric information that is crucial to drawing insights from data. Here, we present SPLICE: a neural network-based method that infers disentangled, interpretable representations of private and shared latent variables from paired samples of high-dimensional views. Compared to competing methods, we demonstrate that SPLICE 1) disentangles shared and private representations more effectively, 2) yields more interpretable representations by preserving geometry, and 3) is more robust to incorrect a priori estimates of latent dimensionality. We propose our approach as a general-purpose method for finding succinct and interpretable descriptions of paired data sets in terms of disentangled shared and private latent variables.

[422] Ensembles provably learn equivariance through data augmentation

Oskar Nordenfors, Axel Flinth

Main category: cs.LG

TL;DR: Group equivariance emerges in neural network ensembles with full augmentation, independent of neural tangent kernel limit, extending to stochastic settings and general architectures.

Details

Motivation: Previous work showed group equivariance emerges in neural network ensembles with full augmentation in the neural tangent kernel (NTK) limit. This paper aims to extend this result by showing the emergence is independent of the NTK limit and applies more broadly.

Method: Theoretical proof that group equivariance emergence doesn’t depend on NTK limit, extension to stochastic settings, analysis of general architectures with a simple sufficient condition relating architecture to group action, and validation through numeric experiments.

Result: Group equivariance emerges in neural network ensembles with full augmentation regardless of NTK limit, applicable to stochastic settings and general architectures meeting the sufficient condition.

Conclusion: The emergence of group equivariance in neural network ensembles is more general than previously thought - independent of NTK limit, applicable to stochastic settings, and holds for broad architectural classes with a simple condition.

Abstract: Recently, it was proved that group equivariance emerges in ensembles of neural networks as the result of full augmentation in the limit of infinitely wide neural networks (neural tangent kernel limit). In this paper, we extend this result significantly. We provide a proof that this emergence does not depend on the neural tangent kernel limit at all. We also consider stochastic settings, and furthermore general architectures. For the latter, we provide a simple sufficient condition on the relation between the architecture and the action of the group for our results to hold. We validate our findings through simple numeric experiments.

[423] Provable optimal transport with transformers: The essence of depth and prompt engineering

Hadi Daneshmand

Main category: cs.LG

TL;DR: Transformers align tokens via attention layers that approximate Optimal Transport between word embeddings, with depth controlling alignment accuracy.

Details

Motivation: Despite empirical success, the internal mechanism of token alignment in transformers remains poorly understood. The paper aims to provide a mechanistic and theoretical explanation of how transformers align tokens during language processing.

Method: 1) Present empirical evidence showing attention weights progressively align translated word pairs across layers, approximating Optimal Transport between word embeddings. 2) Prove that softmax self-attention layers can simulate gradient descent on the dual of entropy-regularized OT problem. 3) Derive constructive convergence bound showing transformer depth controls OT approximation accuracy.

Result: Transformers can sort lists of varying lengths without parameter adjustment, with error term vanishing with transformer depth. Attention weights in machine translation closely approximate Optimal Transport between word embeddings.

Conclusion: The paper provides a theoretical foundation showing transformers perform token alignment through attention mechanisms that approximate Optimal Transport, with depth controlling alignment precision. This explains transformers’ ability to handle variable-length sequences and perform tasks like list sorting without parameter tuning.

Abstract: Despite their empirical success, the internal mechanism by which transformer models align tokens during language processing remains poorly understood. This paper provides a mechanistic and theoretical explanation of token alignment in LLMs. We first present empirical evidences showing that, in machine translation, attention weights progressively align translated word pairs across layers, closely approximating Optimal Transport (OT) between word embeddings. Building on this observation, we prove that softmax self-attention layers can simulate gradient descent on the dual of the entropy-regularized OT problem, providing a theoretical foundation for the alignment. Our analysis yields a constructive convergence bound showing that transformer depth controls OT approximation accuracy. A direct implication is that standard transformers can sort lists of varying lengths without any parameter adjustment, up to an error term vanishing with transformers depth.

[424] Scaling Laws for Black box Adversarial Attacks

Chuan Liu, Huanran Chen, Yichi Zhang, Jun Zhu, Yinpeng Dong

Main category: cs.LG

TL;DR: Large-scale study shows attack success rate scales linearly with logarithm of ensemble size, enabling devastating 80%+ transfer attacks on proprietary models like GPT-4o.

Details

Motivation: Prior studies use small, fixed ensembles for adversarial attacks, leaving open whether scaling the number of surrogate models can further improve black-box attack transferability.

Method: Resolve gradient conflict with advanced optimizers, conduct large-scale empirical study, discover log-linear scaling law through theoretical analysis and empirical evaluations across standard classifiers, SOTA defenses, and MLLMs.

Result: Attack Success Rate (ASR) scales linearly with logarithm of ensemble size T. Scaling distills robust, semantic features of target class. Achieved 80%+ transfer attack success rate on proprietary models like GPT-4o, revealing Claude-3.5-Sonnet’s exceptional resilience.

Conclusion: Findings urge shift in robustness evaluation focus: from designing intricate algorithms on small ensembles to understanding principled, powerful threat of scaling. Reveals clear robustness hierarchy among models.

Abstract: Adversarial examples exhibit cross-model transferability, enabling threatening black-box attacks on commercial models. Model ensembling, which attacks multiple surrogate models, is a known strategy to improve this transferability. However, prior studies typically use small, fixed ensembles, which leaves open an intriguing question of whether scaling the number of surrogate models can further improve black-box attacks. In this work, we conduct the first large-scale empirical study of this question. We show that by resolving gradient conflict with advanced optimizers, we discover a robust and universal log-linear scaling law through both theoretical analysis and empirical evaluations: the Attack Success Rate (ASR) scales linearly with the logarithm of the ensemble size $T$. We rigorously verify this law across standard classifiers, SOTA defenses, and MLLMs, and find that scaling distills robust, semantic features of the target class. Consequently, we apply this fundamental insight to benchmark SOTA MLLMs. This reveals both the attack’s devastating power and a clear robustness hierarchy: we achieve 80%+ transfer attack success rate on proprietary models like GPT-4o, while also highlighting the exceptional resilience of Claude-3.5-Sonnet. Our findings urge a shift in focus for robustness evaluation: from designing intricate algorithms on small ensembles to understanding the principled and powerful threat of scaling.

[425] Iterative Feature Exclusion Ranking for Deep Tabular Learning

Fathi Said Emhemed Shaninah, AbdulRahman M. A. Baraka, Mohd Halim Mohd Noor

Main category: cs.LG

TL;DR: Proposes iterative feature exclusion module for better feature importance ranking in tabular data by capturing contextual dependencies and interactions.

Details

Motivation: Existing deep learning models for tabular data have unidimensional feature selection mechanisms that fail to account for contextual dependence of feature importance, overlook feature interactions, bias of high-impact features, and attention generalization limitations.

Method: Iterative feature exclusion module that excludes each feature from input data, computes attention scores representing feature impact on prediction, and aggregates scores across iterations to generate refined feature importance representation capturing both global and local interactions.

Result: Outperforms state-of-the-art methods and baseline models on four public datasets for both feature ranking and classification tasks.

Conclusion: The proposed iterative feature exclusion module effectively addresses limitations of existing tabular deep learning models by capturing contextual feature dependencies and interactions, leading to improved feature importance ranking and classification performance.

Abstract: Tabular data is a common format for storing information in rows and columns to represent data entries and their features. Although deep neural networks have become the main approach for modeling a wide range of domains including computer vision and NLP, many of them are not well-suited for tabular data. Recently, a few deep learning models have been proposed for deep tabular learning, featuring an internal feature selection mechanism with end-to-end gradient-based optimization. However, their feature selection mechanisms are unidimensional, and hence fail to account for the contextual dependence of feature importance, potentially overlooking crucial interactions that govern complex tasks. In addition, they overlook the bias of high-impact features and the risk associated with the limitations of attention generalization. To address this limitation, this study proposes a novel iterative feature exclusion module that enhances the feature importance ranking in tabular data. The proposed module iteratively excludes each feature from the input data and computes the attention scores, which represent the impact of the features on the prediction. By aggregating the attention scores from each iteration, the proposed module generates a refined representation of feature importance that captures both global and local interactions between features. The effectiveness of the proposed module is evaluated on four public datasets. The results demonstrate that the proposed module consistently outperforms state-of-the-art methods and baseline models in feature ranking and classification tasks. The code is publicly available at https://github.com/abaraka2020/Iterative-Feature-Exclusion-Ranking-Module and https://github.com/mohalim/IFENet

[426] Closed-Form Feedback-Free Learning with Forward Projection

Robert O’Shea, Bipin Rajendran

Main category: cs.LG

TL;DR: Forward Projection (FP) is a single-forward-pass training method that uses randomized projections and closed-form regression without backpropagation or retrograde communication, achieving comparable performance to gradient-based methods with faster training and better interpretability.

Details

Motivation: To develop a learning method that operates without retrograde communication (feedback from neuronal outputs) for pre-synaptic weight optimization, addressing limitations of backpropagation-free methods that still require local error feedback.

Method: Forward Projection (FP) generates target values for pre-activation membrane potentials through randomized nonlinear projections of pre-synaptic inputs and labels, then optimizes local loss functions using closed-form regression without feedback from downstream layers.

Result: FP achieves generalization comparable to gradient descent-based local learning methods across biomedical datasets while requiring only a single forward propagation step, yielding significant training speedup. In few-shot learning, FP produces more generalizable models than backpropagation-optimised alternatives.

Conclusion: FP provides an efficient, interpretable alternative to backpropagation-based training that operates without retrograde communication, offering faster training, comparable performance, and clinically interpretable features in biomedical applications.

Abstract: State-of-the-art backpropagation-free learning methods employ local error feedback to direct iterative optimisation via gradient descent. Here, we examine the more restrictive setting where retrograde communication from neuronal outputs is unavailable for pre-synaptic weight optimisation. We propose Forward Projection (FP), a randomised closed-form training method requiring only a single forward pass over the dataset without retrograde communication. FP generates target values for pre-activation membrane potentials through randomised nonlinear projections of pre-synaptic inputs and labels. Local loss functions are optimised using closed-form regression without feedback from downstream layers. A key advantage is interpretability: membrane potentials in FP-trained networks encode information interpretable layer-wise as label predictions. Across several biomedical datasets, FP achieves generalisation comparable to gradient descent-based local learning methods while requiring only a single forward propagation step, yielding significant training speedup. In few-shot learning tasks, FP produces more generalisable models than backpropagation-optimised alternatives, with local interpretation functions successfully identifying clinically salient diagnostic features.

[427] Reinforcement Learning Finetunes Small Subnetworks in Large Language Models

Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, Hao Peng

Main category: cs.LG

TL;DR: RL fine-tuning of LLMs produces substantial performance gains while updating only 5-30% of parameters, revealing intrinsic parameter update sparsity across algorithms and models.

Details

Motivation: To investigate the surprising phenomenon that RL fine-tuning of LLMs achieves large performance improvements while updating only a small fraction of parameters, and to understand the nature and causes of this intrinsic sparsity.

Method: Experimental analysis across 7 RL algorithms (PPO, GRPO, DPO, etc.) and 10 different LLM families, examining parameter update patterns, subnetwork properties, and the effects of various training components like KL regularization and gradient clipping.

Result: RL consistently induces parameter update sparsity (5-30% of parameters) across all tested algorithms and models without explicit sparsity constraints. The sparse subnetworks show high overlap across different training conditions, update nearly all parameter matrices similarly, and produce nearly full-rank updates that span most representable subspaces.

Conclusion: The observed parameter update sparsity in RL fine-tuning of LLMs is intrinsic and primarily attributed to training on data near the policy distribution, rather than explicit regularization techniques. This suggests RL fine-tuning efficiently identifies and updates a small but critical subset of parameters that capture most of the necessary adjustments.

Abstract: Reinforcement learning (RL) yields substantial improvements in large language models (LLMs) downstream task performance and alignment with human values. Surprisingly, such large gains result from updating only a small subnetwork comprising just 5 percent to 30 percent of the parameters, with the rest effectively unchanged. We refer to this phenomenon as parameter update sparsity induced by RL. It is observed across all 7 widely used RL algorithms (e.g., PPO, GRPO, DPO) and all 10 LLMs from different families in our experiments. This sparsity is intrinsic and occurs without any explicit sparsity promoting regularizations or architectural constraints. Finetuning the subnetwork alone recovers the test accuracy, and, remarkably, produces a model nearly identical to the one obtained via full finetuning. The subnetworks from different random seeds, training data, and even RL algorithms show substantially greater overlap than expected by chance. Our analysis suggests that this sparsity is not due to updating only a subset of layers, instead, nearly all parameter matrices receive similarly sparse updates. Moreover, the updates to almost all parameter matrices are nearly full-rank, suggesting RL updates a small subset of parameters that nevertheless span almost the full subspaces that the parameter matrices can represent. We conjecture that the this update sparsity can be primarily attributed to training on data that is near the policy distribution, techniques that encourage the policy to remain close to the pretrained model, such as the KL regularization and gradient clipping, have limited impact.

[428] Conflicting Biases at the Edge of Stability: Norm versus Sharpness Regularization

Maria Matveev, Vit Fojtik, Hung-Hsu Chou, Gitta Kutyniok, Johannes Maly

Main category: cs.LG

TL;DR: The paper argues that understanding gradient descent’s generalization requires analyzing the interaction between multiple implicit biases (parameter norm minimization and low sharpness), not just focusing on one, as the learning rate balances these competing regularization effects.

Details

Motivation: To develop a comprehensive understanding of gradient descent's generalization performance by examining how different forms of implicit regularization interact, rather than studying them in isolation. Current theoretical analyses often focus on single implicit biases (like ℓ₁-norm minimization or max-margin solutions) under vanishing learning rates, while empirical evidence shows moderate/large learning rates induce different biases toward low-sharpness minima.

Method: Combines empirical demonstrations showing that learning rate balances between low parameter norm and low sharpness, with theoretical analysis proving for diagonal linear networks on a simple regression task that neither implicit bias alone minimizes generalization error.

Result: Empirical results show learning rate controls trade-off between parameter norm and sharpness. Theoretical analysis proves that focusing on either norm minimization or sharpness minimization alone doesn’t minimize generalization error, demonstrating the insufficiency of single-bias explanations.

Conclusion: A single implicit bias perspective is insufficient to explain good generalization. A broader view capturing the dynamic trade-off between norm and sharpness induced by non-negligible learning rates is needed to understand gradient descent’s generalization performance.

Abstract: A widely believed explanation for the remarkable generalization capacities of overparameterized neural networks is that the optimization algorithms used for training induce an implicit bias towards benign solutions. To grasp this theoretically, recent works examine gradient descent and its variants in simplified training settings, often assuming vanishing learning rates. These studies reveal various forms of implicit regularization, such as $\ell_1$-norm minimizing parameters in regression and max-margin solutions in classification. Concurrently, empirical findings show that moderate to large learning rates exceeding standard stability thresholds lead to faster, albeit oscillatory, convergence in the so-called Edge-of-Stability regime, and induce an implicit bias towards minima of low sharpness (norm of training loss Hessian). In this work, we argue that a comprehensive understanding of the generalization performance of gradient descent requires analyzing the interaction between these various forms of implicit regularization. We empirically demonstrate that the learning rate balances between low parameter norm and low sharpness of the trained model. We furthermore prove for diagonal linear networks trained on a simple regression task that neither implicit bias alone minimizes the generalization error. These findings demonstrate that focusing on a single implicit bias is insufficient to explain good generalization, and they motivate a broader view of implicit regularization that captures the dynamic trade-off between norm and sharpness induced by non-negligible learning rates.

[429] Uncovering Alzheimer’s Disease Progression via SDE-based Spatio-Temporal Graph Deep Learning on Longitudinal Brain Networks

Houliang Zhou, Rong Zhou, Yangying Liu, Kanhao Zhao, Li Shen, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer’s Disease Neuroimaging Initiative

Main category: cs.LG

TL;DR: An interpretable spatio-temporal graph neural network using dual Stochastic Differential Equations predicts Alzheimer’s disease progression from irregular longitudinal fMRI data, identifying key brain circuit abnormalities and novel biomarkers.

Details

Motivation: There's a need for objective neuroimaging biomarkers to forecast Alzheimer's disease progression, but existing methods overlook the complex spatio-temporal dysfunctions in brain networks, and irregularly-sampled longitudinal data poses additional challenges.

Method: Developed an interpretable spatio-temporal graph neural network framework using dual Stochastic Differential Equations to model irregularly-sampled longitudinal fMRI data, learning sparse regional and connective importance probabilities.

Result: Validated on OASIS-3 and ADNI cohorts, identified key brain regions (parahippocampal cortex, prefrontal cortex, parietal lobule) and network disruptions (ventral attention, dorsal attention, default mode networks) correlating with clinical symptoms. Revealed novel neural systems-level and sex-specific biomarkers.

Conclusion: The framework demonstrates potential for early, individualized AD progression prediction using irregular longitudinal imaging data, offering new insights into neurobiological mechanisms through interpretable spatio-temporal graph-based learning.

Abstract: Identifying objective neuroimaging biomarkers to forecast Alzheimer’s disease (AD) progression is crucial for timely intervention. However, this task remains challenging due to the complex dysfunctions in the spatio-temporal characteristics of underlying brain networks, which are often overlooked by existing methods. To address these limitations, we develop an interpretable spatio-temporal graph neural network framework to predict future AD progression, leveraging dual Stochastic Differential Equations (SDEs) to model the irregularly-sampled longitudinal functional magnetic resonance imaging (fMRI) data. We validate our approach on two independent cohorts, including the Open Access Series of Imaging Studies (OASIS-3) and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Our framework effectively learns sparse regional and connective importance probabilities, enabling the identification of key brain circuit abnormalities associated with disease progression. Notably, we detect the parahippocampal cortex, prefrontal cortex, and parietal lobule as salient regions, with significant disruptions in the ventral attention, dorsal attention, and default mode networks. These abnormalities correlate strongly with longitudinal AD-related clinical symptoms. Moreover, our interpretability strategy reveals both established and novel neural systems-level and sex-specific biomarkers, offering new insights into the neurobiological mechanisms underlying AD progression. Our findings highlight the potential of spatio-temporal graph-based learning for early, individualized prediction of AD progression, even in the context of irregularly-sampled longitudinal imaging data.

[430] Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study

Yotam Alexander, Yonatan Slutzky, Yuval Ran-Milo, Nadav Cohen

Main category: cs.LG

TL;DR: The paper investigates the volume hypothesis, which claims neural networks can generalize well even without gradient descent, using Guess & Check (G&C). The authors find that in matrix factorization models, G&C generalization worsens with width but improves with depth, showing no simple answer to whether neural networks need gradient descent to generalize.

Details

Motivation: To challenge the conventional wisdom that gradient descent is essential for neural network generalization, and to test the volume hypothesis that claims generalization persists even with random weight selection (Guess & Check). The open question is whether this holds for wide and deep networks.

Method: Theoretical investigation of matrix factorization models (with linear and non-linear activation) as a testbed for neural network theory. Compare generalization under Guess & Check (random weight selection) versus gradient descent, analyzing effects of width and depth.

Result: 1) Generalization under G&C deteriorates with increasing width - first provable case where G&C is inferior to gradient descent. 2) Generalization under G&C improves with increasing depth, showing opposite effects of width vs depth. 3) Empirical validation supports theoretical findings.

Conclusion: There’s no simple answer to whether neural networks need gradient descent to generalize well. The volume hypothesis doesn’t hold uniformly - width and depth have opposite effects on G&C generalization, suggesting complexity in understanding neural network generalization mechanisms.

Abstract: Conventional wisdom attributes the mysterious generalization abilities of overparameterized neural networks to gradient descent (and its variants). The recent volume hypothesis challenges this view: it posits that these generalization abilities persist even when gradient descent is replaced by Guess & Check (G&C), i.e., by drawing weight settings until one that fits the training data is found. The validity of the volume hypothesis for wide and deep neural networks remains an open question. In this paper, we theoretically investigate this question for matrix factorization (with linear and non-linear activation)–a common testbed in neural network theory. We first prove that generalization under G&C deteriorates with increasing width, establishing what is, to our knowledge, the first case where G&C is provably inferior to gradient descent. Conversely, we prove that generalization under G&C improves with increasing depth, revealing a stark contrast between wide and deep networks, which we further validate empirically. These findings suggest that even in simple settings, there may not be a simple answer to the question of whether neural networks need gradient descent to generalize well.

[431] Reverse Supervision at Scale: Exponential Search Meets the Economics of Annotation

Masoud Makrehchi

Main category: cs.LG

TL;DR: Reversed-supervision strategy searching over unlabeled data labelings has exponential complexity that can’t be overcome by faster computation alone, requiring human input for objectives, definitions, and seed annotations.

Details

Motivation: To understand the fundamental limits of computational approaches to learning and determine whether arbitrarily fast computation can eliminate the need for human supervision in machine learning.

Method: Analyze a reversed-supervision strategy that searches over all possible labelings of a large unlabeled dataset to minimize error on a small labeled set, examining the exponential search space complexity.

Result: The search space remains exponential (2^n) even with large constant-factor speedups (quantum or parallel hardware), showing that arbitrarily fast computation cannot eliminate the need for informative labels or priors.

Conclusion: Human input remains essential for specifying objectives, defining classes, and providing seed annotations; generative AI can amplify labels but requires human-grade quality and oversight; computational speed reduces time but not fundamental supervision needs.

Abstract: We analyze a reversed-supervision strategy that searches over labelings of a large unlabeled set (B) to minimize error on a small labeled set (A). The search space is (2^n), and the resulting complexity remains exponential even under large constant-factor speedups (e.g., quantum or massively parallel hardware). Consequently, arbitrarily fast – but not exponentially faster – computation does not obviate the need for informative labels or priors. In practice, the machine learning pipeline still requires an initial human contribution: specifying the objective, defining classes, and providing a seed set of representative annotations that inject inductive bias and align models with task semantics. Synthetic labels from generative AI can partially substitute provided their quality is human-grade and anchored by a human-specified objective, seed supervision, and validation. In this view, generative models function as \emph{label amplifiers}, leveraging small human-curated cores via active, semi-supervised, and self-training loops, while humans retain oversight for calibration, drift detection, and failure auditing. Thus, extreme computational speed reduces wall-clock time but not the fundamental supervision needs of learning; initial human (or human-grade) input remains necessary to ground the system in the intended task.

[432] Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes

David M. Bossens, Atsushi Nitanda

Main category: cs.LG

TL;DR: Mirror descent policy optimization for robust constrained MDPs achieves Õ(1/T^{1/3}) convergence with adversarial transition kernel optimization.

Details

Motivation: Safety is critical for RL systems, requiring policies that satisfy long-term constraints under epistemic uncertainty. Robust constrained MDPs provide guarantees but need efficient optimization methods.

Method: Mirror descent policy optimization using policy gradient techniques to optimize both policy (maximizer) and adversarial transition kernel (minimizer) on the Lagrangian of constrained MDPs.

Result: Achieves Õ(1/T^{1/3}) convergence rate in sample-based robust constrained MDP setting. Also develops algorithm for approximate gradient descent in transition kernel space.

Conclusion: Proposed method shows benefits in constrained/unconstrained optimization and significant robustness improvements over baseline policy optimization algorithms.

Abstract: Safety is an essential requirement for reinforcement learning systems. The newly emerging framework of robust constrained Markov decision processes allows learning policies that satisfy long-term constraints while providing guarantees under epistemic uncertainty. This paper presents mirror descent policy optimisation for robust constrained Markov decision processes, making use of policy gradient techniques to optimise both the policy (as a maximiser) and the transition kernel (as an adversarial minimiser) on the Lagrangian representing a constrained Markov decision process. Our proposed algorithm obtains an $\tilde{\mathcal{O}}\left(1/T^{1/3}\right)$ convergence rate in the sample-based robust constrained Markov decision process setting. The paper also contributes an algorithm for approximate gradient descent in the space of transition kernels, which is of independent interest for designing adversarial environments in general Markov decision processes. Experiments confirm the benefits of mirror descent policy optimisation in constrained and unconstrained optimisation, and significant improvements are observed in robustness tests when compared to baseline policy optimisation algorithms.

[433] Alternative Fairness and Accuracy Optimization in Criminal Justice

Shaolong Wu, James Blume, Geshi Yeung

Main category: cs.LG

TL;DR: The paper proposes a modified group fairness approach that minimizes weighted error loss while keeping false negative rate differences within tolerance, offering a practical framework for deploying algorithmic fairness in public decision systems.

Details

Motivation: Algorithmic fairness concepts remain unsettled in criminal justice contexts, with existing approaches facing conflicts between different fairness definitions and practical implementation challenges.

Method: Develops a modified group fairness approach that minimizes weighted error loss while constraining differences in false negative rates within a small tolerance, rather than requiring exact parity across protected groups.

Result: The approach makes solutions easier to find, can improve predictive accuracy, surfaces ethical choices about error costs, and addresses critiques about biased data, latent affirmative action, and subgroup constraint explosion.

Conclusion: Proposes a practical deployment framework with three pillars: need-based decisions, transparency/accountability, and narrowly tailored definitions/solutions, linking technical design to legitimacy for public decision systems.

Abstract: Algorithmic fairness has grown rapidly as a research area, yet key concepts remain unsettled, especially in criminal justice. We review group, individual, and process fairness and map the conditions under which they conflict. We then develop a simple modification to standard group fairness. Rather than exact parity across protected groups, we minimize a weighted error loss while keeping differences in false negative rates within a small tolerance. This makes solutions easier to find, can raise predictive accuracy, and surfaces the ethical choice of error costs. We situate this proposal within three classes of critique: biased and incomplete data, latent affirmative action, and the explosion of subgroup constraints. Finally, we offer a practical framework for deployment in public decision systems built on three pillars: need-based decisions, Transparency and accountability, and narrowly tailored definitions and solutions. Together, these elements link technical design to legitimacy and provide actionable guidance for agencies that use risk assessment and related tools.

[434] Scientific Machine Learning of Chaotic Systems Discovers Governing Equations for Neural Populations

Anthony G. Chesebro, David Hofmann, Vaibhav Dixit, Earl K. Miller, Richard H. Granger, Alan Edelman, Christopher V. Rackauckas, Lilianne R. Mujica-Parodi, Helmut H. Strey

Main category: cs.LG

TL;DR: PEM-UDE method combines prediction-error method with universal differential equations to extract interpretable equations from chaotic systems, even with noisy data, and applies it to discover novel neural circuit equations.

Details

Motivation: Traditional methods fail to discover governing equations for complex chaotic systems, especially in neuroscience where biological constraints like network sparsity are important but not captured in existing models.

Method: PEM-UDE combines prediction-error method with universal differential equations to smooth optimization landscapes and remove chaotic properties during fitting without distorting optimal parameters, enabling extraction of interpretable mathematical expressions from noisy chaotic data.

Result: Method successfully recovers hidden states in Rossler system and reconstructs dynamics from noise-corrupted electrical-circuit data (5x noise magnitude). Derives novel neural governing equations respecting biological constraints like sparsity, predicting relationship between connection density and oscillation frequency/synchrony. Validated on three intracranial electrode datasets.

Conclusion: Provides pathway to develop mechanistic, multi-scale brain models that generalize across neural architectures, bridging single-neuron dynamics with macroscale brain activity, outperforming traditional symbolic regression methods.

Abstract: Discovering governing equations that describe complex chaotic systems remains a fundamental challenge in physics and neuroscience. Here, we introduce the PEM-UDE method, which combines the prediction-error method with universal differential equations to extract interpretable mathematical expressions from chaotic dynamical systems, even with limited or noisy observations. This approach succeeds where traditional techniques fail by smoothing optimization landscapes and removing the chaotic properties during the fitting process without distorting optimal parameters. We demonstrate its efficacy by recovering hidden states in the Rossler system and reconstructing dynamics from noise-corrupted electrical-circuit data, in which the correct functional form of the dynamics is recovered even when one of the observed time series is corrupted by noise 5x the magnitude of the true signal. We demonstrate that this method can recover the correct dynamics, whereas direct symbolic regression methods, such as STLSQ, fail to do so with the available data and noise. Importantly, when applied to neural populations, our method derives novel governing equations that respect biological constraints such as network sparsity - a constraint necessary for cortical information processing yet not captured in next-generation neural mass models - while preserving microscale neuronal parameters. These equations predict an emergent relationship between connection density and both oscillation frequency and synchrony in neural circuits. We validate these predictions using three intracranial electrode recording datasets from the medial entorhinal cortex, prefrontal cortex, and orbitofrontal cortex. Our work provides a pathway to develop mechanistic, multi-scale brain models that generalize across diverse neural architectures, bridging the gap between single-neuron dynamics and macroscale brain activity.

[435] Learning Collective Variables for Enhanced Sampling from BioEmu with Time-Lagged Generation

Seonghyun Park, Kiyoung Seong, Soojung Yang, Rafael Gómez-Bombarelli, Sungsoo Ahn

Main category: cs.LG

TL;DR: BioEmu-CV learns collective variables automatically from BioEmu foundation model for enhanced molecular dynamics sampling of protein folding.

Details

Motivation: Enhanced sampling in molecular dynamics requires effective collective variables (CVs) to capture slow dynamics like protein folding, but identifying good CVs is difficult and time-consuming.

Method: Repurpose BioEmu foundation model to learn time-lagged generation conditioned on learned CVs, forcing CVs to encode only slow, long-term information while filtering out fast fluctuations.

Result: Validated on fast-folding proteins for two applications: estimating free energy differences using on-the-fly probability enhanced sampling and sampling transition paths with steered molecular dynamics.

Conclusion: BioEmu-CV provides an automated framework for learning effective CVs from foundation models, enabling better enhanced sampling and serving as a new benchmark for machine learning CVs on larger proteins.

Abstract: Molecular dynamics is crucial for understanding molecular systems but its applicability is often limited by the vast timescales of rare events like protein folding. Enhanced sampling techniques overcome this by accelerating the simulation along key reaction pathways, which are defined by collective variables (CVs). However, identifying effective CVs that capture the slow, macroscopic dynamics of a system remains a major bottleneck. This work proposes a novel framework coined BioEmu-CV that learns these essential CVs automatically from BioEmu, a recently proposed foundation model for generating protein equilibrium samples. In particular, we re-purpose BioEmu to learn time-lagged generation conditioned on the learned CV, i.e., predict the distribution of molecular states after a certain amount of time. This training process promotes the CV to encode only the slow, long-term information while disregarding fast, random fluctuations. We validate our learned CV on fast-folding proteins with two key applications: (1) estimating free energy differences using on-the-fly probability enhanced sampling and (2) sampling transition paths with steered molecular dynamics. Our empirical study also serves as a new systematic and comprehensive benchmark for MLCVs on fast-folding proteins larger than Alanine Dipeptide.

[436] Online Continual Graph Learning

Giovanni Donghi, Luca Pasa, Daniele Zambon, Cesare Alippi, Nicolò Navarin

Main category: cs.LG

TL;DR: The paper introduces Online Continual Graph Learning (OCGL), a new setting for node-level continual learning on evolving graphs under strict memory/computational constraints, with a comprehensive benchmark and competitive baseline.

Details

Motivation: Real-world networks evolve over time and require timely online predictions, but existing continual graph learning methods violate online efficiency constraints by assuming access to entire graph snapshots or multiple passes.

Method: Introduces OCGL setting formalizing node-level continual learning on evolving graphs under strict budgets, establishes benchmark with 7 datasets and 9 adapted CL strategies, and presents a minimalistic yet competitive baseline.

Result: Created comprehensive OCGL benchmark enabling standardized evaluation, and developed baseline achieving strong empirical performance with high efficiency.

Conclusion: OCGL addresses the gap in online continual learning for graph-structured data, providing formal framework, benchmark, and competitive baseline for timely, efficient learning on evolving networks.

Abstract: Continual Learning (CL) aims to incrementally acquire new knowledge while mitigating catastrophic forgetting. Within this setting, Online Continual Learning (OCL) focuses on updating models promptly and incrementally from single or small batches of observations from a data stream. Extending OCL to graph-structured data is crucial, as many real-world networks evolve over time and require timely, online predictions. However, existing continual or streaming graph learning methods typically assume access to entire graph snapshots or multiple passes over tasks, violating the efficiency constraints of the online setting. To address this gap, we introduce the Online Continual Graph Learning (OCGL) setting, which formalizes node-level continual learning on evolving graphs under strict memory and computational budgets. OCGL defines how a model incrementally processes a stream of node-level information while maintaining anytime inference and respecting resource constraints. We further establish a comprehensive benchmark comprising seven datasets and nine CL strategies, suitably adapted to the OCGL setting, enabling a standardized evaluation setup. Finally, we present a minimalistic yet competitive baseline for OCGL, inspired by our benchmarking results, that achieves strong empirical performance with high efficiency.

[437] EnviSAgE: A Survey of Environment Scaling for Qualitative Agentic Experience Collection

Yuchen Huang, Sijia Li, Minghao Liu, Wei Liu, Shijue Huang, Zhiyuan Fan, Hou Pong Chan, Yi R. Fung

Main category: cs.LG

TL;DR: Survey paper proposing Generation-Execution-Feedback (GEF) loop framework for LLM-based agent training, advocating for environment scaling over static datasets, and reviewing methods for task generation, execution, and feedback.

Details

Motivation: Static datasets for LLM agent training are insufficient for developing adaptive behavior and long-term decision-making capabilities. They are costly, lack dynamism and realism, and rely on human-level knowledge. The paper argues for training agents through direct environment interaction using reinforcement learning.

Method: Proposes the Generation-Execution-Feedback (GEF) loop framework where environments: 1) generate tasks to challenge agents, 2) return observations during task execution, and 3) provide evaluative feedback on rollouts. Systematically reviews environment scaling methods from an environment-centric perspective, organized along GEF loop stages.

Result: Presents a comprehensive survey organizing fragmented advances in environment scaling for LLM agents. Identifies key implementation frameworks, challenges, and applications. Establishes the GEF loop as a formal paradigm for iterative agent training through environment interaction.

Conclusion: Environments are essential producers of experiential data for LLM agent training. Scaling environments toward greater complexity, realism, and interactivity is crucial for advancing agent intelligence. The survey consolidates current methods and outlines future research directions for agent development through the GEF loop framework.

Abstract: LLM-based agents can autonomously accomplish complex tasks across various domains. However, to further cultivate capabilities such as adaptive behavior and long-term decision-making, training on static datasets built from human-level knowledge is insufficient. These datasets are costly to construct and lack both dynamism and realism. A growing consensus is that agents should instead interact directly with environments and learn from experience through reinforcement learning. We formalize this iterative process as the Generation-Execution-Feedback (GEF) loop, where environments generate tasks to challenge agents, return observations in response to agents’ actions during task execution, and provide evaluative feedback on rollouts for subsequent learning. Under this paradigm, environments function as indispensable producers of experiential data, highlighting the need to scale them toward greater complexity, realism, and interactivity. In this survey, we systematically review representative methods for environment scaling from a pioneering environment-centric perspective and organize them along the stages of the GEF loop, namely task generation, task execution, and feedback. We further analyze implementation frameworks, challenges, and applications, consolidating fragmented advances and outlining future research directions for agent intelligence.

[438] Trust Me, I Know This Function: Hijacking LLM Static Analysis using Bias

Shir Bernstein, David Beste, Daniel Ayzenshteyn, Lea Schonherr, Yisroel Mirsky

Main category: cs.LG

TL;DR: LLMs have an abstraction bias that causes them to overlook small bugs in familiar code patterns, enabling “Familiar Pattern Attacks” (FPAs) where minimal edits hijack LLM interpretation without affecting runtime behavior.

Details

Motivation: As LLMs are increasingly trusted for automated code review and static analysis, identifying vulnerabilities in their analysis capabilities is crucial for security and reliability. The paper aims to expose a critical blind spot where LLMs overgeneralize familiar programming patterns.

Method: Developed a fully automated, black-box algorithm that discovers and injects FPAs into target code. Evaluated attacks across multiple model families (OpenAI, Anthropic, Google) and programming languages (Python, C, Rust, Go), including testing with robust system prompts warning about attacks.

Result: FPAs are effective against both basic and reasoning models, transferable across model families, universal across programming languages, and remain effective even when models are explicitly warned about the attack via robust system prompts.

Conclusion: The paper identifies a critical vulnerability in LLM-based code analysis and demonstrates its practical exploitation. It explores defensive uses of FPAs and discusses broader implications for the reliability and safety of code-oriented LLMs, highlighting the need for more robust analysis systems.

Abstract: Large Language Models (LLMs) are increasingly trusted to perform automated code review and static analysis at scale, supporting tasks such as vulnerability detection, summarization, and refactoring. In this paper, we identify and exploit a critical vulnerability in LLM-based code analysis: an abstraction bias that causes models to overgeneralize familiar programming patterns and overlook small, meaningful bugs. Adversaries can exploit this blind spot to hijack the control flow of the LLM’s interpretation with minimal edits and without affecting actual runtime behavior. We refer to this attack as a Familiar Pattern Attack (FPA). We develop a fully automated, black-box algorithm that discovers and injects FPAs into target code. Our evaluation shows that FPAs are not only effective against basic and reasoning models, but are also transferable across model families (OpenAI, Anthropic, Google), and universal across programming languages (Python, C, Rust, Go). Moreover, FPAs remain effective even when models are explicitly warned about the attack via robust system prompts. Finally, we explore positive, defensive uses of FPAs and discuss their broader implications for the reliability and safety of code-oriented LLMs.

[439] ModalSurv: Investigating opportunities and limitations of multimodal deep survival learning in prostate and bladder cancer

Noorul Wahab, Ethar Alzaid, Jiaqi Lv, Fayyaz Minhas, Adam Shephard, Shan E Ahmed Raza

Main category: cs.LG

TL;DR: ModalSurv is a multimodal deep survival framework for cancer prognosis that integrates clinical, MRI, histopathology, and RNA-seq data, achieving top performance on prostate cancer but showing limited generalization on external validation.

Details

Motivation: Accurate survival prediction is essential for personalized cancer treatment, but current approaches may not fully leverage the complementary information across different data modalities.

Method: ModalSurv uses modality-specific projections and cross-attention fusion to integrate clinical, MRI, histopathology, and RNA-sequencing data for survival prediction.

Result: Achieved C-index of 0.7402 (1st place) for prostate cancer and 0.5740 (5th) for bladder cancer on CHIMERA Grand Challenge datasets. Clinical features alone outperformed multimodal models on external tests, revealing challenges with multimodal alignment and potential overfitting.

Conclusion: ModalSurv provides systematic evaluation of multimodal survival modeling, showing promise but also highlighting current limitations in scalability and generalization for cancer prognosis applications.

Abstract: Accurate survival prediction is essential for personalised cancer treatment. We propose ModalSurv, a multimodal deep survival framework integrating clinical, MRI, histopathology, and RNA-sequencing data via modality-specific projections and cross-attention fusion. On the CHIMERA Grand Challenge datasets, ModalSurv achieved a C-index of 0.7402 (1st) for prostate and 0.5740 (5th) for bladder cancer. Notably, clinical features alone outperformed multimodal models on external tests, highlighting challenges of limited multimodal alignment and potential overfitting. Local validation showed multimodal gains but limited generalisation. ModalSurv provides a systematic evaluation of multimodal survival modelling, underscoring both its promise and current limitations for scalable, generalisable cancer prognosis.

[440] AuON: A Linear-time Alternative to Orthogonal Momentum Updates

Dipan Maity

Main category: cs.LG

TL;DR: AuON is a linear-time optimizer that improves on orthogonal momentum gradient updates, addressing computational complexity and exploding attention logits issues in previous methods like Muon.

Details

Motivation: Vector-based optimizers like Adam have high memory costs and ill-conditioned momentum updates. Orthogonal momentum approaches (SVD/QR) are computationally expensive and underperform SGD. Recent methods like Muon improve efficiency but suffer from exploding attention logits and cubic complexity.

Method: Proposes AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without approximate orthogonal matrices. Preserves structural alignment and reconditions ill-posed updates with an automatic “emergency brake” for exploding attention logits. Also introduces Hybrid-AuON that applies linear transformations with Newton-Schulz iterations.

Result: AuON achieves strong performance while being computationally efficient (linear-time). Hybrid-AuON outperforms Muon in language modeling tasks.

Conclusion: AuON provides an effective solution to the limitations of orthogonal momentum gradient updates, offering linear-time computation, handling of exploding attention logits, and improved performance over previous methods like Muon in language modeling tasks.

Abstract: Orthogonal momentum gradient updates have emerged to overcome the limitations of vector-based optimizers like Adam. The vector-based optimizer Adam suffers from high memory costs and ill-conditioned momentum gradient updates. However, traditional Orthogonal momentum approaches, such as SVD/QR decomposition, suffer from high computational and memory costs and underperform compared to well-tuned SGD with momentum. Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and approximate orthogonal matrices via Newton-Schulz iterations, which gives better GPU utilization, active high TFLOPS, and reduces memory usage by up to 3x. Nevertheless, Muon(Vanilla) suffers from exploding attention logits and has cubic computation complexity. In this paper, we deep dive into orthogonal momentum gradient updates to find the main properties that help Muon achieve remarkable performance. We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without approximate orthogonal matrices, while preserving structural alignment and reconditioning ill-posed updates. AuON has an automatic “emergency brake” to handle exploding attention logits. We further introduce a hybrid variant, Hybrid-AuON, that applies the linear transformations with Newton-Schulz iterations, which outperforms Muon in the language modeling tasks. Code is available at: https://github.com/ryyzn9/AuON

[441] Forking-Sequences

Willa Potosnak, Malcolm Wolff, Boris Oreshkin, Mengfei Cao, Michael W. Mahoney, Dmitry Efimov, Kin G. Olivares

Main category: cs.LG

TL;DR: Forking-sequences technique improves forecast stability across forecast creation dates by jointly encoding/decoding time series across all dates, reducing erratic revisions and improving training stability, forecast variance, and computational efficiency.

Details

Motivation: Current time series forecasting models focus heavily on accuracy but overlook forecast stability across different forecast creation dates. Even accurate models can produce erratic revisions between forecast dates, which undermines stakeholder trust and disrupts downstream decision-making processes.

Method: The paper formalizes the forking-sequences approach used by models like MQCNN, MQT, and SPADE. Unlike standard methods that treat each forecast creation date independently, forking-sequences jointly encodes and decodes the entire time series across all forecast creation dates, mirroring time series cross-validation techniques.

Result: Experiments on 16 datasets from M1, M3, M4, and Tourism competitions show significant improvements: forecast percentage change stability improved by 28.8% for MLP, 28.8% for RNN, 37.9% for LSTM, 31.3% for CNN, and 8.8% on average for Transformer-based architectures. The method provides three key benefits: more stable gradient updates during training, reduced forecast variance through ensembling, and improved inference computational efficiency.

Conclusion: Forking-sequences is a highly effective but underutilized technique that should be more widely adopted in neural forecasting. It addresses the critical need for forecast stability across different forecast creation dates while maintaining accuracy, making models more reliable for practical applications.

Abstract: While accuracy is a critical requirement for time series forecasting models, an equally important (yet often overlooked) desideratum is forecast stability across forecast creation dates (FCDs). Even highly accurate models can produce erratic revisions between FCDs, undermining stakeholder trust and disrupting downstream decision-making. To improve forecast stability, models like MQCNN, MQT, and SPADE employ a little-known but highly effective technique: forking-sequences. Unlike standard statistical and neural forecasting methods that treat each FCD independently, the forking-sequences method jointly encodes and decodes the entire time series across all FCDs, in a way mirroring time series cross-validation. Since forking sequences remains largely unknown in the broader neural forecasting community, in this work, we formalize the forking-sequences approach, and we make a case for its broader adoption. We demonstrate three key benefits of forking-sequences: (i) more stable and consistent gradient updates during training; (ii) reduced forecast variance through ensembling; and (iii) improved inference computational efficiency. We validate forking-sequences’ benefits using 16 datasets from the M1, M3, M4, and Tourism competitions, showing improvements in forecast percentage change stability of 28.8%, 28.8%, 37.9%, and 31.3%, and 8.8%, on average, for MLP, RNN, LSTM, CNN, and Transformer-based architectures, respectively.

[442] Developing Distance-Aware, and Evident Uncertainty Quantification in Dynamic Physics-Constrained Neural Networks for Robust Bearing Degradation Estimation

Waleed Razzaq, Yun-Bo Zhao

Main category: cs.LG

TL;DR: Two distance-aware uncertainty methods (PG-SNGP and PG-SNER) for physics-guided neural networks improve degradation prediction accuracy and uncertainty calibration for bearing health monitoring, outperforming Monte Carlo and Deep Ensemble methods.

Details

Motivation: Existing uncertainty methods for predictive maintenance lack confidence calibration, are computationally expensive, not distance-aware, and fail to generalize under out-of-distribution data, which is critical for safety-critical systems like rotating machinery with rolling-element bearings.

Method: Two distance-aware uncertainty methods: PG-SNGP (Spectral Normalization Gaussian Process) replaces final dense layer with Gaussian Process layer; PG-SNER (Deep Evidential Regression) outputs Normal Inverse Gamma parameters. Both apply spectral normalization to preserve input-to-latent distances, use dynamic weighting to balance data fidelity and physical consistency, and evaluate with new distance-aware Pearson Correlation Coefficient metric.

Result: PG-SNGP and PG-SNER improve prediction accuracy, generalize reliably under OOD conditions, and remain robust to adversarial attacks and noise when tested on PRONOSTIA, XJTU-SY and HUST bearing datasets, outperforming Monte Carlo and Deep Ensemble PGNNs.

Conclusion: The proposed distance-aware uncertainty methods provide accurate, calibrated uncertainty estimation for bearing degradation prediction, addressing key limitations of existing approaches and enhancing reliability for safety-critical predictive maintenance applications.

Abstract: Accurate and uncertainty-aware degradation estimation is essential for predictive maintenance in safety-critical systems like rotating machinery with rolling-element bearings. Many existing uncertainty methods lack confidence calibration, are costly to run, are not distance-aware, and fail to generalize under out-of-distribution data. We introduce two distance-aware uncertainty methods for deterministic physics-guided neural networks: PG-SNGP, based on Spectral Normalization Gaussian Process, and PG-SNER, based on Deep Evidential Regression. We apply spectral normalization to the hidden layers so the network preserves distances from input to latent space. PG-SNGP replaces the final dense layer with a Gaussian Process layer for distance-sensitive uncertainty, while PG-SNER outputs Normal Inverse Gamma parameters to model uncertainty in a coherent probabilistic form. We assess performance using standard accuracy metrics and a new distance-aware metric based on the Pearson Correlation Coefficient, which measures how well predicted uncertainty tracks the distance between test and training samples. We also design a dynamic weighting scheme in the loss to balance data fidelity and physical consistency. We test our methods on rolling-element bearing degradation using the PRONOSTIA, XJTU-SY and HUST datasets and compare them with Monte Carlo and Deep Ensemble PGNNs. Results show that PG-SNGP and PG-SNER improve prediction accuracy, generalize reliably under OOD conditions, and remain robust to adversarial attacks and noise.

[443] UAMDP: Uncertainty-Aware Markov Decision Process for Risk-Constrained Reinforcement Learning from Probabilistic Forecasts

Michal Koren, Or Peretz, Tai Dinh, Philip S. Yu

Main category: cs.LG

TL;DR: UAMDP framework combines Bayesian forecasting, Thompson sampling, and CVaR-constrained planning for risk-aware sequential decision-making in volatile environments.

Details

Motivation: Sequential decisions in volatile, high-stakes settings require principled uncertainty management beyond just maximizing expected return. There's a need for robust approaches that handle structural uncertainty and economic volatility.

Method: Uncertainty-Aware Markov Decision Process (UAMDP) framework that couples Bayesian forecasting, posterior-sampling reinforcement learning (Thompson sampling), and planning under conditional value-at-risk (CVaR) constraints in a closed loop.

Result: Established regret bounds converging to Bayes-optimal benchmark. In high-frequency trading and retail inventory control, UAMDP improved forecasting accuracy (RMSE ↓25%, sMAPE ↓32%), increased Sharpe ratio from 1.54 to 1.74, and halved maximum drawdown.

Conclusion: Integrating calibrated probabilistic modeling, exploration aligned with posterior uncertainty, and risk-aware control yields a robust, generalizable approach to safer and more profitable sequential decision-making in volatile environments.

Abstract: Sequential decisions in volatile, high-stakes settings require more than maximizing expected return; they require principled uncertainty management. This paper presents the Uncertainty-Aware Markov Decision Process (UAMDP), a unified framework that couples Bayesian forecasting, posterior-sampling reinforcement learning, and planning under a conditional value-at-risk (CVaR) constraint. In a closed loop, the agent updates its beliefs over latent dynamics, samples plausible futures via Thompson sampling, and optimizes policies subject to preset risk tolerances. We establish regret bounds that converge to the Bayes-optimal benchmark under standard regularity conditions. We evaluate UAMDP in two domains including high-frequency equity trading and retail inventory control, both marked by structural uncertainty and economic volatility. Relative to strong deep learning baselines, UAMDP improves long-horizon forecasting accuracy (RMSE decreases by up to 25% and sMAPE by 32%), and these gains translate into economic performance: the trading Sharpe ratio rises from 1.54 to 1.74 while maximum drawdown is roughly halved. These results show that integrating calibrated probabilistic modeling, exploration aligned with posterior uncertainty, and risk-aware control yields a robust, generalizable approach to safer and more profitable sequential decision-making.

[444] nanoTabPFN: A Lightweight and Educational Reimplementation of TabPFN

Alexander Pfefferle, Johannes Hog, Lennart Purucker, Frank Hutter

Main category: cs.LG

TL;DR: nanoTabPFN is a simplified, lightweight implementation of TabPFN v2 that makes tabular foundation models accessible with minimal computational resources, achieving comparable performance to traditional ML baselines in one minute of pre-training on a single GPU.

Details

Motivation: Existing tabular foundation models like TabPFN are implemented in complex pipelines with over 10,000 lines of code, lacking documentation and code quality, making them hard to understand, not beginner-friendly, and difficult to adapt for new experiments.

Method: Introduces nanoTabPFN, a simplified implementation of TabPFN v2 architecture with a corresponding training loop that uses pre-generated training data, designed to be lightweight and accessible.

Result: In small data settings, nanoTabPFN achieves performance comparable to traditional machine learning baselines within one minute of pre-training on a single GPU (160,000x faster than TabPFN v2 pretraining).

Conclusion: nanoTabPFN makes tabular foundation models more accessible to students and researchers by eliminating the requirement of large computational resources, enabling educational use and experimentation.

Abstract: Tabular foundation models such as TabPFN have revolutionized predictive machine learning for tabular data. At the same time, the driving factors of this revolution are hard to understand. Existing open-source tabular foundation models are implemented in complicated pipelines boasting over 10,000 lines of code, lack architecture documentation or code quality. In short, the implementations are hard to understand, not beginner-friendly, and complicated to adapt for new experiments. We introduce nanoTabPFN, a simplified and lightweight implementation of the TabPFN v2 architecture and a corresponding training loop that uses pre-generated training data. nanoTabPFN makes tabular foundation models more accessible to students and researchers alike. For example, restricted to a small data setting it achieves a performance comparable to traditional machine learning baselines within one minute of pre-training on a single GPU (160,000x faster than TabPFN v2 pretraining). This eliminated requirement of large computational resources makes pre-training tabular foundation models accessible for educational purposes. Our code is available at https://github.com/automl/nanoTabPFN.

[445] Forgetting is Everywhere

Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin, David Abel, Amos Storkey

Main category: cs.LG

TL;DR: The paper proposes a unified theory of forgetting in learning algorithms, defining it as loss of predictive information due to lack of self-consistency, and shows Bayesian learners can adapt without forgetting.

Details

Motivation: A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of forgetting.

Method: Propose an algorithm- and task-agnostic theory that characterizes forgetting as a lack of self-consistency in a learner’s predictive distribution over future experiences, manifesting as a loss of predictive information. This theory yields a general measure of an algorithm’s propensity to forget.

Result: The theory shows that Bayesian learners are capable of adapting without forgetting. Comprehensive experiments across classification, regression, generative modeling, and reinforcement learning empirically demonstrate that forgetting is present across all deep learning settings and plays a significant role in determining learning efficiency.

Conclusion: The results establish a principled understanding of forgetting and lay the foundation for analyzing and improving the information retention capabilities of general learning algorithms.

Abstract: A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner’s predictive distribution over future experiences, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm’s propensity to forget and shows that Bayesian learners are capable of adapting without forgetting. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all deep learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.

[446] Error Estimate and Convergence Analysis for Data Valuation

Zhangyong Liang, Huanhuan Gao, Ji Zhang

Main category: cs.LG

TL;DR: NDDV enables data valuation in a single training process with proven error bounds and convergence guarantees.

Details

Motivation: Existing data valuation methods cannot ensure validity in a single training process, creating a need for methods that provide theoretical guarantees for data importance quantification.

Method: Neural Dynamic Data Valuation (NDDV) with error estimation and convergence analysis under Lipschitz and smoothness assumptions, deriving quadratic error bounds and proving asymptotic vanishing of gradient norms.

Result: Quadratic error bounds scale inversely with time steps and quadratically with control variations, ensuring stability; expected squared gradient norm vanishes asymptotically; meta loss converges sublinearly over iterations.

Conclusion: NDDV achieves sublinear convergence and provides the first theoretical guarantees for data valuation with error bounds and convergence analysis in a single training process.

Abstract: Data valuation quantifies data importance, but existing methods cannot ensure validity in a single training process. The neural dynamic data valuation (NDDV) method [3] addresses this limitation. Based on NDDV, we are the first to explore error estimation and convergence analysis in data valuation. Under Lipschitz and smoothness assumptions, we derive quadratic error bounds for loss differences that scale inversely with time steps and quadratically with control variations, ensuring stability. We also prove that the expected squared gradient norm for the training loss vanishes asymptotically, and that the meta loss converges sublinearly over iterations. In particular, NDDV achieves sublinear convergence.

[447] Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling

Aihua Zhu, Rui Su, Qinglin Zhao, Li Feng, Meng Shen, Shibo He

Main category: cs.LG

TL;DR: HSO is a training-free bi-level optimization framework that accelerates diffusion models by finding optimal timestep schedules in under 8 seconds, achieving state-of-the-art FID scores with just 5 function evaluations.

Details

Motivation: Diffusion models have excellent generative quality but suffer from slow iterative sampling. Existing schedule optimization methods fail to simultaneously satisfy four key principles: effectiveness, adaptivity, practical robustness, and computational efficiency.

Method: HSO uses a bi-level optimization framework with two levels: upper-level global search for optimal initialization strategy and lower-level local optimization for schedule refinement. Key innovations include Midpoint Error Proxy (MEP) for stable local optimization and Spacing-Penalized Fitness (SPF) function for robustness against pathologically close timesteps.

Result: HSO achieves state-of-the-art performance in low-NFE regime: with just 5 function evaluations, it achieves FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1. The optimization cost is less than 8 seconds, making it highly practical.

Conclusion: HSO presents an efficient, training-free paradigm for diffusion model acceleration that simultaneously satisfies all four core principles of schedule optimization, offering a practical solution for real-world deployment of diffusion models.

Abstract: Diffusion probabilistic models have set a new standard for generative fidelity but are hindered by a slow iterative sampling process. A powerful training-free strategy to accelerate this process is Schedule Optimization, which aims to find an optimal distribution of timesteps for a fixed and small Number of Function Evaluations (NFE) to maximize sample quality. To this end, a successful schedule optimization method must adhere to four core principles: effectiveness, adaptivity, practical robustness, and computational efficiency. However, existing paradigms struggle to satisfy these principles simultaneously, motivating the need for a more advanced solution. To overcome these limitations, we propose the Hierarchical-Schedule-Optimizer (HSO), a novel and efficient bi-level optimization framework. HSO reframes the search for a globally optimal schedule into a more tractable problem by iteratively alternating between two synergistic levels: an upper-level global search for an optimal initialization strategy and a lower-level local optimization for schedule refinement. This process is guided by two key innovations: the Midpoint Error Proxy (MEP), a solver-agnostic and numerically stable objective for effective local optimization, and the Spacing-Penalized Fitness (SPF) function, which ensures practical robustness by penalizing pathologically close timesteps. Extensive experiments show that HSO sets a new state-of-the-art for training-free sampling in the extremely low-NFE regime. For instance, with an NFE of just 5, HSO achieves a remarkable FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1. Crucially, this level of performance is attained not through costly retraining, but with a one-time optimization cost of less than 8 seconds, presenting a highly practical and efficient paradigm for diffusion model acceleration.

[448] SEED: Spectral Entropy-Guided Evaluation of SpatialTemporal Dependencies for Multivariate Time Series Forecasting

Feng Xiong, Zongxia Xie, Yanru Sun, Haoyu Wang, Jianhong Lin

Main category: cs.LG

TL;DR: SEED is a spectral entropy-guided framework for multivariate time series forecasting that addresses limitations in existing attention/graph methods by dynamically evaluating spatial-temporal dependencies, preserving negative correlations, and enhancing temporal position awareness.

Details

Motivation: Existing attention- or graph-based methods for multivariate time series forecasting face three key issues: (1) strong temporal self-dependencies are disrupted by irrelevant variables, (2) softmax normalization ignores/reverses negative correlations, and (3) variables struggle to perceive their temporal positions.

Method: SEED introduces four key components: (1) Dependency Evaluator using spectral entropy to dynamically assess spatial-temporal dependencies and balance Channel Independence vs Channel Dependence strategies; (2) Spectral Entropy-based Fuser to refine dependency weights by separating temporal regularities from external influences; (3) Signed Graph Constructor with signed edge weights to preserve negative correlations; (4) Context Spatial Extractor using local contextual windows to help variables perceive temporal positions and extract spatial features.

Result: Extensive experiments on 12 real-world datasets from various application domains demonstrate that SEED achieves state-of-the-art performance, validating its effectiveness and generality.

Conclusion: SEED provides an effective spectral entropy-guided framework for multivariate time series forecasting that addresses key limitations in existing methods through dynamic dependency evaluation, negative correlation preservation, and enhanced temporal position awareness, achieving superior performance across diverse datasets.

Abstract: Effective multivariate time series forecasting often benefits from accurately modeling complex inter-variable dependencies. However, existing attention- or graph-based methods face three key issues: (a) strong temporal self-dependencies are often disrupted by irrelevant variables; (b) softmax normalization ignores and reverses negative correlations; (c) variables struggle to perceive their temporal positions. To address these, we propose \textbf{SEED}, a Spectral Entropy-guided Evaluation framework for spatial-temporal Dependency modeling. SEED introduces a Dependency Evaluator, a key innovation that leverages spectral entropy to dynamically provide a preliminary evaluation of the spatial and temporal dependencies of each variable, enabling the model to adaptively balance Channel Independence (CI) and Channel Dependence (CD) strategies. To account for temporal regularities originating from the influence of other variables rather than intrinsic dynamics, we propose Spectral Entropy-based Fuser to further refine the evaluated dependency weights, effectively separating this part. Moreover, to preserve negative correlations, we introduce a Signed Graph Constructor that enables signed edge weights, overcoming the limitations of softmax. Finally, to help variables perceive their temporal positions and thereby construct more comprehensive spatial features, we introduce the Context Spatial Extractor, which leverages local contextual windows to extract spatial features. Extensive experiments on 12 real-world datasets from various application domains demonstrate that SEED achieves state-of-the-art performance, validating its effectiveness and generality.

[449] Tuning for Two Adversaries: Enhancing the Robustness Against Transfer and Query-Based Attacks using Hyperparameter Tuning

Pascal Zimmer, Ghassan Karame

Main category: cs.LG

TL;DR: Training hyperparameters have opposite effects on robustness against transfer-based vs query-based attacks: lower learning rate helps against transfer attacks, higher learning rate helps against query attacks.

Details

Motivation: To understand how training hyperparameters (learning rate, weight decay, momentum, batch size) influence model robustness against different types of adversarial attacks in practical deployment settings.

Method: Theoretical analysis and experimental study across various practical settings including centralized training, ensemble learning, and distributed training. Explores hyperparameter space to jointly enhance robustness against both attack types.

Result: Found striking dichotomy: decreasing learning rate enhances robustness against transfer-based attacks by up to 64%, while increasing learning rate improves robustness against query-based attacks by up to 28%. Distributed models benefit most from hyperparameter tuning, achieving best tradeoff against both attack types.

Conclusion: Hyperparameter tuning can significantly impact model robustness, with opposite effects for different attack types. Distributed training setups offer the best opportunity for achieving balanced robustness against both transfer-based and query-based attacks through careful hyperparameter optimization.

Abstract: In this paper, we present the first detailed analysis of how training hyperparameters – such as learning rate, weight decay, momentum, and batch size – influence robustness against both transfer-based and query-based attacks. Supported by theory and experiments, our study spans a variety of practical deployment settings, including centralized training, ensemble learning, and distributed training. We uncover a striking dichotomy: for transfer-based attacks, decreasing the learning rate significantly enhances robustness by up to $64%$. In contrast, for query-based attacks, increasing the learning rate consistently leads to improved robustness by up to $28%$ across various settings and data distributions. Leveraging these findings, we explore – for the first time – the training hyperparameter space to jointly enhance robustness against both transfer-based and query-based attacks. Our results reveal that distributed models benefit the most from hyperparameter tuning, achieving a remarkable tradeoff by simultaneously mitigating both attack types more effectively than other training setups.

[450] Intervention Efficiency and Perturbation Validation Framework: Capacity-Aware and Robust Clinical Model Selection under the Rashomon Effect

Yuwen Zhang, Viet Tran, Paul Weng

Main category: cs.LG

TL;DR: The paper proposes two tools (Intervention Efficiency and Perturbation Validation Framework) to address the Rashomon Effect in clinical ML, where multiple models have comparable performance but differ in clinical utility and robustness.

Details

Motivation: Clinical ML faces the Rashomon Effect - multiple models with similar performance metrics but different clinical utility. Small, imbalanced, noisy datasets with high-dimensional features make conventional validation unreliable. Resource constraints and operational priorities aren't captured by metrics like F1 score.

Method: Two complementary tools: 1) Intervention Efficiency (IE) - a capacity-aware metric quantifying how efficiently a model identifies actionable true positives when interventions are limited, linking prediction to clinical utility. 2) Perturbation Validation Framework (PVF) - assesses model stability under data perturbations to identify models with invariant performance across noisy/shifted validation sets.

Result: Empirical results on synthetic and real-world healthcare datasets show these tools help select models that generalize more robustly and align with capacity constraints, offering a new approach to tackle the Rashomon Effect in clinical settings.

Conclusion: The proposed tools provide a practical framework for robust model assessment and selection in clinical ML, addressing the fundamental challenge of model multiplicity when conventional metrics fail to capture clinical utility and operational constraints.

Abstract: In clinical machine learning, the coexistence of multiple models with comparable performance – a manifestation of the Rashomon Effect – poses fundamental challenges for trustworthy deployment and evaluation. Small, imbalanced, and noisy datasets, coupled with high-dimensional and weakly identified clinical features, amplify this multiplicity and make conventional validation schemes unreliable. As a result, selecting among equally performing models becomes uncertain, particularly when resource constraints and operational priorities are not considered by conventional metrics like F1 score. To address these issues, we propose two complementary tools for robust model assessment and selection: Intervention Efficiency (IE) and the Perturbation Validation Framework (PVF). IE is a capacity-aware metric that quantifies how efficiently a model identifies actionable true positives when only limited interventions are feasible, thereby linking predictive performance with clinical utility. PVF introduces a structured approach to assess the stability of models under data perturbations, identifying models whose performance remains most invariant across noisy or shifted validation sets. Empirical results on synthetic and real-world healthcare datasets show that using these tools facilitates the selection of models that generalize more robustly and align with capacity constraints, offering a new direction for tackling the Rashomon Effect in clinical settings.

[451] OceanForecastBench: A Benchmark Dataset for Data-Driven Global Ocean Forecasting

Haoming Jia, Yi Han, Xiang Wang, Huizan Wang, Wei Wu, Jianming Zheng, Peikun Xiao

Main category: cs.LG

TL;DR: OceanForecastBench is an open-source benchmarking framework for data-driven ocean forecasting models, providing standardized training data, evaluation observations, and baseline models to address the lack of consistent benchmarks in the field.

Details

Motivation: The absence of open-source, standardized benchmarks in data-driven ocean forecasting has led to inconsistent data usage and evaluation methods, hindering model development, fair performance comparison, and interdisciplinary collaboration.

Method: Proposes OceanForecastBench with three core components: 1) 28-year global ocean reanalysis data with 4 ocean variables across 23 depth levels and 4 sea surface variables for training, 2) high-reliability satellite and in-situ observations covering ~100 million locations for evaluation, and 3) an evaluation pipeline with 6 baseline models for comprehensive benchmarking.

Result: OceanForecastBench represents the most comprehensive benchmarking framework currently available for data-driven ocean forecasting, providing an open-source platform for model development, evaluation, and comparison.

Conclusion: The benchmark addresses critical gaps in the field by standardizing data usage and evaluation methods, enabling more efficient model development, fair performance comparison, and enhanced interdisciplinary collaboration in ocean forecasting research.

Abstract: Global ocean forecasting aims to predict key ocean variables such as temperature, salinity, and currents, which is essential for understanding and describing oceanic phenomena. In recent years, data-driven deep learning-based ocean forecast models, such as XiHe, WenHai, LangYa and AI-GOMS, have demonstrated significant potential in capturing complex ocean dynamics and improving forecasting efficiency. Despite these advancements, the absence of open-source, standardized benchmarks has led to inconsistent data usage and evaluation methods. This gap hinders efficient model development, impedes fair performance comparison, and constrains interdisciplinary collaboration. To address this challenge, we propose OceanForecastBench, a benchmark offering three core contributions: (1) A high-quality global ocean reanalysis data over 28 years for model training, including 4 ocean variables across 23 depth levels and 4 sea surface variables. (2) A high-reliability satellite and in-situ observations for model evaluation, covering approximately 100 million locations in the global ocean. (3) An evaluation pipeline and a comprehensive benchmark with 6 typical baseline models, leveraging observations to evaluate model performance from multiple perspectives. OceanForecastBench represents the most comprehensive benchmarking framework currently available for data-driven ocean forecasting, offering an open-source platform for model development, evaluation, and comparison. The dataset and code are publicly available at: https://github.com/Ocean-Intelligent-Forecasting/OceanForecastBench.

[452] Bridging Modalities via Progressive Re-alignment for Multimodal Test-Time Adaptation

Jiacheng Li, Songhe Feng

Main category: cs.LG

TL;DR: BriMPR is a multimodal test-time adaptation framework that addresses distribution shifts across modalities via progressive re-alignment using prompt tuning and contrastive learning.

Details

Motivation: Multimodal TTA faces challenges due to varying distribution shifts across modalities, creating complex coupling effects between unimodal feature shifts and cross-modal semantic misalignment that existing TTA methods cannot handle.

Method: BriMPR uses a divide-and-conquer strategy with two progressive modules: 1) Prompt tuning to calibrate unimodal global feature distributions to source distributions for initial semantic re-alignment, and 2) Inter-modal instance-wise contrastive learning with pseudo-labels on masked/complete modality combinations to enhance information interaction and refine alignment.

Result: Extensive experiments on both corruption-based and real-world domain shift benchmarks demonstrate the superiority of the proposed method over existing approaches.

Conclusion: BriMPR effectively addresses the multimodal TTA challenge by progressively re-aligning modalities through feature calibration and contrastive learning, providing a robust solution for adapting models to multimodal distribution shifts.

Abstract: Test-time adaptation (TTA) enables online model adaptation using only unlabeled test data, aiming to bridge the gap between source and target distributions. However, in multimodal scenarios, varying degrees of distribution shift across different modalities give rise to a complex coupling effect of unimodal shallow feature shift and cross-modal high-level semantic misalignment, posing a major obstacle to extending existing TTA methods to the multimodal field. To address this challenge, we propose a novel multimodal test-time adaptation (MMTTA) framework, termed as Bridging Modalities via Progressive Re-alignment (BriMPR). BriMPR, consisting of two progressively enhanced modules, tackles the coupling effect with a divide-and-conquer strategy. Specifically, we first decompose MMTTA into multiple unimodal feature alignment sub-problems. By leveraging the strong function approximation ability of prompt tuning, we calibrate the unimodal global feature distributions to their respective source distributions, so as to achieve the initial semantic re-alignment across modalities. Subsequently, we assign the credible pseudo-labels to combinations of masked and complete modalities, and introduce inter-modal instance-wise contrastive learning to further enhance the information interaction among modalities and refine the alignment. Extensive experiments on MMTTA tasks, including both corruption-based and real-world domain shift benchmarks, demonstrate the superiority of our method. Our source code is available at https://github.com/Luchicken/BriMPR.

[453] Masked Diffusion for Generative Recommendation

Kulin Shah, Bhuvesh Kumar, Neil Shah, Liam Collins

Main category: cs.LG

TL;DR: Proposes masked diffusion for generative recommendation with semantic IDs, outperforming autoregressive models with parallel decoding and better data efficiency.

Details

Motivation: Autoregressive generative recommendation models with semantic IDs suffer from expensive sequential inference, inefficient training data use, and bias toward short-context relationships. Inspired by NLP breakthroughs, the authors aim to overcome these limitations.

Method: Uses masked diffusion modeling instead of autoregressive modeling. Employs discrete masking noise to learn sequence distributions, enabling parallel decoding of masked tokens as conditionally independent given unmasked tokens.

Result: The proposed method consistently outperforms autoregressive modeling, especially in data-constrained settings and coarse-grained recall. Allows flexible parallel prediction of multiple semantic IDs while maintaining superior performance.

Conclusion: Masked diffusion provides a more effective alternative to autoregressive modeling for generative recommendation with semantic IDs, offering better performance, parallel decoding capabilities, and improved data efficiency.

Abstract: Generative recommendation (GR) with semantic IDs (SIDs) has emerged as a promising alternative to traditional recommendation approaches due to its performance gains, capitalization on semantic information provided through language model embeddings, and inference and storage efficiency. Existing GR with SIDs works frame the probability of a sequence of SIDs corresponding to a user’s interaction history using autoregressive modeling. While this has led to impressive next item prediction performances in certain settings, these autoregressive GR with SIDs models suffer from expensive inference due to sequential token-wise decoding, potentially inefficient use of training data and bias towards learning short-context relationships among tokens. Inspired by recent breakthroughs in NLP, we propose to instead model and learn the probability of a user’s sequence of SIDs using masked diffusion. Masked diffusion employs discrete masking noise to facilitate learning the sequence distribution, and models the probability of masked tokens as conditionally independent given the unmasked tokens, allowing for parallel decoding of the masked tokens. We demonstrate through thorough experiments that our proposed method consistently outperforms autoregressive modeling. This performance gap is especially pronounced in data-constrained settings and in terms of coarse-grained recall, consistent with our intuitions. Moreover, our approach allows the flexibility of predicting multiple SIDs in parallel during inference while maintaining superior performance to autoregressive modeling.

[454] Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation

Ziniu Zhang, Minxuan Duan, Haris N. Koutsopoulos, Hongyang R. Zhang

Main category: cs.LG

TL;DR: Multimodal traffic accident prediction using road network graphs and satellite images improves accuracy by 3.7% over graph-only models and enables causal analysis of contributing factors.

Details

Motivation: Previous accident prediction models rely only on road network structural features, missing important physical and environmental information from road surfaces and surroundings that could improve prediction accuracy.

Method: Constructed large multimodal dataset with 9M accident records, 1M satellite images aligned to road nodes, plus weather statistics, road types, and traffic volume. Evaluated multimodal learning methods integrating visual and network embeddings.

Result: Multimodal integration achieved 90.1% AUROC, 3.7% gain over graph-only models. Causal analysis showed accident rates increase by 24% with higher precipitation, 22% on higher-speed roads, and 29% due to seasonal patterns.

Conclusion: Integrating satellite imagery with road network data significantly improves accident prediction accuracy and enables identification of key contributing factors through causal analysis, with satellite features being essential for accurate predictions.

Abstract: We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region’s weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1%$, which is a $3.7%$ gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by $24%$ under higher precipitation, by $22%$ on higher-speed roads such as motorways, and by $29%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.

[455] DCFO: Density-Based Counterfactuals for Outliers - Additional Material

Tommaso Amico, Pernille Matthews, Lena Krieger, Arthur Zimek, Ira Assent

Main category: cs.LG

TL;DR: DCFO is a novel method that generates counterfactual explanations specifically for Local Outlier Factor (LOF) outlier detection, outperforming existing methods on proximity and validity metrics across 50 datasets.

Details

Motivation: LOF is widely used but lacks interpretability, while existing counterfactual explanation methods don't address the unique challenges of outlier detection or target classical algorithms like LOF.

Method: DCFO partitions the data space into regions where LOF behaves smoothly, enabling efficient gradient-based optimization to generate counterfactual explanations for LOF outliers.

Result: Extensive experiments on 50 OpenML datasets show DCFO consistently outperforms benchmark competitors, offering superior proximity and validity of generated counterfactuals.

Conclusion: DCFO successfully addresses the interpretability gap in LOF by providing effective counterfactual explanations tailored to the unique characteristics of outlier detection.

Abstract: Outlier detection identifies data points that significantly deviate from the majority of the data distribution. Explaining outliers is crucial for understanding the underlying factors that contribute to their detection, validating their significance, and identifying potential biases or errors. Effective explanations provide actionable insights, facilitating preventive measures to avoid similar outliers in the future. Counterfactual explanations clarify why specific data points are classified as outliers by identifying minimal changes required to alter their prediction. Although valuable, most existing counterfactual explanation methods overlook the unique challenges posed by outlier detection, and fail to target classical, widely adopted outlier detection algorithms. Local Outlier Factor (LOF) is one the most popular unsupervised outlier detection methods, quantifying outlierness through relative local density. Despite LOF’s widespread use across diverse applications, it lacks interpretability. To address this limitation, we introduce Density-based Counterfactuals for Outliers (DCFO), a novel method specifically designed to generate counterfactual explanations for LOF. DCFO partitions the data space into regions where LOF behaves smoothly, enabling efficient gradient-based optimisation. Extensive experimental validation on 50 OpenML datasets demonstrates that DCFO consistently outperforms benchmarked competitors, offering superior proximity and validity of generated counterfactuals.

[456] Cornserve: Efficiently Serving Any-to-Any Multimodal Models

Jeff J. Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, Mosharaf Chowdhury

Main category: cs.LG

TL;DR: Cornserve is an efficient online serving system for Any-to-Any multimodal models that automatically optimizes deployment and execution, achieving significant throughput and latency improvements.

Details

Motivation: Any-to-Any models introduce serving challenges due to heterogeneity in request types, computation paths, and scaling requirements, requiring specialized serving systems beyond traditional approaches.

Method: Cornserve allows developers to describe computation graphs of Any-to-Any models, then automatically plans optimized deployments (including disaggregation decisions) and provides a distributed runtime for efficient execution.

Result: Cornserve delivers up to 3.81× throughput improvement and up to 5.79× tail latency reduction compared to existing serving solutions for diverse Any-to-Any models and workloads.

Conclusion: Cornserve effectively addresses the serving challenges of Any-to-Any multimodal models through automated planning and optimized distributed execution, significantly improving serving efficiency.

Abstract: We present Cornserve, an efficient online serving system for an emerging class of multimodal models called Any-to-Any models. Any-to-Any models accept combinations of text and multimodal data (e.g., image, video, audio) as input and also generate combinations of text and multimodal data as output, introducing request type, computation path, and computation scaling heterogeneity in model serving. Cornserve allows model developers to describe the computation graph of generic Any-to-Any models, which consists of heterogeneous components such as multimodal encoders, autoregressive models like Large Language Models (LLMs), and multimodal generators like Diffusion Transformers (DiTs). Given this, Cornserve’s planner automatically finds an optimized deployment plan for the model, including whether and how to disaggregate the model into smaller components based on model and workload characteristics. Cornserve’s distributed runtime then executes the model per the plan, efficiently handling Any-to-Any model heterogeneity during online serving. Evaluations show that Cornserve can efficiently serve diverse Any-to-Any models and workloads, delivering up to 3.81$\times$ throughput improvement and up to 5.79$\times$ tail latency reduction over existing solutions.

[457] Synthetic Electrogram Generation with Variational Autoencoders for ECGI

Miriam Gutiérrez-Fernández, Karen López-Linares, Carlos Fambuena-Santos, María S. Guillem, Andreu M. Climent, Óscar Barquero-Pérez

Main category: cs.LG

TL;DR: VAE-based generative models create synthetic atrial electrograms to address data scarcity in deep learning-based electrocardiographic imaging for atrial fibrillation assessment.

Details

Motivation: Atrial fibrillation assessment requires accurate characterization of atrial electrical activity, but progress in deep learning-based ECGI is hindered by limited availability of paired body surface potential and intracardiac electrogram datasets.

Method: Proposed two variational autoencoder models: VAE-S (sinus rhythm-specific) and VAE-C (class-conditioned for both sinus rhythm and AF signals). Generated synthetic multichannel atrial EGMs evaluated using morphological, spectral, and distributional similarity metrics.

Result: VAE-S achieves higher fidelity with respect to in silico EGMs, while VAE-C enables rhythm-specific generation at the expense of reduced sinus reconstruction quality. Generated EGMs used for data augmentation moderately improves performance in downstream noninvasive EGM reconstruction tasks.

Conclusion: VAE-based generative modeling shows potential to alleviate data scarcity and enhance deep learning-based ECGI pipelines for atrial fibrillation assessment.

Abstract: Atrial fibrillation (AF) is the most prevalent sustained cardiac arrhythmia, and its clinical assessment requires accurate characterization of atrial electrical activity. Noninvasive electrocardiographic imaging (ECGI) combined with deep learning (DL) approaches for estimating intracardiac electrograms (EGMs) from body surface potentials (BSPMs) has shown promise, but progress is hindered by the limited availability of paired BSPM-EGM datasets. To address this limitation, we investigate variational autoencoders (VAEs) for the generation of synthetic multichannel atrial EGMs. Two models are proposed: a sinus rhythm-specific VAE (VAE-S) and a class-conditioned VAE (VAE-C) trained on both sinus rhythm and AF signals. Generated EGMs are evaluated using morphological, spectral, and distributional similarity metrics. VAE-S achieves higher fidelity with respect to in silico EGMs, while VAE-C enables rhythm-specific generation at the expense of reduced sinus reconstruction quality. As a proof of concept, the generated EGMs are used for data augmentation in a downstream noninvasive EGM reconstruction task, where moderate augmentation improves estimation performance. These results demonstrate the potential of VAE-based generative modeling to alleviate data scarcity and enhance deep learning-based ECGI pipelines.

[458] Automatic Extraction of Rules for Generating Synthetic Patient Data From Real-World Population Data Using Glioblastoma as an Example

Arno Appenzeller, Nick Terzer, André Homeyer, Jan-Philipp Redlich, Sabine Luttmann, Friedrich Feuerhake, Nadine S. Schaadt, Timm Intemann, Sarah Teuber-Hanselmann, Stefan Nikolin, Joachim Weis, Klaus Kraywinkel, Pascal Birnstill

Main category: cs.LG

TL;DR: Automated generation of Synthea rules from cancer report statistics to create synthetic glioblastoma patient data that preserves statistical properties while ensuring privacy compliance.

Details

Motivation: Synthetic medical data generation enables privacy-compliant secondary use of healthcare data, but creating meaningful rules for tools like Synthea requires expert knowledge and realistic sample data, making the process complex and resource-intensive.

Method: Developed an approach to automatically generate Synthea rules from statistics extracted from tabular cancer report data. Created a Synthea module for glioblastoma using real-world dataset statistics and generated synthetic patient data based on these automated rules.

Result: The synthetic glioblastoma dataset reproduced known disease courses and mostly retained the statistical properties of the original dataset. Synthetic data showed potential for privacy-preserving research while maintaining realistic medical patterns.

Conclusion: Automated rule generation for Synthea enables efficient creation of synthetic medical data that preserves privacy while maintaining statistical fidelity. Synthetic data is valuable for hypothesis formulation and prototype development, though medical interpretation should consider current methodological limitations.

Abstract: The generation of synthetic data is a promising technology to make medical data available for secondary use in a privacy-compliant manner. A popular method for creating realistic patient data is the rule-based Synthea data generator. Synthea generates data based on rules describing the lifetime of a synthetic patient. These rules typically express the probability of a condition occurring, such as a disease, depending on factors like age. Since they only contain statistical information, rules usually have no specific data protection requirements. However, creating meaningful rules can be a very complex process that requires expert knowledge and realistic sample data. In this paper, we introduce and evaluate an approach to automatically generate Synthea rules based on statistics from tabular data, which we extracted from cancer reports. As an example use case, we created a Synthea module for glioblastoma from a real-world dataset and used it to generate a synthetic dataset. Compared to the original dataset, the synthetic data reproduced known disease courses and mostly retained the statistical properties. Overall, synthetic patient data holds great potential for privacy-preserving research. The data can be used to formulate hypotheses and to develop prototypes, but medical interpretation should consider the specific limitations as with any currently available approach.

cs.MA

[459] Ev-Trust: A Strategy Equilibrium Trust Mechanism for Evolutionary Games in LLM-Based Multi-Agent Services

Shiduo Yang, Jiye Wang, Jiayu Qin, Jianbin Li, Yu Wang, Yuanhe Zhao, Kenan Guo

Main category: cs.MA

TL;DR: Ev-Trust is an evolutionary game theory-based trust mechanism for LLM-driven multi-agent systems that integrates direct/indirect trust and expected revenue to guide agents toward cooperative equilibria while excluding malicious participants.

Details

Motivation: The Web's shift toward agent-centric paradigms using LLMs creates decentralized environments where openness and heterogeneity amplify risks of deception, fraud, and misinformation, posing challenges to trust establishment and system robustness.

Method: Proposes Ev-Trust, a strategy-equilibrium trust mechanism based on evolutionary game theory. It integrates direct trust, indirect trust, and expected revenue into a dynamic feedback structure within a “Request-Response-Payment-Evaluation” service framework, using replicator dynamics equations to prove equilibrium existence and stability.

Result: Experimental results show Ev-Trust effectively reflects agent trustworthiness in LLM-driven open service interactions, reduces malicious strategies, and increases collective revenue.

Conclusion: Ev-Trust provides a new perspective on trust modeling for agentic service web in group evolutionary game scenarios, enabling adaptive strategy adjustment and natural exclusion of malicious participants while reinforcing high-quality collaboration.

Abstract: The rapid evolution of the Web toward an agent-centric paradigm, driven by large language models (LLMs), has enabled autonomous agents to reason, plan, and interact in complex decentralized environments. However, the openness and heterogeneity of LLM-based multi-agent systems also amplify the risks of deception, fraud, and misinformation, posing severe challenges to trust establishment and system robustness. To address this issue, we propose Ev-Trust, a strategy-equilibrium trust mechanism grounded in evolutionary game theory. This mechanism integrates direct trust, indirect trust, and expected revenue into a dynamic feedback structure that guides agents’ behavioral evolution toward equilibria. Within a decentralized “Request-Response-Payment-Evaluation” service framework, Ev-Trust enables agents to adaptively adjust strategies, naturally excluding malicious participants while reinforcing high-quality collaboration. Furthermore, our theoretical derivation based on replicator dynamics equations proves the existence and stability of local evolutionary equilibria. Experimental results indicate that our approach effectively reflects agent trustworthiness in LLM-driven open service interaction scenarios, reduces malicious strategies, and increases collective revenue. We hope Ev-Trust can provide a new perspective on trust modeling for the agentic service web in group evolutionary game scenarios.

[460] Don’t Guess, Escalate: Towards Explainable Uncertainty-Calibrated AI Forensic Agents

Giulia Boato, Andrea Montibeller, Edward Delp, Luisa Verdoliva, Daniele Miorandi

Main category: cs.MA

TL;DR: AI forensic agents as orchestrators that select/combine forensic detectors, identify provenance/context, and provide uncertainty-aware assessments for multimedia authenticity verification.

Details

Motivation: AI is transforming multimedia forensics, but current solutions have pitfalls that need addressing. There's a need for more reliable and comprehensive forensic analysis systems.

Method: Propose AI forensic agents as orchestrators that: 1) select and combine multiple forensic detectors, 2) identify provenance and context, and 3) provide uncertainty-aware assessments. Introduce a unified framework to improve the authenticity verification process.

Result: A proposed framework for AI forensic agents that addresses current limitations in multimedia forensics by providing more reliable, comprehensive, and uncertainty-aware authenticity verification.

Conclusion: AI forensic agents represent a promising approach to improve multimedia forensics by orchestrating multiple detection methods, considering provenance/context, and incorporating uncertainty assessment for more reliable authenticity verification.

Abstract: AI is reshaping the landscape of multimedia forensics. We propose AI forensic agents: reliable orchestrators that select and combine forensic detectors, identify provenance and context, and provide uncertainty-aware assessments. We highlight pitfalls in current solutions and introduce a unified framework to improve the authenticity verification process.

cs.MM

[461] A Tri-Dynamic Preprocessing Framework for UGC Video Compression

Fei Zhao, Mengxi Guo, Shijie Zhao, Junlin Li, Li Zhang, Xiaodong Xie

Main category: cs.MM

TL;DR: Proposes Tri-Dynamic Preprocessing framework for UGC video encoding optimization using three adaptive components to handle variability in user-generated content.

Details

Motivation: UGC videos have become dominant in internet traffic but exhibit high variability and diverse characteristics compared to traditional test videos, challenging data-driven ML algorithms for encoding optimization in UGC scenarios.

Method: Tri-Dynamic Preprocessing framework with three adaptive components: 1) adaptive factor to regulate preprocessing intensity, 2) adaptive quantization level to fine-tune codec simulator, and 3) adaptive lambda tradeoff to adjust rate-distortion loss function.

Result: Experimental results on large-scale test sets demonstrate that the method attains exceptional performance.

Conclusion: The proposed framework effectively addresses the variability challenges in UGC video encoding optimization through adaptive preprocessing techniques.

Abstract: In recent years, user generated content (UGC) has become the dominant force in internet traffic. However, UGC videos exhibit a higher degree of variability and diverse characteristics compared to traditional encoding test videos. This variance challenges the effectiveness of data-driven machine learning algorithms for optimizing encoding in the broader context of UGC scenarios. To address this issue, we propose a Tri-Dynamic Preprocessing framework for UGC. Firstly, we employ an adaptive factor to regulate preprocessing intensity. Secondly, an adaptive quantization level is employed to fine-tune the codec simulator. Thirdly, we utilize an adaptive lambda tradeoff to adjust the rate-distortion loss function. Experimental results on large-scale test sets demonstrate that our method attains exceptional performance.

eess.AS

[462] Learning Recursive Attenuation Filters Under Noisy Conditions

Gloria Dal Santo, Karolina Prawda, Sebastian J. Schlecht, Vesa Välimäki

Main category: eess.AS

TL;DR: The paper proposes a method for robustly tuning recursive attenuation filters in feedback delay networks when target room impulse responses contain noise, addressing issues with gradient-based optimization under low signal-to-noise conditions.

Details

Motivation: Existing differentiable digital signal processing methods for tuning recursive audio systems are sensitive to background noise in real measurements, leading to incorrect optimization results and spurious loss minima.

Method: The authors examine loss landscapes of different optimization objectives and propose a method that explicitly models noise to ensure correct minima under low signal-to-noise conditions.

Result: Statistical analysis on 80 optimization examples shows that explicit noise modeling restores correct minima, and reveals sensitivity of attenuation filter parameters to frequency-independent parameter perturbations.

Conclusion: The findings provide practical guidelines for more robust and reproducible gradient-based optimization of feedback delay networks in noisy real-world conditions.

Abstract: Recursion is a fundamental concept in the design of filters and audio systems. In particular, artificial reverberation systems that use delay networks depend on recursive paths to control both echo density and the decay rate of modal components. The differentiable digital signal processing framework has shown promise in automatically tuning both recursive and non-recursive elements given a target room impulse response. This is done by applying gradient descent to loss functions based on energy-decay or spectrogram differences. However, these representations are highly sensitive to background noise, which is ubiquitous in real measurements, producing spurious loss minima and leading to incorrect attenuation. This paper addresses the problem of tuning recursive attenuation filters of a feedback delay network when targets are noisy. We examine the loss landscape associated with different optimization objectives and propose a method that ensures correct minima under low signal-to-noise conditions. We demonstrate the effectiveness of the proposed approach through statistical analysis on 80 individual optimization examples. The results reveal that explicitly modeling the noise restores correct minima. Furthermore, we identify the sensitivity of attenuation filter parameters tuning to perturbations in frequency-independent parameters. These findings provide practical guidelines for more robust and reproducible gradient-based optimization of feedback delay networks.

[463] BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

Anup Singh, Kris Demuynck, Vipul Arora

Main category: eess.AS

TL;DR: Proposed noise-augmented training and optimal transport regularization for token-based spoken term detection to improve robustness and token efficiency while maintaining fast retrieval.

Details

Motivation: Token-based STD systems are efficient for voice search but struggle with robustness to noise/reverberation and inefficient token utilization, limiting their practical effectiveness.

Method: 1) Noise and reverberation-augmented training strategy to improve tokenizer robustness; 2) Optimal transport-based regularization for balanced token usage and efficiency; 3) TF-IDF-based search mechanism for faster retrieval.

Result: Outperforms STD baselines across various distortion levels while maintaining high search efficiency in empirical evaluations.

Conclusion: The proposed approach effectively addresses robustness and efficiency challenges in token-based STD systems, making them more practical for real-world voice search applications.

Abstract: Fast and accurate spoken content retrieval is vital for applications such as voice search. Query-by-Example Spoken Term Detection (STD) involves retrieving matching segments from an audio database given a spoken query. Token-based STD systems, which use discrete speech representations, enable efficient search but struggle with robustness to noise and reverberation, and with inefficient token utilization. We address these challenges by proposing a noise and reverberation-augmented training strategy to improve tokenizer robustness. In addition, we introduce optimal transport-based regularization to ensure balanced token usage and enhance token efficiency. To further speed up retrieval, we adopt a TF-IDF-based search mechanism. Empirical evaluations demonstrate that the proposed method outperforms STD baselines across various distortion levels while maintaining high search efficiency.

[464] Optimizing tiny colorless feedback delay networks

Gloria Dal Santo, Karolina Prawda, Sebastian J. Schlecht, Vesa Välimäki

Main category: eess.AS

TL;DR: Optimization framework for tiny feedback delay networks (4 delay lines) reduces spectral coloration in artificial reverberation while maintaining temporal density.

Details

Motivation: Artificial reverberation algorithms suffer from spectral coloration (metallic ringing), especially with fewer delay lines, degrading perceived sound quality. Need to reduce coloration while keeping computational efficiency.

Method: Uses tiny differentiable feedback delay network (4 delay lines) with optimization framework. Learns parameters including feedback matrix, input/output gains. Objective: maximize spectral flatness via spectral loss while penalizing parameter sparseness to maintain temporal density.

Result: Achieves favorable narrow distribution of modal excitation while maintaining impulse response density. Subjective assessment shows effective reduction of perceptual coloration in late reverberation. Computational savings compared to previous time-domain sparsity loss method.

Conclusion: Proposed method effectively reduces coloration in artificial reverberation with minimal delay lines. Demonstrated through applications with attenuation filters and optimizable scattering feedback matrix for smooth-sounding synthetic room impulse responses.

Abstract: A common bane of artificial reverberation algorithms is spectral coloration in the synthesized sound, typically manifesting as metallic ringing, leading to a degradation in the perceived sound quality. In delay network methods, coloration is more pronounced when fewer delay lines are used. This paper presents an optimization framework in which a tiny differentiable feedback delay network, with as few as four delay lines, is used to learn a set of parameters to iteratively reduce coloration. The parameters under optimization include the feedback matrix, as well as the input and output gains. The optimization objective is twofold: to maximize spectral flatness through a spectral loss while maintaining temporal density by penalizing sparseness in the parameter values. A favorable narrow distribution of modal excitation is achieved while maintaining the desired impulse response density. In a subjective assessment, the new method proves effective in reducing perceptual coloration of late reverberation. Compared to the author’s previous work, which serves as the baseline and utilizes a sparsity loss in the time domain, the proposed method achieves computational savings while maintaining performance. The effectiveness of this work is demonstrated through two application scenarios where smooth-sounding synthetic room impulse responses are obtained via the introduction of attenuation filters and an optimizable scattering feedback matrix.

[465] Similarity Metrics For Late Reverberation

Gloria Dal Santo, Karolina Prawda, Sebastian J. Schlecht, Vesa Välimäki

Main category: eess.AS

TL;DR: Novel differentiable metrics for late reverberation similarity in room impulse responses, outperforming general audio metrics in optimization tasks.

Details

Motivation: General audio similarity metrics are not optimized for the specific statistical properties of reverberation in rooms, creating a need for specialized metrics for reverberation algorithm tuning.

Method: Developed two novel differentiable metrics based on averaged power and frequency-band energy decay for assessing late reverberation similarity in room impulse responses. These metrics can be used within machine-learning frameworks.

Result: The proposed metrics outperform two popular audio metrics on a large dataset of room impulse responses across various room configurations and microphone positions. The averaged power-based function shows the most suitable profile toward the minimum.

Conclusion: The proposed metrics offer promising improvements for the design and evaluation of reverberation similarity metrics, enhancing automatic tuning of reverberation algorithms.

Abstract: Automatic tuning of reverberation algorithms relies on the optimization of a cost function. While general audio similarity metrics are useful, they are not optimized for the specific statistical properties of reverberation in rooms. This paper presents two novel metrics for assessing the similarity of late reverberation in room impulse responses. These metrics are differentiable and can be utilized within a machine-learning framework. We compare the performance of these metrics to two popular audio metrics using a large dataset of room impulse responses encompassing various room configurations and microphone positions. The results indicate that the proposed functions based on averaged power and frequency-band energy decay outperform the baselines with the former exhibiting the most suitable profile towards the minimum. The proposed work holds promise as an improvement to the design and evaluation of reverberation similarity metrics.

eess.IV

[466] Keep the Core: Adversarial Priors for Significance-Preserving Brain MRI Segmentation

Feifei Zhang, Zhenhong Jia, Sensen Song, Fei Shi, Aoxue Chen, Dayong Ren

Main category: eess.IV

TL;DR: “Keep the Core” introduces a data-centric paradigm using adversarial priors to guide augmentation and masking, preserving critical diagnostic features in medical image segmentation.

Details

Motivation: Medical image segmentation suffers from sparse pathological annotations. Existing augmentation strategies are feature-agnostic, often corrupting critical diagnostic semantics or failing to prioritize essential features needed for accurate segmentation.

Method: The approach uses SAGE (Sparse Adversarial Gated Estimator) to identify minimal tokens whose perturbation flips segmentation boundaries, creating a Token Importance Map. The online KEEP module then applies two augmentation strategies: Semantic-Preserving Augmentation (restoring original values of high-importance tokens) and Guided-Masking Augmentation (masking low-importance tokens for MAE-style reconstruction).

Result: Extensive experiments show the method achieves state-of-the-art segmentation robustness and generalization on 2D medical datasets. The approach is backbone-agnostic with no inference overhead.

Conclusion: “Keep the Core” provides a novel data-centric paradigm that effectively preserves critical diagnostic features during augmentation, improving segmentation performance through structured adversarial priors and region-selective mechanisms.

Abstract: Medical image segmentation is constrained by sparse pathological annotations. Existing augmentation strategies, from conventional transforms to random masking for self-supervision, are feature-agnostic: they often corrupt critical diagnostic semantics or fail to prioritize essential features. We introduce “Keep the Core,” a novel data-centric paradigm that uses adversarial priors to guide both augmentation and masking in a significance-preserving manner. Our approach uses SAGE (Sparse Adversarial Gated Estimator), an offline module identifying minimal tokens whose micro-perturbation flips segmentation boundaries. SAGE forges the Token Importance Map $W$ by solving an adversarial optimization problem to maximally degrade performance, while an $\ell_1$ sparsity penalty encourages a compact set of sensitive tokens. The online KEEP (Key-region Enhancement & Preservation) module uses $W$ for a two-pronged augmentation strategy: (1) Semantic-Preserving Augmentation: High-importance tokens are augmented, but their original pixel values are strictly restored. (2) Guided-Masking Augmentation: Low-importance tokens are selectively masked for an $\text{MAE}$-style reconstruction, forcing the model to learn robust representations from preserved critical features. “Keep the Core” is backbone-agnostic with no inference overhead. Extensive experiments show SAGE’s structured priors and KEEP’s region-selective mechanism are highly complementary, achieving state-of-the-art segmentation robustness and generalization on 2D medical datasets.

[467] BioimageAIpub: a toolbox for AI-ready bioimaging data publishing

Stefan Dvoretskii, Anwai Archit, Constantin Pape, Josh Moore, Marco Nolden

Main category: eess.IV

TL;DR: BioimageAIpub is a workflow that simplifies bioimaging data conversion and enables seamless upload to HuggingFace for easier sharing and consumption by ML tools.

Details

Motivation: Bioimage analysis requires large datasets, but existing repositories like IDR and BioImage Archive require substantial data wrangling to make their data usable by analysis tools, creating a significant time burden for researchers.

Method: The paper introduces BioimageAIpub, a workflow that streamlines bioimaging data conversion and enables seamless upload to HuggingFace platform.

Result: The workflow reduces the tedious assembly and conversion of metadata that previously hindered development of more powerful bioimage analysis tools.

Conclusion: BioimageAIpub addresses the data accessibility gap in bioimaging by providing a streamlined pipeline for converting and sharing bioimaging datasets on HuggingFace, facilitating easier consumption by machine learning tools.

Abstract: Modern bioimage analysis approaches are data hungry, making it necessary for researchers to scavenge data beyond those collected within their (bio)imaging facilities. In addition to scale, bioimaging datasets must be accompanied with suitable, high-quality annotations and metadata. Although established data repositories such as the Image Data Resource (IDR) and BioImage Archive offer rich metadata, their contents typically cannot be directly consumed by image analysis tools without substantial data wrangling. Such a tedious assembly and conversion of (meta)data can account for a dedicated amount of time investment for researchers, hindering the development of more powerful analysis tools. Here, we introduce BioimageAIpub, a workflow that streamlines bioimaging data conversion, enabling a seamless upload to HuggingFace, a widely used platform for sharing machine learning datasets and models.

[468] SNIC: Synthesized Noisy Images using Calibration

Nik Bhatt

Main category: eess.IV

TL;DR: The paper presents SNIC dataset - a synthesized noisy image dataset created using calibrated heteroscedastic noise models, available in both RAW and TIFF formats, outperforming manufacturer noise models.

Details

Motivation: Advanced denoising algorithms need large, high-quality datasets, but there's limited information on calibrating heteroscedastic noise models and a lack of published datasets using them.

Method: Developed methods for building high-quality heteroscedastic noise models that produce realistic synthesized noisy images in RAW and TIFF formats through proper calibration and tuning.

Result: Created SNIC dataset with over 6000 noisy images from 30 scenes across four sensors. Synthesized images achieve comparable LPIPS to real noisy images and outperform manufacturer DNG noise models in LPIPS and SOTA denoising tests.

Conclusion: The paper successfully demonstrates a process for creating realistic synthesized noisy images using calibrated heteroscedastic models, producing the first synthesized noisy image dataset available in both RAW and TIFF formats.

Abstract: Advanced denoising algorithms require large, high-quality datasets. Physics-based, statistical noise models can create such datasets by realistically simulating noise in digital images. However, there is little information on the correct way to calibrate and tune these heteroscedastic models, and a lack of published datasets using them. In this paper, we explore the process of building high-quality heteroscedastic noise models. Our methods produce realistic synthesized noisy images, in both RAW and TIFF formats. Our synthesized noisy images achieve comparable LPIPS results to real noisy images, and greatly outperform those created with manufacturer-provided DNG noise models both in LPIPS and when tested with a state-of-the-art (SOTA) denoising model. Using our approach, we created the Synthesized Noisy Images using Calibration dataset (SNIC) containing over 6000 noisy images, comprising 30 scenes from four sensors, including two smartphone sensors, a point-and-shoot, and a DSLR. SNIC is the first synthesized noisy image dataset provided in both RAW and TIFF format.

[469] In search of truth: Evaluating concordance of AI-based anatomy segmentation models

Lena Giebeler, Deepa Krishnaswamy, David Clunie, Jakob Wasserthal, Lalith Kumar Shiyam Sundar, Andres Diaz-Pinto, Klaus H. Maier-Hein, Murong Xu, Bjoern Menze, Steve Pieper, Ron Kikinis, Andrey Fedorov

Main category: eess.IV

TL;DR: A framework for evaluating AI anatomy segmentation models without ground truth using standardized representations and visualization tools.

Details

Motivation: The growing number of AI models for anatomy segmentation creates challenges for evaluating them on datasets without ground truth annotations, necessitating practical evaluation frameworks.

Method: Harmonize segmentation results into standard interoperable representations, extend 3D Slicer for loading/comparison, and use interactive summary plots and OHIF Viewer for visualization. Applied to evaluate 6 models on 31 anatomical structures in CT scans from NLST dataset.

Result: Framework enables automated loading, structure-wise inspection, and model comparison. Shows excellent agreement for some structures (lungs) but poor for others (vertebrae, ribs). Allows quick detection of problematic results.

Conclusion: Provides open-source resources (scripts, plots, tools) to assist model evaluation without ground truth, enabling informed model selection for anatomy segmentation tasks.

Abstract: Purpose AI-based methods for anatomy segmentation can help automate characterization of large imaging datasets. The growing number of similar in functionality models raises the challenge of evaluating them on datasets that do not contain ground truth annotations. We introduce a practical framework to assist in this task. Approach We harmonize the segmentation results into a standard, interoperable representation, which enables consistent, terminology-based labeling of the structures. We extend 3D Slicer to streamline loading and comparison of these harmonized segmentations, and demonstrate how standard representation simplifies review of the results using interactive summary plots and browser-based visualization using OHIF Viewer. To demonstrate the utility of the approach we apply it to evaluating segmentation of 31 anatomical structures (lungs, vertebrae, ribs, and heart) by six open-source models - TotalSegmentator 1.5 and 2.6, Auto3DSeg, MOOSE, MultiTalent, and CADS - for a sample of Computed Tomography (CT) scans from the publicly available National Lung Screening Trial (NLST) dataset. Results We demonstrate the utility of the framework in enabling automating loading, structure-wise inspection and comparison across models. Preliminary results ascertain practical utility of the approach in allowing quick detection and review of problematic results. The comparison shows excellent agreement segmenting some (e.g., lung) but not all structures (e.g., some models produce invalid vertebrae or rib segmentations). Conclusions The resources developed are linked from https://imagingdatacommons.github.io/segmentation-comparison/ including segmentation harmonization scripts, summary plots, and visualization tools. This work assists in model evaluation in absence of ground truth, ultimately enabling informed model selection.

[470] MCR-VQGAN: A Scalable and Cost-Effective Tau PET Synthesis Approach for Alzheimer’s Disease Imaging

Jin Young Kim, Jeremy Hudson, Jeongchul Kim, Qing Lyu, Christopher T. Whitlow

Main category: eess.IV

TL;DR: MCR-VQGAN synthesizes high-fidelity tau PET images from T1-weighted MRI to overcome limitations of conventional tau PET imaging in Alzheimer’s disease diagnosis.

Details

Motivation: Tau PET imaging is crucial for Alzheimer's disease diagnosis but faces barriers including radiation exposure, limited availability, high costs, and clinical workload. There's a need for an alternative approach to make tau imaging more accessible.

Method: Proposed Multi-scale CBAM Residual Vector Quantized Generative Adversarial Network (MCR-VQGAN) that enhances standard VQGAN with multi-scale convolutions, ResNet blocks, and Convolutional Block Attention Modules (CBAM). Trained on 222 paired T1-weighted MRI and tau PET scans from ADNI and compared against cGAN, WGAN-GP, CycleGAN, and VQGAN.

Result: MCR-VQGAN achieved superior synthesis performance: MSE 0.0056±0.0061, PSNR 24.39±4.49 dB, SSIM 0.9000±0.0453. A CNN-based AD classifier showed comparable accuracy on real (63.64%) and synthetic (65.91%) images, indicating preserved diagnostic features.

Conclusion: MCR-VQGAN provides a reliable, scalable surrogate for conventional tau PET imaging, potentially improving accessibility and scalability of tau imaging biomarkers for Alzheimer’s disease research and clinical workflows.

Abstract: Tau positron emission tomography (PET) is a critical diagnostic modality for Alzheimer’s disease (AD) because it visualizes and quantifies neurofibrillary tangles, a hallmark of AD pathology. However, its widespread clinical adoption is hindered by significant challenges, such as radiation exposure, limited availability, high clinical workload, and substantial financial costs. To overcome these limitations, we propose Multi-scale CBAM Residual Vector Quantized Generative Adversarial Network (MCR-VQGAN) to synthesize high-fidelity tau PET images from structural T1-weighted MRI scans. MCR-VQGAN improves standard VQGAN by integrating three key architectural enhancements: multi-scale convolutions, ResNet blocks, and Convolutional Block Attention Modules (CBAM). Using 222 paired structural T1-weighted MRI and tau PET scans from Alzheimer’s Disease Neuroimaging Initiative (ADNI), we trained and compared MCR-VQGAN with cGAN, WGAN-GP, CycleGAN, and VQGAN. Our proposed model achieved superior image synthesis performance across all metrics: MSE of 0.0056 +/- 0.0061, PSNR of 24.39 +/- 4.49 dB, and SSIM of 0.9000 +/- 0.0453. To assess the clinical utility of the synthetic images, we trained and evaluated a CNN-based AD classifier. The classifier achieved comparable accuracy when tested on real (63.64%) and synthetic (65.91%) images. This result indicates that our synthesis process successfully preserves diagnostically relevant features without significant information loss. Our results demonstrate that MCR-VQGAN can offer a reliable and scalable surrogate for conventional tau PET imaging, potentially improving the accessibility and scalability of tau imaging biomarkers for AD research and clinical workflows.

[471] Single-View Tomographic Reconstruction Using Learned Primal Dual

Sean Breckling, Matthew Swan, Keith D. Tan, Derek Wingard, Brandon Baldonado, Yoohwan Kim, Ju-Yeon Jo, Evan Scott, Jordan Pillow

Main category: eess.IV

TL;DR: LPD method evaluated for single-view tomographic reconstruction of axially-symmetric targets in two X-ray modalities, outperforming traditional numerical methods.

Details

Motivation: To investigate the performance of Learned Primal Dual (LPD) method in extreme single-view tomographic reconstruction scenarios, particularly for axially-symmetric targets where traditional methods struggle with limited data.

Method: Evaluated LPD on two X-ray modalities: 1) low-divergence/parallel X-rays, 2) cone-beam X-ray imaging. Generated training data using closed-form integral transforms or physics-based ray-tracing software, then corrupted with blur and noise. Compared against common numerical inversion methodologies.

Result: LPD showed promising results for single-view tomographic reconstructions of axially-symmetric targets, demonstrating effectiveness even in extreme limited-view scenarios where traditional methods face challenges.

Conclusion: The Learned Primal Dual method is effective for single-view tomographic reconstruction of axially-symmetric targets, outperforming conventional numerical inversion methods in challenging acquisition scenarios.

Abstract: The Learned Primal Dual (LPD) method has shown promising results in various tomographic reconstruction modalities, particularly under challenging acquisition restrictions such as limited viewing angles or a limited number of views. We investigate the performance of LPD in a more extreme case: single-view tomographic reconstructions of axially-symmetric targets. This study considers two modalities: the first assumes low-divergence or parallel X-rays. The second models a cone-beam X-ray imaging testbed. For both modalities, training data is generated using closed-form integral transforms, or physics-based ray-tracing software, then corrupted with blur and noise. Our results are then compared against common numerical inversion methodologies.

[472] Convergent Primal-Dual Plug-and-Play Image Restoration: A General Algorithm and Applications

Yodai Suzuki, Ryosuke Isono, Shunsuke Ono

Main category: eess.IV

TL;DR: A deep plug-and-play algorithm with theoretical convergence guarantees is proposed, combining PnP with primal-dual splitting to efficiently handle various image restoration problems while ensuring convergence under reasonable assumptions.

Details

Motivation: Existing PnP methods for image restoration often lack theoretical convergence guarantees under realistic assumptions, leading to inconsistent behavior. Even when convergence is guaranteed, methods are typically designed for specific settings or require high computational costs for handling non-quadratic data-fidelity terms and constraints common in image restoration.

Method: Integrates the PnP paradigm with primal-dual splitting (PDS), an efficient proximal splitting methodology for solving convex optimization problems. Develops a general convergent PnP framework that establishes theoretical conditions for convergence under reasonable assumptions, showing the problem is a monotone inclusion problem rather than standard convex optimization.

Result: The proposed approach efficiently handles a broad class of image restoration problems with guaranteed theoretical convergence. Numerical experiments on specific image restoration tasks validate the practicality and effectiveness of the theoretical results.

Conclusion: The paper presents a general deep PnP algorithm with theoretical convergence guarantees that addresses limitations of existing PnP methods, providing both theoretical rigor and practical effectiveness for image restoration tasks.

Abstract: We propose a general deep plug-and-play (PnP) algorithm with a theoretical convergence guarantee. PnP strategies have demonstrated outstanding performance in various image restoration tasks by exploiting the powerful priors underlying Gaussian denoisers. However, existing PnP methods often lack theoretical convergence guarantees under realistic assumptions due to their ad-hoc nature, resulting in inconsistent behavior. Moreover, even when convergence guarantees are provided, they are typically designed for specific settings or require a considerable computational cost in handling non-quadratic data-fidelity terms and additional constraints, which are key components in many image restoration scenarios. To tackle these challenges, we integrate the PnP paradigm with primal-dual splitting (PDS), an efficient proximal splitting methodology for solving a wide range of convex optimization problems, and develop a general convergent PnP framework. Specifically, we establish theoretical conditions for the convergence of the proposed PnP algorithm under a reasonable assumption. Furthermore, we show that the problem solved by the proposed PnP algorithm is not a standard convex optimization problem but a more general monotone inclusion problem, where we provide a mathematical representation of the solution set. Our approach efficiently handles a broad class of image restoration problems with guaranteed theoretical convergence. Numerical experiments on specific image restoration tasks validate the practicality and effectiveness of our theoretical results.

[473] StructDiff: Structure-aware Diffusion Model for 3D Fine-grained Medical Image Synthesis

Jiahao Xia, Yutao Hu, Yaolei Qi, Zhenliang Li, Wenqi Shao, Junjun He, Ying Fu, Longjiang Zhang, Guanyu Yang

Main category: eess.IV

TL;DR: StructDiff is a structure-aware diffusion model for fine-grained 3D medical image synthesis that addresses data scarcity by generating anatomically accurate images with complex topological details.

Details

Motivation: Existing generative models for medical imaging focus on synthesizing whole-organ or large-tissue structures but struggle with reproducing fine-grained anatomical details due to topological consistency requirements and complex 3D morphological heterogeneity.

Method: Proposes StructDiff with: 1) Paired image-mask template guidance for structural constraints and mask-to-image correspondence, 2) Mask Generation Module to enrich mask diversity, and 3) Confidence-aware Adaptive Learning strategy based on Skip-Sampling Variance to mitigate uncertainty from synthetic data.

Result: Extensive experiments show StructDiff achieves state-of-the-art performance in topological consistency and visual realism, and significantly boosts downstream segmentation performance.

Conclusion: StructDiff effectively addresses fine-grained 3D medical image synthesis challenges and improves downstream task performance, with code to be released upon acceptance.

Abstract: Solving medical imaging data scarcity through semantic image generation has attracted growing attention in recent years. However, existing generative models mainly focus on synthesizing whole-organ or large-tissue structures, showing limited capability in reproducing fine-grained anatomical details. Due to the stringent requirement of topological consistency and the complex 3D morphological heterogeneity of medical data, accurately reconstructing fine-grained anatomical details remains a significant challenge. To address these limitations, we propose StructDiff, a Structure-aware Diffusion Model for fine-grained 3D medical image synthesis, which enables precise generation of topologically complex anatomies. In addition to the conventional mask-based guidance, StructDiff further introduces a paired image-mask template to guide the generation process, providing structural constrains and offering explicit knowledge of mask-to-image correspondence. Moreover, a Mask Generation Module (MGM) is designed to enrich mask diversity and alleviate the scarcity of high-quality reference masks. Furthermore, we propose a Confidence-aware Adaptive Learning (CAL) strategy based on Skip-Sampling Variance (SSV), which mitigates uncertainty introduced by imperfect synthetic data when transferring to downstream tasks. Extensive experiments demonstrate that StructDiff achieves state-of-the-art performance in terms of topological consistency and visual realism, and significantly boosts downstream segmentation performance. Code will be released upon acceptance.

[474] TransUNet-GradCAM: A Hybrid Transformer-U-Net with Self-Attention and Explainable Visualizations for Foot Ulcer Segmentation

Akwasi Asare, Mary Sagoe, Justice Williams Asare, Stephen Edward Moore

Main category: eess.IV

TL;DR: TransUNet-based model for diabetic foot ulcer segmentation achieves strong performance on internal validation (Dice=0.8886) and demonstrates robust zero-shot transferability to external datasets.

Details

Motivation: Automated DFU segmentation is crucial for clinical diagnosis and monitoring, but challenging due to heterogeneous appearance, irregular morphology, and complex backgrounds. Traditional CNNs like U-Net struggle with long-range dependencies due to limited receptive fields.

Method: Used TransUNet architecture combining Vision Transformers’ global attention with U-Net’s localization. Trained on FUSeg dataset with robust augmentation and hybrid loss to address class imbalance.

Result: Internal validation: Dice=0.8886. External validation: AZH dataset Dice=0.6209, Medetec dataset Dice=0.7850 (zero-shot transfer). Strong clinical utility with Pearson r=0.9749 between predicted and ground-truth wound areas.

Conclusion: TransUNet effectively integrates global and local feature extraction, providing reliable, effective, and explainable automated foot ulcer assessment with strong generalizability to unseen clinical domains.

Abstract: Automated segmentation of diabetic foot ulcers (DFUs) plays a critical role in clinical diagnosis, therapeutic planning, and longitudinal wound monitoring. However, this task remains challenging due to the heterogeneous appearance, irregular morphology, and complex backgrounds associated with ulcer regions in clinical photographs. Traditional convolutional neural networks (CNNs), such as U-Net, provide strong localization capabilities but struggle to model long-range spatial dependencies due to their inherently limited receptive fields. To address this, we employ the TransUNet architecture, a hybrid framework that integrates the global attention mechanism of Vision Transformers (ViTs) into the U-Net structure. This combination allows the model to extract global contextual features while maintaining fine-grained spatial resolution. We trained the model on the public Foot Ulcer Segmentation Challenge (FUSeg) dataset using a robust augmentation pipeline and a hybrid loss function to mitigate class imbalance. On the internal validation set, the model achieved a Dice Similarity Coefficient (F1-score) of 0.8886 using an optimized threshold of 0.4843. Crucially, to assess generalizability, we performed external validation on two independent datasets: the AZH Wound Care Center dataset (n=278) and the Medetec dataset (n=152). Without any retraining, the model achieved Dice scores of 0.6209 and 0.7850, respectively, demonstrating robust zero-shot transferability to unseen clinical domains. Furthermore, clinical utility analysis revealed a strong correlation (Pearson r = 0.9749) between predicted and ground-truth wound areas. These outcomes demonstrate that our approach effectively integrates global and local feature extraction, offering a reliable, effective, and explainable solution for automated foot ulcer assessment.

[475] Team Westwood Solution for MIDOG 2025 Challenge: An Ensemble-CNN-Based Approach For Mitosis Detection And Classification

Tengyou Xu, Haochen Yang, Xiang ‘Anthony’ Chen, Hongyan Gu, Mohammad Haeri

Main category: eess.IV

TL;DR: Team Westwood’s solution for MIDOG 2025 challenge uses nnUNetV2 for initial mitosis detection screening, followed by random forest ensembles of EfficientNet models for both mitosis detection and atypical mitosis classification.

Details

Motivation: To develop an effective solution for the MItosis DOmain Generalization (MIDOG) 2025 challenge, addressing both mitosis detection and atypical mitosis classification tasks in histopathology images.

Method: Two-stage approach: 1) For mitosis detection: nnUNetV2 for initial candidate screening with high sensitivity, followed by random forest classifier ensembling predictions from three CNNs (EfficientNet-b3, EfficientNet-b5, EfficientNetV2-s). 2) For atypical mitosis classification: random forest classifier ensembling predictions from three CNNs (EfficientNet-b3, EfficientNet-b5, InceptionV3).

Result: Preliminary test set: F1 score of 0.7450 for mitosis detection, balanced accuracy of 0.8722 for atypical mitosis classification. Final test set: F1 score of 0.6972 for mitosis detection, balanced accuracy of 0.8242 for atypical mitosis classification.

Conclusion: The proposed ensemble approach combining nnUNetV2 for initial screening and random forest classifiers with multiple CNN backbones demonstrates competitive performance in both mitosis detection and atypical mitosis classification tasks in the MIDOG 2025 challenge.

Abstract: This abstract presents our solution (Team Westwood) for mitosis detection and atypical mitosis classification in the MItosis DOmain Generalization (MIDOG) 2025 challenge. For mitosis detection, we trained an nnUNetV2 for initial mitosis candidate screening with high sensitivity, followed by a random forest classifier ensembling predictions of three convolutional neural networks (CNNs): EfficientNet-b3, EfficientNet-b5, and EfficientNetV2-s. For the atypical mitosis classification, we trained another random forest classifier ensembling the predictions of three CNNs: EfficientNet-b3, EfficientNet-b5, and InceptionV3. On the preliminary test set, our solution achieved an F1 score of 0.7450 for track 1 mitosis detection, and a balanced accuracy of 0.8722 for track 2 atypical mitosis classification. On the final test set, our solution achieved an F1 score of 0.6972 for track 1 mitosis detection, and a balanced accuracy of 0.8242 for track 2 atypical mitosis classification.

[476] Bayesian model selection and misspecification testing in imaging inverse problems only from noisy and partial measurements

Tom Sprunck, Marcelo Pereyra, Tobias Liaudat

Main category: eess.IV

TL;DR: Proposes unsupervised model selection and misspecification detection for Bayesian imaging using Bayesian cross-validation with data fission, compatible with modern samplers like diffusion and plug-and-play.

Details

Motivation: Existing unsupervised model evaluation methods are unsuitable for computational imaging due to high computational cost and incompatibility with modern image priors defined via machine learning models.

Method: Combines Bayesian cross-validation with data fission (randomized measurement splitting) for unsupervised model selection and misspecification detection, compatible with any Bayesian imaging sampler including diffusion and plug-and-play samplers.

Result: Demonstrates excellent selection and detection accuracy with low computational cost through experiments with various scoring rules and types of model misspecification.

Conclusion: Provides a general methodology for unsupervised model evaluation in Bayesian imaging sciences that addresses limitations of existing methods while maintaining compatibility with modern imaging techniques.

Abstract: Modern imaging techniques heavily rely on Bayesian statistical models to address difficult image reconstruction and restoration tasks. This paper addresses the objective evaluation of such models in settings where ground truth is unavailable, with a focus on model selection and misspecification diagnosis. Existing unsupervised model evaluation methods are often unsuitable for computational imaging due to their high computational cost and incompatibility with modern image priors defined implicitly via machine learning models. We herein propose a general methodology for unsupervised model selection and misspecification detection in Bayesian imaging sciences, based on a novel combination of Bayesian cross-validation and data fission, a randomized measurement splitting technique. The approach is compatible with any Bayesian imaging sampler, including diffusion and plug-and-play samplers. We demonstrate the methodology through experiments involving various scoring rules and types of model misspecification, where we achieve excellent selection and detection accuracy with a low computational cost.

[477] Meta-learners for few-shot weakly-supervised optic disc and cup segmentation on fundus images

Pandega Abyan Zumarsyah, Igi Ardiyanto, Hanung Adi Nugroho

Main category: eess.IV

TL;DR: Developed meta-learners for few-shot weakly-supervised segmentation of optic disc and cup using Omni meta-training and sparsification techniques, achieving state-of-the-art performance with minimal labeled data.

Details

Motivation: Address the challenge of optic disc and optic cup segmentation for glaucoma diagnosis with limited labeled fundus images, overcoming data scarcity in medical imaging.

Method: Introduced Omni meta-training to balance data usage and diversify shot numbers, developed efficient versions to reduce computational costs, and created sparsification techniques for customizable scribbles and sparse labels.

Result: Efficient Omni ProtoSeg (EO-ProtoSeg) achieved IoU scores of 88.15% for OD and 71.17% for OC on REFUGE with just one sparsely labeled image, outperforming few-shot and semi-supervised methods requiring more labeled data.

Conclusion: EO-ProtoSeg provides an effective, lightweight solution for few-shot weakly-supervised segmentation that’s comparable to unsupervised domain adaptation methods but with fewer parameters and no retraining requirements.

Abstract: This study develops meta-learners for few-shot weakly-supervised segmentation (FWS) to address the challenge of optic disc (OD) and optic cup (OC) segmentation for glaucoma diagnosis with limited labeled fundus images. We significantly improve existing meta-learners by introducing Omni meta-training which balances data usage and diversifies the number of shots. We also develop their efficient versions that reduce computational costs. In addition, we develop sparsification techniques that generate more customizable and representative scribbles and other sparse labels. After evaluating multiple datasets, we find that Omni and efficient versions outperform the original versions, with the best meta-learner being Efficient Omni ProtoSeg (EO-ProtoSeg). It achieves intersection over union (IoU) scores of 88.15% for OD and 71.17% for OC on the REFUGE dataset using just one sparsely labeled image, outperforming few-shot and semi-supervised methods which require more labeled images. Its best performance reaches 86.80% for OD and 71.78%for OC on DRISHTIGS, 88.21% for OD and 73.70% for OC on REFUGE, 80.39% for OD and 52.65% for OC on REFUGE. EO-ProtoSeg is comparable to unsupervised domain adaptation methods yet much lighter with less than two million parameters and does not require any retraining.

Today’s Research Highlights

Table of Contents

cs.CL

[1] TabReX : Tabular Referenceless eXplainable Evaluation

[2] Social Story Frames: Contextual Reasoning about Narrative Intent and Reception

[3] BRAID: Bounded Reasoning for Autonomous Inference and Decisions

[4] Examining the Utility of Self-disclosure Types for Modeling Annotators of Social Norms

[5] Are We on the Right Way to Assessing LLM-as-a-Judge?

[6] Convolutional Lie Operator for Sentence Classification

[7] MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

[8] Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning

[9] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

[10] A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media

[11] Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation

[12] Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment

[13] LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward

[14] An Information-Theoretic Framework for Robust Large Language Model Editing

[15] Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization

[16] LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

[17] Sigma-Moe-Tiny Technical Report

[18] Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures

[19] Hacking Neural Evaluation Metrics with Single Hub Text

[20] Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains

[21] Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics

[22] UM_FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and Fine-Tune Approaches for GPT-4.1 Models in Sentence and Document-Level Text Simplification

[23] Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

[24] JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

[25] GinSign: Grounding Natural Language Into System Signatures for Temporal Logic Translation

[26] From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs

[27] Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology

[28] Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs

[29] What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels

[30] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

[31] AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

[32] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

[33] In-Context Algebra

[34] Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

[35] The Emergence of Chunking Structures with Hierarchical RNN

[36] Enhancing Long-term RAG Chatbots with Psychological Models of Memory Importance and Forgetting

[37] OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages

[38] Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training

[39] Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

[40] Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

[41] Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection

[42] Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

[43] A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

[44] Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines

[45] On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

[46] InsurTech innovation using natural language processing

[47] RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

[48] Beyond “Not Novel Enough”: Enriching Scholarly Critique with LLM-Assisted Feedback

[49] Are most sentences unique? An empirical examination of Chomskyan claims

[50] Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution

[51] Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation

[52] Who is In Charge? Dissecting Role Conflicts in Instruction Following

[53] Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs

[54] LLM one-shot style transfer for Authorship Attribution and Verification

[55] Think Twice: Branch-and-Rethink Reasoning Reward Model

[56] Voice-Interactive Surgical Agent for Multimodal Patient Data Control

[57] Online-PVLM: Advancing Personalized VLMs with Online Concept Learning

[58] MindShift: Analyzing Language Models’ Reactions to Psychological Prompts

[59] Non-Resolution Reasoning (NRR): A Computational Framework for Contextual Identity and Ambiguity Preservation

[60] A stylometric analysis of speaker attribution from speech transcripts

[61] From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition

[62] Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models

[63] Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning

[64] Rakuten Data Release: A Large-Scale and Long-Term Reviews Corpus for Hotel Domain

cs.CV

[65] Two-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real

[66] Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

[67] City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs

[68] LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

[69] R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

[70] The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

[71] Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

[72] From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection

[73] Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models

[74] D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos

[75] Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

[76] Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings